# Evaluate localization strategies

This notebook does a comparative evaluation of different localization strategies.
- Defines a base interface for localization
- Implements a few localization strategies
- Defines an evaluator that runs a test suite on those localization strategies
- Evaluator dumps the results in a pandas dataframe
- Uses Milvus as the vector database
- Uses OpenAI's embeddings model
- Uses langchain's abstractions for processing

In [None]:
import os
import pandas as pd
from typing import Dict, List, Tuple
from abc import ABC, abstractmethod

from se_agent.localizer import localize_issue
from se_agent.project import Project
from se_agent.project_manager import ProjectManager

## Base interface for localization strategies

In [None]:
class Strategy(ABC):
    @abstractmethod
    def localize(self, issue: Dict[str, str], top_n: int) -> List[Tuple[str, str]]:
        """
        Localizes the issue to a set of relevant packages and files.

        Args:
            issue (Dict[str, str]): A dictionary containing issue details with at least:
                - `title` (str): The title of the issue.
                - `description` (str): The detailed description of the issue.
            top_n (int): The maximum number of localization results to return.

        Returns:
            List[Tuple[str, str]]: A list of tuples representing relevant localization results,
                each containing `package` (str) and `file` (str).
        """
        pass

## Hierarchical localization strategy

Instead of semantic vector search, this strategy uses the completion API to generate localization results. This requires inlining the context. Using all the files in the repository as context, far-exceed the permitted token limits of the completion API. Therefore, it uses generated semantic summaries of the code files as context. However, for large repositories, and depending on the model used, this may still exceed the token limits. Therefore, it also generates higher-level summaries at the level of packages. Let us assume that the aggregated package summaries are within the token limits. The strategy operates as follows:

- **Package level**: Given an issue, it first identifies the package that are relevant to the issue query belongs to, using packages summaries in the inline context.
- **File level**: It then identifies the files within the package that are relevant to the issue query, using file summaries for the relevant packages in the inline context.

This strategy is more expensive than the semantic vector search strategy.

In [None]:
class HierarchicalLocalizationStrategy(Strategy):
    def __init__(self, project: Project, strategy_name: str = "Hierarchical Completion"):
        self.project = project
        self.strategy_name = strategy_name

    def localize(self, issue: Dict[str, str], top_n: int) -> List[Tuple[str, str]]:
        """
        Localizes an issue to specific files by first identifying relevant packages
        and then narrowing down to specific files in those packages.
        """
        # issue conversation
        issue_conversation = {
            "title": issue["title"],
            "conversation": [{'role': 'user', 'content': f'Issue: {issue["title"]}\n\nDescription: {issue["description"]}'}]
        }

        # Localize the issue using the hierarchical approach
        localization_suggestions = localize_issue(self.project, issue, issue_conversation)

        if localization_suggestions is None:
            return []  # If localization fails, return an empty list

        # Format the results as (package, file) tuples, sorted by confidence
        return [(suggestion.package, os.path.splitext(suggestion.file)[0]) for suggestion in localization_suggestions[:top_n]]

## Dataset

In [None]:
from typing import List, Dict, Iterator
import yaml
import os

class Issue:
    def __init__(self, id: str, title: str, content: str, expected_results: List[str]):
        self.id = id
        self.title = title
        self.content = content
        self.expected_results = expected_results

    def to_dict(self) -> Dict[str, str]:
        """Returns the issue data as a dictionary for easy access."""
        return {"title": self.title, "description": self.content}

class Dataset:
    def __init__(self, yaml_path: str):
        self.yaml_dir = os.path.dirname(yaml_path)  # Get the directory containing the YAML file
        with open(yaml_path, 'r') as f:
            data = yaml.safe_load(f)
        self.test_cases = data["test_cases"]

    def __iter__(self) -> Iterator[Issue]:
        """Allows iteration over Issue instances created from test cases."""
        for case in self.test_cases:
            # Construct the full path to the markdown file
            full_path = os.path.join(self.yaml_dir, case["filepath"])
            # Load the content from the markdown file
            with open(full_path, 'r') as f:
                content = f.read()
            # Create an Issue instance for each test case
            yield Issue(
                id=case["id"],
                title=case["title"],
                content=content,
                expected_results=case["expected_results"]
            )

dataset = Dataset("test/dataset.yaml")

## Evaluator

In [None]:
class LocalizationEvaluator:
    def __init__(self, dataset: Dataset, strategies_to_evaluate: List[Strategy]):
        self.dataset = dataset
        self.strategies = strategies_to_evaluate

    def calculate_score(self, expected_results: List[str], actual_results: List[str]) -> float:
        """Calculates the score with distance-based penalties for expected results outside the top-k."""
        score = 1.0  # Start with a perfect score of 1

        for expected in expected_results:
            if expected in actual_results:
                index = actual_results.index(expected)
                # Check if expected item is within the top-k
                if index >= len(expected_results):
                    # Distance-based partial penalty if it's outside top-k but present in results
                    distance_factor = index - len(expected_results) + 1
                    penalty = (1 / len(expected_results)) * distance_factor * 0.2
                    score -= penalty
            else:
                # Full penalty if expected item is missing altogether
                score -= 1 / len(expected_results)

        return max(score, 0)  # Ensure score doesn't go below 0

    def evaluate(self) -> pd.DataFrame:
        """Evaluates each strategy on all test issues and returns a DataFrame with results and scores."""
        df = pd.DataFrame(columns=["Issue Title", "Expected Results"] + [f"Results ({strategy.strategy_name})" for strategy in self.strategies])

        # Dictionary to store total scores per strategy
        total_scores = {strategy.strategy_name: 0 for strategy in self.strategies}

        # Iterate over each Issue in the dataset
        for issue in self.dataset:
            issue_data = {"title": issue.title, "description": issue.content}  # Prepare data for localization
            row_data = {
                "Issue Title": issue.title,
                "Expected Results": issue.expected_results
            }

            # Calculate and store results and formatted score+results for each strategy
            for strategy in self.strategies:
                actual_results = [res[1] for res in strategy.localize(issue_data, top_n=5)]
                score = self.calculate_score(issue.expected_results, actual_results)
                total_scores[strategy.strategy_name] += score  # Accumulate score for total

                # Format results with score as requested
                formatted_result = f"{score:.2f} {actual_results}"
                row_data[f"Results ({strategy.strategy_name})"] = formatted_result

            # Append row data to DataFrame
            df = pd.concat([df, pd.DataFrame([row_data])], ignore_index=True)

        # Append total scores row to DataFrame
        total_row = {"Issue Title": "Total", "Expected Results": ""}
        for strategy in self.strategies:
            total_row[f"Results ({strategy.strategy_name})"] = f"{total_scores[strategy.strategy_name]:.2f}"

        df = pd.concat([df, pd.DataFrame([total_row])], ignore_index=True)
        return df

**Test setup**

In [None]:
projects_store = "/Users/pdhoolia/projects-store"
repo_full_name = "pdhoolia/se-agent"
src_dir = "se_agent"

project_manager = ProjectManager(projects_store)
project_info = project_manager.get_project(repo_full_name)
project = Project(os.getenv("GITHUB_TOKEN"), projects_store, project_info)

**Strategies**

In [None]:
hierarchical_strategy = HierarchicalLocalizationStrategy(project, strategy_name="Hierarchical Localization")
strategies_to_evaluate = [hierarchical_strategy]

**Evaluate**

In [None]:
evaluator = LocalizationEvaluator(
    dataset=dataset,
    strategies_to_evaluate=strategies_to_evaluate
)

evaluation_results = evaluator.evaluate()

**Display results**

In [None]:
# Create a copy of the DataFrame for display purposes
display_df = evaluation_results.copy()

# Set the index to start from 1
display_df.index = display_df.index + 1

# Apply left alignment to all columns, including headers
df_style = display_df.style \
    .set_table_attributes("style='width:100%'") \
    .set_properties(**{'text-align': 'left'}) \
    .set_table_styles([{
        'selector': 'th',
        'props': [('text-align', 'left')]
    }])

df_style