# Benchmarking: project level code completion evaluation

This notebook demonstrates how to benchmark the granite models using Long Code Arena's [project level code completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) dataset.

Key concepts:
1. Dataset preprocessing: We present multiple ways to prepare the completion context from a GitHub repository.
2. Tokenization and truncation: We analyze the token count of our input and truncate it to ensure it fits within the model's context window.
3. Prompt engineering: We showcase how building the prompt in different ways can influence the prediction.
4. Metrics: We present how to evaluate the predictions using exact match and edit similarity.

By the end of this notebook, you'll see how to use an existing code dataset to benchmark the performance of a model.

## Prerequisites

Before we get started, let's make sure you have the following installed:

1. Python 3.10 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)
2. [Ollama](https://ollama.com/)
3. [Granite code 8b model](https://ollama.com/library/granite-code:8b), which will serve as our LLM for this tutorial.

See the [Coding_Assistant_in_VSCode](../Coding_Assistant_in_VSCode/Coding_Assistant_in_VSCode.ipynb) recipe for instructions on setting up Ollama and installing the Granite models.

## Install Dependencies

Before we begin, we need to install the required Python packages. We'll be using:

- `git+https://github.com/ibm-granite-community/granite-kitchen`: To interact with the Ollama API
- `datasets`: To interact with Huggingface datasets (download, process)
- `transformers`: For tokenization and working with language models
- `evaluate`: For using the exact match metric
- `thefuzz`: For computing the edit similarity between 2 strings

These packages will be installed using pip, Python's package installer. If you're running this notebook in a fresh environment, make sure you have pip installed and updated.

In [None]:
!pip install git+https://github.com/ibm-granite-community/granite-kitchen transformers[torch] datasets evaluate thefuzz

In [None]:
MODEL_ID = "granite-code:8b"

## Create the Ollama client

For this example, we're setting the temperature to 0 since we want greedy decoding.

You can adjust the context window up to 128000 tokens to see the impact of additional code context on the results.

In [None]:
# maximum context size, you can use up to 128000
MAX_LENGTH = 4096
from langchain_ollama.llms import OllamaLLM

# we're only trying to predict a single line of code, so 100 tokens are enough
model = OllamaLLM(model=MODEL_ID, temperature=0, num_predict=100, num_ctx=MAX_LENGTH)

## Tokenization

Load the model's tokenizer and define a helper function to truncate the model's prompt when it goes over the imposed limit.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-8b-code-instruct-128k", truncation_side="left")

def truncate_prompt(prompt: str, max_length: int = MAX_LENGTH) -> str:
    tokenized_prompt = tokenizer(prompt, return_tensors="pt", padding=False, truncation=True, add_special_tokens=False, max_length=max_length)
    return tokenizer.decode(tokenized_prompt["input_ids"][0])

## Dataset

Load the [project level code completion](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion) dataset. Each data point contains the snapshot of a Python repository along with a number of lines that need to be completed. The lines are split into 6 categories:

1. infile – a line contains at least one function or class that was declared in the completion file.
2. inproject – a line contains at least one function or class that was declared in the repository snapshot files.
3. common – a line contains at least one function or class that was classified to be common, e.g., main, get, etc.
4. committed – a line contains at least one function or class that was declared in the files that were created in the same commit as the completion file (excluding the completion file).
5. non-informative – a line that satisfies at least on of the following criteria: (i) shorter than 5 characters or longer than 150 characters, (ii) a line with print, (iii) a line with import, (iv) a declaration of a function or a class, (v) a comment or contains an inline comment.
6. random – all the lines that do not have any category.

We're going to use the small context split that has up to 48K characters.

In [None]:
DATASET_PATH = "JetBrains-Research/lca-project-level-code-completion"
DATASET_NAME = "small_context"

from datasets import load_dataset

ds = load_dataset(path=DATASET_PATH, name=DATASET_NAME, split="test")

## Context building

Define diferent ways of organizing the additional context coming from all the repository's files apart from the one that we're trying to complete:

 - concatenate the file path and content in the same order defined in the dataset.
 - concatenate the file path and content starting from the file path that is furthest away from the completion file.

These are inspired by the original [Long Code Arena baselines code](https://github.com/JetBrains-Research/lca-baselines/tree/main/project_level_code_completion/composers), check it out if you'd like to see additional ways of organizing it.

In [None]:
import os

sep_symbol = 'METASEP\n'

def path_distance(path_from: str, path_to: str) -> int:
    """
    Compute the number of steps needed to go from one file system path to another.
    """

    divided_path_from = os.path.normpath(path_from).split(os.path.sep)
    divided_path_to = os.path.normpath(path_to).split(os.path.sep)
    common_len = 0
    for el1, el2 in zip(divided_path_from, divided_path_to):
        if el1 == el2:
            common_len += 1
        else:
            break

    # -1 to ignore the file itself
    return (len(divided_path_from) - 1 - common_len) + (len(divided_path_to) - 1 - common_len)

def sort_filepaths(path_from: str, list_of_filepaths: list[str]) -> list[str]:
    """
    Sorts the list of file system paths by how close they are to the provided path.
    """

    max_len = max([len(os.path.normpath(path).split(os.path.sep)) for path in list_of_filepaths])
    max_len += len(os.path.normpath(path_from).split(os.path.sep))
    paths_by_distance = [list() for _ in range(max_len)]

    for path_to in list_of_filepaths:
        dist = path_distance(path_from, path_to)
        paths_by_distance[dist].append(path_to)

    return [path for path_group in paths_by_distance for path in path_group]

def default_composer(example):
    """
    Construct the prompt context using the file order present in the dataset.
    """

    filenames, contents = example["repo_snapshot"]["filename"], example["repo_snapshot"]["content"]
    context_dict = {filename: content for filename, content in zip(filenames, contents)}
    repo_name = example["repo"]
    context = [path + sep_symbol + content for path, content in context_dict.items()]
    example["context"] = f"{repo_name}{sep_symbol}" + "".join(context)

    completion_path = example["completion_file"]["filename"]
    completion_content = example["completion_file"]["content"]
    completion = [completion_path + sep_symbol + completion_content]
    example["completion"] = sep_symbol + "".join(completion)

    return example

def path_distance_composer(example):
    """
    Construct the prompt context by ordering the files based on how close they are to the commetion file.
    """

    filenames, contents = example["repo_snapshot"]["filename"], example["repo_snapshot"]["content"]
    context_dict = {filename: content for filename, content in zip(filenames, contents)}
    repo_name = example["repo"]

    completion_path = example["completion_file"]["filename"]
    sorted_paths = sort_filepaths(completion_path, list(context_dict))
    context_content = [path + sep_symbol + context_dict[path] for path in sorted_paths[::-1]]

    context_content.append(completion_path + sep_symbol)
    example["context"] = f"{repo_name}{sep_symbol}" + "".join(context_content)

    example["completion"] = example["completion_file"]["content"]

    return example

## Generation

Define helper functions for building the prompt and generating the predictions.

In [None]:
def get_prefix(line_num: int, code: str) -> str:
    lines = code.split('\n')
    return '\n'.join(lines[:line_num])

def get_line(line_num: int, code: str) -> str:
    lines = code.split('\n')
    return lines[line_num]

def generate(example, use_context: bool = True):
    """
    Generate the prediction. If `use_context` is `False` use only the contents of the file
    missing the line of code, else use also the contents of the other files in the repository. 
    """

    completions = example["completion_lines"]
    completion = example["completion"]
    context = example["context"]

    generation_results = {}
    for completion_type, completion_lines in completions.items():
        generation_results[completion_type] = list()
        for line_num in completion_lines:
            if use_context:
                prompt = "\n".join([context] + [get_prefix(line_num, completion)])
            else:
                prompt = get_prefix(line_num, completion)
            prompt = truncate_prompt(prompt, MAX_LENGTH - 100)
            ground_truth = get_line(line_num, completion)
            prediction = model.invoke(prompt)
            prediction = prediction.strip("\n")
            prediction_line = prediction.split("\n")[0]
            generation_results[completion_type].append({"ground_truth": ground_truth, "prediction": prediction_line})
    
    example["generation_results"] = generation_results

    return example

Do 3 generations to see how the additional context influences the predictions:

1. Generate results using only the content of the completion file.
2. Generate results with the original file content ordering.
3. Generate results with the file path distance ordering.

In [None]:
# reduce the number of samples used, use the entire dataset if you want accurate results
SAMPLES = 2
if len(ds) > SAMPLES:
    ds = ds.select(range(SAMPLES))

original_order_ds = ds.map(default_composer)
path_distance_order_ds = ds.map(path_distance_composer)

# generate predictions using only the completion file's content
no_context_ds = original_order_ds.map(generate, fn_kwargs={"use_context": False})

original_order_context_ds = original_order_ds.map(generate, fn_kwargs={"use_context": True})
path_distance_order_context_ds = path_distance_order_ds.map(generate, fn_kwargs={"use_context": True})

## Metrics

Define the metrics we'll use to evaluate the predictions:

1. Exact match: 1 if the prediction and the ground truth are identical, 0 otherwise.
2. Edit similarity: normalized [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) expressed as percentage [0, 100].

In [None]:
from evaluate import load
from thefuzz import fuzz

exact_match = load("exact_match", module_type="metric")
    
def calculate_exact_match(generation_results):
    results = dict()
    for sc_name, gen_res in generation_results.items():
        if len(gen_res) > 0:
            results[sc_name] = exact_match.compute(
                references=[item["ground_truth"].strip() for item in gen_res],
                predictions=[item["prediction"].strip() for item in gen_res],
            )
    return results

def calculate_edit_similarity(generation_results):
    results = dict()
    for sc_name, gen_res in generation_results.items():
        similarity = 0.
        count = 0
        for item in gen_res:
            similarity += fuzz.ratio(item["prediction"], item["ground_truth"])
            count += 1
        if count > 0:
            results[sc_name] = {'edit_similarity': similarity / count}
    return results

Compute the metrics and aggregate the results

In [None]:
def get_generations(ds):
    generations = {"all": list()}

    def gather_generations(example):
        for key, values in example["generation_results"].items():
            if key not in generations:
                generations[key] = list()
            for result in values:
                generations[key].append(result)
                generations["all"].append(result)

    ds.map(gather_generations)

    return generations

def merge(d: dict, w: dict) -> dict:
    res = dict(**d)
    for k, v in w.items():
        res[k] |= v

    return res

results = list()

for dataset in [no_context_ds, original_order_context_ds, path_distance_order_context_ds]:
    generations = get_generations(dataset)
    em = calculate_exact_match(generations)
    es = calculate_edit_similarity(generations)
    metrics = merge(em, es)

    flattened_metrics = dict()
    for key, values in metrics.items():
        flattened_metrics[f"{key}-em"] = values["exact_match"]
        flattened_metrics[f"{key}-es"] = values["edit_similarity"]

    results.append(flattened_metrics)

## Display results

In [None]:
names = [
    "Results without additional context",
    "Results using original order with additional context",
    "Results using path distance order with additional context",
]

import pandas as pd

pd.DataFrame.from_records(results, index=names)