# Jetbrains AI Code Completion Internship task

### Setup
Installing necessary libraries

In [152]:
import os
import sys
!{sys.executable} -m pip install q transformers
!{sys.executable} -m pip install q torch torchvision torchaudio
!{sys.executable} -m pip install q editdistance
!{sys.executable} -m pip install q rouge_score
!{sys.executable} -m pip install q sacrebleu
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install pandas




[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Preparing dataset
I had two main requirements in mind: the programming language should be popular, and the code should avoid excessive use of external libraries to ensure that the used model is capable of understanding the examples. For this reason, I selected some files from my first-year project for Object-Oriented Programming class. These files are relatively small, allowing the entire file to serve as context for each example, with the model's goal being to predict a single line.

In [153]:
def generate_completion_examples(file_path, num_examples=10):
    with open(file_path, 'r') as file:
        lines = file.readlines()

    examples = []
    length = len(lines)
    interval = length // num_examples

    for i in range(num_examples):
        pred = i * interval
        prefix = ''.join(lines[0:pred])
        middle = ''.join(lines[pred])
        suffix = ''.join(lines[pred+1::])
        
        examples.append({
            "prefix": prefix,
            "middle": middle,
            "suffix": suffix
        })

    return examples

In [154]:
examples = []
directory = 'examples'

for filename in os.listdir(directory):
    file_path = os.path.join(directory, filename)
    
    if os.path.isfile(file_path) and filename.endswith('.java'):
        print(f"Processing file: {filename}")
        
        examples += generate_completion_examples(file_path)

print("All examples prepared")

Processing file: Diagonal.java
Processing file: DoubleMatrix.java
Processing file: Full.java
All examples prepared


### Running experiments

A model suggested in the task content, that is tiny_starcoder proved to be capable enough to draw meaningful conclusions, so I opted against changing it.

In addition to the exact match metric and chrf score, I also included edit similarity that I learned about in Computational Genomics classes. This metric is actually pretty relevant to code generation, as minimizing edit distance can directly reduce amount of correction a programmer using our tool would need to make.

Another metric I included is Rouge-L (Longest Common Subsequence metric). This metric proved useful, as during tests I noticed that model often generated fundamentally right predictions but added some unnecessary stuff, like extra whitespace or closing brackets. This metric recognises such cases pretty well.

In [155]:
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from rouge_score import rouge_scorer
from nltk.translate.chrf_score import sentence_chrf
import editdistance

checkpoint = "bigcode/tiny_starcoder_py"
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
line_breaker = "========\n"

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

completions = []

for example in examples:
    input_text = '<fim_prefix>' + example['prefix'] + '<fim_suffix>' + example['suffix'] + '<fim_middle>'
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    output = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_new_tokens=10,
        pad_token_id=tokenizer.eos_token_id
    )
    
    completion_text = tokenizer.decode(output[0], skip_special_tokens=False)
    completion_text = completion_text.replace('<|endoftext|>', '').split('<fim_middle>')[1]

    em = int(example["middle"] == completion_text)

    edit = editdistance.eval(completion_text, example["middle"])
    max_len = max(len(completion_text), len(example["middle"]))
    edit = 1 - (edit / max_len) 

    chrf_score = sentence_chrf(example["middle"], completion_text)
    rouge_l = scorer.score(example['middle'], completion_text)['rougeL'].fmeasure
    
    completions.append({
        "Prefix": example["prefix"],
        "Suffix": example["suffix"],
        "Expected Middle": example["middle"],
        "Model Completion": completion_text,
        "Exact Match": em,
        "Edit Similarity": edit,
        "CHRF Score": chrf_score,
        "ROUGE-L": rouge_l
    })

print(len(completions))
df = pd.DataFrame(completions)

df.to_csv("completions.csv", index=False)

print("Dataset saved")


30
Dataset saved


### Drawing conclusions
I used Pearson correlation coefficient to assess the correlation between human judgement and the model's automated metrics. Edit similarity, CHRF score and ROUGE-L are all closely matched, while exact matching is obviously not really precise as it doesn't differentiate between bad and partially good predictions. That being said, don't believe our dataset is large enough to confidently say which of the proposed metrics captures human judgement the best.

In [163]:
df = pd.read_csv("completions_labeled.csv")
numeric_df = df[['Exact Match', 'Edit Similarity', 'CHRF Score', 'ROUGE-L', 'My Score']]

correlations = numeric_df.corr()['My Score'][['Exact Match', 'Edit Similarity', 'CHRF Score', 'ROUGE-L']]

print("Correlation with My Score:")
print(correlations)


Correlation with My Score:
Exact Match        0.309423
Edit Similarity    0.867961
CHRF Score         0.876278
ROUGE-L            0.931328
Name: My Score, dtype: float64
