# Trouver le fichier à modifier

In [1]:
import ast
import os
from sentence_transformers import SentenceTransformer, util
from transformers import AutoModel, AutoTokenizer
from tqdm import tqdm

##### Creation des modèle

- [CodeT5+](https://github.com/salesforce/CodeT5) sert à traduire une fonction en une phrase qui la résume ( la fonction ne doit pas faire plus de 512 tokens)
- Bert ([MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)) sert à mettre dans un embedding le texte à comparer et le code résumé

Modifier la variable `device` par 'cuda' pour utiliser le GPU ce qui augmenterai les performance du modèle.

In [2]:
checkpoint = "Salesforce/codet5p-220m-bimodal"
device = "cpu"  

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
codeT5 = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

bert = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')



Downloading (…)okenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/511k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading (…)n_codet5p_bimodal.py:   0%|          | 0.00/2.81k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- configuration_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)g_codet5p_bimodal.py:   0%|          | 0.00/939 [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- modeling_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Récupération des fonctions

Ces deux blocs de codes ci-dessous servent à isoler et récupérer les fonctions qui se trouvent dans un fichier de code.\
Pour ce faire, nous utilisons des AST, mais cette fonction ne fonctionne sur des fichiers Python. \
Donc si on veut s'ouvrir à d'autres langages il faut trouver soit une fonction AST plus complète ou une autre possibilité serai d'utiliser des Regex

In [3]:
def extract_function_source(node):
    if isinstance(node, ast.FunctionDef):
        return ast.unparse(node)

In [4]:
functions_sources = []
for filename in tqdm(os.listdir('samples/')):
    if os.path.isfile(os.path.join('samples/', filename)):
        with open('samples/' + filename, "r") as file:
            file_content = file.read()
        
        parsed_tree = ast.parse(file_content)
        

        functions = [extract_function_source(node) for node in ast.walk(parsed_tree) if isinstance(node, ast.FunctionDef)]
        if functions :
            for function in functions:
                functions_sources.append([
                    'samples/' + filename,
                    function,
                ])

    

100%|██████████| 15/15 [00:00<00:00, 360.25it/s]


## Résumé

Cette partie permet de transformer le code en texte (anglais)

In [5]:
for function_source in tqdm(functions_sources):
    input_ids = tokenizer(function_source[1], return_tensors="pt").input_ids.to(device)


    generated_ids = codeT5.generate(input_ids, max_length=20)
    function_source.append(tokenizer.decode(generated_ids[0], skip_special_tokens=True))



 61%|██████    | 83/136 [00:48<00:29,  1.78it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (608 > 512). Running this sequence through the model will result in indexing errors
100%|██████████| 136/136 [01:19<00:00,  1.72it/s]


## Calcul de similitude

Dans la variable text est la phrase remontant l'issue, l'anomalie...\
Dans la suite du bloc nous comparons la similitude entre le texte et la fonction afin de trouver le fichier qu'il faudra potentiellement modifier.\
Nous utilisons la similarité cosinus pour faire la comparaison.


In [6]:
text = 'I have a problem with the colisions of my block'
file = ''
max_similitude = float('-inf')

for function_source in tqdm(functions_sources):
    sentences = [text, function_source[2]]

    #Compute embedding for both lists
    embedding_1= bert.encode(sentences[0], convert_to_tensor=True)
    embedding_2 = bert.encode(sentences[1], convert_to_tensor=True)

    if max_similitude < util.pytorch_cos_sim(embedding_1, embedding_2):
        max_similitude = util.pytorch_cos_sim(embedding_1, embedding_2)
        file = function_source[0]

print(f"Le fichier qui match le plus est {file} avec une sim cos de {max_similitude}")

100%|██████████| 136/136 [00:03<00:00, 43.15it/s]

Le fichier qui match le plus est samples/tetris.py avec une sim cos de tensor([[0.6263]])



