# Talking with a Repository - Understanding RAG

The purpose of this Project is be able to talk with your own repository. This notebook will try to define a pipeline to work with the documentation. The main steps will be:

- Prepare the environment
- Get the repository
- Do the Ingest step 
...


## Preparing the environment

These are the main libraries used for each step in the Pipeline:

`Get the repository`
- GitPython 

`Ingest step`
- LangChain to load the [directory](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory)

## Get the git repository

In [10]:
# Import libraries

import os
from git import Repo

ModuleNotFoundError: No module named 'git'

Clone locally a repository from GitHub.

In [2]:
import shutil

# remove /repo directory if it exists using python
if os.path.exists("repo"):
    shutil.rmtree("repo")

repo_path = os.getcwd() + "/repo"
repo_url = "https://github.com/pablotoledo/the-mergementor.git"
Repo.clone_from(repo_url, repo_path)

repo = Repo(repo_path)
assert not repo.bare

# Move to the main branch
repo.git.checkout('main')

"Your branch is up to date with 'origin/main'."

Read all files in the repository.

In [3]:
file_contents = []
for subdir, dirs, files in os.walk(repo_path):
    for file in files:
        filepath = subdir + os.sep + file

        if filepath.endswith((".py", ".md", ".txt")):
            with open(filepath, 'r', encoding='utf-8') as f:
                # Guarda una tupla de (ruta del archivo, contenido del archivo)
                file_contents.append((filepath, f.read()))

# Embeddings

Embeddings are a way to represent words in a vector space. The idea is to represent words in a way that similar words are close to each other in the vector space. This is useful for many NLP tasks, such as sentiment analysis, text classification, and machine translation.

In [8]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")
model = AutoModel.from_pretrained("microsoft/graphcodebert-base")

# Define la función de división en segmentos
def divide_en_segmentos(texto, max_length=512, overlap=50):
    palabras = texto.split()
    segmentos = [' '.join(palabras[i:min(i+max_length, len(palabras))]) for i in range(0, len(palabras), max_length - overlap)]
    return segmentos

# Modifica la función encode_texts para procesar cada segmento de texto
def encode_texts(texts_with_paths, max_length=512, overlap=50):
    results = []
    for filepath, text in texts_with_paths:
        segmentos = divide_en_segmentos(text, max_length, overlap)
        embeddings_segmento = []
        for seg in segmentos:
            encoded_input = tokenizer([seg], padding=True, truncation=True, return_tensors='pt', max_length=max_length)
            with torch.no_grad():
                model_output = model(**encoded_input)
            embeddings_segmento.append(model_output.pooler_output)
        # Promedia los embeddings de los segmentos
        embeddings_promedio = torch.mean(torch.stack(embeddings_segmento), dim=0)
        # Guarda una tupla de (ruta del archivo, embedding promediado)
        results.append((filepath, embeddings_promedio))
    return results

# Repite para el contenido de los archivos si es necesario
file_embeddings = encode_texts(file_contents)

Some weights of RobertaModel were not initialized from the model checkpoint at microsoft/graphcodebert-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
