# Talking with a Repository - Understanding RAG

The purpose of this Project is be able to talk with your own repository. This notebook will try to define a pipeline to work with the documentation. The main steps will be:

- Prepare the environment
- Get the repository
- Do the Ingest step 
...


## Preparing the environment

These are the main libraries used for each step in the Pipeline:

`Get the repository`
- GitPython 

`Ingest step`
- The idea is use all-MiniLM-L6-v2 from HuggingFace as a model trained to obtain embeddings from texts and files.
- To split files and text in chunks we will use LangChain

## Get the git repository step

In [2]:
# Import libraries

import os
from git import Repo

Clone locally a repository from GitHub.

In [3]:
import shutil

# remove /repo directory if it exists using python
if os.path.exists("repo"):
    shutil.rmtree("repo")

repo_path = os.getcwd() + "/repo"
repo_url = "https://github.com/pablotoledo/the-mergementor.git"
Repo.clone_from(repo_url, repo_path)

repo = Repo(repo_path)
assert not repo.bare

# Move to the main branch
repo.git.checkout('main')

"Your branch is up to date with 'origin/main'."

## Ingest step

### Getting the content we want from the repository

This notebook is using a Python repository as example, so, we filter only the files we need:

In [20]:
file_contents = []
for subdir, dirs, files in os.walk(repo_path):
    for file in files:
        filepath = subdir + os.sep + file

        if filepath.endswith((".py", ".md", ".txt")):
            with open(filepath, 'r', encoding='utf-8') as f:
                # Guarda una tupla de (ruta del archivo, contenido del archivo)
                file_contents.append((filepath, f.read()))

### Embeddings

Embeddings are a way to represent words in a vector space. The idea is to represent words in a way that similar words are close to each other in the vector space. This is useful for many NLP tasks, such as sentiment analysis, text classification, and machine translation.

In [29]:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

# Define la función de división en segmentos
def divide_en_segmentos(texto, max_length=512, overlap=50):
    palabras = texto.split()
    segmentos = [' '.join(palabras[i:min(i+max_length, len(palabras))]) for i in range(0, len(palabras), max_length - overlap)]
    return segmentos

# Modifica la función encode_texts para procesar cada segmento de texto
def encode_texts(texts_with_paths, max_length=512, overlap=50):
    results = []
    for filepath, text in texts_with_paths:
        segmentos = divide_en_segmentos(text, max_length, overlap)
        embeddings_segmento = []
        for seg in segmentos:
            encoded_input = tokenizer([seg], padding=True, truncation=True, return_tensors='pt', max_length=max_length)
            with torch.no_grad():
                model_output = model(**encoded_input)
            embeddings_segmento.append(model_output.pooler_output)
        # Promedia los embeddings de los segmentos
        embeddings_promedio = torch.mean(torch.stack(embeddings_segmento), dim=0)
        # Guarda una tupla de (ruta del archivo, embedding promediado)
        results.append((filepath, embeddings_promedio))
    return results

# Repite para el contenido de los archivos si es necesario
file_embeddings = encode_texts(file_contents)


Guardando

In [28]:
import numpy as np
import faiss

# Asume que tienes tus embeddings y rutas como listas separadas
embeddings = [tensor.numpy() for _, tensor in file_embeddings]  # Convierte a numpy si no lo están
embeddings = np.vstack(embeddings)  # Convierte la lista de numpy arrays en un único numpy array
file_paths = [path for path, _ in file_embeddings]

dimension = embeddings.shape[1]  # La dimensión de tus embeddings
index = faiss.IndexFlatL2(dimension)  # Crea un índice para L2 distance

# Añade los embeddings al índice
index.add(embeddings)

# Para realizar una consulta
query_vector = np.random.rand(dimension).astype('float32')  # Un vector de consulta aleatorio para el ejemplo
k = 10  # Número de vecinos más cercanos que queremos recuperar
distances, indices = index.search(np.array([query_vector]), k)

# Muestra los resultados
for i, idx in enumerate(indices[0]):
    print(f"Ruta: {file_paths[idx]}, Distancia: {distances[0][i]}")


Ruta: /home/pablo/testing-rag/rag_repository/repo/README.md, Distancia: 123.15203857421875
Ruta: /home/pablo/testing-rag/rag_repository/repo/review_onpremise.py, Distancia: 123.2261962890625
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 123.24508666992188
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663852886e+38
Ruta: /home/pablo/testing-rag/rag_repository/repo/_requirements.txt, Distancia: 3.4028234663

In [30]:
import faiss
import torch
from transformers import AutoTokenizer, AutoModel

# Cargar el modelo LLM y el tokenizador
model_name = 'llama2'  # Reemplaza esto con el nombre de tu modelo LLM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Cargar el índice FAISS
# index = faiss.read_index('ruta_a_tu_indice.faiss')  # Reemplaza esto con la ruta a tu índice FAISS

# Consultar el índice FAISS
query = 'tu_consulta'  # Reemplaza esto con tu consulta
tokens = tokenizer(query, return_tensors='pt')
with torch.no_grad():
    embeddings = model(**tokens).last_hidden_state.mean(dim=1).numpy()

# Buscar en el índice FAISS
D, I = index.search(embeddings, k=10)  # Buscar los 10 vecinos más cercanos

# D contiene las distancias a los vecinos más cercanos
# I contiene los índices de los vecinos más cercanos

OSError: llama2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`