# Talking with a Repository - Understanding RAG

The purpose of this Project is be able to talk with your own repository. This notebook will try to define a pipeline to work with the documentation. The main steps will be:

- Prepare the environment
- Get the repository
- Do the Ingest step 
...

This notebook is based in https://github.com/ZeusSama0001/RAG-chatbot

## Preparing the environment

These are the main libraries used for each step in the Pipeline:

`Get the repository`
- GitPython 

`Ingest step`
- The idea is use all-MiniLM-L6-v2 from HuggingFace as a model trained to obtain embeddings from texts and files.
- Using LangChain for TextDecoder and store embeddings in a FAISS db

## Get the git repository step

In [2]:
# Import libraries

import os
from git import Repo

Clone locally a repository from GitHub.

In [3]:
import shutil

# remove /repo directory if it exists using python
if os.path.exists("repo"):
    shutil.rmtree("repo")

repo_path = os.getcwd() + "/repo"
repo_url = "https://github.com/pablotoledo/the-mergementor.git"
Repo.clone_from(repo_url, repo_path)

repo = Repo(repo_path)
assert not repo.bare

# Move to the main branch
repo.git.checkout('main')

"Your branch is up to date with 'origin/main'."

## Ingest step

### Getting the content and create the Embeddings

Embeddings are a way to represent words in a vector space. The idea is to represent words in a way that similar words are close to each other in the vector space. This is useful for many NLP tasks, such as sentiment analysis, text classification, and machine translation.

In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter 

DATA_PATH = 'repo/'
DB_FAISS_PATH = 'vectorstore/db_faiss'

# Create vector database
def create_vector_db():
    loader = DirectoryLoader(DATA_PATH, silent_errors=True, loader_cls=TextLoader, exclude='**/.git/**')

    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    texts = text_splitter.split_documents(documents)

    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', model_kwargs={'device': 'cpu'})

    db = FAISS.from_documents(texts, embeddings)
    db.save_local(DB_FAISS_PATH)

if __name__ == "__main__":
    create_vector_db()

## Query step

In [16]:
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.prompts import PromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import CTransformers
from langchain.chains import RetrievalQA

DB_FAISS_PATH = 'vectorstore/db_faiss'

custom_prompt_template = """Use the following pieces of information to answer the user's question. You are a developer expert who guides wbout a repository.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else.
Helpful answer:
"""

def set_custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

#Retrieval QA Chain
def retrieval_qa_chain(llm, prompt, db):
    qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=db.as_retriever(search_kwargs={'k': 2}), return_source_documents=True, chain_type_kwargs={'prompt': prompt})
    return qa_chain

#Loading the model
def load_llm():
    llm = CTransformers(
        model = "TheBloke/Llama-2-7B-Chat-GGML",
        model_type="llama",
        max_new_tokens = 512,
        temperature = 0.5
    )
    return llm

def qa_bot():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'})
    db = FAISS.load_local(DB_FAISS_PATH, embeddings, allow_dangerous_deserialization=True)
    llm = load_llm()
    qa_prompt = set_custom_prompt()
    qa = retrieval_qa_chain(llm, qa_prompt, db)

    return qa

def final_result(query):
    qa_result = qa_bot()
    response = qa_result({'query': query})
    return response

final_result("What is the repository about? and which llm pretrained model is being used?")

Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 3228.87it/s]
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 4815.50it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'query': 'What is the repository about? and which llm pretrained model is being used?',
 'result': 'The repository contains a pre-trained language model called CodeReviewer from Microsoft, licensed under MIT License. The model is trained on a large corpus of code and can be fine-tuned for various NLP tasks like code completion, code search, and code generation.',
 'source_documents': [Document(page_content='---\n\nThis repository contains the CodeReviewer model from Microsoft, which is licensed under the MIT License. \n\nMicrosoft CodeReviewer model\nhttps://huggingface.co/microsoft/codereviewer\n\nCopyright (c) Microsoft', metadata={'source': 'repo/LICENSE'}),
  Document(page_content='MIT License\n\nCopyright (c) 2023 pablotoledo\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the "Software"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify

# main.py

Knowing how RAG is working in this notebook the main.py is prepared to be used with Chainlit.

```bash
chainlit run main.py
```