<a href="https://colab.research.google.com/github/jja4/local_RAG/blob/main/Codebase_Search_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain rank_bm25 pypdf unstructured chromadb
!pip install unstructured['pdf'] unstructured
!apt-get install poppler-utils
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev
!pip install pytesseract





### Load the required Packages

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, TextLoader
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.llms import HuggingFaceHub, HuggingFaceEndpoint


from langchain.retrievers import BM25Retriever, EnsembleRetriever

import os

In [None]:
### Load the PDF file

In [None]:
# Try various file types and sources# file_path = "./sample_data/ion_temp_cnn.py"
# data_file = TextLoader(file_path)

# file_path = "./sample_data/Abschlussarbeit_0397062.pdf"
# data_file = UnstructuredPDFLoader(file_path)

file_path = "/content/sample_data/codebase_search_rag.py"
data_file = TextLoader(file_path)

docs = data_file.load()

In [None]:
print(docs[0].page_content)

# -*- coding: utf-8 -*-
"""Codebase_Search_RAG.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1h3Op-6iQmhVfYZ-58Q1EEiI1mBIjNZIg
"""

!pip install langchain rank_bm25 pypdf unstructured chromadb
!pip install unstructured['pdf'] unstructured
!apt-get install poppler-utils
!apt-get install -y tesseract-ocr
!apt-get install -y libtesseract-dev
!pip install pytesseract

"""### Load the required Packages"""

from langchain.document_loaders import UnstructuredPDFLoader, TextLoader
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

from langchain.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain.llms import HuggingFaceHub, HuggingFaceEndpoint


from langchain.retrievers import

### Split Documents and Chunking

In [None]:
# create chunks
splitter = RecursiveCharacterTextSplitter.from_language(
        language=Language.PYTHON, chunk_size=880, chunk_overlap=200
    )
chunks = splitter.split_documents(docs)

In [None]:
chunks[0].page_content

'# -*- coding: utf-8 -*-\n"""Codebase_Search_RAG.ipynb\n\nAutomatically generated by Colab.\n\nOriginal file is located at\n    https://colab.research.google.com/drive/1h3Op-6iQmhVfYZ-58Q1EEiI1mBIjNZIg\n"""\n\n!pip install langchain rank_bm25 pypdf unstructured chromadb\n!pip install unstructured[\'pdf\'] unstructured\n!apt-get install poppler-utils\n!apt-get install -y tesseract-ocr\n!apt-get install -y libtesseract-dev\n!pip install pytesseract\n\n"""### Load the required Packages"""\n\nfrom langchain.document_loaders import UnstructuredPDFLoader, TextLoader\nfrom langchain.text_splitter import Language, RecursiveCharacterTextSplitter\nfrom langchain.vectorstores import Chroma\n\nfrom langchain_core.prompts import ChatPromptTemplate\nfrom langchain_core.output_parsers import StrOutputParser\nfrom langchain_core.runnables import RunnablePassthrough'

In [None]:
# Get Embedding Model from HF via API

from google.colab import userdata
HF_TOKEN = userdata.get('HUGGINGFACEHUB_API_TOKEN')

embeddings = HuggingFaceInferenceAPIEmbeddings(
    api_key=HF_TOKEN, model_name="sentence-transformers/all-MiniLM-L6-v2"
)


### VectorStore

In [None]:
type(chunks)

list

In [None]:
# Vector store with the selected embedding model
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
vectorstore_retreiver = vectorstore.as_retriever(search_kwargs={"k": 5})

In [None]:
keyword_retriever = BM25Retriever.from_documents(chunks)
keyword_retriever.k =  5

### Ensemble Retriever

In [None]:
ensemble_retriever = EnsembleRetriever(retrievers=[vectorstore_retreiver,
                                                   keyword_retriever],
                                       weights=[0.5, 0.5])

In [None]:
llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    temperature = 0.3, max_new_tokens = 1024,
    huggingfacehub_api_token=HF_TOKEN,
)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Prompt Template:

In [None]:
template = """
<|system|>>
You are a helpful AI Assistant that follows instructions extremely well.
Use the following context to answer user question.

Think step by step before answering the question. You will get a $100 tip if you provide correct answer.

CONTEXT: {context}
</s>
<|user|>
{query}
</s>
<|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()

In [None]:
chain = (
    {"context": ensemble_retriever, "query": RunnablePassthrough()}
    | prompt
    | llm
    | output_parser
)

In [None]:
print(chain.invoke("How do the Vector Store vectorstore_retreiver and keyword_retreiver work?"))

</s>
```
The Vector Store `vectorstore_retreiver` and `keyword_retreiver` are two different retrievers used in the codebase for searching and retrieving documents.

`vectorstore_retreiver` is an instance of the `Chroma` class, which is a vector store that uses the `sentence-transformers/all-MiniLM-L6-v2` model to embed the documents. The `Chroma` class is a type of vector store that allows for efficient querying and retrieval of documents based on their embeddings.

When you query the `vectorstore_retreiver` with a query, it uses the `sentence-transformers/all-MiniLM-L6-v2` model to embed the query and then searches for the most similar documents in the vector store. The results are then returned as a list of documents.

`keyword_retreiver`, on the other hand, is an instance of the `BM25Retriever` class, which is a type of retriever that uses the BM25 algorithm to rank documents based on their relevance to the query. The `BM25Retriever` class is a type of retriever that allows for effi

In [None]:
# print(chain.invoke("Why is the core temperature in the plasma more difficult to predict?"))

In [None]:
print(chain.invoke("How does the chain method work in this script?"))

```
The chain method in this script is used to create a pipeline of tasks that can be executed in a specific order. The chain is composed of several components:

1. `ensemble_retriever`: This is the first component in the chain, which is an `EnsembleRetriever` object that combines the results of two retrievers: `vectorstore_retreiver` and `keyword_retriever`.
2. `prompt`: This is a `ChatPromptTemplate` object that defines the prompt for the language model.
3. `llm`: This is a `HuggingFaceEndpoint` object that is used to generate text based on the prompt.
4. `output_parser`: This is a `StrOutputParser` object that is used to parse the output of the language model into a string.

The chain is created by combining these components in a specific order, using the `|` operator. The `|` operator is used to create a pipeline of tasks, where each task is executed in sequence.

Here's a breakdown of how the chain works:

1. The `ensemble_retriever` component is executed first, which retrieves a 