# Setting Up a Basic RAG System

## Introduction
In this notebook, we'll walk through setting up a basic Retrieval-Augmented Generation (RAG) system. RAG systems enhance the accuracy of language models (LLMs) by integrating up-to-date information from documents. Let's dive into the various steps and techniques involved.

## 1. Installation
First, we need to install the necessary libraries. These include `langchain`, `sentence-transformers`, `chromadb`, and `tiktoken`.


In [None]:
! pip install datasets
! pip install langchain_core
! pip install langchain_huggingface
# ! pip install sentence-transformers
# ! pip install torch
! pip install tiktoken
! pip install chromadb
! pip install langchain_community

## 2. Import Libraries
Next, we'll import the libraries we'll be using throughout this notebook.

In [None]:
import os
import csv
import json
import zipfile
import bs4
import uuid

import requests
import pandas as pd
import torch
from tqdm import tqdm
from datasets import Dataset
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_huggingface import ChatHuggingFace, HuggingFacePipeline, HuggingFaceEmbeddings

from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever

In [None]:
from huggingface_hub import login

os.environ["HUGGINGFACEHUB_API_TOKEN"] = 'your api key here'

login(token = os.environ["HUGGINGFACEHUB_API_TOKEN"])

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 3. Data Preparation

If you took a look at the previous notebook, this will look very familiar to you. This downloads a data set and then loads every file that has text that matches one of our key words.

In [None]:
# URL of the dataset
url = 'https://prod-dcd-datasets-cache-zipfiles.s3.eu-west-1.amazonaws.com/zm33cdndxs-2.zip'
keywords = ['fitness', 'workout',  'sports', 'strength training',]

def download_file(url, local_filename):
    if not os.path.exists(local_filename):
        with requests.get(url, stream=True) as response:
            response.raise_for_status()
            with open(local_filename, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
        print(f'Dataset downloaded and saved as {local_filename}')

def extract_zip(zip_file_path, extract_to, max_files=None):
    if not os.path.exists(extract_to):
        os.makedirs(extract_to, exist_ok=True)
        with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
            total_files = len(zip_ref.namelist())
            for i, file in enumerate(zip_ref.namelist()):
                zip_ref.extract(file, extract_to)
                if max_files and i + 1 >= max_files:
                    break
                print(f'Extracted {i + 1}/{total_files} files', end='\r')
        print(f'\nAll files extracted to {extract_to}')

# Local filenames and paths
local_filename = 'all_data.zip'
all_data_name = 'all_data_folder'
all_text_name = 'all_text'
exercise_files_name = 'exercise_file_names.csv'

#Download and extract zips
download_file(url, local_filename)
extract_zip(local_filename, all_data_name)
all_data_zip = os.path.join(all_data_name, 'json-articals.zip')
extract_zip(all_data_zip, all_text_name)

Dataset downloaded and saved as all_data.zip
Extracted 7/7 files
All files extracted to all_data_folder
Extracted 40002/40002 files
All files extracted to all_text


In [None]:
def is_text_is_about_topic(keywords, text):
    is_about_exercise_science = any(keyword in text.lower() for keyword in keywords)
    return is_about_exercise_science

def sort_through_files_and_create_csv(folder_path, output_csv):
    if not os.path.exists(output_csv):
      matching_files = []
      #Open files
      for filename in tqdm(os.listdir(folder_path)):
          file_path = os.path.join(folder_path, filename)
          if os.path.isfile(file_path) and filename.endswith('.json'):
              with open(file_path, 'r', encoding='utf-8') as file:
                  data = json.load(file)

                  #Find if text files mention exercise science
                  if 'abstract' in data and is_text_is_about_topic(keywords, data['abstract']):
                      matching_files.append(filename)

      # Write the matching file names to a CSV
      with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:
          writer = csv.writer(csvfile)
          writer.writerow(['Filename'])
          for file_name in matching_files:
              writer.writerow([file_name])

      print(f'CSV file created with {len(matching_files)} matching files.')

sort_through_files_and_create_csv(os.path.join(all_text_name,'json'), exercise_files_name)

100%|██████████| 40001/40001 [02:08<00:00, 312.28it/s]

CSV file created with 222 matching files.





In [None]:
all_text_name = 'all_text'
exercise_files_name = 'exercise_file_names.csv'

def combine_body_text(data):
    body_text = data['body_text']
    sorted_body_text = sorted(body_text, key=lambda x: x['startOffset'])
    combined_text = ' '.join(item['sentence'] for item in sorted_body_text)
    return combined_text

def load_json_files(json_folder:str, file_names:list)->list:
  documents = []
  for file_name in tqdm(file_names):
    file_path = os.path.join(json_folder, file_name)
    if os.path.isfile(file_path):
      with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
        if 'body_text' in data:
          for item in data['body_text']:
            if 'sentence' in item:
                combined_text = combine_body_text(data)
                document = Document(page_content=combined_text)
                documents.append(document)
  return documents

json_folder_path = os.path.join(all_text_name, 'json')
csv_data = pd.read_csv(exercise_files_name)
file_names = csv_data['Filename'].tolist()

documents = load_json_files(json_folder_path, file_names)

100%|██████████| 222/222 [00:07<00:00, 27.76it/s]


## 5. Document Embedding
Now we will take each of the documents and split them into smaller chunks of 300 characters each and 25 characters overlapping the previous and following documents. When they are smaller, we will embed them and put them into a vector store. We will only retrieve 1 document at a time.

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200, chunk_overlap=25)

In [None]:
splits = text_splitter.split_documents(documents[0:10])
# We are only embedding 10 documents because google collab can not handle much more than that

In [None]:
embedding_model = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=embedding_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

## Defining the Prompt and Language Model
In this cell, we define a prompt template to guide the model's responses and set up the language model using HuggingFace's GPT-2 for text generation, specifying the maximum number of tokens for the output.


In [None]:
# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = HuggingFacePipeline.from_model_id(
    model_id="openai-community/gpt2",
    task="text-generation",
    pipeline_kwargs={'max_new_tokens':500},
    )

chat_model = ChatHuggingFace(llm=llm)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Define the RAG Chain
Next, we define the RAG chain that will handle document retrieval and generation.

In [None]:
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
question = 'What are the main challenges and strategies in managing the spread of invasive non-native species?'

In [None]:
rag_chain.invoke(question)

"Human: Answer the question based only on the following context:\n[Document(page_content='bacteria often have very high effective population sizes favouring efficient selection, be it natural or sexual, and making the evolution of even subtle adaptations more likely. For instance, Fisherian sexual selection in particular is known to be a weak selective process [59]. Second, great variation in the genetics and physiology of transformation is found in the two dozen or so species used as model systems for transformation [29]. Considering that a trillion species of bacteria and archaea remain to be discovered [60], there is enormous potential to find new types of genome organization, physiology, ecological life-styles, and even methods of active gene transfer (exemplified by the relatively recent finding of gene transfer mediated by nanotubes [61]) that could be more permissive to forms of sexual selection than those currently known. The dynamics of lateral gene transfer processes mediated

## Query Decomposition Overview
Query decomposition breaks down a complex question into simpler sub-questions. This technique ensures more accurate and relevant information retrieval by addressing each sub-question individually before synthesizing the final answer.


1. Define a prompt template to generate multiple sub-questions related to an input question, aiming to break it down into manageable sub-problems.
2. Generate sub-questions from the input question by chaining the prompt template with the model and a string output parser.
3. Define another prompt template for the RAG (Retrieval-Augmented Generation) system to use retrieved context to answer the sub-questions concisely.
4. Write a function to handle the entire RAG process. It decomposes the main question into sub-questions, retrieves relevant documents for each sub-question, and then uses the RAG chain to generate answers.




In [None]:
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_decomposition = ChatPromptTemplate.from_template(template)

In [None]:
generate_queries_decomposition = ( prompt_decomposition | llm | StrOutputParser() | (lambda x: x.split("\n")))

In [None]:
prompt_rag = ChatPromptTemplate.from_template("""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
""")

In [None]:

def retrieve_and_rag(question,prompt_rag,sub_question_generator_chain):
    """RAG on each sub-question"""

    # Use our decomposition /
    sub_questions = sub_question_generator_chain.invoke({"question":question})

    # Initialize a list to hold RAG chain results
    rag_results = []

    for sub_question in sub_questions:

        # Retrieve documents for each sub-question
        retrieved_docs = retriever.get_relevant_documents(sub_question)

        # Use retrieved documents and sub-question in RAG chain
        answer = (prompt_rag | llm | StrOutputParser()).invoke({"context": retrieved_docs,
                                                                "question": sub_question})
        rag_results.append(answer)

    return rag_results,sub_questions

# Wrap the retrieval and RAG process in a RunnableLambda for integration into a chain
answers, questions = retrieve_and_rag(question, prompt_rag, generate_queries_decomposition)

  warn_deprecated(


In [None]:
def format_qa_pairs(questions, answers):
    """Format Q and A pairs"""

    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"
    return formatted_string.strip()

context = format_qa_pairs(questions, answers)

# Prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context":context,"question":question})

Token indices sequence length is longer than the specified maximum sequence length for this model (19727 > 1024). Running this sequence through the model will result in indexing errors


IndexError: index out of range in self

## Multi-representation Indexing Overview
Multi-representation indexing improves the efficiency and relevance of document retrieval by creating and indexing summaries of documents. This allows for quicker and more accurate document retrieval based on queries.

Here is what the code below is doing
1. We start by defining a prompt template that instructs the model to summarize documents and set up a chain that processes the documents through this template and the language model, parsing the output into summaries.
2. We initialize a Chroma vector store to index the summaries and an in-memory byte store to store the original documents, allowing us to link summaries back to their full documents.
3. We initialize the retriever, using both the vector store and the byte store, assigning unique IDs to each document, and linking summaries to their corresponding documents.
4. We perform a similarity search on the vector store using a query to find the most relevant summary, then retrieve and display the full documents corresponding to these summaries.

In [None]:
prompt = ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")

chain = (
    {"doc": lambda x: x.page_content}
    | prompt
    | llm
    | StrOutputParser()
)

summaries = chain.batch(splits, {"max_concurrency": 5})

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embedding_model)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in splits]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, splits)))

In [None]:
query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs[0]

In [None]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
retrieved_docs[0].page_content[0:500]