<a href="https://colab.research.google.com/github/himalrpandey/DeepLearning/blob/main/Himal_AIT_RAG_Assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import os
from langchain.document_loaders import UnstructuredPDFLoader

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

!wget https://arxiv.org/pdf/2403.14799
!wget https://arxiv.org/pdf/2401.05343
!wget https://arxiv.org/pdf/2306.09239
!wget https://arxiv.org/pdf/2104.01469
!wget https://arxiv.org/pdf/1809.06303

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
--2024-05-08 22:24:44--  https://arxiv.org/pdf/2403.14799
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 744054 (727K) [application/pdf]
Saving to: ‘2403.14799.1’


2024-05-08 22:24:44 (23.8 MB/s) - ‘2403.14799.1’ saved [744054/744054]

--2024-05-08 22:24:44--  https://arxiv.org/pdf/2401.05343
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.67.42, 151.101.3.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Len

In [2]:
first_prompt = "Summarize the key findings on EEG complexity in ADHD diagnosis as discussed in recent 2024 studies"
second_prompt = "Discuss the latest approaches in using topological data analysis for understanding brain signals in ADHD research."
third_prompt = "Evaluate the effectiveness of network structure exploitation in automatic identification of ADHD subjects according to recent studies."
fourth_prompt_irrelevant = "Provide a detailed guide on the cultivation and care of mint trees"

In [3]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [4]:
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [5]:
prompts = [first_prompt, second_prompt, third_prompt, fourth_prompt_irrelevant]
for prompt in prompts:
  result = llm(prompt.format(text=prompt))
  print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Summarize the key findings on EEG complexity in ADHD diagnosis as discussed in recent 2024
studies.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9341239/) * [Discuss the potential
implications of these findings for future research and clinical practice in ADHD diagnosis and
treatment.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9341239/)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Discuss the latest approaches in using topological data analysis for understanding brain signals in
ADHD research. (30 minutes)   3. Present a case study on how machine learning and deep learning can
be used to analyze EEG signals in ADHD diagnosis, highlighting the potential benefits and challenges
of these approaches. (20 minutes)   4. Discuss the importance of considering individual differences
and variability in brain function when developing diagnostic tools for ADHD, and present some recent
advances in this area. (20 minutes)  The presentation will conclude with a summary of the key
takeaways and future directions for the application of machine learning and topological data
analysis in ADHD research.  ### References:  1. **Topological Data Analysis**:         * Carlsson et
al. (2010). Topology and data. Notices of the American Mathematical Society, 57(10), 1322-1336.
* Ghrist (2008). Barcodes: The persistent topology of data. Bulletin of the American Mathematical
Society, 45(2), 

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Evaluate the effectiveness of network structure exploitation in automatic identification of ADHD
subjects according to recent studies. The results are presented as follows:  1. **Network-based
feature extraction**: In a study published in 2020, researchers used functional magnetic resonance
imaging (fMRI) data from 30 ADHD and 30 healthy control subjects to extract features from brain
networks [14]. They found that the extracted features were significantly different between the two
groups, with higher connectivity in the default mode network (DMN) and lower connectivity in the
salience network (SN) in ADHD subjects. 2. **Graph theory analysis**: A study published in 2019
analyzed the structural connectome of 20 ADHD and 20 healthy control subjects using diffusion tensor
imaging (DTI) data [15]. The authors applied graph theory metrics to identify differences in network
topology between the two groups. They found that ADHD subjects had reduced global efficiency and
increased clustering 

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/1809.06303v1.pdf", "/content/2104.01469v1.pdf", "/content/2306.09239v1.pdf", "/content/2401.05343v1.pdf", "/content/2403.14799v1.pdf"]]

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/1809.06303v1.pdf
Splitting text...
Loading raw document.../content/2104.01469v1.pdf
Splitting text...
Loading raw document.../content/2306.09239v1.pdf
Splitting text...
Loading raw document.../content/2401.05343v1.pdf
Splitting text...
Loading raw document.../content/2403.14799v1.pdf
Splitting text...


422

In [7]:
%%time
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)



CPU times: user 8.88 s, sys: 491 ms, total: 9.37 s
Wall time: 12.1 s


In [8]:
prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [9]:
%%time

Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

ValidationError: 1 validation error for StuffDocumentsChain
__root__
  document_variable_name context was not found in llm_chain input_variables: ['text'] (type=value_error)

In [10]:
for query in prompts:
  result = Chain_pdf.invoke(query)
  print(fill(result['result'].strip(), width=100))

NameError: name 'Chain_pdf' is not defined