<a href="https://colab.research.google.com/github/lydyang/AIT-Deep-Learning/blob/main/Lydia_Yang_AIT_NLP_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<pre>
Modified version of: https://github.com/muntasirhsn/Retrieval-Augmented-Generation-with-Llama-2/blob/main/RAG_with_Lama_2_and_LangChain_2.ipynb

Modified and updated by:  Bálint Gyires-Tóth
</pre>

# RAG + LLM Assessment
Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

---



1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)

# 1. Domain and Data
I picked five LinkedIn posts from 2024 to use as my dataset.
* https://www.linkedin.com/posts/robschluensen_hello-fall-2024-andor-spring-2025-computer-activity-7191611032388132864-q8J3?utm_source=combined_share_message&utm_medium=member_desktop

* https://www.linkedin.com/posts/seth-ingall-20ab71226_last-fall-after-more-than-3-decades-of-service-activity-7193585428199809024-gJhT?utm_source=combined_share_message&utm_medium=member_desktop

* https://www.linkedin.com/posts/costafilhoe_fully-funded-phd-student-position-at-south-activity-7184803939345727488-pUDS?utm_source=combined_share_message&utm_medium=member_desktop

* https://www.linkedin.com/posts/xiao-kuang-9aa578184_scientist-xiao-kuang-activity-7193411289589723137-vD7O?utm_source=combined_share_message&utm_medium=member_desktop

* https://www.linkedin.com/posts/seungdeog-choi-b0a7246_pees-lab-activity-7193383940835995648-b1cp?utm_source=combined_share_message&utm_medium=member_desktop



# 2. Prompts
My relevant prompts are:

1.   "Who sent the LinkedIn post 3 weeks ago in April 2024 for a fully-funded Ph.D. student position in ecohydrology and remote sensing, starting in Spring 2025 at South Dakota School of Mines, Rapid City, SD.?"
2.   "After how many years did Seth Ingall serve before depart from GEICO in spring of 2024?"
3. "Who sent the LinkedIn post for a Full-Stack Software Engineering Intern role for the IT Business System Engineering team in Fall 2024 and/or Spring 2025 Computer Science students?"

My irrelevant prompt is:

1. "What color are my shoes today?"

# 3. Load the LLM



In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [2]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# 4. Testing the LLM with My Prompts

In [3]:
template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [11]:
text = "Who sent the LinkedIn post 3 weeks ago in April 2024 for a fully-funded Ph.D. student position in ecohydrology and remote sensing, starting in Spring 2025 at South Dakota School of Mines, Rapid City, SD.?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Who sent the LinkedIn post 3 weeks ago in
April 2024 for a fully-funded Ph.D. student position in ecohydrology and remote sensing, starting in
Spring 2025 at South Dakota School of Mines, Rapid City,
SD.?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I'm not aware of any specific LinkedIn
post that was sent three weeks ago in April 2024 for a fully-funded Ph.D. student position in
ecohydrology and remote sensing at South Dakota School of Mines, Rapid City, SD.  It's possible that
you may have missed the post or it might be a private message or an email that wasn't shared
publicly on LinkedIn. Additionally, job postings can be removed or updated frequently, so even if
there was a posting, it may no longer be active.  If you're interested in pursuing a Ph.D. program
in ecohydrology and remote sensing, I recommend searching for current job openings on websites like
AcademicJobs.net, HigherEdJobs.com, or university-specifi

In [5]:
text = "Who sent the LinkedIn post for a Full-Stack Software Engineering Intern role for the IT Business System Engineering team in Fall 2024 and/or Spring 2025 Computer Science students?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Who sent the LinkedIn post for a Full-
Stack Software Engineering Intern role for the IT Business System Engineering team in Fall 2024
and/or Spring 2025 Computer Science students?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I'm happy to help you with your question! However, I need more information. Can you please provide
me with more context or details about the LinkedIn post you're referring to? Such as:  * The company
name or industry? * The specific job title "Full-Stack Software Engineering Intern"? * Any other
relevant keywords or phrases from the post?  This will help me narrow down my search and try to find
the correct answer for you.


In [6]:
text = "After how many years did Seth Ingall serve before depart from GEICO in spring of 2024?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  After how many years did Seth Ingall
serve before depart from GEICO in spring of
2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but I couldn't find any
information on a person named Seth Ingall serving at GEICO or departing the company in the spring of
2024. It's possible that you may have incorrect or outdated information. If you could provide more
context or clarify who Seth Ingall is and what his role was at GEICO, I'd be happy to try and help
you further.


In [12]:
text = "What color are my shoes today?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What color are my shoes
today?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I'm just an AI, I don't have the
ability to see or know what you're wearing. I can only respond based on our conversation. Since
we've just started talking, I don't have any information about your shoes. Can you tell me what
color they are?


# 5. Create a LangChain-based RAG system

In [27]:
links = []
links.append("https://www.linkedin.com/posts/robschluensen_hello-fall-2024-andor-spring-2025-computer-activity-7191611032388132864-q8J3?utm_source=combined_share_message&utm_medium=member_desktop")
links.append("https://www.linkedin.com/posts/seth-ingall-20ab71226_last-fall-after-more-than-3-decades-of-service-activity-7193585428199809024-gJhT?utm_source=combined_share_message&utm_medium=member_desktop")
links.append("https://www.linkedin.com/posts/costafilhoe_fully-funded-phd-student-position-at-south-activity-7184803939345727488-pUDS?utm_source=combined_share_message&utm_medium=member_desktop")
links.append("https://www.linkedin.com/posts/xiao-kuang-9aa578184_scientist-xiao-kuang-activity-7193411289589723137-vD7O?utm_source=combined_share_message&utm_medium=member_desktop")
links.append("https://www.linkedin.com/posts/seungdeog-choi-b0a7246_pees-lab-activity-7193383940835995648-b1cp?utm_source=combined_share_message&utm_medium=member_desktop")
web_loader = UnstructuredURLLoader(
    urls=links, mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"
db_web = FAISS.from_documents(chunked_web_doc, embeddings)



# 6. Providing my Prompts to the RAG System

In [28]:
%%time
prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3


CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.87 µs


In [29]:
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)


In [30]:
query = "Who sent the LinkedIn post for a Full-Stack Software Engineering Intern role for the IT Business System Engineering team in Fall 2024 and/or Spring 2025 Computer Science students?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  LinkedIn  LinkedIn  LinkedIn  LinkedIn  LinkedIn  Hello Fall 2024 and/or Spring 2025
Computer Science students.  I have 1 remaining Full-Stack Software Engineering Intern role for the
IT Business System Engineering team.  The impact you will have: Work with the IT Business System
Engineering team to roll out new features and processes to help several teams such as HR, Finance
and Customer Success. Participate in full Development Lifecycle: design, development, testing,
documentation and operations in production environment. Build customized applications with a high
degree of reliability, scalability and security.  What we look for: You will graduate in Fall 2024
or Spring 2025

In [31]:
%%time
query = "After how many years did Seth Ingall serve before depart from GEICO in spring of 2024?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  Congrats Seth on your career at GEICO and a well deserved hiatus 30 years later, which I
am sure will be short. Can't wait to learn about what comes next. All the best.  Seth Ingall
Operational Executive | Chief Claims Officer | Chief Legal Officer | Strategic Corporate Leader and
Former GEICO Executive
1mo
Report this post
Congratulations!
Joe Pusateri                                                     Chief Marketer | Performance &
Brand Marketing VP  | ROI-minded Leader | Omnichannel Advertiser | Passionate Team Builder | DEI
Advocate | Ex-GEICO
1mo                                                                    Edited  Congratulations,
Seth. Your dedication and commitm

In [32]:
%%time
query = "Who sent the LinkedIn post 3 weeks ago in April 2024 for a fully-funded Ph.D. student position in ecohydrology and remote sensing, starting in Spring 2025 at South Dakota School of Mines, Rapid City, SD.?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  Opportunity for a Fully-Funded Ph.D. degree at South Dakota School of Mines
--------------------------------------------------------------------------------------- I am pleased
to announce that I am seeking a highly motivated individual for a fully-funded Ph.D. student
position in ecohydrology and remote sensing, starting in Spring 2025 at South Dakota School of
Mines, Rapid City, SD. The start date of this Ph.D. program and research is January 13, 2025.
Application Deadline:  The deadline for submissions is May 10, 2024.  Interested candidates are
invited to apply by visiting the following link: https://shorturl.at/hzDT6. Please, complete the
screening questionnaire and uploa

In [33]:
query = "What color are my shoes today?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  What color are my shoes
today?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough information to
answer that question. I'm a large language model, I don't have access to your physical appearance or
surroundings, so I wouldn't know what color your shoes are today.
