# RAG

## Requirements

In [1]:
%%capture
!pip install transformers bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

In [2]:
! pip install -U 'accelerate==0.21.0'



In [3]:
import accelerate

accelerate.__version__

'0.21.0'

## Dataset

In [4]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=5374c700-28a6-4995-948c-3caf71d33d17
To: /content/IMDB_crawled.json
100% 292M/292M [00:01<00:00, 179MB/s]


## Config

In [5]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [7]:
import pandas as pd

df = pd.read_json('IMDB_crawled.json')
df

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,synposis,reviews,genres
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,[The Godfather Part II presents two parallel s...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,[Galadriel (Cate Blanchett) (The Elven co-rule...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[Narrative structure\nPulp Fiction's narrative...,[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]"
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,"[In late summer 1945, guests are gathered for ...",[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...","[In 1947, Andy Dufresne (Tim Robbins), a banke...",[[The Shawshank Redemption is written and dire...,[Drama]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9945,tt0052190,The Sheepman,A brash stranger and his sheep arrive in a sma...,1958,Passed,"$1,283,000 (estimated)",,6.8,[George Marshall],,"[Glenn Ford, Shirley MacLaine, Leslie Nielsen]",[https://imdb.com/title/tt0049201/?ref_=tt_sim...,[English],[United States],[A brash stranger and his sheep arrive in a sm...,"[In The Sheepman, Glenn Ford arrives at a smal...",[[The Sheepman is directed by George Marshall ...,[Western]
9946,tt0062865,Day of the Evil Gun,A woman and two children are kidnapped by Apac...,1968,Approved,,,6.4,[Jerry Thorpe],,"[Glenn Ford, Arthur Kennedy, Dean Jagger]",[https://imdb.com/title/tt0061893/?ref_=tt_sim...,"[English, Spanish]",[United States],[A woman and two children are kidnapped by Apa...,,[[Glenn Ford plays here a character close to t...,[Western]
9947,tt0061893,The Last Challenge,A deadly gunslinger travels to a town to shoot...,1967,Approved,,,6.0,[Richard Thorpe],,"[Glenn Ford, Angie Dickinson, Chad Everett]",[https://imdb.com/title/tt0062865/?ref_=tt_sim...,[English],[United States],[A deadly gunslinger travels to a town to shoo...,,[[With elements of the TV western Gunsmoke and...,[Western]
9948,tt0059661,The Rounders,"In Sedona, two aging cowpokes bust broncos, ch...",1965,TV-PG,,,6.1,[Burt Kennedy],"[Max Evans, Burt Kennedy]","[Glenn Ford, Henry Fonda, Sue Ane Langdon]",[https://imdb.com/title/tt0064409/?ref_=tt_sim...,[English],[United States],"[In Sedona, two aging cowpokes bust broncos, c...",,[[So said the agreeable Henry Fonda to just ab...,"[Comedy, Western]"


In [21]:
import os
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited
def clean_text(text):
    text = re.sub(r'\W', ' ', str(text))  # Remove non-word characters
    text = text.lower()  # Lowercase the text
    text = re.sub(r'\s+[a-z]\s+', ' ', text)  # Remove single characters surrounded by spaces
    text = re.sub(r'^[a-z]\s+', ' ', text)  # Remove single characters at the start
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = text.strip()  # Strip leading and trailing whitespace
    stop_words = set(stopwords.words('english')) # Remove stop words
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text


df = df[['title', 'first_page_summary', 'release_year', 'genres',]]

df = df.dropna(subset=['title', 'first_page_summary', 'release_year', 'genres'])
df = df.drop_duplicates(subset=['first_page_summary'])
df['first_page_summary'] = df['first_page_summary'].apply(clean_text)
# df['first_page_summary'].astype(str)
print(df['first_page_summary'].apply(lambda x: isinstance(x, str)).all())
df['data'] = df.apply(
    lambda row: f"Title: {row['title']}\nRelease year: {row['release_year']}\nSummary: {row['first_page_summary']}",
    axis=1
)
df.to_csv('data/imdb.csv', index=False)

df.head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True


Unnamed: 0,title,first_page_summary,release_year,genres,data
0,The Godfather Part II,early life career vito corleone 1920s new york...,1974,"['Crime', 'Drama']",Title: The Godfather Part II\nRelease year: 19...
1,The Lord of the Rings: The Fellowship of the Ring,meek hobbit shire eight companions set journey...,2001,"['Action', 'Adventure', 'Drama']",Title: The Lord of the Rings: The Fellowship o...
2,Pulp Fiction,lives two mob hitmen boxer gangster wife pair ...,1994,"['Crime', 'Drama']",Title: Pulp Fiction\nRelease year: 1994\nSumma...
3,The Godfather,aging patriarch organized crime dynasty transf...,1972,"['Crime', 'Drama']",Title: The Godfather\nRelease year: 1972\nSumm...
4,The Shawshank Redemption,course several years two convicts form friends...,1994,['Drama'],Title: The Shawshank Redemption\nRelease year:...
5,Schindler's List,german occupied poland world war ii industrial...,1993,"['Biography', 'Drama', 'History']",Title: Schindler's List\nRelease year: 1993\nS...
6,One Flew Over the Cuckoo's Nest,fall 1963 korean war veteran criminal pleads i...,1975,['Drama'],Title: One Flew Over the Cuckoo's Nest\nReleas...
7,Fight Club,insomniac office worker devil may care soap ma...,1999,['Drama'],Title: Fight Club\nRelease year: 1999\nSummary...
8,The Dark Knight,menace known joker wreaks havoc chaos people g...,2008,"['Action', 'Crime', 'Drama']",Title: The Dark Knight\nRelease year: 2008\nSu...
9,12 Angry Men,jury new york city murder trial frustrated sin...,1957,"['Crime', 'Drama']",Title: 12 Angry Men\nRelease year: 1957\nSumma...


## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [22]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_core.documents import Document

from langchain_community.embeddings import HuggingFaceEmbeddings

from tqdm.notebook import tqdm, trange

# load the csv
data = pd.read_csv('data/imdb.csv').dropna()
documents = []
a = 0
for index, row in data.iterrows():
  a = max(a, len(row['data'].split()))
  d = Document(
      page_content=row['data'],
      metadata={"genres": row['genres']}
    )
  documents.append(d)
# load the embeddings model
embedder = HuggingFaceEmbeddings(model_name=Config.EMBEDDING_MODEL_NAME)

# save embed the documents using the model in a vectorstore
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
vectorstore = await FAISS.afrom_documents(docs, embedder)

with open("data/vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)



load the vectorstore as a retriever.

In [23]:
with open("data/vectorstore.pkl", "rb") as f:
    vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever(k=5)

testing retriever

In [24]:
query = ["running in the jungle toward the mountains while the ring weakens us",
         "Im an space and galaxy enthusiast",
         "in the dark he tries to save people"]
for q in query:
    print('QUERY:',q)
    results = retriever.invoke(q)
    for result in results:
        print(result.page_content)

QUERY: running in the jungle toward the mountains while the ring weakens us
Title: Running on the Sun: The Badwater 135
Release year: 2000
Summary: forty runners compete grueling race earth badwater film documents trials tribulations athletes run 135 miles death valley july read
Title: Wailing in the Forest
Release year: 2016
Summary: indigenous family one last forest people old copes unusual changes environment including imposing lifestyle customs sociable tribes th read
Title: Day of the Animals
Release year: 1977
Summary: battle survival ensues group hikers encounters chemically imbalanced forest
Title: 127 Hours
Release year: 2010
Summary: mountain climber becomes trapped boulder canyoneering alone near moab utah resorts desperate measures order survive
QUERY: Im an space and galaxy enthusiast
Title: Cosmos
Release year: 1980
Summary: astronomer carl sagan leads us engaging guided tour various elements cosmological theories universe
Title: Cosmos: A Spacetime Odyssey
Release year: 

## LLM

load the quantized LLM.

restart the runtime when you get error for accelerate!

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# load the quantization config
bnb_config = BitsAndBytesConfig()


#needs accelerate v.0.21.0, not newer!
model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=50)

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

  warn_deprecated(


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [37]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableBranch

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
give me the search query about the above conversation.
<|assistant|>"""
)

#chat api?llm.  config?

query_transforming_retriever_chain = RunnableBranch(
            (
                lambda x: len(x.get("messages", [])) == 1,
                query_transform_prompt | llm | LoggerStrOutputParser() | retriever,
            ),
            query_transform_prompt | llm | LoggerStrOutputParser() | retriever,
        ).with_config(run_name="chat_retriever_chain")

initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [38]:
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

{context}
-----------------
{messages}
<|assistant|>""")


chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = (
            RunnablePassthrough.assign(
                context=query_transforming_retriever_chain,
            ).assign(
                answer=chain,
            )
        )

write the conversation helper class for easier testing.

In [43]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        formatted_messages = "\n".join([f"{role}: {message}" for role, message in self.messages])
        return formatted_messages

    def chat(self, message):
        self.add_user_message(message)
        response = retrieval_chain.invoke({"messages": self.messages})
        self.add_assistant_message(response['answer'])
        return response['answer']


'\n    def add_user(self, user):\n        self.history[user] = ChatMessageHistory()\n        logger.info(f"user {user} added")\n\n    def has_user(self, user):\n        return user in self.history\n\n    def chat(self, user, message):\n        if not self.has_user(user):\n            self.add_user(user)\n        logger.info(f"Q: {message}")\n        self.history[user].add_user_message(message)\n        response = self.retrieval_chain.invoke({"messages": self.history[user].messages})\n        logger.info(f"A: {response}")\n        self.history[user].add_ai_message(response[\'answer\'])\n        return response[\'answer\']\n'

## Test

talk with the RAG to see how good it performs.

In [44]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print("*"*40)
print(A)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


QUERY: <|system|>You are a helpful assistant.
[('user', 'give me a cool gangster movie')]
<|user|>
give me the search query about the above conversation.
<|assistant|>
"Generate a search query based on a conversation between a user and an assistant discussing a cool gangster movie recommendation."

Search query: "Recommend a stylish and intense gangster movie with captivating characters and a thrilling
****************************************
<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

Title: Searchers 2.0
Release year: 2007
Summary: hollywood western actors mel torres fred fletcher hear fritz frobisher attend screening one movies arizona decide go exact revenge fo read

Title: American Gangster
Release year: 2006–2009
Summary: follows lives american gangsters

Title: The Assistants
Release year: 2009
Summary: group hollywood assistants strive something bigger conspire make movie behind bosses backs using resources

Title: Q&A
Release year: 1990
S

In [45]:
A = c.chat('give me a newer one')
print("*"*40)
print(A)

QUERY: <|system|>You are a helpful assistant.
[('user', 'give me a cool gangster movie'), ('assistant', '<|system|>You are a helpful assistant.\n\nHere are the movies you MUST choose from:\n\nTitle: Searchers 2.0\nRelease year: 2007\nSummary: hollywood western actors mel torres fred fletcher hear fritz frobisher attend screening one movies arizona decide go exact revenge fo read\n\nTitle: American Gangster\nRelease year: 2006–2009\nSummary: follows lives american gangsters\n\nTitle: The Assistants\nRelease year: 2009\nSummary: group hollywood assistants strive something bigger conspire make movie behind bosses backs using resources\n\nTitle: Q&A\nRelease year: 1990\nSummary: dirty cop mike brennan thinks got away murder routine righteous assistant da finds clue sets collision course\n-----------------\n[(\'user\', \'give me a cool gangster movie\')]\n<|assistant|>\nBased on your request, I would recommend "American Gangster" (2006-2009) as the movie for you. It follows the lives of two

as you can see it is currently working and even retrieves a newer movie on demand but the answer is not in a good format I wish I had more time so I could fix this too!