## RAG Architecture

A brief introduction and demo

Let's explore our dataset. It's a dataset based on movies information scrapped from IMDB website and available on kaggle at:
https://www.kaggle.com/datasets/utsh0dey/25k-movie-dataset

In [5]:
import pandas as pd

df = pd.read_csv("../data/unprocessed/25k-imdb-movie-dataset.csv")
df.head(3)

Unnamed: 0,movie title,Run Time,Rating,User Rating,Generes,Overview,Plot Kyeword,Director,Top 5 Casts,Writer,year,path
0,Top Gun: Maverick,"$170,000,000 (estimated)",8.6,187K,"['Action', 'Drama']",After more than thirty years of service as one...,"['fighter jet', 'sequel', 'u.s. navy', 'fighte...",Joseph Kosinski,"['Jack Epps Jr.', 'Peter Craig', 'Tom Cruise',...",Jim Cash,-2022,/title/tt1745960/
1,Jurassic World Dominion,2 hours 27 minutes,6.0,56K,"['Action', 'Adventure', 'Sci-Fi']",Four years after the destruction of Isla Nubla...,"['dinosaur', 'jurassic park', 'tyrannosaurus r...",Colin Trevorrow,"['Colin Trevorrow', 'Derek Connolly', 'Chris P...",Emily Carmichael,-2022,/title/tt8041270/
2,Top Gun,"$15,000,000 (estimated)",6.9,380K,"['Action', 'Drama']",As students at the United States Navy's elite ...,"['pilot', 'male camaraderie', 'u.s. navy', 'gr...",Tony Scott,"['Jack Epps Jr.', 'Ehud Yonay', 'Tom Cruise', ...",Jim Cash,-1986,/title/tt0092099/


Let's do some transformations

In [9]:
from ast import literal_eval

def concat_list(list_: list) -> str:
  """Joins with ' ' every item from the list."""
  list_ = literal_eval(list_)
  return ' '.join(list_)

def string_to_list(string: str) -> list:
  """Literal eval. for a list"""
  list_ = literal_eval(string)
  return list_

In [7]:
# Fill NAs, clean keywords, stars, generes and ratings
df = df.fillna(' ')
df['Keywords'] = df['Plot Kyeword'].apply(concat_list)
df['Stars'] = df['Top 5 Casts'].apply(concat_list)
df['Generes'] = df['Generes'].apply(concat_list)
df['Rating'] = pd.to_numeric(df['Rating'], errors="coerce").fillna(0).astype("float")

# Concatenate all to have a more complete description
df['text'] = df.apply(lambda x : str(x['Overview']) + ' ' + x['Keywords'] + ' ' + x['Stars'], axis=1)

Drop used columns

In [8]:
df.drop(['Plot Kyeword','Top 5 Casts', 'Run Time', 'User Rating', 'path', 'year'],axis=1, inplace=True)

Generate Embeddings

In [9]:
# You can use sentence_transformers if you have GPU, else you can use OpenAI Embeddings via API
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate Embeddings
embeddings = model.encode(df['text'], batch_size=64, show_progress_bar=True)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 382/382 [09:08<00:00,  1.43s/it]


In [10]:
# Asign Embeddings to a column
df['embeddings'] = embeddings.tolist()

# The vectorstore will need an id for every register
df['ids'] = df.index
df['ids'] = df['ids'].astype('str')

In [8]:
import pandas as pd

In [10]:
# Save it
df = pd.read_csv("../data/processed/embeddings_dataset.csv")
df['ids'] = df.index
df['ids'] = df['ids'].astype('str')
df['embeddings'] = df['embeddings'].apply(string_to_list)
df.head(2)

Unnamed: 0,movie title,Rating,Generes,Overview,Director,Writer,Keywords,Stars,text,embeddings,ids
0,Top Gun: Maverick,8.6,Action Drama,After more than thirty years of service as one...,Joseph Kosinski,Jim Cash,fighter jet sequel u.s. navy fighter aircraft ...,Jack Epps Jr. Peter Craig Tom Cruise Jennifer ...,After more than thirty years of service as one...,"[-0.07095595449209213, -0.009480987675487995, ...",0
1,Jurassic World Dominion,6.0,Action Adventure Sci-Fi,Four years after the destruction of Isla Nubla...,Colin Trevorrow,Emily Carmichael,dinosaur jurassic park tyrannosaurus rex veloc...,Colin Trevorrow Derek Connolly Chris Pratt Bry...,Four years after the destruction of Isla Nubla...,"[-0.025362147018313408, -0.06149573251605034, ...",1


## Vector Store or Vector Database

Chromadb

Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.

In [1]:
import chromadb
from chromadb.utils import embedding_functions

Instantiate client and persist Embeddings Datatabase

In [2]:
chroma_client = chromadb.Client()
client_persistent = chromadb.PersistentClient(path='../data/data_embeddings')

### Sentence Transformers: open-source embeddings model
Define Embedding function

In [5]:
sentence_transformer_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

  from .autonotebook import tqdm as notebook_tqdm


Create the collection

In [6]:
db_embeddings = client_persistent.get_or_create_collection(
    name='movies_db_embeddings', 
    embedding_function= sentence_transformer_function
)

For test reasons, We will select only top 3500 rows

In [22]:
df = df.head(3500)

Add rows to vector database

In [26]:
db_embeddings.add(
    ids=df['ids'].tolist(),
    embeddings=df['embeddings'].tolist(),
    metadatas=df.drop(['ids','embeddings','text'],axis=1).to_dict('records')
)

Let's test it with a query

In [15]:
results = db_embeddings.query(
    query_texts=['a history about a theft'],
    n_results=2
)

In [16]:
results

{'ids': [['2394', '1312']],
 'distances': [[0.8909051418304443, 0.9687187075614929]],
 'metadatas': [[{'Director': 'Brian A. Miller',
    'Generes': 'Action Drama Thriller',
    'Keywords': 'car stolen money gunfight firefight shot to death amnesia hidden loot gunshot wound wounded in the head head injury',
    'Overview': 'The lone surviving thief of a violent armored car robbery is sprung from a high security facility and administered an experimental drug.',
    'Rating': 3.8,
    'Stars': 'Ryan Guzman Sylvester Stallone Meadow Williams Brian A. Miller Mike Maples',
    'Writer': 'Mike Maples',
    'movie title': 'Backtrace'},
   {'Director': 'Gemma Mc Carthy',
    'Generes': 'Action',
    'Overview': 'A thief gets out of prison after many years and decides to try and locate someone from his past. After a series of events, he ends up becoming sheriff. Now he has to face his past while in his new position as sheriff.',
    'Rating': 8.5,
    'Stars': "Jonathan O' Dwyer Sean Flood Fran

### OpenAI Embeddings: private embeddings model
Define Embedding function

In [1]:
import os
openai_api_key = os.getenv("OPENAI_API_KEY")

In [12]:
openai_function = embedding_functions.OpenAIEmbeddingFunction(
    api_key=openai_api_key,
    model_name = 'text-embedding-ada-002'
)

Create Embeddings

In [15]:
import mapply
mapply.init(chunk_size=1)

In [17]:
from openai import OpenAI

client = OpenAI(api_key=openai_api_key)

def get_embedding(text: str, model: str = 'text-embedding-ada-002') -> list:
    text = text.replace('\n', '')
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

df["embeddings"] = df["text"].mapply(get_embedding)

100%|██████████| 56/56 [01:42<00:00,  1.82s/it]


Create collection

In [18]:
db_embeddings = client_persistent.get_or_create_collection(
    name='movies_db_embeddings_openai', 
    embedding_function= openai_function
)

Add rows to vector database

In [19]:
db_embeddings.add(
    ids=df['ids'].tolist(),
    embeddings=df['embeddings'].tolist(),
    metadatas=df.drop(['ids','embeddings','text'],axis=1).to_dict('records')
)

## LangChain
LangChain is a framework for developing applications powered by language models. 

It's very useful to build RAG pipelines

In [1]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("../data/unstructured/atenttion_is_all_you_need.pdf")
data = loader.load()
print(f"This document has {len(data)} pages")

This document has 15 pages


### Text splitters

We need to break Documents into smaller chunks because is useful both, for indexing data and passing it to a model, large chunks are harder to search over and won't fit in model's context window.

It's recommended chunks between 500 and 1000 characters.

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    length_function=len,
    chunk_overlap=50
)

documents = text_splitter.split_documents(data)
print(f"Now we have {len(documents)} documents!")

Now we have 93 documents!


Generate Embeddings

In [6]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Instantiate OpenAI Embeddings
embedding_openai = OpenAIEmbeddings(model = "text-embedding-ada-002")

# Create Vector Store with Embeddings
vector_store = Chroma.from_documents(
    documents = documents,
    embedding = embedding_openai,
    persist_directory = "..data/data_embeddings/attention-embeddings"
)

vector_store.persist()

Let's test it with a query

In [14]:
query = "What are transformers?"

docs = vector_store.similarity_search_with_score(query, k=5)
docs

[(Document(page_content='The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building', metadata={'page': 1, 'source': '../data/unstructured/atenttion_is_all_you_need.pdf'}),
  0.37473738402405193),
 (Document(page_content='language modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [17, 18] and [9].\n3 Model Architect

Let's create a retriever

In [31]:
retriever = vector_store.as_retriever(
    search_kwargs = {"k":4}
)
query = "What are transformers?"
retriever.get_relevant_documents(query)

[Document(page_content='The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building', metadata={'page': 1, 'source': '../data/unstructured/atenttion_is_all_you_need.pdf'}),
 Document(page_content='language modeling tasks [34].\nTo the best of our knowledge, however, the Transformer is the first transduction model relying\nentirely on self-attention to compute representations of its input and output without using sequence-\naligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [17, 18] and [9].\n3 Model Architecture', metadata={'page': 1,

Finally we can create a Chain to ask question based on our base knowledge (Vector Store)

In [32]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(
    model_name = "gpt-3.5-turbo",
    temperature = 0.0
)

qa_chain_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = retriever,
    return_source_documents=True,
    verbose = True
)

Let's use our RAG pipeline 😎

In [33]:
qa_chain_with_sources.invoke({"question":"What are transformers?"})



[1m> Entering new RetrievalQAWithSourcesChain chain...[0m

[1m> Finished chain.[0m


{'question': 'What are transformers?',
 'answer': 'Transformers are a type of sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. They allow for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on GPUs.\n',
 'sources': '../data/unstructured/atenttion_is_all_you_need.pdf',
 'source_documents': [Document(page_content='The Transformer allows for significantly more parallelization and can reach a new state of the art in\ntranslation quality after being trained for as little as twelve hours on eight P100 GPUs.\n2 Background\nThe goal of reducing sequential computation also forms the foundation of the Extended Neural GPU\n[16], ByteNet [ 18] and ConvS2S [ 9], all of which use convolutional neural networks as basic building', metadata={'page': 1, 'source': '../data/unstructured/