## RAG Architecture

A brief introduction and demo

Let's explore our dataset. It's a dataset based on movies information scrapped from IMDB website and available on kaggle at:
https://www.kaggle.com/datasets/utsh0dey/25k-movie-dataset

In [1]:
import pandas as pd

df = pd.read_csv("../data/unprocessed/25k-imdb-movie-dataset.csv")
df.head(3)

Unnamed: 0,movie title,Run Time,Rating,User Rating,Generes,Overview,Plot Kyeword,Director,Top 5 Casts,Writer,year,path
0,Top Gun: Maverick,"$170,000,000 (estimated)",8.6,187K,"['Action', 'Drama']",After more than thirty years of service as one...,"['fighter jet', 'sequel', 'u.s. navy', 'fighte...",Joseph Kosinski,"['Jack Epps Jr.', 'Peter Craig', 'Tom Cruise',...",Jim Cash,-2022,/title/tt1745960/
1,Jurassic World Dominion,2 hours 27 minutes,6.0,56K,"['Action', 'Adventure', 'Sci-Fi']",Four years after the destruction of Isla Nubla...,"['dinosaur', 'jurassic park', 'tyrannosaurus r...",Colin Trevorrow,"['Colin Trevorrow', 'Derek Connolly', 'Chris P...",Emily Carmichael,-2022,/title/tt8041270/
2,Top Gun,"$15,000,000 (estimated)",6.9,380K,"['Action', 'Drama']",As students at the United States Navy's elite ...,"['pilot', 'male camaraderie', 'u.s. navy', 'gr...",Tony Scott,"['Jack Epps Jr.', 'Ehud Yonay', 'Tom Cruise', ...",Jim Cash,-1986,/title/tt0092099/


Let's do some transformations

In [2]:
from ast import literal_eval

def concat_list(list_: list) -> str:
  """Joins with ' ' every item from the list."""
  list_ = literal_eval(list_)
  return ' '.join(list_)

def string_to_list(string: str) -> list:
  """Literal eval. for a list"""
  list_ = literal_eval(string)
  return list_

In [3]:
# Fill NAs, clean keywords, stars, generes and ratings
df = df.fillna(' ')
df['Keywords'] = df['Plot Kyeword'].apply(concat_list)
df['Stars'] = df['Top 5 Casts'].apply(concat_list)
df['Generes'] = df['Generes'].apply(string_to_list)
df['Rating'] = pd.to_numeric(df['Rating'], errors="coerce").fillna(0).astype("float")

# Concatenate all to have a more complete description
df['text'] = df.apply(lambda x : str(x['Overview']) + ' ' + x['Keywords'] + ' ' + x['Stars'], axis=1)

Drop used columns

In [6]:
df.drop(['Plot Kyeword','Top 5 Casts'],axis=1, inplace=True)

Generate Embeddings

In [None]:
# You can use sentence_transformers if you have GPU, else you can use OpenAI Embeddings via API
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate Embeddings
embeddings = model.encode(df['text'], batch_size=64, show_progress_bar=True)

In [None]:
# Asign Embeddings to a column
df['embeddings'] = embeddings.tolist()

# The vectorstore will need an id for every register
df['ids'] = df.index
df['ids'] = df['ids'].astype('str')

In [6]:
# to delete
df = pd.read_csv("../data/processed/embeddings_dataset.csv")
df.head(2)

Unnamed: 0,movie title,Run Time,Rating,User Rating,Generes,Overview,Director,Writer,year,path,Keywords,Stars,text,ids,embeddings
0,Top Gun: Maverick,"$170,000,000 (estimated)",8.6,187K,"['Action', 'Drama']",After more than thirty years of service as one...,Joseph Kosinski,Jim Cash,-2022,/title/tt1745960/,fighter jet sequel u.s. navy fighter aircraft ...,Jack Epps Jr. Peter Craig Tom Cruise Jennifer ...,After more than thirty years of service as one...,0,"[-0.07095592468976974, -0.009481011889874935, ..."
1,Jurassic World Dominion,2 hours 27 minutes,6.0,56K,"['Action', 'Adventure', 'Sci-Fi']",Four years after the destruction of Isla Nubla...,Colin Trevorrow,Emily Carmichael,-2022,/title/tt8041270/,dinosaur jurassic park tyrannosaurus rex veloc...,Colin Trevorrow Derek Connolly Chris Pratt Bry...,Four years after the destruction of Isla Nubla...,1,"[-0.0253621693700552, -0.06149572879076004, 0...."


## Vector Store or Vector Database

In [3]:
import chromadb
from chromadb.utils import embedding_functions