<a href="https://colab.research.google.com/github/muqadasuet41/RAG-Movie-SelectionFromCsv/blob/main/RAG_Movie_MFA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Step 1: Install required libraries
!pip install pandas faiss-cpu sentence-transformers transformers



In [None]:

# Step 2: Import libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
from transformers import pipeline

# Step 3: Load and print the CSV file
# Replace 'your_file.csv' with the path to your CSV file
df = pd.read_csv('/content/movies.csv')
print("Data loaded successfully:\n", df.head())

# Step 4: Create embeddings for each entry in the CSV file
# Using a lightweight Sentence Transformer model for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # This model is light and efficient

# Convert data to a list of strings that will be embedded
data_texts = df['Film'].fillna('') + " " + df['Genre'].fillna('') + " " + df['Lead Studio'].fillna('') \
             + " " + df['Audience score %'].astype(str) + " " + df['Profitability'].astype(str) \
             + " " + df['Rotten Tomatoes %'].astype(str) + " " + df['Worldwide Gross'].astype(str) \
             + " " + df['Year'].astype(str)

# Generate embeddings
embeddings = embedding_model.encode(data_texts.tolist(), show_progress_bar=True)

# Step 5: Set up FAISS for similarity search
# Initialize FAISS index
embedding_dim = embeddings.shape[1]  # Dimensionality of embeddings
index = faiss.IndexFlatL2(embedding_dim)  # L2 distance index
index.add(embeddings)  # Add embeddings to the index

# Step 6: RAG Logic Replacement
# Set up the generator model
generator = pipeline("text-generation", model="gpt2", max_new_tokens=50)  # Use max_new_tokens to limit output length

def retrieve_and_generate(query, top_k=5):
    # Step 1: Encode the query
    query_embedding = embedding_model.encode([query])

    # Step 2: Retrieve top_k similar entries
    _, top_k_indices = index.search(query_embedding, top_k)
    retrieved_texts = data_texts.iloc[top_k_indices[0]].values  # Retrieve top_k entries

    # Step 3: Concatenate retrieved information for generation prompt
    prompt = " ".join(retrieved_texts) + " Based on the information provided, " + query

    # Step 4: Generate response using the generator model
    response = generator(prompt, max_new_tokens=50, num_return_sequences=1)  # Generate with limited output tokens

    # Step 5: Return generated response
    return response[0]['generated_text']

# Test the RAG function
query = "tell me the names of comedy movies"
generated_response = retrieve_and_generate(query)
print("Generated response:\n", generated_response)


Data loaded successfully:
                                  Film    Genre            Lead Studio  \
0          Zack and Miri Make a Porno  Romance  The Weinstein Company   
1                     Youth in Revolt   Comedy  The Weinstein Company   
2  You Will Meet a Tall Dark Stranger   Comedy            Independent   
3                        When in Rome   Comedy                 Disney   
4               What Happens in Vegas   Comedy                    Fox   

   Audience score %  Profitability  Rotten Tomatoes % Worldwide Gross  Year  
0                70       1.747542                 64         $41.94   2008  
1                52       1.090000                 68         $19.62   2010  
2                35       1.211818                 43         $26.66   2010  
3                44       0.000000                 15         $43.04   2010  
4                72       6.267647                 28        $219.37   2008  




Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated response:
 Mamma Mia! Comedy Universal 76 9.234453864 53 $609.47  2008 Mamma Mia! Comedy Universal 76 9.234453864 53 $609.47  2008 It's Complicated Comedy Universal 63 2.642352941 56 $224.60  2009 The Invention of Lying Comedy Warner Bros. 47 1.751351351 56 $32.40  2009 Beginners Comedy Independent 80 4.471875 84 $14.31  2011 Based on the information provided, tell me the names of comedy movies that have made it in the past (and/or that we'll never see again...) comedy 80 4.471875 84 $14.31  2011 An Introduction to Modern Comedy Comedy Independent 65 6.558714090 64 $22
