**Document Summarization using Retrieval-Augmented Generation (RAG)**
**step1:Document Ingestion**

In [1]:
!pip install transformers datasets langchain faiss-cpu sentence-transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting packaging>=20.0 (from transformers)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch>=1.11.0->sentence-transformers)
  Downloadi

In [2]:
!pip install requests beautifulsoup4



**Load Base Anime Dataset:**

In [3]:
##Load Base Anime Dataset:
import pandas as pd

# Correct file paths (adjust filenames as necessary)
animes_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv")
profiles_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/profiles.csv")
reviews_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/reviews.csv")

# Print summary
print(f"Loaded {len(animes_df)} anime entries")
print(f"Loaded {len(profiles_df)} profiles")
print(f"Loaded {len(reviews_df)} reviews")
print("Animes columns:", animes_df.columns.tolist())
print("Profiles columns:", profiles_df.columns.tolist())
print("Reviews columns:", reviews_df.columns.tolist())


Loaded 19311 anime entries
Loaded 81727 profiles
Loaded 192112 reviews
Animes columns: ['uid', 'title', 'synopsis', 'genre', 'aired', 'episodes', 'members', 'popularity', 'ranked', 'score', 'img_url', 'link']
Profiles columns: ['profile', 'gender', 'birthday', 'favorites_anime', 'link']
Reviews columns: ['uid', 'profile', 'anime_uid', 'text', 'score', 'scores', 'link']


**Scrape Character Data from MyAnimeList:**

In [4]:
##Scrape Character Data from MyAnimeList:

import pandas as pd
import re
from bs4 import BeautifulSoup

# Load datasets (already done, but included for completeness)
animes_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv")
reviews_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/reviews.csv")

# Merge animes and reviews on anime_uid
merged_df = reviews_df.merge(animes_df[['uid', 'title', 'synopsis']], left_on='anime_uid', right_on='uid', how='left')

# Clean text function
def clean_text(text):
    if pd.isna(text):
        return ""
    text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# Apply cleaning to synopsis and review text
merged_df['synopsis'] = merged_df['synopsis'].apply(clean_text)
merged_df['text'] = merged_df['text'].apply(clean_text)

# Combine synopsis and review text for each anime
merged_df['combined_text'] = merged_df['synopsis'] + " " + merged_df['text']

# Filter documents mentioning the character
def filter_by_character(df, character_name):
    return df[df['combined_text'].str.contains(character_name, case=False, na=False)]

# Example: Filter for Naruto Uzumaki
character_name = "Naruto Uzumaki"
character_docs = filter_by_character(merged_df, character_name)

# Save filtered documents to CSV to inspect and use later
character_docs.to_csv('/kaggle/working/character_docs.csv', index=False)

# Print results
print(f"Found {len(character_docs)} documents for {character_name}")
print("Sample document (first 200 characters):")
if len(character_docs) > 0:
    print(character_docs['combined_text'].iloc[0][:200] + "...")
else:
    print("No documents found.")
print("Columns in character_docs:", character_docs.columns.tolist())

Found 3123 documents for Naruto Uzumaki
Sample document (first 200 characters):
Moments prior to Naruto Uzumaki's birth, a huge demon known as the Kyuubi, the Nine-Tailed Fox, attacked Konohagakure, the Hidden Leaf Village, and wreaked havoc. In order to put an end to the Kyuubi'...
Columns in character_docs: ['uid_x', 'profile', 'anime_uid', 'text', 'score', 'scores', 'link', 'uid_y', 'title', 'synopsis', 'combined_text']


**## Adding character review too**

In [5]:
## Adding character review too

import pandas as pd

# Load the filtered documents from Step 1
character_docs = pd.read_csv('/kaggle/working/character_docs.csv')

# Print details of a sample document to confirm synopsis and review inclusion
print(f"Total documents for Naruto Uzumaki: {len(character_docs)}")
print("\nSample document details:")
if len(character_docs) > 0:
    sample_doc = character_docs.iloc[0]
    print(f"Anime Title: {sample_doc['title']}")
    print(f"Synopsis (first 200 chars): {sample_doc['synopsis'][:200]}...")
    print(f"Review Text (first 200 chars): {sample_doc['text'][:200]}...")
    print(f"Combined Text (first 400 chars): {sample_doc['combined_text'][:400]}...")
else:
    print("No documents found.")

Total documents for Naruto Uzumaki: 3123

Sample document details:
Anime Title: Naruto
Synopsis (first 200 chars): Moments prior to Naruto Uzumaki's birth, a huge demon known as the Kyuubi, the Nine-Tailed Fox, attacked Konohagakure, the Hidden Leaf Village, and wreaked havoc. In order to put an end to the Kyuubi'...
Review Text (first 200 chars): more pics Overall 5 Story 5 Animation 5 Sound 7 Character 4 Enjoyment 8 **Has Been Re-Edited, Watched as of Episode 62** Story - 5.6/10 (F+) What story? Oh the Ninja boy wants to be the best. Where ha...
Combined Text (first 400 chars): Moments prior to Naruto Uzumaki's birth, a huge demon known as the Kyuubi, the Nine-Tailed Fox, attacked Konohagakure, the Hidden Leaf Village, and wreaked havoc. In order to put an end to the Kyuubi's rampage, the leader of the village, the Fourth Hokage, sacrificed his life and sealed the monstrous beast inside the newborn Naruto. Now, Naruto is a hyperactive and knuckle-headed ninja still livin...


**step 2 Embedding & Retrieval**

In [6]:
%%time
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pandas as pd
import json
# step 2 Embedding & Retrieval 

# Load filtered documents
character_docs = pd.read_csv('/kaggle/working/character_docs.csv')
print(f"Original document count: {len(character_docs)}")

# Deduplicate documents based on combined_text
character_docs = character_docs.drop_duplicates(subset=['combined_text'])
print(f"Document count after deduplicating combined_text: {len(character_docs)}")

# Group reviews by anime_uid to consolidate reviews per anime
# For each anime_uid, combine all reviews into a single text field
grouped_docs = character_docs.groupby('anime_uid').agg({
    'title': 'first',  # Take the first title (should be consistent per anime_uid)
    'synopsis': 'first',  # Take the first synopsis (consistent per anime_uid)
    'text': lambda x: " ".join(x),  # Concatenate all reviews
    'combined_text': lambda x: " ".join(x)  # Concatenate all combined_text (synopsis + reviews)
}).reset_index()
print(f"Document count after grouping by anime_uid: {len(grouped_docs)}")

# Save deduplicated and grouped documents for reference
grouped_docs.to_csv('/kaggle/working/grouped_docs.csv', index=False)

# Initialize sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for deduplicated documents
documents = grouped_docs['combined_text'].tolist()
print(f"Generating embeddings for {len(documents)} documents...")
embeddings = model.encode(documents, show_progress_bar=True, batch_size=32)

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

# Query embedding for character name
character_name = "Naruto Uzumaki"
query_embedding = model.encode([character_name])

# Retrieve top-k documents (increased k for diversity)
k = 10
distances, indices = index.search(query_embedding, k)

# Get relevant documents
retrieved_docs = grouped_docs.iloc[indices[0]][['title', 'synopsis', 'text', 'combined_text']].to_dict('records')

# Filter for diversity: Keep only one document per unique title
unique_titles = []
diverse_docs = []
for doc in retrieved_docs:
    if doc['title'] not in unique_titles:
        unique_titles.append(doc['title'])
        diverse_docs.append(doc)
retrieved_docs = diverse_docs[:5]  # Keep up to 5 diverse documents
print(f"Selected {len(retrieved_docs)} diverse documents")

# Save retrieved documents for Step 3
with open('/kaggle/working/retrieved_docs.json', 'w') as f:
    json.dump(retrieved_docs, f, indent=2)

# Print results
print(f"\nTop {len(retrieved_docs)} retrieved documents for {character_name}:")
for i, doc in enumerate(retrieved_docs):
    print(f"{i+1}. {doc['title']}")
    print(f"  Synopsis (first 200 chars): {doc['synopsis'][:200]}...")
    print(f"  Review (first 200 chars): {doc['text'][:200]}...")
    print(f"  Distance: {distances[0][i]:.4f}")

# Print shape of embeddings
print(f"Embeddings shape: {embeddings.shape}")

2025-06-10 11:07:55.894661: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749553676.064182      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749553676.114091      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Original document count: 3123
Document count after deduplicating combined_text: 927
Document count after grouping by anime_uid: 20


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings for 20 documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Selected 5 diverse documents

Top 5 retrieved documents for Naruto Uzumaki:
1. Naruto: Shippuuden
  Synopsis (first 200 chars): It has been two and a half years since Naruto Uzumaki left Konohagakure, the Hidden Leaf Village, for intense training following events which fueled his desire to be stronger. Now Akatsuki, the myster...
  Review (first 200 chars): more pics Overall 10 Story 10 Animation 10 Sound 10 Character 10 Enjoyment 10 If you want to be hung in suspense Shippuuden is for you. One of the best anime pick-upd ever. Naruto Shippuuden this is t...
  Distance: 0.6966
2. Boruto: Naruto the Movie
  Synopsis (first 200 chars): The spirited Boruto Uzumaki, son of Seventh Hokage Naruto, is a skilled ninja who possesses the same brashness and passion his father once had. However, the constant absence of his father, who is busy...
  Review (first 200 chars): more pics Overall 7 Story 6 Animation 8 Sound 8 Character 7 Enjoyment 0 I really wanted to review this movie, because it's offi

**step 3 Summary Generation**

In [7]:
%%time 
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
import json
import time
import re

# step 3 Summary Generation

# Load retrieved documents
with open('/kaggle/working/retrieved_docs.json', 'r') as f:
    retrieved_docs = json.load(f)

# Enhanced review cleaning to extract coherent opinions
def clean_review_text(text):
    # Remove metadata and noise
    text = re.sub(r'more pics\s*', '', text, flags=re.IGNORECASE)
    text = re.sub(r'(Overall|Story|Animation|Sound|Character|Enjoyment)\s*\d+(\.\d+)?\s*(\w+\s*\d+(\.\d+)?)*', '', text)
    text = re.sub(r'\*\*.*?\*\*|\[.*?\]', '', text)  # Remove markdown
    # Split into sentences and filter for opinions about Naruto
    sentences = [s.strip() for s in text.split('.')]
    opinion_sentences = [s for s in sentences if any(word in s.lower() for word in ['naruto', 'character', 'love', 'great', 'best', 'inspiring', 'epic', 'fan']) and len(s) > 15]
    return ". ".join(opinion_sentences[:3]) + "." if opinion_sentences else "No clear fan opinions found."

# Prepare synopsis and review inputs (use top 3 docs)
max_chars = 400
synopsis_texts = []
review_texts = []
for doc in retrieved_docs[:3]:
    synopsis = doc['synopsis'][:max_chars]
    review = clean_review_text(doc['text'])[:max_chars]
    synopsis_texts.append(synopsis)
    review_texts.append(review)

# Concatenate with prompts
synopsis_prompt = "Summarize the character description of Naruto Uzumaki, focusing on his role and traits."
synopsis_input = synopsis_prompt + " " + " ".join(synopsis_texts)
review_prompt = "Summarize fan opinions about Naruto Uzumaki, highlighting what audiences think of him."
review_input = review_prompt + " " + " ".join(review_texts)

# Initialize BART model and tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Move to GPU if available
if torch.cuda.is_available():
    bart_model = bart_model.to('cuda')

# Function to summarize text
def summarize(text, max_len=100, min_len=30):
    inputs = tokenizer(text, max_length=512, truncation=True, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to('cuda') for k, v in inputs.items()}
    summary_ids = bart_model.generate(
        inputs['input_ids'],
        max_length=max_len,
        min_length=min_len,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True), len(inputs['input_ids'][0])

# Summarize synopsis
start_time = time.time()
synopsis_summary, synopsis_tokens = summarize(synopsis_input)
synopsis_latency = time.time() - start_time

# Summarize reviews
start_time = time.time()
review_summary, review_tokens = summarize(review_input)
review_latency = time.time() - start_time

# Combine summaries with a space
final_summary = f"{synopsis_summary}\n\n{review_summary}"

# Total metrics
total_tokens = synopsis_tokens + review_tokens
total_latency = synopsis_latency + review_latency

# Save summary and metrics
output = {
    "character_name": "Naruto Uzumaki",
    "synopsis_summary": synopsis_summary,
    "review_summary": review_summary,
    "final_summary": final_summary,
    "token_usage": int(total_tokens),
    "latency": total_latency
}
with open('/kaggle/working/summary_output.json', 'w') as f:
    json.dump(output, f, indent=2)

# Print results with boxed format
print("\nSummary for Naruto Uzumaki:")
print("┌" + "─" * 80 + "┐")
print(f"│ Description of the character: {synopsis_summary:<72} │")
print("├" + "─" * 80 + "┤")
print(f"│ Reviews of the Audience on character: {review_summary:<76} │")
print("└" + "─" * 80 + "┘")
print(f"│ Combined review of the character: {final_summary:<76} │")
print("└" + "─" * 80 + "┘")
print(f"Token count: {total_tokens}")
print(f"Latency: {total_latency:.2f} seconds")

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]


Summary for Naruto Uzumaki:
┌────────────────────────────────────────────────────────────────────────────────┐
│ Description of the character: It has been two and a half years since Naruto Uzumaki left Konohagakure, the Hidden Leaf Village, for intense training following events which fueled his desire to be stronger. Now Akatsuki, the mysterious organization of elite rogue ninja, is closing in on their grand plan which may threaten the safety of the entire shinobi world. │
├────────────────────────────────────────────────────────────────────────────────┤
│ Reviews of the Audience on character: Summarize fan opinions about Naruto Uzumaki, highlighting what audiences think of him. One of the best anime pick-upd ever. A show with action, Drama, and Fantasy. │
└────────────────────────────────────────────────────────────────────────────────┘
│ Combined review of the character: It has been two and a half years since Naruto Uzumaki left Konohagakure, the Hidden Leaf Village, for intense tra

**Step 4: Output Presentation**

In [8]:
%%time
import pandas as pd
import json
from IPython.display import Image, display

# Step 4: Output Presentation


# Load summary output
with open('/kaggle/working/summary_output.json', 'r') as f:
    summary_output = json.load(f)

# Load retrieved documents
with open('/kaggle/working/retrieved_docs.json', 'r') as f:
    retrieved_docs = json.load(f)

# Load animes.csv for image URL
animes_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv")

# Get distances from Step 2
distances = [0.6966, 0.8878, 0.9313, 0.9402, 0.9446]

# Clean review snippets for display
def clean_review_snippet(text):
    text = re.sub(r'more pics\s*', '', text, flags=re.IGNORECASE)
    text = re.sub(r'(Overall|Story|Animation|Sound|Character|Enjoyment)\s*\d+(\.\d+)?\s*(\w+\s*\d+(\.\d+)?)*', '', text)
    text = re.sub(r'\*\*.*?\*\*|\[.*?\]', '', text)
    return text[:100] + "..."

# Prepare output
output = {
    "character_name": summary_output["character_name"],
    "anime_titles": [doc["title"] for doc in retrieved_docs],
    "summary": summary_output["final_summary"],
    "context": [
        {
            "title": doc["title"],
            "synopsis_snippet": doc["synopsis"][:100] + "...",
            "review_snippet": clean_review_snippet(doc["text"]),
            "similarity_score": distances[i]
        } for i, doc in enumerate(retrieved_docs)
    ],
    "token_usage": summary_output["token_usage"],
    "latency": summary_output["latency"],
    "image_url": animes_df[animes_df["title"] == retrieved_docs[0]["title"]]["img_url"].iloc[0]
}

# Print output with boxed format
print("\n=== Anime Character RAG Summarization Output ===")
print("┌" + "─" * 100 + "┐")
print(f"│ Character: {output['character_name']:<90} │")
print("├" + "─" * 100 + "┤")
print(f"│ Anime Titles: {', '.join(output['anime_titles'][:3]) + ('...' if len(output['anime_titles']) > 3 else ''):<90} │")
print("├" + "─" * 100 + "┤")
print(f"│ Summary:                                                                                      │")
for line in output["summary"].split("\n"):
    print(f"│   {line[:96]:<96} │")
print("├" + "─" * 100 + "┤")
print(f"│ Context:                                                                                      │")
for ctx in output["context"]:
    print(f"│   {ctx['title']}: Synopsis: {ctx['synopsis_snippet'][:46]}... Review: {ctx['review_snippet'][:46]}... (Score: {ctx['similarity_score']:.4f}) │")
print("├" + "─" * 100 + "┤")
print(f"│ Token Usage: {output['token_usage']:<88} │")
print(f"│ Latency: {output['latency']:.2f} seconds{'':<76} │")
print(f"│ Image URL: {output['image_url'][:88]:<88} │")
print("└" + "─" * 100 + "┘")

# Display image (if valid URL)
try:
    display(Image(url=output["image_url"]))
except:
    print("Image could not be displayed. Check the URL:", output["image_url"])

# Save final output
with open('/kaggle/working/final_output.json', 'w') as f:
    json.dump(output, f, indent=2)


=== Anime Character RAG Summarization Output ===
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Character: Naruto Uzumaki                                                                             │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Anime Titles: Naruto: Shippuuden, Boruto: Naruto the Movie, Naruto...                                    │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Summary:                                                                                      │
│   It has been two and a half years since Naruto Uzumaki left Konohagakure, the Hidden Leaf Village │
│                                                                                                    │
│   Summarize fan opinions about Naruto Uzumaki, highlighting what audiences think of him. One of th │
├──────────────────

CPU times: user 248 ms, sys: 47 ms, total: 295 ms
Wall time: 297 ms


**RAG FOR MULTIPLE CHRACTER RETRIVAL**

**RAG For Characters Naruto Uzumaki, Ichigo Kurosaki , Edward Elric**

In [9]:
%%time
import pandas as pd
import re
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
import json
import time
from IPython.display import Image, display


#RAG FOR MULTIPLE CHRACTER RETRIVAL

# Initialize global models
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
if torch.cuda.is_available():
    bart_model = bart_model.to('cuda')

# Load datasets
animes_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/animes.csv")
reviews_df = pd.read_csv("/kaggle/input/myanimelist-dataset-animes-profiles-reviews/reviews.csv")

# Predefined characters
characters = ["Naruto Uzumaki", "Ichigo Kurosaki", "Edward Elric"]
precomputed_data = {}

# Step 1: Document Ingestion
def clean_text(text):
    if pd.isna(text):
        return ""
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

merged_df = reviews_df.merge(animes_df[['uid', 'title', 'synopsis', 'img_url']], left_on='anime_uid', right_on='uid', how='left')
merged_df['synopsis'] = merged_df['synopsis'].apply(clean_text)
merged_df['text'] = merged_df['text'].apply(clean_text)
merged_df['combined_text'] = merged_df['synopsis'] + " " + merged_df['text']

for character in characters:
    print(f"\nProcessing {character}...")
    
    # Filter documents
    character_docs = merged_df[merged_df['combined_text'].str.contains(character, case=False, na=False)]
    character_docs = character_docs.drop_duplicates(subset=['combined_text'])
    
    # Group by anime_uid
    grouped_docs = character_docs.groupby('anime_uid').agg({
        'title': 'first',
        'synopsis': 'first',
        'text': lambda x: " ".join(x),
        'combined_text': lambda x: " ".join(x),
        'img_url': 'first'
    }).reset_index()
    grouped_docs.to_csv(f'/kaggle/working/{character.replace(" ", "_")}_docs.csv', index=False)
    print(f"Documents for {character}: {len(grouped_docs)}")

    # Step 2: Embedding & Retrieval
    documents = grouped_docs['combined_text'].tolist()
    if not documents:
        print(f"No documents found for {character}. Skipping...")
        continue
    
    embeddings = sentence_model.encode(documents, show_progress_bar=True, batch_size=32)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    
    query_embedding = sentence_model.encode([character])
    k = 10
    distances, indices = index.search(query_embedding, k)
    
    retrieved_docs = grouped_docs.iloc[indices[0]][['title', 'synopsis', 'text', 'combined_text', 'img_url']].to_dict('records')
    unique_titles = []
    diverse_docs = []
    for doc in retrieved_docs:
        if doc['title'] not in unique_titles:
            unique_titles.append(doc['title'])
            diverse_docs.append(doc)
    retrieved_docs = diverse_docs[:5]
    
    with open(f'/kaggle/working/{character.replace(" ", "_")}_retrieved_docs.json', 'w') as f:
        json.dump(retrieved_docs, f, indent=2)
    
    # Step 3: Summary Generation
    def clean_review_text(text):
        text = re.sub(r'more pics\s*', '', text, flags=re.IGNORECASE)
        text = re.sub(r'(Overall|Story|Animation|Sound|Character|Enjoyment)\s*\d+(\.\d+)?\s*(\w+\s*\d+(\.\d+)?)*', '', text)
        text = re.sub(r'\*\*.*?\*\*|\[.*?\]', '', text)
        sentences = [s.strip() for s in text.split('.')]
        opinion_sentences = [s for s in sentences if any(word in s.lower() for word in ['character', 'love', 'great', 'best', 'inspiring', 'epic', 'fan']) and len(s) > 15]
        return ". ".join(opinion_sentences[:3]) + "." if opinion_sentences else "No clear fan opinions found."
    
    max_chars = 400
    synopsis_texts = []
    review_texts = []
    for doc in retrieved_docs[:3]:
        synopsis = doc['synopsis'][:max_chars]
        review = clean_review_text(doc['text'])[:max_chars]
        synopsis_texts.append(synopsis)
        review_texts.append(review)
    
    synopsis_prompt = f"Summarize the character description of {character}, focusing on their role and traits."
    synopsis_input = synopsis_prompt + " " + " ".join(synopsis_texts)
    review_prompt = f"Summarize fan opinions about {character}, highlighting what audiences think."
    review_input = review_prompt + " " + ". ".join(review_texts)
    
    def summarize(text, max_len=100, min_len=30):
        inputs = bart_tokenizer(text, max_length=512, truncation=True, return_tensors="pt")
        if torch.cuda.is_available():
            inputs = {k: v.to('cuda') for k, v in inputs.items()}
        summary_ids = bart_model.generate(
            inputs['input_ids'],
            max_length=max_len,
            min_length=min_len,
            length_penalty=2.0,
            num_beams=4,
            early_stopping=True
        )
        return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True), len(inputs['input_ids'][0])
    
    start_time = time.time()
    synopsis_summary, synopsis_tokens = summarize(synopsis_input)
    synopsis_latency = time.time() - start_time
    
    start_time = time.time()
    review_summary, review_tokens = summarize(review_input)
    review_latency = time.time() - start_time
    
    final_summary = f"{synopsis_summary}\n\n{review_summary}"
    
    precomputed_data[character] = {
        "synopsis_summary": synopsis_summary,
        "review_summary": review_summary,
        "final_summary": final_summary,
        "token_usage": int(synopsis_tokens + review_tokens),
        "latency": synopsis_latency + review_latency,
        "retrieved_docs": retrieved_docs,
        "distances": distances[0].tolist()
    }

# Save precomputed data
with open('/kaggle/working/precomputed_data.json', 'w') as f:
    json.dump(precomputed_data, f, indent=2)

# Step 4: Dynamic Output Presentation
def display_character_output(character_name):
    if character_name not in precomputed_data:
        print(f"No data found for {character_name}. Please preprocess this character.")
        return
    
    output = precomputed_data[character_name]
    anime_titles = [doc["title"] for doc in output["retrieved_docs"]]
    
    print(f"\n=== RAG Summarization Output for {character_name} ===")
    print("┌" + "─" * 100 + "┐")
    print(f"│ Character: {character_name:<90} │")
    print("├" + "─" * 100 + "┤")
    print(f"│ Anime Titles: {', '.join(anime_titles[:3]) + ('...' if len(anime_titles) > 3 else ''):<90} │")
    print("├" + "─" * 100 + "┤")
    print(f"│ Summary:                                                                                      │")
    for line in output["final_summary"].split("\n"):
        print(f"│   {line[:96]:<96} │")
    print("├" + "─" * 100 + "┤")
    print(f"│ Context:                                                                                      │")
    for i, doc in enumerate(output["retrieved_docs"]):
        synopsis_snippet = doc["synopsis"][:100] + "..."
        review_snippet = clean_review_text(doc["text"])[:100] + "..."
        print(f"│   {doc['title']}: Synopsis: {synopsis_snippet[:46]}... Review: {review_snippet[:46]}... (Score: {output['distances'][i]:.4f}) │")
    print("├" + "─" * 100 + "┤")
    print(f"│ Token Usage: {output['token_usage']:<88} │")
    print(f"│ Latency: {output['latency']:.2f} seconds{'':<76} │")
    print(f"│ Image URL: {output['retrieved_docs'][0]['img_url'][:88]:<88} │")
    print("└" + "─" * 100 + "┘")
    
    # Display image
    try:
        display(Image(url=output["retrieved_docs"][0]["img_url"]))
    except:
        print(f"Image for {character_name} could not be displayed. Check URL: {output['retrieved_docs'][0]['img_url']}")

# Example: Display output for all precomputed characters
for character in characters:
    display_character_output(character)



Processing Naruto Uzumaki...
Documents for Naruto Uzumaki: 20


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Processing Ichigo Kurosaki...
Documents for Ichigo Kurosaki: 16


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Processing Edward Elric...
Documents for Edward Elric: 41


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


=== RAG Summarization Output for Naruto Uzumaki ===
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Character: Naruto Uzumaki                                                                             │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Anime Titles: Naruto: Shippuuden, Boruto: Naruto the Movie, Naruto...                                    │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Summary:                                                                                      │
│   It has been two and a half years since Naruto Uzumaki left Konohagakure, the Hidden Leaf Village │
│                                                                                                    │
│   Summarize fan opinions about Naruto Uzumaki, highlighting what audiences think. One of the best  │
├───────────────


=== RAG Summarization Output for Ichigo Kurosaki ===
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Character: Ichigo Kurosaki                                                                            │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Anime Titles: Bleach, Bleach Movie 3: Fade to Black - Kimi no Na wo Yobu, Bleach Movie 4: Jigoku-hen...  │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Summary:                                                                                      │
│   Ichigo Kurosaki is an ordinary high schooler until his family is attacked by a Hollow, a corrupt │
│                                                                                                    │
│   This anime probably has my favorite cast of characters. I am kind of surprised other recurring B │
├──────────────


=== RAG Summarization Output for Edward Elric ===
┌────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Character: Edward Elric                                                                               │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Anime Titles: Fullmetal Alchemist, Fullmetal Alchemist: The Conqueror of Shamballa, Fullmetal Alchemist: Brotherhood... │
├────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Summary:                                                                                      │
│   Edward Elric, a young, brilliant alchemist, has lost much in his twelve-year life. When he and h │
│                                                                                                    │
│   Summarize fan opinions about Edward Elric, highlighting what audiences think. To best describe t │
├──

CPU times: user 2min 54s, sys: 2.63 s, total: 2min 57s
Wall time: 2min 54s
