# RAG con Qdrant (usando busqueda hibrida) + OpenAI

### Requisitos de instalación

Antes de ejecutar el proyecto, asegurate de tener instaladas las siguientes librerías de Python:

```bash
pip install \
  fastembed>=0.7.1 \
  ipywidgets>=8.1.7 \
  pandas>=2.3.0 \
  notebook>=7.4.3 \
  openai>=1.93.0 \
  qdrant-client>=1.14.3
```


###  Descarga y procesamiento de documentos

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5979622.svg)](https://doi.org/10.5281/zenodo.5979622)

In [1]:
import pandas as pd

url = 'https://zenodo.org/records/5979622/files/peliculas.csv?download=1'
df = pd.read_csv(url, header=0, index_col=0)

df.head()

Unnamed: 0,title,year,synopsis,critic_score,people_score,consensus,total_reviews,total_ratings,type,rating,...,release_date_(theaters),release_date_(streaming),box_office_(gross_usa),runtime,production_co,sound_mix,aspect_ratio,view_the_collection,crew,link
0,Black Panther,2018,"After the death of his father, T'Challa return...",96,79.0,Black Panther elevates superhero cinema to thr...,519,"50,000+",Action & Adventure,PG-13 (Sequences of Action Violence|A Brief Ru...,...,"Feb 16, 2018 wide","May 2, 2018",$700.2M,2h 14m,Walt Disney Pictures,"DTS, Dolby Atmos",Scope (2.35:1),Marvel Cinematic Universe,"Chadwick Boseman, Michael B. Jordan, Lupita Ny...",http://www.rottentomatoes.com/m/black_panther_...
1,Avengers: Endgame,2019,"Adrift in space with no food or water, Tony St...",94,90.0,"Exciting, entertaining, and emotionally impact...",538,"50,000+",Action & Adventure,PG-13 (Sequences of Sci-Fi Violence|Action|Som...,...,"Apr 26, 2019 wide","Jul 30, 2019",$858.4M,3h 1m,"Marvel Studios, Walt Disney Pictures","Dolby Atmos, DTS, Dolby Digital, SDDS",Scope (2.35:1),Marvel Cinematic Universe,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...",http://www.rottentomatoes.com/m/avengers_endgame
2,Mission: Impossible -- Fallout,2018,Ethan Hunt and the IMF team join forces with C...,97,88.0,"Fast, sleek, and fun, Mission: Impossible - Fa...",433,"10,000+",Action & Adventure,PG-13 (Intense Sequences of Action|Brief Stron...,...,"Jul 27, 2018 wide","Nov 20, 2018",$220.1M,2h 27m,"Bad Robot, Tom Cruise","DTS, Dolby Atmos, Dolby Digital",Scope (2.35:1),,"Tom Cruise, Henry Cavill, Ving Rhames, Simon P...",http://www.rottentomatoes.com/m/mission_imposs...
3,Mad Max: Fury Road,2015,"Years after the collapse of civilization, the ...",97,86.0,With exhilarating action and a surprising amou...,427,"100,000+",Action & Adventure,R (Intense Sequences of Violence|Disturbing Im...,...,"May 15, 2015 wide","Aug 10, 2016",$153.6M,2h,"Kennedy Miller Mitchell, Village Roadshow Pict...",Dolby Atmos,Scope (2.35:1),,"Tom Hardy, Charlize Theron, Nicholas Hoult, Hu...",http://www.rottentomatoes.com/m/mad_max_fury_road
4,Spider-Man: Into the Spider-Verse,2018,"Bitten by a radioactive spider in the subway, ...",97,93.0,Spider-Man: Into the Spider-Verse matches bold...,387,"10,000+",Action & Adventure,PG (Mild Language|Frenetic Action Violence|The...,...,"Dec 14, 2018 wide","Mar 7, 2019",$190.2M,1h 57m,"Lord Miller, Sony Pictures Animation, Pascal P...","Dolby Atmos, DTS, Dolby Digital, SDDS",Scope (2.35:1),,"Shameik Moore, Hailee Steinfeld, Mahershala Al...",http://www.rottentomatoes.com/m/spider_man_int...


In [2]:
df = df[['title', 'synopsis', 'consensus', 'critic_score', 'people_score', 'genre', 'director', 'writer']]

documents = df.to_dict(orient='records')
documents[:1]

[{'title': 'Black Panther',
  'synopsis': "After the death of his father, T'Challa returns home to the African nation of Wakanda to take his rightful place as king. When a powerful enemy suddenly reappears, T'Challa's mettle as king -- and as Black Panther -- gets tested when he's drawn into a conflict that puts the fate of Wakanda and the entire world at risk. Faced with treachery and danger, the young king must rally his allies and release the full power of Black Panther to defeat his foes and secure the safety of his people.",
  'consensus': "Black Panther elevates superhero cinema to thrilling new heights while telling one of the MCU's most absorbing stories -- and introducing some of its most fully realized characters.",
  'critic_score': 96,
  'people_score': 79.0,
  'genre': 'adventure, action, fantasy',
  'director': 'Ryan Coogler',
  'writer': 'Ryan Coogler, Joe Robert Cole'}]

### Construcción de la colección en Qdrant

In [3]:
from qdrant_client import QdrantClient, models

qd_client = QdrantClient("http://localhost", port=6333)
collection_name = "movie_catalog_hybrid"
embedding_dim = 768
model_dense = "BAAI/bge-base-en"
model_sparse = "Qdrant/bm25"


# Solo borra la colección si existe (evita error o espera innecesaria)
if collection_name in [col.name for col in qd_client.get_collections().collections]:
    qd_client.delete_collection(collection_name=collection_name)

# Usa create_collection para crear una collection
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "dense_movie": models.VectorParams(
            size=embedding_dim,
            distance=models.Distance.COSINE,
        ),
    },
    sparse_vectors_config={
        "sparse_movie": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    },
    timeout=60
)

True

### Inserción de documentos vectorizados

In [4]:
from qdrant_client.models import Document, PointStruct
from tqdm import tqdm
import concurrent.futures

# Barra de progreso
points = []

def build_point(i, doc):
    text = f"{'synopsis'} {'consensus'} - critic score:{'critic_score'} people score: {'people_score'}"
    vector = {"dense_movie": Document(text=text, model=model_dense), 
              "sparse_movie": Document(text=text, model=model_sparse)}
    
    return PointStruct(id=i, vector=vector, payload=doc)

# Generar puntos en paralelo (si hay muchos documentos)
with concurrent.futures.ThreadPoolExecutor() as executor:
    points = list(tqdm(executor.map(lambda x: build_point(*x), enumerate(documents)), total=len(documents)))

# Usar upsert en batches para reducir el uso de memoria y red
BATCH_SIZE = 128

for i in range(0, len(points), BATCH_SIZE):
    batch = points[i:i + BATCH_SIZE]
    qd_client.upsert(collection_name=collection_name, points=batch, wait=True)

100%|████████████████████████████████████████████████████████████████████████████████████████| 1610/1610 [00:00<00:00, 115610.84it/s]


### Función de búsqueda vectorial con filtros

In [5]:
from qdrant_client import QdrantClient, models

def vector_search(
    question,
    limit,
    model_dense,
    model_sparse,
    collection_name
):
    client = QdrantClient("http://localhost", port=6333)
    print(f"[vector_search_hybrid] Searching for: '{question}'")
    
    # Construcción explícita del filtro para claridad
    search_prefetch = [
        models.Prefetch(
            query=models.Document(text=question, model=model_dense),
            using="dense_movie", 
            limit=(limit * 4)),
        models.Prefetch(
            query=models.Document(text=question, model=model_sparse), 
            using="sparse_movie", 
            limit=(limit * 4))
    ]

    # Busqueda hibrida
    hybrid_results = client.query_points(
        collection_name=collection_name,
        prefetch=search_prefetch,
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=limit,
        with_payload=True
    )
    
    return [point.payload for point in hybrid_results.points]

### Construcción del prompt para el LLM

In [21]:
def build_prompt(query, search_results):
    prompt_template = (
        "You are a movie expert. Answer the QUESTION based solely on the information in the CONTEXT.\n"
        "Do not make up information. Use only the data provided..\n\n"
        "Please include the synopsis, critic score, and people score for each movie, and provide a summary of the consensus.s\n\n"
        "QUESTION: {question}\n\n"
        "CONTEXT:\n{context}"
    )


    context = "\n\n".join(
        f"Title: {doc.get('title', 'N/A')}\n"
        f"Synopsis: {doc.get('synopsis', 'N/A')}\n"
        f"Consensus: {doc.get('consensus', 'N/A')}\n"
        f"Critic score: {doc.get('critic_score', 'N/A')}\n"
        f"People score: {doc.get('people_score', 'N/A')}\n"
        f"Genre: {doc.get('genre', 'N/A')}\n"
        f"Director: {doc.get('director', 'N/A')}\n"
        f"Writer: {doc.get('writer', 'N/A')}"
        for doc in search_results
    )

    return prompt_template.format(question=query, context=context)

### Generación de respuestas con un LLM

In [7]:
import time
from openai import OpenAIError  # O el error correcto según tu cliente
from openai import OpenAI

def llm(prompt, model="gpt-4o-mini", max_retries=3):

    openai_client = OpenAI()
    for attempt in range(max_retries):
        try:
            response = openai_client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content.strip()
        except OpenAIError as e:
            print(f"[llm] Error (attempt {attempt + 1}/{max_retries}): {e}")
            time.sleep(1.5 * (attempt + 1))

    raise RuntimeError("LLM request failed after multiple attempts.")

### El pipeline completo: función rag

In [8]:
def rag(
    query,
    limit=5,
    model='gpt-4o-mini',
    model_dense = "BAAI/bge-base-en",
    model_sparse = "Qdrant/bm25",
    collection_name='movie_catalog_hybrid'
):
    search_results = vector_search(
        question=query,
        limit=limit,
        model_dense=model_dense,
        model_sparse=model_sparse,
        collection_name=collection_name
    )
    
    if not search_results:
        return "No relevant documents found to answer the question."

    prompt = build_prompt(query, search_results)
    answer = llm(prompt, model=model)
    
    return answer

In [22]:
rag("Which movies have received positive reviews from both the audience and critics, and whose synopsis reflects deep or emotional themes?")

[vector_search_hybrid] Searching for: 'Which movies have received positive reviews from both the audience and critics, and whose synopsis reflects deep or emotional themes?'


"Based on the provided information, the movies that have received positive reviews from both the audience and critics, and whose synopsis reflects deep or emotional themes are:\n\n1. **Apocalypse Now**\n   - **Synopsis**: In Vietnam in 1970, Captain Willard (Martin Sheen) takes a perilous and increasingly hallucinatory journey upriver to find and terminate Colonel Kurtz (Marlon Brando), a once-promising officer who has reportedly gone completely mad. In the company of a Navy patrol boat filled with street-smart kids, a surfing-obsessed Air Cavalry officer (Robert Duvall), and a crazed freelance photographer (Dennis Hopper), Willard travels further and further into the heart of darkness.\n   - **Critic Score**: 98\n   - **People Score**: 94.0\n   - **Consensus**: Francis Ford Coppola's haunting, hallucinatory Vietnam War epic is cinema at its most audacious and visionary.\n\n2. **A Fistful of Dollars**\n   - **Synopsis**: The Man With No Name (Clint Eastwood) enters the Mexican village 

In [23]:
rag("Which fantasy or action movies based on books have been highly rated by audiences and even if critics were less favorable?")

[vector_search_hybrid] Searching for: 'Which fantasy or action movies based on books have been highly rated by audiences and even if critics were less favorable?'


'Based on the provided information, there are no fantasy or action movies specifically based on books that meet the criteria of being highly rated by audiences while having less favorable critic scores. All the movies listed have high critic scores along with high audience scores.\n\nHere’s a summary of the ratings for the movies mentioned:\n\n1. **The French Connection**\n   - **Synopsis**: New York Detective "Popeye" Doyle and his partner chase a French heroin smuggler.\n   - **Critic Score**: 98\n   - **People Score**: 87.0\n   - **Consensus**: Realistic, fast-paced, and smart, bolstered by stellar performances.\n\n2. **Apocalypse Now**\n   - **Synopsis**: Captain Willard takes a perilous journey in Vietnam to find and terminate Colonel Kurtz.\n   - **Critic Score**: 98\n   - **People Score**: 94.0\n   - **Consensus**: A haunting, audacious Vietnam War epic.\n\n3. **A Fistful of Dollars**\n   - **Synopsis**: The Man With No Name inserts himself into a power struggle among the Rojo b

In [24]:
rag("Which action movies based on books have been highly rated by both audiences and critics?")

[vector_search_hybrid] Searching for: 'Which action movies based on books have been highly rated by both audiences and critics?'


'Based on the provided context, the action movies based on books that have been highly rated by both audiences and critics are:\n\n1. **Apocalypse Now**\n   - **Synopsis:** In Vietnam in 1970, Captain Willard (Martin Sheen) takes a perilous and increasingly hallucinatory journey upriver to find and terminate Colonel Kurtz (Marlon Brando), a once-promising officer who has reportedly gone completely mad. With a diverse crew, Willard travels further into the heart of darkness.\n   - **Critic score:** 98\n   - **People score:** 94.0\n   - **Consensus:** Francis Ford Coppola\'s haunting, hallucinatory Vietnam War epic is cinema at its most audacious and visionary.\n\n2. **A Fistful of Dollars**\n   - **Synopsis:** The Man With No Name (Clint Eastwood) enters the Mexican village of San Miguel amidst a power struggle among the Rojo brothers and sheriff John Baxter. He inserts himself into the battle, selling false information to both sides for his own gain.\n   - **Critic score:** 98\n   - **

In [25]:
rag("Which action movies have been highly rated by both audiences and critics?")

[vector_search_hybrid] Searching for: 'Which action movies have been highly rated by both audiences and critics?'


'The following action movies have been highly rated by both audiences and critics:\n\n### 1. The French Connection\n- **Synopsis**: New York Detective "Popeye" Doyle (Gene Hackman) and his partner (Roy Scheider) chase a French heroin smuggler.\n- **Critic Score**: 98\n- **People Score**: 87.0\n- **Consensus**: Realistic, fast-paced and uncommonly smart, The French Connection is bolstered by stellar performances by Gene Hackman and Roy Scheider, not to mention William Friedkin\'s thrilling production.\n\n### 2. Apocalypse Now\n- **Synopsis**: In Vietnam in 1970, Captain Willard (Martin Sheen) takes a perilous and increasingly hallucinatory journey upriver to find and terminate Colonel Kurtz (Marlon Brando), a once-promising officer who has reportedly gone completely mad. In the company of a Navy patrol boat filled with street-smart kids, a surfing-obsessed Air Cavalry officer (Robert Duvall), and a crazed freelance photographer (Dennis Hopper), Willard travels further and further into t

In [27]:
rag("Which highly rated movies are based in or originate from Eastern countries?.")

[vector_search_hybrid] Searching for: 'Which highly rated movies are based in or originate from Eastern countries?.'


'Based on the context provided, the highly rated movies that are based in or originate from Eastern countries are:\n\n### 1. Apocalypse Now\n- **Synopsis**: In Vietnam in 1970, Captain Willard (Martin Sheen) takes a perilous and increasingly hallucinatory journey upriver to find and terminate Colonel Kurtz (Marlon Brando), a once-promising officer who has reportedly gone completely mad. In the company of a Navy patrol boat filled with street-smart kids, a surfing-obsessed Air Cavalry officer (Robert Duvall), and a crazed freelance photographer (Dennis Hopper), Willard travels further and further into the heart of darkness.\n- **Critic score**: 98\n- **People score**: 94.0\n- **Consensus**: Francis Ford Coppola\'s haunting, hallucinatory Vietnam War epic is cinema at its most audacious and visionary.\n\n### 2. A Fistful of Dollars\n- **Synopsis**: The Man With No Name (Clint Eastwood) enters the Mexican village of San Miguel in the midst of a power struggle among the three Rojo brothers