# Basic example of how to use Qdrant to build a movie recommendation system
In this notebook, we plan to create a basic recommendation system by utilizing the MovieLens dataset and Qdrant.
Vector databases like Qdrant are essential for recommendation systems. They store and manage high-dimensional data, including user and item embeddings, which represent detailed user preferences and item characteristics. With advanced indexing techniques, vector databases enable quick retrieval of similar users or items, ensuring that the suggested items or content closely align with the user's interests, leading to more personalized and relevant recommendations.

MovieLens dataset contains a list of movies and ratings given by users. 
Our approach will use **collaborative filtering** to build a recommendation system based on this data.

The idea behind collaborative filtering is that if two users have similar tastes, they will like similar movies. We will use this idea to find the users most similar to our ratings and see what movies these similar users liked that we haven't seen yet. To identify similar users based on their ratings, we will represent each user's ratings as a vector in a high-dimensional space, which will be sparse. We will use Qdrant to index these vectors and then use the same collection to find users whose vectors are similar to our ratings. Once we have identified such users, we will check what movies they have liked but haven't seen yet.

First, we need to download the dataset:
```shell
mkdir -p data
wget https://files.grouplens.org/datasets/movielens/ml-1m.zip
unzip ml-1m.zip -d data
```

#### Getting started
First, install all dependencies:
```shell
!pip install -U  \
    pandas  \
    qdrant-client \
    python-dotenv
```

Set up secret key values on `.env` file: 
```bash
QDRANT_HOST
QDRANT_API_KEY
```

In [1]:
# load all environment variables
import os
from dotenv import load_dotenv
load_dotenv('./.env')

True

## 1 - Setup and loading data

In [2]:
from qdrant_client import QdrantClient, models
import pandas as pd

In [3]:
# load users
users = pd.read_csv('data/ml-1m/users.dat', sep='::', names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [4]:
movies = pd.read_csv('data/ml-1m/movies.dat', sep='::', names=['movie_id', 'title', 'genres'], engine='python', encoding='latin-1')
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings = pd.read_csv( 'data/ml-1m/ratings.dat', sep='::', names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


#### Normalize ratings
> Sparse vectors can use advantage of negative values, so we can normalize ratings to have mean 0 and std 1
In this scenario we can take into account movies that we don't like

In [6]:
ratings.rating = (ratings.rating - ratings.rating.mean()) / ratings.rating.std()

In [7]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,1.269746,978300760
1,1,661,-0.520601,978302109
2,1,914,-0.520601,978301968
3,1,3408,0.374572,978300275
4,1,2355,1.269746,978824291


## 2 - Preparing the data and creating a collection
- Create collection with configured sparse vectors
- Sparse vectors don't require to specify dimension, because it's extracted from the data automatically
> An explanation of using hybrid cloud with OVH can be inserted here!

In [8]:
# Convert ratings to sparse vectors

from collections import defaultdict

user_sparse_vectors = defaultdict(lambda: {"values": [], "indices": []})

for row in ratings.itertuples():
    user_sparse_vectors[row.user_id]["values"].append(row.rating)
    user_sparse_vectors[row.user_id]["indices"].append(row.movie_id)

In [9]:
client = QdrantClient(
    url = os.getenv("QDRANT_HOST"),
    api_key = os.getenv("QDRANT_API_KEY")
)

client.create_collection(
    "movielens",
    vectors_config={},
    sparse_vectors_config={
        "ratings": models.SparseVectorParams()
    }
)

True

***Upload all user's votes as sparse vectors***:

In [10]:
def data_generator():
    for user in users.itertuples():
        yield models.PointStruct(
            id=user.user_id,
            vector={
                "ratings": user_sparse_vectors[user.user_id]
            },
            payload=user._asdict()
        )

client.upload_points(
    "movielens",
    data_generator()
)

## 3 - Making a recommendation for ourselves
Let's try to recommend something for ourselves:
- 1 = Like
- -1 = dislike

In [11]:
# Search with 
# movies[movies.title.str.contains("Matrix", case=False)]

my_ratings = { 
    2571: 1,  # Matrix
    329: 1,   # Star Trek
    260: 1,   # Star Wars
    2288: -1, # The Thing
    1: 1,     # Toy Story
    1721: -1, # Titanic
    296: -1,  # Pulp Fiction
    356: 1,   # Forrest Gump
    2116: 1,  # Lord of the Rings
    1291: -1, # Indiana Jones
    1036: -1  # Die Hard
}

inverse_ratings = {k: -v for k, v in my_ratings.items()}

def to_vector(ratings):
    vector = models.SparseVector(
        values=[],
        indices=[]
    )
    for movie_id, rating in ratings.items():
        vector.values.append(rating)
        vector.indices.append(movie_id)
    return vector

In [12]:
# Find users with similar taste

results = client.search(
    "movielens",
    query_vector=models.NamedSparseVector(
        name="ratings",
        vector=to_vector(my_ratings)
    ),
    with_vectors=True, # We will use those to find new movies
    limit=20
)

In [13]:
# Calculate how frequently each movie is found in similar users' ratings

def results_to_scores(results):
    movie_scores = defaultdict(lambda: 0)

    for user in results:
        user_scores = user.vector['ratings']
        for idx, rating in zip(user_scores.indices, user_scores.values):
            if idx in my_ratings:
                continue
            movie_scores[idx] += rating

    return movie_scores

In [14]:
# Sort movies by score and print top 5

movie_scores = results_to_scores(results)
top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)

for movie_id, score in top_movies[:5]:
    print(movies[movies.movie_id == movie_id].title.values[0], score)

Star Wars: Episode V - The Empire Strikes Back (1980) 20.02387858
Star Wars: Episode VI - Return of the Jedi (1983) 16.443184379999998
Princess Bride, The (1987) 15.840068229999996
Raiders of the Lost Ark (1981) 14.94489462
Sixth Sense, The (1999) 14.570322149999999
