# Movie Recommender System

In this example you will create a movie recommender system.

The system will extract feature vectors from metadata about films using SentenceTransformers, import those vectors into Milvus, and import the metadata into Redis.When a user submits information about movies they're interested in, you'll search Milvus for similar films and provide searchers movie info from Redis using the results.

## Requirements
- Python 3.x.
- Docker
- A system with at least 32GB of RAM, or a Zilliz cloud account

## Data

In this project, you'll use [The Movies Dataset from Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset/data). This dataset contains metadata on more than 45k movies.

The dataset has several files, but you'll only need **movies_metadata.csv,** the main Movies Metadata file. You can use this notebook as a starting point and modify it to take advantage of the rest of this dataset.

# Requirements

First, install the Python packages needed for this project.

In [None]:
! python -m pip install pymilvus redis pandas sentence_transformers kaggle

## Download dataset

Now you'll download the dataset. You'll use the [Kaggle API](https://github.com/Kaggle/kaggle-api) to retrieve the data. 

Set your login information below, or download **kaggle.json** to a location where the API will find it. 

In [None]:
%env KAGGLE_USERNAME=username
%env KAGGLE_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

%env TOKENIZERS_PARALLELISM=true

Download the data and unzip it to the **dataset** directory.

In [22]:
import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_files('rounakbanik/the-movies-dataset', path='dataset', unzip=True)

## Milvus Server

You're going to create vector embeddings from the movie's descriptions. So, you need a way to store, index, and search on those embeddngs. That's where Milvus comes in.

This is a relatively large dataset, at least for a server running on a personal computer. So, you may want to use a Zilliz Cloud instance to store these vectors.

But, if you want to stay with a local instance, you can download a docker compose configuration and run that.

Here's how to get the compose file, and start the server.

In [None]:
! wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
! docker-compose up -d

But, if you want to use cloud, sign up for an account [here](https://cloud.zilliz.com).

## Redis Server

You're using Redis as a metadata storage service. Start it with docker.

In [None]:
! docker run --name redis -d -p 6379:6379 redis

Read in the movie metadata and see what it has to offer.

In [25]:
import pandas as pd

movies=pd.read_csv('dataset/movies_metadata.csv',low_memory=False)
movies.shape

(45466, 24)

Youhave more than 45k movies, with 24 columns of metadata.

List the columns.

In [26]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

There's no need to store all these columns in Redis. Trim them down to the essentials.

In [27]:
trimmed_movies = movies[["id", "title", "overview", "release_date", "genres"]]
trimmed_movies.head(5)

Unnamed: 0,id,title,overview,release_date,genres
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,1995-12-15,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,1995-12-22,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,1995-02-10,"[{'id': 35, 'name': 'Comedy'}]"


Now, it's time to connect to Milvus and Redis so you can start uploading data.

Start with Redis:

In [28]:
import redis                    
redis_server = redis.StrictRedis(host="localhost", port=6379)
redis_server.flushall()

True

Next, connect to Milvus. 

Here's the code for connecting to a cloud instance. Replace the URI and TOKEN with the correct values for your instance. 

You can find them in your Zilliz dashboard:


![image.png](cluster_info.png)


In [29]:
from pymilvus import *
import redis

milvus_uri="https://in03-75204f04fc4368d.api.gcp-us-west1.zillizcloud.com"
token="12ac8d384a8bae99c7fe00e34c112aa4095034cf45c7db6e5d8be9f974dbb737b116e983fbd350f6cf2e75406a74b7d030fc814e"
connections.connect("default",
                        uri=milvus_uri,
                        token=token)
print("Connected!")

Connected!


Start by importing the movie metadata into Redis. While you're doing this, clean up the data by verifying that all the fields are present before inserting the entry into Redis. Many of the movies are missing fields, especailly the critical **overview**.

By removing the bad entries from the list, you'll leave yourself with a dataset that matches Redis and is save to transform into embeddings.

In [30]:
import json
from pprint import pprint

movies_dict = trimmed_movies.to_dict('records')

for movie in movies_dict:
    if not movie["id"] or movie["overview"] != movie["overview"] or movie["release_date"] != movie["release_date"] or movie["genres"] != movie["genres"]:
        print("Dropping:")
        pprint(movie)
        movies_dict.remove(movie)

    redis_server.set(str(movie['id']), json.dumps(movie))

print("Done importing movies into redis.")

Dropping:
{'genres': "[{'id': 10749, 'name': 'Romance'}, {'id': 12, 'name': "
           "'Adventure'}]",
 'id': '78802',
 'overview': nan,
 'release_date': '1996-09-18',
 'title': 'Wings of Courage'}
Dropping:
{'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}]",
 'id': '161495',
 'overview': nan,
 'release_date': '1995-03-01',
 'title': 'Roommates'}
Dropping:
{'genres': "[{'id': 35, 'name': 'Comedy'}]",
 'id': '287305',
 'overview': nan,
 'release_date': '1996-03-21',
 'title': 'Peanuts – Die Bank zahlt alles'}
Dropping:
{'genres': "[{'id': 35, 'name': 'Comedy'}]",
 'id': '10801',
 'overview': nan,
 'release_date': '1996-03-06',
 'title': 'The Superwife'}
Dropping:
{'genres': "[{'id': 18, 'name': 'Drama'}]",
 'id': '278978',
 'overview': nan,
 'release_date': '1996-02-29',
 'title': 'Und keiner weint mir nach'}
Dropping:
{'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 10769, 'name': 'Foreign'}]",
 'id': '117730',
 'overview': nan,
 'release_date': '1994-01-01',
 

See how many movies you have left, and make sure you can retrieve a movie by ID. 862 is the id for Toy Story. You should see some metadata for that film if you retrieve that key.

In [31]:
print(len(movies_dict))
data = redis_server.get("862")
movie = json.loads(data)
print(movie)

44535
{'id': '862', 'title': 'Toy Story', 'overview': "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.", 'release_date': '1995-10-30', 'genres': "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"}


So, with the meta data stored in Redis, it's time to calculate the embeddings and add them to Milvus.

First, you need a collection to store them in. Create a simple one that stores the movie ID and embeddings for in the **Movies** field.

Then, you'll index that field to make searches more efficent.

In [32]:
COLLECTION_NAME = 'film_vectors'
PARTITION_NAME = 'Movie'

pk = FieldSchema(name='ID', dtype=DataType.INT64, is_primary=True, auto_id=False)
field = FieldSchema(name='Movies', dtype=DataType.FLOAT_VECTOR, dim=384)
schema = CollectionSchema(fields=[pk, field], description="movie recommender: film vectors")

if utility.has_collection(COLLECTION_NAME): # drop the same collection created before
    collection = Collection(COLLECTION_NAME)
    collection.drop()
    
collection = Collection(name=COLLECTION_NAME, schema=schema)
print("Collection created.")

index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

collection.create_index(field_name="Movies", index_params=index_params)
collection.load()

print("Collection indexed!")
 

Collection created.


alloc_timestamp unimplemented, ignore it


Collection indexed!


Now, you need a function to create the embeddings.

The primary artifact for movie information is the overview, but including the genre and release date in complete sentences may help with search accuracy.

Create a transformer and call it from a simple function:
- extract the id field
- creates an embed from the overview, genre and release date
- inserts the vector into Milvus.

You'll reuse the **build_genres** function below for searching.

In [34]:
from sentence_transformers import SentenceTransformer
import ast

def build_genres(data):
    genres = data['genres']
    genre_list = ""
    entries= ast.literal_eval(genres)
    genres = ""
    for entry in entries:
        genre_list = genre_list + entry["name"] + ", "
    genres += genre_list
    genres = "".join(genres.rsplit(",", 1))
    return genres

transformer = SentenceTransformer('all-MiniLM-L6-v2')

def embed_movie(data):
    embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))    
    embeddings = transformer.encode(embed)
    return embeddings

Now, you can create the embeddings. This dataset is too large to send to Milvus in a single insert statement, but sending them one at a time would create unnecessary network traffic and add too much time. So, this code uses batches. You can play with the batch size to suit your individual needs and preferences.

A few movies will fail for ids that cannot be cast to integers. You could fix this above with a schema change or by verifying their format. 

In [35]:
# Parallel arrays for movies ids and embeddings
ids = []
embeds = []

# Loop counter for batching and showing progress
j = 0

for movie_dict in movies_dict:
    try:
        ids.append(int(movie_dict["id"]))
        embeds.append(embed_movie(movie_dict))
        j += 1
        if j % 5000 == 0:
            print("Embedded {} records".format(j))
            data = [ids, embeds]
            collection.insert(data)
            print("Batch insert completed")
            ids = []
            embeds = []
    except ValueError as ve:
        print("Invalid id: {} for record {}".format(movie_dict["id"], j))

data = [ids, embeds]
collection.insert(data)
print("Final batch completed")
print("Finished with {} embeddings".format(j))

Embedded 5000 records
Batch insert completed
Embedded 10000 records
Batch insert completed
Embedded 15000 records
Batch insert completed
Invalid id: 1997-08-20 for record 19595
Embedded 20000 records
Batch insert completed
Embedded 25000 records
Batch insert completed
Invalid id: 2012-09-29 for record 29175
Embedded 30000 records
Batch insert completed
Embedded 35000 records
Batch insert completed
Invalid id: 2014-01-01 for record 35039
Embedded 40000 records
Batch insert completed
Final batch completed
Finished with 44532 embeddings


Now you can search for movies that match viewer criteria. To do this, you need a few more functions.

First, you need a transformer to convert the user's search string to an embedding. For this, **embed_search** takes their criteria and passed it to the same transformer you used to populate Milvus.

Milvus will return a set of movie ids. You need to use them to retrieve data about those ids from Redis. This happens in **collate_results**.

Finally, **search_for_movies** performs the actual vector search, using the other two functions for support.

In [36]:
collection.load() # load collection memory before search

# Set search parameters
topK = 5
SEARCH_PARAM = {
    "metric_type":"L2",
    "params":{"nprobe": 20},
}


def embed_search(search_string):
    search_embeddings = transformer.encode(search_string)
    return search_embeddings


def collate_results(ids):
    movies = []    
    for id in ids:
        data = redis_server.get(str(id))
        if data:
            movie = json.loads(data)
            genres = build_genres(movie)
            movie["genres"] = genres
            movies.append(movie)
    return movies


def search_for_movies(search_string):
    user_vector = embed_search(search_string)
    results = collection.search([user_vector],"Movies",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=None)
    return collate_results(results[0].ids)


So, put this search to work!

This search is looking for 1990s comedies with Vampires. The first hit is exactly that, but as the vector distance increases you can see that the films move further away from what you're looking for.

You can play around with different search criteria.

In [38]:
from pprint import pprint


search_string = "A comedy from the 1990s set in a hospital. The main characters are in their 20s and are trying to stop a vampire."
results = search_for_movies(search_string)
pprint(results)




[{'genres': 'Thriller, Comedy ',
  'id': '281289',
  'overview': 'A sarcastic comedy thriller about a vampire who sets out to '
              'infect the blood of a hospital and the twenty-something '
              'slackers who set out to stop him.',
  'release_date': '1998-05-18',
  'title': 'Tainted'},
 {'genres': 'Comedy, Horror ',
  'id': '65416',
  'overview': 'A group of friends go on a vacation with some new friends that '
              'turn out to be vampires, they hire a famous vampire killer to '
              'help them.',
  'release_date': '1997-01-24',
  'title': 'An American Vampire Story'},
 {'genres': 'Horror, Science Fiction ',
  'id': '40028',
  'overview': 'A science fiction vampire movie. The Vampire is an emissary '
              'from an embattled world near destruction who teleports to Earth '
              'to see if they can live here. He finds that our blood is '
              'nourishing and that at least one source of it is a steady '
              'stream