# Recommender System

In this example we are creating a movie recommender system by extracting feature vectors using SentenceTransformers, importing movie vectors into Milvus, metadata into Redis, searching in Milvus, and providing searchers movie data from Redis.

## Data

This is wrong.

In this project, we use [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/). This dataset contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users. 

We use the following files:
- movies.dat: Contains movie information.
- movie_vectors.txt: Contains movie vectors that can be imported to Milvus easily.

File structure:

 - movMovieID::Title::Genres   

     - Titles are identical to titles provided by the IMDB (includingyear of release)
 
     - Genres are pipe-separated

     - Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries
 
    - Movies are mostly entered by hand, so errors and inconsistencies may exist

# Requirements

bla bla bla

In [24]:
! python -m pip install pymilvus redis pandas sentence_transformers

Collecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Obtaining dependency information for transformers<5.0.0,>=4.6.0 from https://files.pythonhosted.org/packages/1a/06/3817f9bb923437ead9a794f0ac0d03b8b5e0478ab112db4c413dd37c09da/transformers-4.33.2-py3-none-any.whl.metadata
  Downloading transformers-4.33.2-py3-none-any.whl.metadata (119 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm (from sentence_transformers)
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Usi

##Milvus Server

We're going to create vector embedding from the movie's descriptions. So, we need a way to store, index, and search on those embeddngs. That's where Milvus comes in.

We're using a relatively large dataset, at least for a personal computer. So, you may want to use a Zilliz Cloud instance to store these vectors.

But, if yu want to stay with a local instance, you can download a docker compose configuration and run that.

Here's how to get the compose file, and start the server.

In [None]:
! wget https://github.com/milvus-io/milvus/releases/download/v2.3.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
! docker-compose up -d

If you want to use cloud, sign up for an account [here](https://cloud.zilliz.com).

## Redis Server

We are using Redis as a metadata storage service.

In [1]:
! docker run --name redis -d -p 6379:6379 redis

1ce442c4e0ddb27fe1b19c23a173876833132c61a3985fbe84af4b54308dfd6f


Let's read in the movie metadata and see what it has to offer.

In [2]:
import pandas as pd

movies=pd.read_csv('dataset/movies_metadata.csv',low_memory=False)
movies.shape

(45466, 24)

We have more than 45k movies, with 24 coum,ns of metadata.

Let's list the columns.

In [3]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

We don't need to store all these columns in Redis. Let's trim them down to the essentials.

In [4]:
trimmed_movies = movies[["id", "title", "overview", "release_date", "genres"]]
trimmed_movies.head(5)

Unnamed: 0,id,title,overview,release_date,genres
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",1995-10-30,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,1995-12-15,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,1995-12-22,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",1995-12-22,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,1995-02-10,"[{'id': 35, 'name': 'Comedy'}]"


Now, it's time to connect to Milvus and Redis so you can start uploading data.

Start with Redis:

In [13]:
import redis                    
redis_server = redis.StrictRedis(host="localhost", port=6379)
redis_server.flushall()

True

Next, connect to Milvus. 

Here's the code for connecting to a cloud instance. Replace the URI and TOKEN with the correct values for your instance. 

You can find them in your Zilliz dashboard:


![image.png](cluster_info.png)


In [6]:
from pymilvus import *
import redis

milvus_uri="https://in03-75204f04fc4368d.api.gcp-us-west1.zillizcloud.com"
token="12ac8d384a8bae99c7fe00e34c112aa4095034cf45c7db6e5d8be9f974dbb737b116e983fbd350f6cf2e75406a74b7d030fc814e"
connections.connect("default",
                        uri=milvus_uri,
                        token=token)
print("Connected!")

Connected!


Now, let's import the movie metadat into redis. While we're doing this, we'll clean up the data by verifying that all the fields are present before insert the entry into Redis. Many of the movies are missing fields.

In [14]:
import json

movies_dict = trimmed_movies.to_dict('records')

for movie in movies_dict:
    if not movie["id"] or movie["overview"] != movie["overview"] or movie["release_date"] != movie["release_date"] or movie["genres"] != movie["genres"]:
        print("Dropping {}".format(movie))
        movies_dict.remove(movie)

    redis_server.set(str(movie['id']), json.dumps(movie))

print("Done importing movies into redis.")

Dropping {'id': '78802', 'title': 'Wings of Courage', 'overview': nan, 'release_date': '1996-09-18', 'genres': "[{'id': 10749, 'name': 'Romance'}, {'id': 12, 'name': 'Adventure'}]"}
Dropping {'id': '161495', 'title': 'Roommates', 'overview': nan, 'release_date': '1995-03-01', 'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}]"}
Dropping {'id': '287305', 'title': 'Peanuts – Die Bank zahlt alles', 'overview': nan, 'release_date': '1996-03-21', 'genres': "[{'id': 35, 'name': 'Comedy'}]"}
Dropping {'id': '10801', 'title': 'The Superwife', 'overview': nan, 'release_date': '1996-03-06', 'genres': "[{'id': 35, 'name': 'Comedy'}]"}
Dropping {'id': '278978', 'title': 'Und keiner weint mir nach', 'overview': nan, 'release_date': '1996-02-29', 'genres': "[{'id': 18, 'name': 'Drama'}]"}
Dropping {'id': '117730', 'title': 'Under The Domim Tree', 'overview': nan, 'release_date': '1994-01-01', 'genres': "[{'id': 18, 'name': 'Drama'}, {'id': 10769, 'name': 'Foreign'}]"}
Dropping {'

Let's see how many movies we have left, and make sure we can retrieve a movie by ID. 862 is the id for Toy Story. We should see some metadata for them film if we retrieve that key.

In [15]:
print(len(movies_dict))
data = redis_server.get("862")
movie = json.loads(data)
print(movie)

44535
{'id': '862', 'title': 'Toy Story', 'overview': "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.", 'release_date': '1995-10-30', 'genres': "[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"}


So, with the meta data stored in Redis, it's time to calculate the embeddings and add them to Milvus.

First, we need a collection to store them in. Let's create a simple one that stores the movie ID and embeddings for the *overview* field.

Then, we'll index the overview to make searches more efficent.

In [48]:
COLLECTION_NAME = 'film_vectors'
PARTITION_NAME = 'Movie'

pk = FieldSchema(name='ID', dtype=DataType.INT64, is_primary=True, auto_id=False)
field = FieldSchema(name='Movies', dtype=DataType.FLOAT_VECTOR, dim=384)
schema = CollectionSchema(fields=[pk, field], description="movie recommender: film vectors")

if utility.has_collection(COLLECTION_NAME): # drop the same collection created before
    collection = Collection(COLLECTION_NAME)
    collection.drop()
    
collection = Collection(name=COLLECTION_NAME, schema=schema)
print("Collection created.")

index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}

collection.create_index(field_name="Movies", index_params=index_params)
collection.load()

print("Collection indexed!")
 

Collection created.


alloc_timestamp unimplemented, ignore it


Collection indexed!


Now, we need a function to create the embeddings.

The primary artifact for movie information is the overview, but including the genre and release date in complete sentences may help with search accuracy.

Let's create a transformer and call it from a simple function that extracts the id field and creates an embed from the overview, genre and release date and inserts the vector into Milvus.

In [62]:
from sentence_transformers import SentenceTransformer
import ast

def build_genres(data):
    genres = data['genres']
    genre_list = ""
    entries= ast.literal_eval(genres)
    genres = ""
    for entry in entries:
        genre_list = genre_list + entry["name"] + ", "
    genres += genre_list
    genres = "".join(genres.rsplit(",", 1))
    return genres

transformer = SentenceTransformer('all-MiniLM-L6-v2')


def embed_movie(data):
    embed = "{} Released on {}. Genres are {}.".format(data["overview"], data["release_date"], build_genres(data))    
    embeddings = transformer.encode(embed)
    return embeddings

Now, we can create the embeddings. The data isn't perfectly clean, so we need to check the fields before processing. This should be done in an intial ingestion step, but we're saving an extra pass through the data by doing it here.

This dataset is too large to send to Milvus in a single insert statement, but sending them one at a time would create unnecessary network traffic and add too much time. So, this code uses batches. You can play with the batch size to suit your individual needs and preferences.


In [50]:
# Parallel arrays for movies ids and embeddings
ids = []
embeds = []

# Loop counter for batching and showing progress
j = 0

for movie_dict in movies_dict:
    try:
        ids.append(int(movie_dict["id"]))
        embeds.append(embed_movie(movie_dict))
        j += 1
        if j % 5000 == 0:
            print("Embedded {} records".format(j))
            data = [ids, embeds]
            collection.insert(data)
            print("Insert completed")
            ids = []
            embeds = []
    except ValueError as ve:
        print("Invalid id: {} for record {}".format(movie_dict["id"], j))

data = [ids, embeds]
collection.insert(data)
print("Final insert completed")
print("Finished with {} embeddings".format(j))

Embedded 5000 records
Insert completed
Embedded 10000 records
Insert completed
Embedded 15000 records
Insert completed
Invalid id: 1997-08-20 for record 19595
Embedded 20000 records
Insert completed
Embedded 25000 records
Insert completed
Invalid id: 2012-09-29 for record 29175
Embedded 30000 records
Insert completed
Embedded 35000 records
Insert completed
Invalid id: 2014-01-01 for record 35039
Embedded 40000 records
Insert completed
Final insert completed
Finished with 44532 embeddings


So, now we can search for movies that match viewer criteria. TO do this, we need several functions.

First, we need a transformer to convert the user's search string to an embedding. **embed_search** does this.

Second, Milvus will return a set of movie ids to us. We need to use them to retrieve movie data from Redis. This happens in **collate_results**.

Finally, **search_for_movies** performs the actual vector search.

In [69]:
collection.load() # load collection memory before search

# Set search parameters
topK = 5
SEARCH_PARAM = {
    "metric_type":"L2",
    "params":{"nprobe": 20},
}


def embed_search(search_string):
    search_embeddings = transformer.encode(search_string)
    return search_embeddings


def collate_results(ids):
    movies = []    
    for id in ids:
        data = redis_server.get(str(id))
        if data:
            movie = json.loads(data)
            genres = build_genres(movie)
            movie["genres"] = genres
            movies.append(movie)
    return movies


def search_for_movies(search_string):
    user_vector = embed_search(search_string)
    results = collection.search([user_vector],"Movies",param=SEARCH_PARAM, limit=topK, expr=None, output_fields=None)
    return collate_results(results[0].ids)


So, let's put this search to work!

In [73]:
from pprint import pprint


search_string = "A movie from the 1990s about vampires."
results = search_for_movies(search_string)
pprint(results)




[{'genres': 'Horror, Science Fiction ',
  'id': '40028',
  'overview': 'A science fiction vampire movie. The Vampire is an emissary '
              'from an embattled world near destruction who teleports to Earth '
              'to see if they can live here. He finds that our blood is '
              'nourishing and that at least one source of it is a steady '
              'stream of transfusions. He hypnotizes a Dr. to provide them and '
              'has his blond nurse move in to administer them.',
  'release_date': '1988-05-20',
  'title': 'Not of This Earth'},
 {'genres': 'Comedy, Horror ',
  'id': '65416',
  'overview': 'A group of friends go on a vacation with some new friends that '
              'turn out to be vampires, they hire a famous vampire killer to '
              'help them.',
  'release_date': '1997-01-24',
  'title': 'An American Vampire Story'},
 {'genres': 'Horror, Thriller ',
  'id': '18392',
  'overview': 'An 18-year-old college freshman is seduced by a hand