# Using a Vector Database to Recommend Movies

Vector search is certainly critical for generative AI, but also has lots of other interesting applications as well. One very common one is building personalized recommendations. In this exercise, we'll take a small diversion and build a quick movie recommender using a vector database.

For this exercise we'll use the [MovieLens Latest Small Dataset](https://grouplens.org/datasets/movielens/latest/), which contains 100,000 ratings and 3,600 tags applied to 9,000 movies by 600 users. The strategy we'll use is to create embeddings for the movies based on the user ratings. Then if a user rated a particular movie highly, we'll recommend "similar" movies, as determined by the embeddings

In [None]:
import lancedb

import numpy as np
import pandas as pd

The dataset is included along with this exercise:

In [None]:
!ls ml-latest-small

## Loading data

Let's start by reading in the `ratings.csv` file. We'll use this to compute the content embeddings

In [None]:
ratings = pd.read_csv('./ml-latest-small/ratings.csv', header=0)
ratings

## Computing ratings

Use the ratings dataframe from above and create a new reviews dataframe of users (index) and movies (columns). Each entry (i, j) in the dataframe will be the rating that user_i gave to movie_j. If no such pair exists, then fill in the value 0.

**HINT** In Excel this would be called a **pivot** table

In [None]:
<fill me in>

## Computing embeddings

Now let's use [matrix factorization](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf) to extract content embeddings.

Please compute the content embeddings from the reviewmatrix dataframe and name the result `embeddings`.

**HINT**
1. SVD is a popular matrix factorization technique
2. If you're not sure which of the SVD results to use as the content embeddings, look at the shape of the results

In [None]:
matrix = reviewmatrix.values
_, _, vh = <fill me in>
embeddings = vh.T

## Metadata

Read in the `movies.csv` and `links.csv` files and make sure it is aligned with the embeddings dataframe.

**HINT** pandas provides `reindex` functionality to help with data alignment

In [None]:
movies = pd.read_csv('./ml-latest-small/movies.csv', header=0)
movies = movies.set_index("movieId").reindex(reviewmatrix.columns)
movies

In [None]:
# now do this for links

<fill me in>

## Create vector database table

Let's create a table with the following fields:

1. an integer movie id field
2. a vector field of embeddings
3. a string field of genres
4. a string field for the movie title
5. an integer field for the imdb_id

First, we'll create a pydantic model named `Content` for these fields. For the vector field, use the `lancedb.pydantic.vector` as a shorthand for the field type. Note that you'll need to pass in the number of dimensions.

In [None]:
from lancedb.pydantic import vector, LanceModel

class Content(LanceModel):
    movie_id: int
    vector: vector(embeddings.shape[1])
    genres: str
    title: str
    imdb_id: int
        
    @property
    def imdb_url(self) -> str:
        return f"https://www.imdb.com/title/tt{self.imdb_id}"

Let's prepare a list of python dicts with all of the data

In [None]:
values = list(zip(*[reviewmatrix.columns,
                    embeddings, 
                    movies["genres"], 
                    movies["title"], 
                    links["imdbId"], 
                    links["tmdbId"]]))
keys = Content.__annotations__.keys()
data = [dict(zip(keys, v)) for v in values]

data[0]

Now please connect to the local database at ~/.lancedb
and create the LanceDB table named "movielens_small".

**HINT** you've seen this in a previous exercise

In [None]:
import pyarrow as pa
table_name = "movielens_small"
data = pa.Table.from_pylist(data, schema=Content.to_arrow_schema())

<fill me in>

## Generating recommendations

Finally we're ready to generate recommendations based on content vector similarity.

For this exercise please fill in the rest of the function to generate recommendations

**HINT** It's easier if you use the pydantic integration to convert results

In [None]:
def get_recommendations(title: str) -> list[(int, str, str)]:
    # First we retrieve the vector for the input title
    query_vector = (table.to_lance()
                    .to_table(filter=f"title='{title}'")["vector"].to_numpy()[0])
    # Please write the code to search for the 5 most similar titles
    <fill me in>
    # For each result, return the movie_id, title, and imdb_url
    return <fill me in>

If a user watched the movie titled "Moana (2016)", what should we recommend to the user?

In [None]:
get_recommendations("Moana (2016)")

What about "Rogue One: A Star Wars Story (2016)"?

In [None]:
get_recommendations("Rogue One: A Star Wars Story (2016)")

Do these look reasonable? How would you improve this recommender system?