In [2]:
import pandas as pd
import numpy as np

## Content-Based Filtering


We want to build a system that recommends movies that are similar to a particular movie. We will use the plot descriptions as well as other metadata to create content-based recommendations.

> ### Creating Features from Plot Descriptions
> To use the `plot_keywords` for content-based filtering, we first have to convert them into numeric features. Use what we learned about TF-IDF in the last class to create new features based on the plot keywords.

In [1]:
# Read in the dataset
path = r'C:\Users\user\DS-SF-41\data\movie_metadata.csv'

> Replace the '`|`' in `plot_keywords` with a blank string (i.e. ' ')

In [3]:
df = pd.read_csv(path)
df.head()
df.plot_keywords = df.plot_keywords.str.replace("|"," ")
df['plot_keywords']

0                  avatar future marine native paraplegic
1       goddess marriage ceremony marriage proposal pi...
2                     bomb espionage sequel spy terrorist
3       deception imprisonment lawlessness police offi...
4                                                     NaN
5       alien american civil war male nipple mars prin...
6               sandman spider man symbiote venom villain
7       17th century based on fairy tale disney flower...
8       artificial intelligence based on comic book ca...
9                        blood book love potion professor
10      based on comic book batman sequel to a reboot ...
11      crystal epic lex luthor lois lane return to earth
12      action hero attempted rape bond girl official ...
13      box office hit giant squid heart liar's dice m...
14                  horse outlaw texas texas ranger train
15      based on comic book british actor playing amer...
16      brother brother relationship brother sister re...
17        alie

> Replace '`\xc2\xa0`' in `movie_title` with an empty string (i.e. '')

In [32]:
df.movie_title = df.movie_title.str.replace('\xc2\xa0','')


> Use `TfidfVectorizer` to generate features from `plot_keywords`

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

pWords = df['plot_keywords'].fillna('')

cls = TfidfVectorizer(
    stop_words='english',
)

cls.fit(pWords)

X = cls.transform(pWords)

X.shape

(5043, 5997)

> With this matrix in hand, you can now compute a similarity score. There are several candidates for this, but the most popular is the cosine similarity score. It is a numerical quantity that denotes the similarity between two movies.

> Use `cosine_similarity` to compute the cosine similarity scores and store them in `similarities`

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

> Now use this function to get the top recommendations based on a given movie

In [None]:
def get_top_recommendations(movie, top_n=10):
    
    # Get the index of the movie we care about
    try:
        index = df[df['movie_title'] == movie].index[0]
    except IndexError:
        raise Exception('"{}" not found in list of known movies!'.format(movie))
    
    # Find the top n most similar movies and return their titles
    recs = similarities.iloc[index, :].transpose().drop(index).sort_values(ascending=False).head(top_n).index
    return df.iloc[recs, :]['movie_title']

## Collaborative Filtering

> ### Using Suprise
> Surprise is a Python scikit building and analyzing recommender systems.
> http://surpriselib.com/

In [None]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# Load the movielens-100k dataset (download it if needed).
data = Dataset.load_builtin('ml-100k').build_full_trainset()

# Use the famous SVD algorithm.
algo = SVD()

algo.fit(data)

In [None]:
uid = str(196)  # raw user id (as in the ratings file)
iid = str(302)  # raw item id (as in the ratings file)

# Get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)