## Recommender System
Tutorial: https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system/notebook  
TMBD Dataset: https://www.kaggle.com/tmdb/tmdb-movie-metadata#tmdb_5000_credits.csv

- [Demographic Filtering](#Demographic-Filtering)
- [Content Based Filtering](#Content-Based-Filtering)
- [Collaborative Filtering](#Collaborative-Filtering)
    - User based filtering
    - Item based filtering

In [1]:
import pandas as pd 
import numpy as np 

In [2]:
ls

[31mRecommender System.ipynb[m[m* [31mtmdb_5000_credits.csv[m[m*    [31mtmdb_5000_movies.csv[m[m*


In [3]:
df1 = pd.read_csv('tmdb_5000_credits.csv')
df2 = pd.read_csv('tmdb_5000_movies.csv')

In [4]:
df2.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [5]:
df1.head(2)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [6]:
df2.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [7]:
df1 = df1.drop(columns=['title'], axis=1)

In [8]:
df1.columns = ['id', 'cast', 'crew']
df2 = df2.merge(df1, on='id')

In [9]:
df2.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


### 1. Demographic Filtering<a name="Demographic-Filtering"></a>

- Calculate the score for every movie
- Sort the scores and recommend the best rated movie to the users

In [10]:
df2[['id', 'title', 'vote_count', 'vote_average']].head()

Unnamed: 0,id,title,vote_count,vote_average
0,19995,Avatar,11800,7.2
1,285,Pirates of the Caribbean: At World's End,4500,6.9
2,206647,Spectre,4466,6.3
3,49026,The Dark Knight Rises,9106,7.6
4,49529,John Carter,2124,6.1


In [11]:
# Mean of vote_average

C= df2['vote_average'].mean()
C

6.092171559442016

In [12]:
# Use 90th percentile as the cutoff 
# Means it must have more votes than at least 90%

m = df2['vote_count'].quantile(0.9)
m

1838.4000000000015

In [13]:
# Filter out the movies that have more votes than m

q_movies = df2.copy().loc[df2['vote_count'] >= m]
q_movies.shape

(481, 22)

In [14]:
# Weightes rating

def weighted_rating(x, m=m, C=C):
    # m: he minimum votes required to be listed in the chart
    # C: the mean vote
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [15]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`

q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [16]:
#Sort movies based on score calculated above

q_movies = q_movies.sort_values('score', ascending=False)
q_movies[['title', 'vote_count', 'vote_average', 'score']].head(10)

Unnamed: 0,title,vote_count,vote_average,score
1881,The Shawshank Redemption,8205,8.5,8.059258
662,Fight Club,9413,8.3,7.939256
65,The Dark Knight,12002,8.2,7.92002
3232,Pulp Fiction,8428,8.3,7.904645
96,Inception,13752,8.1,7.863239
3337,The Godfather,5893,8.4,7.851236
95,Interstellar,10867,8.1,7.809479
809,Forrest Gump,7927,8.2,7.803188
329,The Lord of the Rings: The Return of the King,8064,8.1,7.727243
1990,The Empire Strikes Back,5879,8.2,7.697884


In [17]:
pop= df2.sort_values('popularity', ascending=False)
import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))

plt.barh(pop['title'].head(6),pop['popularity'].head(6), align='center',
        color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

Text(0.5,1,'Popular Movies')

### 2. Content Based Filtering<a name="Content-Based-Filtering"></a>

In [18]:
df2.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [19]:
df2['overview'].head(5)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
3    Following the death of District Attorney Harve...
4    John Carter is a war-weary, former military ca...
Name: overview, dtype: object

In [20]:
# convert to TF-IDF vectors
# scikit-learn builtin TfIdfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a', 'him'
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)


In [21]:
#Replace NaN with an empty string
df2['overview'] = df2['overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(df2['overview'])

tfidf_matrix.shape

(4803, 5000)

In [22]:
# Cosine similarity scores

# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [23]:
indices = pd.Series(df2.index, index=df2['title']).drop_duplicates()
indices.head(5)

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
dtype: int64

In [24]:
idx = indices['Minions']

In [25]:
sim_scores = list(enumerate(cosine_sim[idx]))

In [26]:
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

In [27]:
# Get the scores of the 10 most similar movies
# The first which cosine sim = 1 is itself

sim_scores = sim_scores[1:11]

In [28]:
sim_scores

[(506, 0.31946218327791664),
 (221, 0.15894260833906343),
 (3944, 0.15351595242076924),
 (2511, 0.1532620287858949),
 (3188, 0.14646533026049918),
 (4726, 0.13321251143490587),
 (70, 0.1314920932121),
 (1733, 0.12968712190689358),
 (1218, 0.1287943590242467),
 (3042, 0.12533423629087226)]

In [29]:
movie_indices = [i[0] for i in sim_scores]
movie_indices

[506, 221, 3944, 2511, 3188, 4726, 70, 1733, 1218, 3042]

In [30]:
print(df2[['title', 'overview']].iloc[idx])
print(df2[['title', 'overview']].iloc[movie_indices])

title                                                 Minions
overview    Minions Stuart, Kevin and Bob are recruited by...
Name: 546, dtype: object
                  title                                           overview
506     Despicable Me 2  Gru is recruited by the Anti-Villain League to...
221     Stuart Little 2  Stuart, an adorable white mouse, still lives h...
3944            Freeway  Following the arrest of her mother, Ramona, yo...
2511         Home Alone  Eight-year-old Kevin McCallister makes the mos...
3188    Velvet Goldmine  Almost a decade has elapsed since Bowie esque ...
4726         The Mighty  This tells the story of a strong friendship be...
70       Wild Wild West  Legless Southern inventor Dr. Arliss Loveless ...
1733  The Spy Next Door  Former CIA spy Bob Ho takes on his toughest as...
1218     The Guilt Trip  An inventor and his mom hit the road together ...
3042           The Gift  A husband and wife try to reinvigorate their r...


In [31]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df2['title'].iloc[movie_indices]

In [32]:
get_recommendations('Minions')

506       Despicable Me 2
221       Stuart Little 2
3944              Freeway
2511           Home Alone
3188      Velvet Goldmine
4726           The Mighty
70         Wild Wild West
1733    The Spy Next Door
1218       The Guilt Trip
3042             The Gift
Name: title, dtype: object

In [33]:
get_recommendations('The Avengers')

3311              Thank You for Smoking
256                           Allegiant
3144                            Plastic
1715                            Timecop
7               Avengers: Age of Ultron
4124                 This Thing of Ours
1468                       The Fountain
3033                      The Corruptor
4112                      Clockwatchers
588     Wall Street: Money Never Sleeps
Name: title, dtype: object

In [34]:
# Convert str to list type

type(df2.iloc[0]['genres'])

str

In [35]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(literal_eval)

In [36]:
type(df2.iloc[0]['genres'])

list

In [37]:
df2.iloc[0][['cast', 'crew', 'keywords', 'genres']].values[3]

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [38]:
# Returns the list top 3 elements or entire list; whichever is more.
# 'cast'

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [39]:
# Get the director's name from the crew feature. If director is not listed, return NaN
# 'crew'

def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [40]:
# Define new director, cast, genres and keywords features that are in a suitable form.
df2['director'] = df2['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df2[feature] = df2[feature].apply(get_list)

In [41]:
# Print the new features of the first 3 films
df2[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes,"[spy, based on novel, secret agent]","[Action, Adventure, Crime]"


In [42]:
# Lowercase and strip spaces

# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "_")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", "_"))
        else:
            return ''

In [43]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    df2[feature] = df2[feature].apply(clean_data)

In [44]:
df2[features]

Unnamed: 0,cast,keywords,director,genres
0,"[sam_worthington, zoe_saldana, sigourney_weaver]","[culture_clash, future, space_war]",james_cameron,"[action, adventure, fantasy]"
1,"[johnny_depp, orlando_bloom, keira_knightley]","[ocean, drug_abuse, exotic_island]",gore_verbinski,"[adventure, fantasy, action]"
2,"[daniel_craig, christoph_waltz, léa_seydoux]","[spy, based_on_novel, secret_agent]",sam_mendes,"[action, adventure, crime]"
3,"[christian_bale, michael_caine, gary_oldman]","[dc_comics, crime_fighter, terrorist]",christopher_nolan,"[action, crime, drama]"
4,"[taylor_kitsch, lynn_collins, samantha_morton]","[based_on_novel, mars, medallion]",andrew_stanton,"[action, adventure, science_fiction]"
5,"[tobey_maguire, kirsten_dunst, james_franco]","[dual_identity, amnesia, sandstorm]",sam_raimi,"[fantasy, action, adventure]"
6,"[zachary_levi, mandy_moore, donna_murphy]","[hostage, magic, horse]",byron_howard,"[animation, family]"
7,"[robert_downey_jr., chris_hemsworth, mark_ruff...","[marvel_comic, sequel, superhero]",joss_whedon,"[action, adventure, science_fiction]"
8,"[daniel_radcliffe, rupert_grint, emma_watson]","[witch, magic, broom]",david_yates,"[adventure, fantasy, family]"
9,"[ben_affleck, henry_cavill, gal_gadot]","[dc_comics, vigilante, superhero]",zack_snyder,"[action, adventure, fantasy]"


In [45]:
# Create metadata soup which is a string that contains all the metadata 
# that we want to feed to our vectorizer (namely actors, director and keywords)

def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

df2['soup'] = df2.apply(create_soup, axis=1)

In [48]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df2['soup'])

In [49]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)


In [50]:
# Reset index of our main DataFrame and construct reverse mapping as before
df2 = df2.reset_index()
indices = pd.Series(df2.index, index=df2['title'])

In [52]:
get_recommendations('The Dark Knight Rises', cosine_sim2)

65               The Dark Knight
119                Batman Begins
4638    Amidst the Devil's Wings
1196                The Prestige
3073           Romeo Is Bleeding
3326              Black November
1503                      Takers
1986                      Faster
303                     Catwoman
747               Gangster Squad
Name: title, dtype: object

In [54]:
get_recommendations('Minions', cosine_sim2)

67                                 Monsters vs Aliens
1426                                          Valiant
358                         Atlantis: The Lost Empire
302     Legend of the Guardians: The Owls of Ga'Hoole
2464                           The Master of Disguise
294                                              Epic
418       Cats & Dogs 2 : The Revenge of Kitty Galore
479                            Walking With Dinosaurs
1620                                  Winnie the Pooh
2823        Harold & Kumar Escape from Guantanamo Bay
Name: title, dtype: object

### 3. Collaborative Filtering<a name="Collaborative-Filtering"></a>
- User based filtering
- Item based filtering 

In [55]:
from surprise import Reader, Dataset, SVD, evaluate
reader = Reader()
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [58]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100004.0,100004.0,100004.0,100004.0
mean,347.01131,12548.664363,3.543608,1129639000.0
std,195.163838,26369.198969,1.058064,191685800.0
min,1.0,1.0,0.5,789652000.0
25%,182.0,1028.0,3.0,965847800.0
50%,367.0,2406.5,4.0,1110422000.0
75%,520.0,5418.0,4.0,1296192000.0
max,671.0,163949.0,5.0,1476641000.0


In [59]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [61]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8955
MAE:  0.6918
------------
Fold 2
RMSE: 0.9000
MAE:  0.6913
------------
Fold 3
RMSE: 0.9015
MAE:  0.6933
------------
Fold 4
RMSE: 0.8907
MAE:  0.6862
------------
Fold 5
RMSE: 0.8995
MAE:  0.6947
------------
------------
Mean RMSE: 0.8974
Mean MAE : 0.6915
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.8954879608797603,
                             0.8999869655266691,
                             0.9014667595224868,
                             0.8907473578505002,
                             0.8994941841780016],
                            'mae': [0.691816986835082,
                             0.6913489055073314,
                             0.6933459565739744,
                             0.6862242013474676,
                             0.6946998085235183]})

In [141]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x143f290b8>

In [142]:
ratings[ratings['userId'] == 1].sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
15,1,2193,2.0,1260759198
0,1,31,2.5,1260759144
3,1,1129,2.0,1260759185
8,1,1339,3.5,1260759125
6,1,1287,2.0,1260759187


In [145]:
# Predict result shows the true rating and the estimated rating
svd.predict(uid=1, iid=3671, r_ui=3)

Prediction(uid=1, iid=3671, r_ui=3, est=3.0020178344719644, details={'was_impossible': False})

In [146]:
svd.estimate(u=1, i=3671)

2.9633632823225216

### Try other codes

In [74]:
from surprise.model_selection import cross_validate

data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

# Run 5-fold cross-validation and print results
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8968  0.9030  0.8990  0.8921  0.8956  0.8973  0.0036  
MAE (testset)     0.6924  0.6947  0.6921  0.6879  0.6872  0.6908  0.0029  
Fit time          5.42    5.14    6.36    6.00    5.39    5.66    0.45    
Test time         0.30    0.16    0.25    0.17    0.17    0.21    0.05    


{'test_rmse': array([0.89683245, 0.90295076, 0.89896674, 0.89205451, 0.89562324]),
 'test_mae': array([0.69235758, 0.69470395, 0.69207994, 0.68793067, 0.68716636]),
 'fit_time': (5.420982837677002,
  5.135545253753662,
  6.359534978866577,
  5.9995317459106445,
  5.386353969573975),
 'test_time': (0.3005411624908447,
  0.16461968421936035,
  0.2543821334838867,
  0.17480134963989258,
  0.17417621612548828)}

In [106]:
data.df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [108]:
from surprise.model_selection import train_test_split
from surprise import accuracy

trainset, testset = train_test_split(data, random_state=1, shuffle=True, test_size=.25)

predictions = svd.test(testset)
predictions[0]

Prediction(uid=387, iid=3801, r_ui=4.0, est=4.214361688006636, details={'was_impossible': False})

In [104]:
accuracy.rmse(predictions)

RMSE: 0.7047


0.7047280313442099

In [168]:
testset[:5]

[(387, 3801, 4.0),
 (534, 507, 4.0),
 (480, 8874, 5.0),
 (575, 3469, 4.0),
 (214, 1219, 4.0)]

In [169]:
trainset.ir[0][:5]

[(0, 2.5), (6, 3.0), (30, 4.0), (31, 4.0), (35, 3.0)]

In [162]:
svd.compute_similarities()[1, 18]

0.3968253968253968

In [157]:
i = 1
u = 1

In [163]:
neighbors = [(v, svd.compute_similarities()[u, v]) for (v, r) in trainset.ir[i]]

In [164]:
neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

In [167]:
neighbors[:3]

[(34, 1.0), (174, 0.75), (528, 0.6260869565217391)]

In [165]:
print('The 3 nearest neighbors of user', str(u), 'are:')
for v, sim_uv in neighbors[:3]:
    print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))

The 3 nearest neighbors of user 1 are:
user 34 with sim 1.00
user 174 with sim 0.75
user 528 with sim 0.63
