# Recommender System Modeling
In this notebook, we will work through multiple iterations of our movie recommender system model. First, we will create a simpler model using just the movie reviews from IMDb. This will give us a baseline model that we can potentially improve on by adding more detailed features.

In [12]:
# Imports
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import pairwise_distances

## Selecting Data for Recommender System
This recommender system will use the cosine similarity of user ratings to recommend a group of movies. It will be able to receive a movie title as an input and return movie recommendations that are similar to the movie that is searched.

### Choosing data for model
For the sake of computing power, we will only be using a subset the reviews that we have in our recommender system. This same method can be replicated with larger data sets if more computing power is available.

In [6]:
# Let's read in our MovieLens reviews that were formatted in the Data Collection notebook
ml_reviews = pd.read_csv('../Data/Large-Data/ml_reviews.csv')
ml_reviews

Unnamed: 0,user_id,imdb_id,scaled_rating,title
0,1,tt0110912,10.0,Pulp Fiction
1,1,tt0111495,7.0,Trois couleurs: Rouge
2,1,tt0108394,10.0,Trois couleurs: Bleu
3,1,tt0114787,10.0,Underground
4,1,tt0045152,7.0,Singin' in the Rain
...,...,...,...,...
24969860,162541,tt0382932,9.0,Ratatouille
24969861,162541,tt0389790,5.0,Bee Movie
24969862,162541,tt0952640,4.0,Alvin and the Chipmunks
24969863,162541,tt0468569,8.0,The Dark Knight


In [7]:
# Let's take about half of our ratings by using only reviews from users with an ID below 81000
sampled_reviews = ml_reviews[ml_reviews['user_id']<81000]
sampled_reviews

Unnamed: 0,user_id,imdb_id,scaled_rating,title
0,1,tt0110912,10.0,Pulp Fiction
1,1,tt0111495,7.0,Trois couleurs: Rouge
2,1,tt0108394,10.0,Trois couleurs: Bleu
3,1,tt0114787,10.0,Underground
4,1,tt0045152,7.0,Singin' in the Rain
...,...,...,...,...
12511161,80999,tt5657846,9.0,Daddy's Home 2
12511162,80999,tt5095030,6.0,Ant-Man and the Wasp
12511163,80999,tt6921996,9.0,Johnny English Strikes Again
12511164,80999,tt1727824,10.0,Bohemian Rhapsody


In [8]:
# Lets see how much data we have in this sample
print(f"Reviews: {len(sampled_reviews['imdb_id'])}")
print(f"Movies: {sampled_reviews['imdb_id'].nunique()}")
print(f"Users: {sampled_reviews['user_id'].nunique()}")

Reviews: 12511166
Movies: 53786
Users: 80999


### Titles in Recommender

In [15]:
imdb_titles = pd.read_csv('../Data/Large-Data/imdb_titles.csv')

In [16]:
imdb_titles.head()

Unnamed: 0,titleId,title
0,tt0000001,Carmencita
1,tt0000002,Le clown et ses chiens
2,tt0000003,Pauvre Pierrot
3,tt0000004,Un bon bock
4,tt0000005,Blacksmith Scene


In [134]:
movies_in_sample = sampled_reviews.drop_duplicates(subset = 'imdb_id').sort_values(by='imdb_id', ascending=True)[['imdb_id', 'title']]
movies_in_sample

Unnamed: 0,imdb_id,title
1929959,tt0000001,Carmencita
1929826,tt0000003,Pauvre Pierrot
3257986,tt0000007,Corbett and Courtney Before the Kinetograph
1632384,tt0000008,Edison Kinetoscopic Record of a Sneeze
826401,tt0000010,La sortie de l'usine Lumière à Lyon
...,...,...
8058613,tt9866700,Paranormal Investigation
5029638,tt9872556,Momenti di trascurabile felicità
11731769,tt9876160,Koridor bessmertiya
6035809,tt9900060,Lupin the IIIrd: Mine Fujiko no Uso


In [135]:
movies_in_sample['title']

1929959                                      Carmencita
1929826                                  Pauvre Pierrot
3257986     Corbett and Courtney Before the Kinetograph
1632384          Edison Kinetoscopic Record of a Sneeze
826401              La sortie de l'usine Lumière à Lyon
                               ...                     
8058613                        Paranormal Investigation
5029638                Momenti di trascurabile felicità
11731769                            Koridor bessmertiya
6035809             Lupin the IIIrd: Mine Fujiko no Uso
6172650                                          Kaithi
Name: title, Length: 53786, dtype: object

In [10]:
# Create a list of titles ordered by IMDb ID for rows and columns of recommender
ordered_titles = sampled_reviews.drop_duplicates(subset = 'imdb_id').sort_values(by='imdb_id', ascending=True)['title']

In [122]:
ordered_titles

1929959                                      Carmencita
1929826                                  Pauvre Pierrot
3257986     Corbett and Courtney Before the Kinetograph
1632384          Edison Kinetoscopic Record of a Sneeze
826401              La sortie de l'usine Lumière à Lyon
                               ...                     
8058613                        Paranormal Investigation
5029638                Momenti di trascurabile felicità
11731769                            Koridor bessmertiya
6035809             Lupin the IIIrd: Mine Fujiko no Uso
6172650                                          Kaithi
Name: title, Length: 53786, dtype: object

## Creating Recommender System DataFrame


In [141]:
def create_matrix(df):
     
    n = len(df['user_id'].unique())
    m = len(df['imdb_id'].unique())
     
    # Map Ids to indices
    user_mapper = dict(zip(np.unique(df["user_id"]), list(range(n))))
    movie_mapper = dict(zip(np.unique(df["imdb_id"]), list(range(m))))
     
    # Map indices to IDs
    user_inv_mapper = dict(zip(list(range(n)), np.unique(df["user_id"])))
    movie_inv_mapper = dict(zip(list(range(m)), np.unique(df["imdb_id"])))
     
    user_index = [user_mapper[i] for i in df['user_id']]
    movie_index = [movie_mapper[i] for i in df['imdb_id']]
 
    X = csr_matrix((df["scaled_rating"], (movie_index, user_index)), shape=(m, n))

    movies_in_sample = sampled_reviews.drop_duplicates(subset = 'imdb_id').sort_values(by='imdb_id', ascending=True)[['imdb_id', 'title']]
     
    return X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper, user_index, movie_index, movies_in_sample
     


In [142]:
X, user_mapper, movie_mapper, user_inv_mapper, movie_inv_mapper, user_index, movie_index, movies_in_sample = create_matrix(sampled_reviews)

In [143]:
len(movie_mapper)

53786

In [145]:
#movie_mapper

In [16]:
similarities = 1.0 - pairwise_distances(X, metric = 'cosine')

In [17]:
similarities.shape

(53786, 53786)

In [146]:
#recommender = pd.DataFrame(similarities, index = ordered_titles, columns=ordered_titles)
recommender = pd.DataFrame(similarities, index = movies_in_sample['title'], columns=movies_in_sample['title'])

recommender.head()

title,Carmencita,Pauvre Pierrot,Corbett and Courtney Before the Kinetograph,Edison Kinetoscopic Record of a Sneeze,La sortie de l'usine Lumière à Lyon,L'arrivée d'un train à La Ciotat,Le débarquement du congrès de photographie à Lyon,L'arroseur arrosé,Barque sortant du port,Les forgerons,...,Fin de siglo,Kaijû no kodomo,Falling Inn Love,Nate Bargatze: The Tennessee Kid,The Far Green Country,Paranormal Investigation,Momenti di trascurabile felicità,Koridor bessmertiya,Lupin the IIIrd: Mine Fujiko no Uso,Kaithi
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Carmencita,1.0,0.687353,0.243162,0.221897,0.367293,0.254815,0.630711,0.459319,0.6742,0.667711,...,0.0,0.0,0.0,0.0,0.256323,0.0,0.232391,0.0,0.0,0.0
Pauvre Pierrot,0.687353,1.0,0.229815,0.199731,0.428713,0.254607,0.575176,0.373734,0.707992,0.696796,...,0.0,0.0,0.0,0.0,0.242253,0.0,0.219635,0.0,0.0,0.0
Corbett and Courtney Before the Kinetograph,0.243162,0.229815,1.0,0.095388,0.127221,0.240332,0.399556,0.353928,0.360668,0.401846,...,0.0,0.0,0.0,0.0,0.411365,0.0,0.372957,0.0,0.0,0.0
Edison Kinetoscopic Record of a Sneeze,0.221897,0.199731,0.095388,1.0,0.236683,0.15914,0.347252,0.184956,0.235091,0.234647,...,0.006218,0.0,0.002047,0.002789,0.100551,0.0,0.091163,0.0,0.0,0.0
La sortie de l'usine Lumière à Lyon,0.367293,0.428713,0.127221,0.236683,1.0,0.407914,0.369062,0.493358,0.372334,0.367538,...,0.0,0.0,0.0,0.0,0.134107,0.0,0.121586,0.0,0.0,0.0


In [19]:
imdb_titles = pd.read_csv('../Data/Large-Data/imdb_titles.csv')

In [20]:
imdb_titles.head()

Unnamed: 0,titleId,title
0,tt0000001,Carmencita
1,tt0000002,Le clown et ses chiens
2,tt0000003,Pauvre Pierrot
3,tt0000004,Un bon bock
4,tt0000005,Blacksmith Scene


In [21]:
# Get unique IMDb IDs from our movie 
movie_mapper_ids = set(movie_mapper.keys())

# Assuming `imdb_titles` is your DataFrame with columns 'imdb_id' and 'title'
# Filter the DataFrame based on whether the imdb_id exists in movie_mapper_ids
titles_in_movie_mapper = imdb_titles[imdb_titles['titleId'].isin(movie_mapper_ids)]

# Print the resulting DataFrame
titles_in_movie_mapper.head()

Unnamed: 0,titleId,title
0,tt0000001,Carmencita
2,tt0000003,Pauvre Pierrot
6,tt0000007,Corbett and Courtney Before the Kinetograph
7,tt0000008,Edison Kinetoscopic Record of a Sneeze
9,tt0000010,La sortie de l'usine Lumière à Lyon


### Getting Recommendations

In [198]:
def get_recommendations():
    x = input('What movie did you just watch?')
    #titles = titles_in_movie_mapper
    titles = movies_in_sample
    for title in titles.loc[titles['title'].str.contains(x), 'title']:
        print(title)
        print(recommender[title].sort_values(ascending = False)[1:11])
        print()
        print()

In [199]:
get_recommendations()

Toy Story
title
Star Wars                                         0.564555
Toy Story 2                                       0.564336
Back to the Future                                0.550556
Forrest Gump                                      0.546750
Jurassic Park                                     0.541597
Independence Day                                  0.537916
Star Wars: Episode VI - Return of the Jedi        0.537413
The Lion King                                     0.528811
Aladdin                                           0.525246
Star Wars: Episode V - The Empire Strikes Back    0.513098
Name: Toy Story, dtype: float64


Toy Story 2
title
Toy Story             0.564336
A Bug's Life          0.562676
Monsters, Inc.        0.520420
Shrek                 0.505918
Finding Nemo          0.480358
Ghostbusters          0.470100
Chicken Run           0.467211
Men in Black          0.464324
Back to the Future    0.461655
The Incredibles       0.457856
Name: Toy Story 2, dtype: float6

### Getting Recommendations for a Specific Title

In [164]:
def get_recommendations_exact_title():
    x = input('What movie did you just watch?')
    print(f"Movies like: {x}")
    print('='*20)
    print(recommender[x].sort_values(ascending = False)[1:11])

In [165]:
# run function searching for Toy Story
get_recommendations_exact_title()

Movies like: Toy Story
title
Star Wars                                         0.564555
Toy Story 2                                       0.564336
Back to the Future                                0.550556
Forrest Gump                                      0.546750
Jurassic Park                                     0.541597
Independence Day                                  0.537916
Star Wars: Episode VI - Return of the Jedi        0.537413
The Lion King                                     0.528811
Aladdin                                           0.525246
Star Wars: Episode V - The Empire Strikes Back    0.513098
Name: Toy Story, dtype: float64


### Specific Movie Search that can handle Duplicates

In [169]:
len(movies_in_sample[movies_in_sample['title'] == 'dsfsdsdvsd'])

0

In [194]:
def get_recommendations_exact_title_w_dups():
    #x = input('What movie did you just watch?')
    x = 'Bad Boys'
    if len(movies_in_sample[movies_in_sample['title'] == x]) == 1:
        print(f"Movies like: {x}")
        print('='*20)
        print(recommender[x].sort_values(by = x, ascending = False)[1:11])
    elif len(movies_in_sample[movies_in_sample['title'] == x]) > 1:
        print(f"There are {len(movies_in_sample[movies_in_sample['title'] == x])} movies named {x}.")
        #for title in movies_in_sample.loc[movies_in_sample['title'] == x]:
            #print(title)
            #print(recommender[title].sort_values(ascending = False)[1:11])
    else:
        print(f'Sorry, {x} was not found in our system.')

    

In [197]:
get_recommendations_exact_title_w_dups()

There are 2 movies named Bad Boys.
