## 10.2 Assignment: Recommender System

Using the small MovieLens data set, create a recommender system that allows users to input a movie they like (in the data set) and recommends ten other movies for them to watch. In your write-up, clearly explain the recommender system process and all steps performed. If you are using a method found online, be sure to reference the source.


You can use R or Python to complete this assignment. Submit your code and output to the submission link. Make sure to add comments to all of your code and to document your steps, process, and analysis.

##### Please Note: For this assignment, I decided to try a few different methods and compare them to see the outcomes and how similar or dissimilar they were.

### Initial Method: Creating a Content-Based (Tags) Recommender Function

#### 1. Import libraries and pull in the data.

In [127]:
## Import libraries.
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

I will first pull in each of the .csv files separately (since they are saved as separate files), and then combine later on.

In [128]:
## Load/pull-in the movies, ratings, and tags data.
## A separate dataframe for each to start (since they each have their own .csv file).

movies_df = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\movies.csv")

ratings_df = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\ratings.csv")

tags_df = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\tags.csv")
                    

In [129]:
## View the data.
## Movies.

movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [130]:
## Ratings.

ratings_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [131]:
## Tags.

tags_df.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


#### 2. Transform and prepare the data to use in the recommender system.

In [132]:
## Next, combine the datasets together. 
combined_movie_df = ratings_df.merge(movies_df, on='movieId', how='left').merge(tags_df, on=['movieId', 'userId'], how='left')

combined_movie_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp_x,title,genres,tag,timestamp_y
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,,
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,,
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,,
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,,
5,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller,,
6,1,101,5.0,964980868,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance,,
7,1,110,4.0,964982176,Braveheart (1995),Action|Drama|War,,
8,1,151,5.0,964984041,Rob Roy (1995),Action|Drama|Romance|War,,
9,1,157,5.0,964984100,Canadian Bacon (1995),Comedy|War,,


#### 3. Organize the data in such a way so as to be able to use it more efficiently.

First, I want to aggregate the data of the movie ratings to get an overall rating. This will allow me to see how many ratings the rating is comprised of.

In [133]:
## Using aggregation methods, create a new ratings table.

aggregate_rate_df = pd.DataFrame(combined_movie_df.groupby('title')['rating'].mean())

aggregate_rate_df['totaloverall_ratings'] = pd.DataFrame(combined_movie_df.groupby('title')['rating'].count())

aggregate_rate_df = aggregate_rate_df.reset_index()

In [134]:
## View.
aggregate_rate_df.head(10)

Unnamed: 0,title,rating,totaloverall_ratings
0,'71 (2014),4.0,1
1,'Hellboy': The Seeds of Creation (2004),4.0,1
2,'Round Midnight (1986),3.5,2
3,'Salem's Lot (2004),5.0,1
4,'Til There Was You (1997),4.0,2
5,'Tis the Season for Love (2015),1.5,1
6,"'burbs, The (1989)",3.176471,17
7,'night Mother (1986),3.0,1
8,(500) Days of Summer (2009),3.857143,49
9,*batteries not included (1987),3.285714,7


Since I have a tags data file, and since tags are often used to "tag" a movie with certain themes or topic areas, using tags might be useful for classifying movies, and thus recommending movies to a user with similar tags to a movie they input to "search." Using tags will allow the recommender to "match" the recommended movie tags to the searched movie tags.

In [135]:
## Create a new dataframe to include up to the top 7 "tags" per movie.
top7tags_df = combined_movie_df[['title', 'tag']].reset_index().groupby(['title', 'tag']).count().reset_index().set_axis(['title', 'tag', 'count'], axis=1).sort_values('count', ascending=False).groupby('title').head(7)

In [136]:
## View the data.
top7tags_df.head(5)

Unnamed: 0,title,tag,count
814,Donnie Darko (2001),dreamlike,3
3159,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),time travel,3
1008,Fight Club (1999),dark comedy,3
2862,Step Brothers (2008),funny,3
1470,Inception (2010),thought-provoking,3


Next, I will create a list of available movies, which will be useful to have a list of all of the movies present within the movie database. This way, if a user searches a movie that does not exist, I can send a message for them to choose a new movie since the one they searched wasn't present (and thus no recommendations can be made).

In [137]:
## Based upon the above, create an available movies list.
## And for error validation.
available_movies = list(combined_movie_df.sort_values('title', ascending=True)['title'].unique())

In [138]:
## View the list. 
available_movies

["'71 (2014)",
 "'Hellboy': The Seeds of Creation (2004)",
 "'Round Midnight (1986)",
 "'Salem's Lot (2004)",
 "'Til There Was You (1997)",
 "'Tis the Season for Love (2015)",
 "'burbs, The (1989)",
 "'night Mother (1986)",
 '(500) Days of Summer (2009)',
 '*batteries not included (1987)',
 '...All the Marbles (1981)',
 '...And Justice for All (1979)',
 '00 Schneider - Jagd auf Nihil Baxter (1994)',
 '1-900 (06) (1994)',
 '10 (1979)',
 '10 Cent Pistol (2015)',
 '10 Cloverfield Lane (2016)',
 '10 Items or Less (2006)',
 '10 Things I Hate About You (1999)',
 '10 Years (2011)',
 '10,000 BC (2008)',
 '100 Girls (2000)',
 '100 Streets (2016)',
 '101 Dalmatians (1996)',
 '101 Dalmatians (One Hundred and One Dalmatians) (1961)',
 "101 Dalmatians II: Patch's London Adventure (2003)",
 '101 Reykjavik (101 Reykjavík) (2000)',
 '102 Dalmatians (2000)',
 '10th & Wolf (2006)',
 '10th Kingdom, The (2000)',
 '10th Victim, The (La decima vittima) (1965)',
 '11\'09"01 - September 11 (2002)',
 '11:14 (2

#### 4. Create the recommender system (define the function).

For my recommender system, I am going to use a function that includes information from above to "align" the recommended movies (10 selections) with the movie searched by the user.

In [186]:
## Recommender System.
## Create and define a function that will obtain the 10 movie recommendations.

def get_movie_recommendations(movie):
    if movie in available_movies:
        tag_list = []
        tag_list = top7tags_df[top7tags_df['title'] == movie]['tag'].unique().tolist()
        movie_list = []

        for tag in tag_list:
            for tags in top7tags_df[top7tags_df['tag'].str.contains(tag, regex=False)==True]['title'].unique().tolist():
                if tags != movie:
                    movie_list.append(tags)

        movie_list_set = set(movie_list)
        movie_list = (list(movie_list_set))
        recs = aggregate_rate_df[aggregate_rate_df['title'].isin(movie_list)].nlargest(10, 'rating')['title'].to_list()
        print(recs)

    else:
        print("Oops! That movie is not present within the database. Please pick another movie to get your recommendations!")

#### 5. Try out the recommender.

In [203]:
## Try it out!
## Select a movie to see what other recommendations "pop-up".
movie = 'Toy Story'

## Call on the function to obtain the recommendations for 500 Days of Summer.
get_movie_recommendations(movie)

Oops! That movie is not present within the database. Please pick another movie to get your recommendations!


In [202]:
## Try it now with the year.
movie = 'Toy Story (1995)'

## Call on the function to obtain the recommendations for 500 Days of Summer.
get_movie_recommendations(movie)

['Brothers Bloom, The (2008)', 'Graduate, The (1967)', 'Guardians of the Galaxy (2014)', 'Game Night (2018)', 'The Lego Movie (2014)', 'Big Hero 6 (2014)', 'Guardians of the Galaxy 2 (2017)', 'Big Short, The (2015)', 'Kung Fury (2015)', 'Zombieland (2009)']


Not bad! It appears like some of the movies fit well as good recommendations, while others, I am not so sure. But, for example, Guardians of the Galaxy (and the second one), The Lego Movie, Big Hero 6, etc. are strong recommendations with similar themes.

### Alternative Methods: Correlation and KNN

While the above is a simple methodology, I wanted to try some others to see how they compare. 

#### 1. Movie correlations method.

In [208]:
## Load and view data (again).
## Ratings.
ratings2_df = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\ratings.csv")
ratings2_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [209]:
## Movies.
movies2_df = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\movies.csv")
movies2_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [210]:
## Merge.
ratings2_df = ratings2_df.merge(movies2_df, on='movieId', how='left')

## View.
ratings2_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [211]:
## Create rating counts (by movie/title).
total_ratings = pd.DataFrame(ratings2_df.groupby('title')['rating'].count())
total_ratings = total_ratings.reset_index()

In [212]:
## Rename column.
total_ratings = total_ratings.rename(columns={'rating':'number of ratings'})

## View.
total_ratings.head(10)

Unnamed: 0,title,number of ratings
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2
5,'Tis the Season for Love (2015),1
6,"'burbs, The (1989)",17
7,'night Mother (1986),1
8,(500) Days of Summer (2009),42
9,*batteries not included (1987),7


In [213]:
## Merge again.
merged_df = ratings2_df.merge(total_ratings, on='title', how='left')

## View.
merged_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,number of ratings
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance,52
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller,102
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,203
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,204


In [216]:
## Filter out movies with less than 10 ratings
merged_df = merged_df[merged_df['number of ratings'] > 10]

In [217]:
## Pivot the table
user_df = merged_df.pivot_table(index='userId', columns='title', values='rating')

In [218]:
## View.
user_df.head(5)

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Cloverfield Lane (2016),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),...,Zack and Miri Make a Porno (2008),Zero Dark Thirty (2012),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,4.0
2,,,,,,,,,,,...,,,,,3.0,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,5.0,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [219]:
## Obtain movie suggestions.
## Input movie title.
movie = 'Toy Story (1995)'

In [220]:
## Calculate correlations.
correlations = user_df.corrwith(user_df[movie]).sort_values(ascending=False)

In [221]:
# Print movie suggestions
print('Your Top 10 movie suggestions are:\n')
for i in range (1,11):
    print(correlations.index[i])

Your Top 10 movie suggestions are:

The Nice Guys (2016)
Avengers: Infinity War - Part I (2018)
Blues Brothers 2000 (1998)
Singles (1992)
22 Jump Street (2014)
Passion of the Christ, The (2004)
Untitled Spider-Man Reboot (2017)
Dunkirk (2017)
48 Hrs. (1982)
Darjeeling Limited, The (2007)


From this method, we receive some different recommendations. While a couple seem to fit, overall it seems like the recommendations are less strong than the ones from the previous method - these seem to be more adult movies whereas Toy Story is geared more towards children.

#### 2. KNN classifier method.

In [222]:
## Import additional libraries.
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import seaborn as sns

In [223]:
## Load data.
movies = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\movies.csv")

ratings = pd.read_csv(r"C:\Users\Madeleine's PC\Documents\Madeleine\Documents\Bellevue University Courses\Masters in DS\BU DSC630\Data\ratings.csv")

In [224]:
## View.
movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [225]:
## View.
ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [226]:
## Pivot.
final_dataset = ratings.pivot(index='movieId',columns='userId',values='rating')
final_dataset.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,,,4.0,,4.5,,,,...,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,...,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,...,,,,,,,,2.0,,
4,,,,,,3.0,,,,,...,,,,,,,,,,
5,,,,,,5.0,,,,,...,,,,3.0,,,,,,


In [227]:
## Fill NAs in this method.
final_dataset.fillna(0,inplace=True)
final_dataset.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [228]:
## Aggregate the number of users who voted/submitted ratings.
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

In [229]:
## Set threshold for user votes.
final_dataset = final_dataset.loc[no_user_voted[no_user_voted > 5].index,:]

In [230]:
## Set threshold for movie votes.
final_dataset=final_dataset.loc[:,no_movies_voted[no_movies_voted > 10].index]
final_dataset

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
179401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
179819,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
180031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [231]:
sample = np.array([[0,0,3,0,0],[4,0,0,0,2],[0,0,0,0,1]])
sparsity = 1.0 - ( np.count_nonzero(sample) / float(sample.size) )
print(sparsity)

0.7333333333333334


In [232]:
csr_sample = csr_matrix(sample)
print(csr_sample)

  (0, 2)	3
  (1, 0)	4
  (1, 4)	2
  (2, 4)	1


In [233]:
csr_data = csr_matrix(final_dataset.values)
final_dataset.reset_index(inplace=True)

In [234]:
## Create the recommendation model.
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10, n_jobs=-1)
knn.fit(csr_data)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=10)

In [235]:
## Recommendation function.
def get_movie_recommendation(movie_name):
    n_movies_to_reccomend = 10
    movie_list = movies[movies['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['movieId']
        movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
        recommend_frame = []
        for val in rec_movie_indices:
            movie_idx = final_dataset.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
        return df
    else:
        return "Oops! That movie is not present within the database. Please pick another movie to get your recommendations!"

In [236]:
## Search and get recommendations!
## Try a new movie here.
get_movie_recommendation('Toy Story')

Unnamed: 0,Title,Distance
1,Back to the Future (1985),0.469619
2,Groundhog Day (1993),0.465831
3,Mission: Impossible (1996),0.461087
4,Star Wars: Episode VI - Return of the Jedi (1983),0.458911
5,"Lion King, The (1994)",0.458855
6,Forrest Gump (1994),0.452904
7,Star Wars: Episode IV - A New Hope (1977),0.442612
8,Independence Day (a.k.a. ID4) (1996),0.435738
9,Jurassic Park (1993),0.434363
10,Toy Story 2 (1999),0.427399


These recommendations seem more in line with the first, and are quite better than the second method. Lots of good recommendations here that are in line with the same theme and would be good movies for children to watch as well. Interestingly enough, this is the only recommender than recommended Toy Story 2 - and it placed it as the 10th recommendation! Overall, I think this one performed best and was the most precise method.

Conclusions:

* The first method I used was a more content-based method that focused on movie tags to make the ten other movie recommendations. The recommendations were pretty good overall.
* The second method focused on using correlation to make movie recommendations (so making a recommendation based upon how correlated it was with the movie choice entered). This one's recommendations were a bit amiss, in my opinion.
* Lastly, the final method (KNN) was a collaborative type recommender method. Overall, I believe this one had the best recommednations and was the most precise - it was also more "machine-learning-esque" than the other two.

Overall, I enjoyed this project - it is neat to see how programming, coding, and mathematical methodologies can be used for "real-life" application - I have no doubt that streaming services (like Netflix) use a more fancy-pants version of the above to regularly make recommendations for its customers.

To complete this assignment, I consulted a variety of resources. Please see below.

##### References/Resources Used:

* https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/
* https://stackabuse.com/creating-a-simple-recommender-system-in-python-using-pandas/
* https://analyticsindiamag.com/how-to-build-your-first-recommender-system-using-python-movielens-dataset/
* https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-1-knn-item-based-collaborative-filtering-637969614ea