In [2]:
import pandas as pd

### The Data and what we want to do with it:

These datasets where originally from Kaggle here is the description they gave:

This dataset contains 27753444 ratings and 1108997 tag applications across 58098 movies. These data sets were created by 283228 users between January 09, 1995 and September 26, 2018. This dataset was generated on September 26, 2018.

Goal: Product a simple recommendation engine that accepts a movie title and then recommends two similar movies

### Reading the Data and Initial Stats

In [3]:
movies_df = pd.read_csv('input/all_movies.csv')
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|animation|CHILDREN|comedy|Fantasy
1,2,Jumanji (1995),Adventure|children|FANTASY
2,3,Grumpier Old Men (1995),Comedy|romance
3,4,Waiting to Exhale (1995),Comedy|drama|ROMANCE
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),Comedy
58095,193880,Her Name Was Mumu (2016),Drama
58096,193882,Flora (2017),Adventure|drama|HORROR|sci-fi


In [4]:
ratings_df = pd.read_csv('input/ratings.csv')
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264
...,...,...,...,...
27753439,283228,8542,4.5,1379882795
27753440,283228,8712,4.5,1379882751
27753441,283228,34405,4.5,1379882889
27753442,283228,44761,4.5,1354159524


## Step 1: Cleaning
You shouldn't necessarily assume that your data is good.  It could be very sparse and not have much there. There could be duplication, poorly recorded or empty values, or with large text there could be a lot of garbage in there if it was an open text field

Some things we'll do here:
1. Get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column (This would tell us if a column is especially sparse)
3. Check for duplicate rows

As we find things that need to be cleaned (bad text, duplicates etc.) we will write tested cleaning functions to cleanup our input data.

In [5]:
movies_df.shape

(58098, 3)

In [6]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58098 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  58098 non-null  int64 
 1   title    58098 non-null  object
 2   genres   58098 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB


#### Checking for Duplicates
You can add assert statements throughout.  These are good as checks because they can directly translate into your tests or even just live in your scripts

Here we need to define what a duplicate is. If you don't specify column names then pandas will only call something a duplicate when all columns match.

For movies, it seems like a "duplicate" would be rows that have exactly matching titles and genres

In [7]:
assert (len(movies_df) - len(movies_df.drop_duplicates(['title', 'genres']))) == 0

AssertionError: 

That is expected, but let's also make sure that the number of duplicated titles is equal to the number of duplicated genres/title combos

In [8]:
movies_df.loc[movies_df.duplicated(['title', 'genres'])].shape

(14, 3)

In [9]:
movies_df.loc[movies_df.duplicated(['title'])].shape

(78, 3)

#### Cleaning the duplicated/bad columns

Notice that the genres column has inconsistency in it's capitalization.  We probably also want to make sure any leading or trailing spaces are removed. So let's start by cleaning up the genres column.

In [10]:
def lower_case_and_strip_spaces(input):
    return input.lower().strip()

We can start by writing tests in the notebook.  Our tests right now won't need any extra libraries like pytest.  They will simply be a function that runs our cleaning code against an input and asserts that the output is what we expect.

In [11]:
initial: str = "Crime|drama|HORROR"
expected: str = "crime|drama|horror"

initial_2: str = " CRIME|DRAMA|HORROR "
expected_2: str = "crime|drama|horror"

initial_3: str = " CRIME "
expected_3: str = "crime"

initial_4: str = " comedy "
expected_4: str = "comedy"
actual = lower_case_and_strip_spaces(initial)
assert actual == expected
actual = lower_case_and_strip_spaces(initial_2)
assert actual == expected_2
actual = lower_case_and_strip_spaces(initial_3)
assert actual == expected_3
actual = lower_case_and_strip_spaces(initial_4)
assert actual == expected_4

In [12]:
movies_df['genres'] = movies_df['genres'].apply(lower_case_and_strip_spaces)
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy
1,2,Jumanji (1995),adventure|children|fantasy
2,3,Grumpier Old Men (1995),comedy|romance
3,4,Waiting to Exhale (1995),comedy|drama|romance
4,5,Father of the Bride Part II (1995),comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),comedy
58095,193880,Her Name Was Mumu (2016),drama
58096,193882,Flora (2017),adventure|drama|horror|sci-fi


In [13]:
movies_df = movies_df.loc[movies_df['genres'] != '(no genres listed)']

In [14]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy
1,2,Jumanji (1995),adventure|children|fantasy
2,3,Grumpier Old Men (1995),comedy|romance
3,4,Waiting to Exhale (1995),comedy|drama|romance
4,5,Father of the Bride Part II (1995),comedy
...,...,...,...
58092,193874,Blondie's Big Moment (1947),comedy
58094,193878,Les tribulations d'une caissière (2011),comedy
58095,193880,Her Name Was Mumu (2016),drama
58096,193882,Flora (2017),adventure|drama|horror|sci-fi


Now we can move our cleaning function to a python file. From here on our we can test drive and write other cleaning functions in that file.  Then we can import the code to use in our notebook.

In [15]:
duplicated_titles_and_genres = movies_df.loc[movies_df.duplicated(['title', 'genres'])]
duplicated_titles_and_genres

Unnamed: 0,movieId,title,genres
15902,80330,Offside (2006),comedy|drama
20835,101212,"Girl, The (2012)",drama
25046,115777,Beneath (2013),horror
27572,122940,Clear History (2013),comedy
29852,128991,Johnny Express (2014),animation|comedy|sci-fi
30226,130062,Darling (2007),drama
36172,143978,Home (2008),drama
38804,150310,Macbeth (2015),drama
44387,163246,Seven Years Bad Luck (1921),comedy
48620,172427,Little Man (2006),comedy


In [16]:
duplicated_titles = movies_df.loc[movies_df.duplicated(['title'])]
duplicated_titles

Unnamed: 0,movieId,title,genres
9142,26958,Emma (1996),romance
9157,26982,Men with Guns (1997),drama
13309,64997,War of the Worlds (2005),action|sci-fi
13395,65665,Hamlet (2000),drama
13614,67459,Chaos (2005),crime|drama|horror
...,...,...,...
56950,190881,The Boss (2016),documentary
57238,191713,Noise (2007),crime|drama|thriller
57269,191775,Berlin Calling (2008),comedy|drama
57361,192003,Journey to the Center of the Earth (2008),action|adventure|fantasy|sci-fi


We still have the same number of rows in each "duplicate" category (duplicate based on title and duplicate based on title & genre).  So let's continue our analysis and see why that is.

It would be good to see an example of a title that is a duplicate title, but has a different list of genres in each entry.

To do that we can create lists of the title column from both data sets and find the difference between the two lists.

In [17]:
from typing import List

list_of_titles_from_duplicated_titles_and_genres: List = duplicated_titles_and_genres['title'].to_list()
list_of_titles_from_duplicated_titles: List = duplicated_titles['title'].to_list()
duplicated_titles = list(set(list_of_titles_from_duplicated_titles).difference(set(list_of_titles_from_duplicated_titles_and_genres)))
duplicated_titles

['Aladdin (1992)',
 'Paradise (2013)',
 'Truth (2015)',
 'Slow Burn (2000)',
 'Office (2015)',
 'Noise (2007)',
 'Eros (2004)',
 'Chaos (2005)',
 'Veronica (2017)',
 'Saturn 3 (1980)',
 'Classmates (2016)',
 'The Break-In (2016)',
 'Lagaan: Once Upon a Time in India (2001)',
 'Shelter (2015)',
 'Rose (2011)',
 '20,000 Leagues Under the Sea (1997)',
 'Frozen (2010)',
 'The Boss (2016)',
 'Ecstasy (2011)',
 'Weekend (2011)',
 'Aftermath (2012)',
 'Men with Guns (1997)',
 'Clockstoppers (2002)',
 'Free Fall (2014)',
 'Tag (2015)',
 'Hostage (2005)',
 'Sing (2016)',
 'Holiday (2014)',
 'Forsaken (2016)',
 'The Midnight Man (2016)',
 'Cargo (2017)',
 'Hamlet (2000)',
 'War of the Worlds (2005)',
 'The Void (2016)',
 'Gossip (2000)',
 'Absolution (2015)',
 'The Reunion (2011)',
 'Delirium (2014)',
 'The Dream Team (2012)',
 'Stranded (2015)',
 'Journey to the Center of the Earth (2008)',
 'Blackout (2007)',
 'Casanova (2005)',
 'Deranged (2012)',
 'Interrogation (2016)',
 'Confessions of a D

Now we can see which titles are duplicates by title but not duplicates by title & genre.  Let's look at an example:

In [18]:
movies_df.loc[movies_df['title'] == 'Aladdin (1992)']

Unnamed: 0,movieId,title,genres
582,588,Aladdin (1992),adventure|animation|children|comedy|musical
24657,114240,Aladdin (1992),adventure|animation|children|comedy|fantasy


It was possible that a movie title could have been assigned to two different movies that have different genres, but based on our example it looks like this really is the same movie with two slight different genre lists.  So now let's write a couple cleaning functions in our cleaning module that can combine these lists for us.

In [19]:
from analysis.utils.cleaning import find_duplicates_and_combine

movies_with_combined_genres_df = find_duplicates_and_combine(movies_df, duplicated_titles)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, val, pi)


Before we run this - the way we have written it the movies data frame should not lose any rows. Let's assert that remains true.

In [20]:
assert len(movies_with_combined_genres_df.index) == len(movies_df.index)

### Ratings Data Frame
Same Steps
1. Easily get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column
3. Check for duplicate rows

In [21]:
ratings_df.shape

(27753444, 4)

In [22]:
ratings_df.notnull().agg(sum)

userId       27753444
movieId      27753444
rating       27753444
timestamp    27753444
dtype: int64

Here we should again define what it means to be a "duplicate":

In [23]:
assert (len(ratings_df) - len(ratings_df.drop_duplicates(subset=['userId', 'movieId', 'rating', 'timestamp']))) == 0

## Step 2: Title?
We want to be able to recommend movies to users. What are some things we could do next?

Answer:
- Join movies to ratings
- Group by movie title to find an average rating for each movie

Before we join we need to figure which direction to join - so we should ask ourselves:

Why are we joining the two data frames?
How many rows should the resulting data frame have?
Do we expect a given column to have any nulls after the join?

Here we will left join movies to their ratings.  If we expect all movies to have ratings than we can start with an assertion that indicates that.

In [24]:
movies_ratings_joined_df = movies_df.set_index('movieId').join(ratings_df.set_index('movieId')).reset_index()

In [25]:
movies_with_ratings = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].notna()]

In [26]:
assert len(movies_ratings_joined_df.index) == len(movies_with_ratings.index)

AssertionError: 

Well our assertion fails but what does that mean?

In [27]:
movies_missing_ratings = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].isna()]

In [28]:
movies_missing_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
21871974,25817,Break of Hearts (1935),drama|romance,,,
21914110,26361,Baby Blue Marine (1976),drama,,,
21971501,27153,Can't Be Heaven (Forever Together) (2000),children|comedy|drama|romance,,,
21986044,27433,Bark! (2002),comedy|drama,,,
22225201,31945,Always a Bridesmaid (2000),documentary,,,
...,...,...,...,...,...,...
27737891,192399,Under Wraps (1997),children|comedy|horror,,,
27738267,192933,Rosie (2018),drama,,,
27738335,193109,Ach śpij kochanie (2017),crime|thriller,,,
27738455,193321,Pledges (2018),comedy|horror,,,


Joining movies to their ratings means that we have a subset of movies that don't have ratings.  We actually may want to know about these so it's good to know that they exist. For now though, we'll drop the records where the rating is NaN and then assert we get the number of rows we expect.

In [29]:
movies_ratings_joined_df = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].notna()]

In [30]:
movies_ratings_joined_df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,4.0,4.0,1.113766e+09
1,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,10.0,5.0,9.488858e+08
2,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,14.0,4.5,1.442169e+09
3,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,15.0,4.0,1.370810e+09
4,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,22.0,4.0,1.237623e+09
...,...,...,...,...,...,...
27738725,193878,Les tribulations d'une caissière (2011),comedy,176871.0,2.0,1.537875e+09
27738726,193880,Her Name Was Mumu (2016),drama,81710.0,2.0,1.537886e+09
27738727,193882,Flora (2017),adventure|drama|horror|sci-fi,33330.0,2.0,1.537891e+09
27738728,193886,Leal (2018),action|crime|drama,206009.0,2.5,1.537918e+09


Next let's get an average rating for each movie by grouping by title

In [31]:
movies_with_all_ratings_df = movies_ratings_joined_df.groupby(['title']).agg({'rating': lambda x : x.to_list(), 'genres': lambda x: x.to_list()[0]}).reset_index()

In [32]:
movies_with_all_ratings_df.sample(20)

Unnamed: 0,title,rating,genres
10026,"Crowd, The (1928)","[3.5, 5.0, 3.0, 3.0, 4.0, 4.5, 5.0, 4.0, 3.5, ...",drama
35605,"Sea Hawk, The (1940)","[4.5, 4.5, 3.5, 0.5, 4.5, 3.5, 3.5, 3.5, 3.0, ...",action|adventure|romance
23052,Krummerne (1991),"[4.0, 4.0]",children|comedy
31175,Parthibhan Kanavu (2003),[3.5],drama|romance
29191,Night Patrol (1984),"[3.0, 5.0, 3.0, 1.0, 3.5, 1.0, 2.0, 2.5, 3.0, ...",comedy
58,08/15 (1954),[4.0],drama|war
33135,Quiz Show (1994),"[5.0, 4.0, 4.0, 3.0, 2.0, 3.0, 4.0, 3.5, 5.0, ...",drama
8,"$1,000 on the Black (1966)","[4.0, 3.0, 3.0]",western
20692,Indecent Proposal (1993),"[3.5, 3.0, 2.0, 3.0, 4.0, 3.0, 3.0, 1.0, 4.0, ...",drama|romance
36944,Sky Of Love (2007),"[5.0, 3.0, 3.5, 5.0, 2.0, 3.5, 3.0, 5.0, 3.5, ...",drama|romance


Anything else we want to check at this point?

Answer: Worth it at this point to check that we still have all the movies that we expect to have.

Remember the movies that had no ratings?  Good thing we grabbed those :) let's use them now.

In [33]:
movies_missing_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
21871974,25817,Break of Hearts (1935),drama|romance,,,
21914110,26361,Baby Blue Marine (1976),drama,,,
21971501,27153,Can't Be Heaven (Forever Together) (2000),children|comedy|drama|romance,,,
21986044,27433,Bark! (2002),comedy|drama,,,
22225201,31945,Always a Bridesmaid (2000),documentary,,,
...,...,...,...,...,...,...
27737891,192399,Under Wraps (1997),children|comedy|horror,,,
27738267,192933,Rosie (2018),drama,,,
27738335,193109,Ach śpij kochanie (2017),crime|thriller,,,
27738455,193321,Pledges (2018),comedy|horror,,,


What can we assert to verify that we have all the movies we expect to have?

We should check that:

Original Number of Movies = # Movies Missing Ratings + # Movies with Ratings + # Movies with duplicated Titles

Why?

Remember that we kept all rows when we were combining genres above (we didn't create one version of the title with all genres).  We did this because ratings were joined to movies on movie id.  By keeping multiple rows of the movie we were able to capture all of the movie's ratings when we joined to the ratings df.  Then we grouped by movie title - which means we should only have one row for each movie title and it should have all the movie's ratings as a list in the rating column.

In [34]:
len(movies_df.index)

53832

In [35]:
(len(movies_missing_ratings.index)+len(movies_with_all_ratings_df.index)+len(duplicated_titles))

53823

In [36]:
assert len(movies_df.index) == (len(movies_missing_ratings.index)+len(movies_with_all_ratings_df.index)+len(duplicated_titles))

AssertionError: 

Let's checkout Aladdin again:

In [37]:
movies_with_all_ratings_df.loc[movies_with_all_ratings_df['title'] == 'Aladdin (1992)']

Unnamed: 0,title,rating,genres
1900,Aladdin (1992),"[2.5, 4.0, 3.5, 4.0, 5.0, 3.5, 4.0, 4.0, 4.5, ...",adventure|animation|children|comedy|musical|fa...


Now let's assign each movie an average rating.

In [38]:
def average_list_items(ratings_list: List):
    return sum(ratings_list)/len(ratings_list)

In [39]:
movies_with_all_ratings_df['average_rating'] = movies_with_all_ratings_df['rating'].apply(average_list_items)

Let's think about our recommendation engine now again.  Let's say that we want to recommend movies to by recommending the movies with the highest ratings and the most similar genres list.

To find the movies with the "most similar genres list" we can use the nearest neighbors algorithm.  The first step in this would involve transforming the genres list to a vector.

First let's drop some rows that could mess up our recommendation engine.  You may have noticed with some of the dataframe samples that some movies have the genre "(no genres listed)"  - that sort of feels like a miscellaneous category that would hurt us more than help us. So let's get rid of those movies.

In [40]:
movies_with_ratings_and_genres = movies_with_all_ratings_df.loc[movies_with_all_ratings_df['genres'] != '(no genres listed)']

In [41]:
movies_with_ratings_and_genres.sample(20)

Unnamed: 0,title,rating,genres,average_rating
5494,"Birds, the Bees and the Italians, The (Signore...","[4.0, 3.5]",comedy,3.75
44555,The Work and the Glory III: A House Divided (2...,[3.5],drama,3.5
40295,The Avenging Eagle (1978),"[3.5, 4.0]",action|adventure,3.75
1353,About Last Night... (1986),"[3.0, 4.0, 2.0, 1.0, 3.0, 1.0, 3.0, 3.0, 5.0, ...",comedy|drama|romance,3.0927
48360,Wheels of Fire (1985),[3.5],action|adventure|sci-fi,3.5
17186,Goofy Movies Number One (1933),"[2.5, 1.5]",comedy,2.0
41759,The Good Witch (2008),"[4.0, 4.0, 5.0, 3.5, 4.0, 3.0, 3.0, 3.0, 4.5, ...",children|drama|fantasy,3.642857
47047,VHS Forever?: Psychotronic People (2014),"[2.0, 3.0]",documentary,2.5
31923,"Place for Lovers, A (Amanti) (1968)",[1.0],drama|romance,1.0
22619,"Killer Shrews, The (1959)","[4.0, 1.0, 2.0, 1.0, 5.0, 2.0, 2.0, 2.5, 2.0, ...",horror|sci-fi,2.142857


Now let's drop movies that have less than 10 ratings. We probably don't have enough data to know if those ratings actually reflect how good the movie is or not.

TODO: We could probably come up with a less random value than 10.

In [80]:
movies_with_ratings_and_genres['number of ratings'] = movies_with_ratings_and_genres['rating'].apply(lambda x: len(x))

In [81]:
movies_with_many_ratings = movies_with_ratings_and_genres.loc[movies_with_ratings_and_genres['number of ratings'] > 5]

In [82]:
movies_with_many_ratings.loc[movies_with_many_ratings['genres'] == '(no genres listed)']

Unnamed: 0,title,rating,genres,average_rating,number of ratings


In [83]:
movies_with_many_ratings['genres'] = movies_with_many_ratings['genres'].apply(lambda x: x.split('|'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [84]:
movies_with_many_ratings.sample(20)

Unnamed: 0,title,rating,genres,average_rating,number of ratings
39920,Terms and Conditions May Apply (2013),"[4.5, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, ...",[documentary],3.685484,62
30883,"Package, The (1989)","[4.0, 3.0, 2.5, 3.0, 4.0, 1.0, 4.0, 2.0, 2.0, ...","[action, thriller]",3.260465,215
16012,"Further Gesture, A (1996)","[0.5, 1.0, 4.0, 3.0, 3.0, 3.5, 2.5, 1.0, 3.0, ...",[drama],2.25,10
2516,American Pastoral (2016),"[3.0, 2.0, 3.0, 3.0, 3.0, 3.0, 5.0, 3.0, 3.0, ...",[drama],3.060606,33
23418,Lake Eerie (2016),"[0.5, 0.5, 0.5, 1.5, 2.5, 1.5, 2.5, 3.0, 2.5]","[horror, sci-fi, thriller]",1.666667,9
48071,"Wedding Banquet, The (Xi yan) (1993)","[4.5, 4.0, 4.0, 3.0, 4.0, 4.0, 4.5, 4.0, 3.0, ...","[comedy, drama, romance]",3.923034,903
39549,"Talk of the Town, The (1942)","[4.5, 3.0, 4.0, 3.5, 5.0, 4.0, 3.0, 4.0, 5.0, ...","[comedy, romance, thriller]",3.773109,238
48121,Ween Live in Chicago (2004),"[5.0, 1.0, 3.0, 0.5, 4.0, 5.0, 0.5, 4.5, 5.0, ...",[documentary],2.84375,16
35915,"Serpent and the Rainbow, The (1988)","[3.5, 3.0, 3.5, 0.5, 2.0, 3.5, 4.0, 3.5, 2.5, ...",[horror],3.221847,888
10272,Daawat-e-Ishq (2014),"[2.5, 4.0, 3.0, 3.0, 4.0, 4.0]","[comedy, drama, romance]",3.416667,6


Next let's do other things....

In [85]:
genres_list = list(movies_with_many_ratings.explode('genres').groupby('genres').agg(sum).reset_index()['genres'])

In [86]:
len(genres_list)

19

In [87]:
movies_with_many_ratings['genres_as_string'] = movies_with_many_ratings['genres'].apply(lambda x: ' '.join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [94]:
movies_with_many_ratings = movies_with_many_ratings.reset_index().drop(columns=['index'])

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 1),min_df=0, stop_words='english', vocabulary=genres_list)
tfidf_matrix = tf.fit_transform(movies_with_many_ratings['genres_as_string'])

In [96]:
pd.DataFrame(tfidf_matrix.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,1.0,0.000000,0.0,0.0,0.000000,0.0,0.0
1,0.0,0.0,0.000000,0.000000,0.773345,0.000000,0.0,0.633986,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
2,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.292683,0.0,0.0,0.530887,0.0,0.0,0.647573,0.0,0.0,0.461677,0.0,0.0
3,0.0,0.0,0.000000,0.000000,1.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
4,0.0,0.0,0.000000,0.000000,0.500969,0.761814,0.0,0.410693,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28317,0.0,0.0,1.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
28318,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.384096,0.0,0.0,0.696698,0.0,0.0,0.000000,0.0,0.0,0.605873,0.0,0.0
28319,0.0,0.0,0.000000,0.875695,0.482865,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0
28320,0.0,0.0,0.704134,0.710067,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0


In [97]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

If n is the number of movies we have at this point then the cosine_sim frame should be of dimensions n x n

In [98]:
cosine_sim.shape

(28322, 28322)

The next part you can Test Drive because now we'll add logic to grab the top 20 movie titles by index

We can build this in a separate module and import it here to see some results

In [99]:
from analysis.utils.recommendation import get_similar_movies

similar_movies = get_similar_movies('Toy Story (1995)', cosine_sim, movies_with_many_ratings)
similar_movies

['Antz (1998)', 'Asterix and the Vikings (Astérix et les Vikings) (2006)']