In [3]:
import pandas as pd
from typing import List

from analysis.utils.cleaning import lower_case_and_strip_spaces

### Reading the Data and Initial Stats

In [4]:
movies_df: pd.DataFrame = pd.read_csv('input/all_movies.csv')
movies_df.sample(20)

Unnamed: 0,movieId,title,genres
54917,186219,The Phantom of the Opera (1962),Drama|horror|MYSTERY
27573,122942,The Identical (2014),Drama
57252,191741,Golden Slumber (2010),Action|comedy|DRAMA|mystery|Thriller
22237,106032,Chastity Bites (2013),Comedy|horror
23660,110750,Heaven Is for Real (2014),Drama
25966,118330,Blood (2012),Crime|drama|THRILLER
13429,65909,Maria (2003),Drama
27845,123559,Spitfire (1995),(no genres listed)
35615,142663,Babes in Toyland (1997),Animation|children
56822,190539,Odnoklassniki.ru: The Magic Laptop (2013),Comedy|romance


In [5]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|animation|CHILDREN|comedy|Fantasy
1,2,Jumanji (1995),Adventure|children|FANTASY
2,3,Grumpier Old Men (1995),Comedy|romance
3,4,Waiting to Exhale (1995),Comedy|drama|ROMANCE
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),Comedy
58095,193880,Her Name Was Mumu (2016),Drama
58096,193882,Flora (2017),Adventure|drama|HORROR|sci-fi


In [6]:
ratings_df = pd.read_csv('input/ratings.csv')
ratings_df.sample(20)

Unnamed: 0,userId,movieId,rating,timestamp
24582124,251314,57368,3.5,1291056475
26168009,267254,800,5.0,1083267679
4021611,41414,2324,4.0,974698686
26975036,275449,2111,3.0,916515465
15864147,161996,296,5.0,843835883
5251317,54040,72,5.0,954272955
1758695,17980,2244,3.0,1037354046
21210854,216701,2683,2.0,1147979998
24151213,246915,3253,3.0,1002247212
13558274,138555,127098,3.5,1496220691


## Step 1: Cleaning
You shouldn't necessarily assume that your data is good.  It could be very sparse and not have much there. There could be duplication, poorly recorded or empty values, or with large text there could be a lot of garbage in there if it was an open text field

Some things we'll do here:
1. Get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column (This would tell us if a column is especially sparse)
3. Check for duplicate rows

As we find things that need to be cleaned (bad text, duplicates etc.) we will write tested cleaning functions to cleanup our input data.

In [7]:
movies_df.shape

(58098, 3)

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58098 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  58098 non-null  int64 
 1   title    58098 non-null  object
 2   genres   58098 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB


### Checking for Duplicates

#### We need to define what a "duplicate" is.

Get a series with True or False that indicates if the title was duplicated or not.

In [9]:
movies_duplicated_title = movies_df.duplicated(['title'])

Locate all records where the title was duplicated.

In [10]:
movies_duplicated_title_df = movies_df.loc[movies_duplicated_title]

Get a series with True of False that indicates if the title and genre were both duplciated.

In [11]:
movies_duplicated_title_and_genre = movies_df.duplicated(['title', 'genres'])

Locate all records where the title and genre were duplicated

In [12]:
movies_duplicated_title_and_genre_df = movies_df.loc[movies_duplicated_title_and_genre]

We'll need a list of titles from each category

In [13]:
titles_duplicated_title_only = set(movies_duplicated_title_df['title'])
titles_duplicated_title_only

{'20,000 Leagues Under the Sea (1997)',
 'Absolution (2015)',
 'Aftermath (2012)',
 'Aladdin (1992)',
 'Another World (2014)',
 'Apparition (2014)',
 'Ava (2017)',
 'Beneath (2013)',
 'Berlin Calling (2008)',
 'Black Field (2009)',
 'Blackout (2007)',
 'Cargo (2017)',
 'Casanova (2005)',
 'Chaos (2005)',
 'Classmates (2016)',
 'Clear History (2013)',
 'Clockstoppers (2002)',
 'Confessions of a Dangerous Mind (2002)',
 'Contact (1992)',
 'Darling (2007)',
 'Delirium (2014)',
 'Deranged (2012)',
 'Detour (2017)',
 'Dracula (1931)',
 'Ecstasy (2011)',
 'Eden (2014)',
 'Emma (1996)',
 'Eros (2004)',
 'Escalation (1968)',
 'Escape Room (2017)',
 'Family Life (1971)',
 'Forsaken (2016)',
 'Free Fall (2014)',
 'Frozen (2010)',
 'Girl, The (2012)',
 'Good People (2014)',
 'Gossip (2000)',
 'Grace (2014)',
 'Hamlet (2000)',
 'Holiday (2014)',
 'Home (2008)',
 'Hostage (2005)',
 'Inside (2012)',
 'Interrogation (2016)',
 'Johnny Express (2014)',
 'Journey to the Center of the Earth (2008)',
 'La

In [14]:
titles_duplicated_title_and_genre = set(movies_duplicated_title_and_genre_df['title'])
titles_duplicated_title_and_genre

{'Beneath (2013)',
 'Berlin Calling (2008)',
 'Clear History (2013)',
 'Darling (2007)',
 'Detour (2017)',
 'Dracula (1931)',
 'Girl, The (2012)',
 'Home (2008)',
 'Johnny Express (2014)',
 'Little Man (2006)',
 'Lucky (2017)',
 'Macbeth (2015)',
 'Offside (2006)',
 'Seven Years Bad Luck (1921)'}

We have two lists of movies:
1. Has all the titles where the title was duplicated - in this list some titles represent records that have duplicated genre strings and some represent records that have different genre string.
2. Has all the titles where the title and genre was duplicated - in this list the title represents records that have duplicated genre strings.

With these two lists how can we get a list that is ONLY the titles where the title was duplicated but the genre was not?

In [15]:
movies_duplicated_by_title_not_genre = titles_duplicated_title_only.difference(titles_duplicated_title_and_genre)
movies_duplicated_by_title_not_genre

{'20,000 Leagues Under the Sea (1997)',
 'Absolution (2015)',
 'Aftermath (2012)',
 'Aladdin (1992)',
 'Another World (2014)',
 'Apparition (2014)',
 'Ava (2017)',
 'Black Field (2009)',
 'Blackout (2007)',
 'Cargo (2017)',
 'Casanova (2005)',
 'Chaos (2005)',
 'Classmates (2016)',
 'Clockstoppers (2002)',
 'Confessions of a Dangerous Mind (2002)',
 'Contact (1992)',
 'Delirium (2014)',
 'Deranged (2012)',
 'Ecstasy (2011)',
 'Eden (2014)',
 'Emma (1996)',
 'Eros (2004)',
 'Escalation (1968)',
 'Escape Room (2017)',
 'Family Life (1971)',
 'Forsaken (2016)',
 'Free Fall (2014)',
 'Frozen (2010)',
 'Good People (2014)',
 'Gossip (2000)',
 'Grace (2014)',
 'Hamlet (2000)',
 'Holiday (2014)',
 'Hostage (2005)',
 'Inside (2012)',
 'Interrogation (2016)',
 'Journey to the Center of the Earth (2008)',
 'Lagaan: Once Upon a Time in India (2001)',
 'Let There Be Light (2017)',
 'Men with Guns (1997)',
 'Noise (2007)',
 'Office (2015)',
 'Paradise (2013)',
 'Rose (2011)',
 'Saturn 3 (1980)',
 '

In [16]:
movies_df.loc[movies_df['title'] == 'Aladdin (1992)']

Unnamed: 0,movieId,title,genres
582,588,Aladdin (1992),Adventure|animation|CHILDREN|comedy|Musical
24657,114240,Aladdin (1992),Adventure|animation|CHILDREN|comedy|Fantasy


#### Let's do some cleanup.

In [47]:
movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy
1,2,Jumanji (1995),adventure|children|fantasy
2,3,Grumpier Old Men (1995),comedy|romance
3,4,Waiting to Exhale (1995),comedy|drama|romance
4,5,Father of the Bride Part II (1995),comedy


In [18]:
movies_df['genres'] =  movies_df['genres'].apply(lower_case_and_strip_spaces)

In [19]:
movies_df = movies_df.loc[movies_df['genres'] != '(no genres listed)']
movies_df.shape

(53832, 3)

## Step 2: Feature Preparation

We can focus on the genres to recommend movies.  Let's prepare our genres list. First we need to group by movie.

In [29]:
movies_grouped_by_title = movies_df.groupby('title').agg({'genres': lambda x: list(x.unique())}).reset_index()

In [30]:
movies_grouped_by_title.loc[movies_grouped_by_title['title'] == 'Aladdin (1992)']

Unnamed: 0,title,rating,genres
1900,Aladdin (1992),"[2.5, 4.0, 3.5, 4.0, 5.0, 3.5, 4.0, 4.0, 4.5, ...","[adventure|animation|children|comedy|musical, ..."


Clean up the genres lists.

In [31]:
from analysis.utils.cleaning import combine_genres_list

movies_grouped_by_title['genres'] = movies_grouped_by_title['genres'].apply(lambda x: combine_genres_list(x))

In [32]:
movies_grouped_by_title

Unnamed: 0,title,rating,genres
0,"""Great Performances"" Cats (1998)","[3.0, 5.0, 4.0, 1.0, 0.5, 1.5, 0.5, 3.5, 5.0, ...",musical
1,#1 Cheerleader Camp (2010),"[3.0, 1.0, 1.5, 3.0, 4.0, 2.0, 4.0, 5.0, 1.5]",comedy|drama
2,#Captured (2017),[2.5],horror
3,#Horror (2015),"[1.0, 1.0, 3.0, 3.5, 0.5, 2.5, 1.5, 2.0, 3.0, ...",drama|horror|mystery|thriller
4,#SCREAMERS (2016),[2.5],horror
...,...,...,...
50091,ארבינקא (1967),[3.5],comedy|crime|romance
50092,…And the Fifth Horseman Is Fear (1965),"[3.5, 3.0]",drama|war
50093,キサラギ (2007),"[3.5, 4.5, 3.0]",comedy|mystery
50094,チェブラーシカ (2010),"[5.0, 4.0, 2.5, 3.5, 3.5, 4.5, 3.0, 3.0, 2.0, ...",animation|children


In [33]:
movies_grouped_by_title.loc[movies_grouped_by_title['title'] == 'Aladdin (1992)']

Unnamed: 0,title,rating,genres
1900,Aladdin (1992),"[2.5, 4.0, 3.5, 4.0, 5.0, 3.5, 4.0, 4.0, 4.5, ...",adventure|animation|children|comedy|musical|fa...


Let's think about our recommendation engine now again.  Let's say that we want to recommend movies to by recommending the movies with the most similar genres list.

In order to to use TF IDF we need a list of all the "words" (genres) used in our corpus.  This is easy for us to do.  We can make a list of all the genres by:
1. Creating a column with a list of genres
2. Grouping by the genre
3. Aggregating the results
4. Transforming the resulting series into a list of genres

Column with list of genres

In [34]:
movies_grouped_by_title['genres_list'] = movies_grouped_by_title['genres'].apply(lambda x: x.split('|'))

Group by Genre

In [35]:
movies_grouped_by_genre = movies_grouped_by_title.drop(columns=['rating']).explode('genres_list').groupby('genres_list')

Aggregrate the results

In [36]:
all_genres = movies_grouped_by_genre.agg(sum).reset_index()['genres_list']

create a list of genres

In [37]:
all_genres_list = list(all_genres)
all_genres_list

['action',
 'adventure',
 'animation',
 'children',
 'comedy',
 'crime',
 'documentary',
 'drama',
 'fantasy',
 'film-noir',
 'horror',
 'imax',
 'musical',
 'mystery',
 'romance',
 'sci-fi',
 'thriller',
 'war',
 'western']

Let's turn our genres column into a space separated list of genres (as if they were words in a document)

In [38]:
movies_with_genres_as_space_separated_string = movies_grouped_by_title
movies_with_genres_as_space_separated_string['genres'] = movies_with_genres_as_space_separated_string['genres'].apply(lambda x: x.replace('|', ' '))
movies_with_genres_as_space_separated_string.sample(5)

Unnamed: 0,title,rating,genres,genres_list
45496,Tom Thumb (1958),"[3.0, 0.5, 3.5, 3.0, 3.0, 3.0, 3.0, 1.0, 3.0, ...",children fantasy musical,"[children, fantasy, musical]"
30452,"Open Road, The (2009)","[3.5, 2.5, 4.5, 3.5, 2.0, 3.5, 1.0, 4.0, 4.0, ...",comedy drama,"[comedy, drama]"
15665,Free Fall (2014),"[3.0, 0.5, 3.5, 4.0, 3.0, 3.5, 4.0]",drama,[drama]
44993,Three Faces East (1930),[3.5],drama mystery war,"[drama, mystery, war]"
44477,The Wicked Gift (2017),[4.0],horror thriller,"[horror, thriller]"


reset our index again

In [39]:
movies_ready_for_vectorizer = movies_with_genres_as_space_separated_string.reset_index().drop(columns=['index'])

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 1),min_df=0, stop_words='english', vocabulary=all_genres_list)
tfidf_matrix = tf.fit_transform(movies_ready_for_vectorizer['genres'])

To see what our matrix looks like we can do the following:

In [41]:
pd.DataFrame(tfidf_matrix.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,1.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.000000,0.000000,0.775664,0.000000,0.0,0.631147,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,1.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
3,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.292225,0.0,0.0,0.529955,0.0,0.0,0.645119,0.000000,0.0,0.466448,0.000000,0.0
4,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,1.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50091,0.0,0.0,0.000000,0.000000,0.440955,0.671256,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.595797,0.0,0.000000,0.000000,0.0
50092,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.378841,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.925462,0.0
50093,0.0,0.0,0.000000,0.000000,0.486407,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.873732,0.000000,0.0,0.000000,0.000000,0.0
50094,0.0,0.0,0.707986,0.706226,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0


Now let's calculate the dot product of the tfidf_martix with itself in order to get a cosine similarity matrix.

In [43]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

 What do we expect the dimensions of this matrix to be?

In [44]:
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.37728826, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        1.        ],
       ...,
       [0.        , 0.37728826, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        1.        ]])

Notice that this matrix also has 1's along the diagonals. Why is that?

In [45]:
cosine_sim.shape

(50096, 50096)

The next part you can Test Drive because now we'll add logic to grab the top 20 movie titles by index

We can build this in a separate module and import it here to see some results

In [46]:
from analysis.utils.recommendation import get_similar_movies

similar_movies = get_similar_movies('Toy Story (1995)', cosine_sim, movies_with_ratings_combined_df)
similar_movies

NameError: name 'movies_with_ratings_combined_df' is not defined

THAT'S IT! :)