In [1]:
import pandas as pd
from typing import List

from analysis.utils.cleaning import lower_case_and_strip_spaces

### Reading the Data and Initial Stats

In [2]:
movies_df: pd.DataFrame = pd.read_csv('input/all_movies.csv')
movies_df.sample(20)

Unnamed: 0,movieId,title,genres
19809,97238,Making 'Do the Right Thing' (1989),Documentary
3592,3682,Magnum Force (1973),Action|crime|DRAMA|thriller
11164,46790,"Kremlin Letter, The (1970)",Drama|thriller
36888,145610,"Teodora, Slave Empress (1954)",Adventure|drama|ROMANCE
11603,50574,Crawlspace (1986),Horror
52471,180915,Specter of the Rose (1946),Drama|thriller
27799,123451,Time Bomb (2006),Action|drama|THRILLER
31478,132987,Kill 'em All (2012),Action|crime|THRILLER
15470,78490,Othello (1965),Drama|romance
23313,109669,"Traveling Executioner, The (1970)",Comedy|drama|WESTERN


In [3]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|animation|CHILDREN|comedy|Fantasy
1,2,Jumanji (1995),Adventure|children|FANTASY
2,3,Grumpier Old Men (1995),Comedy|romance
3,4,Waiting to Exhale (1995),Comedy|drama|ROMANCE
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),Comedy
58095,193880,Her Name Was Mumu (2016),Drama
58096,193882,Flora (2017),Adventure|drama|HORROR|sci-fi


## Step 1: Cleaning
You shouldn't necessarily assume that your data is good.  It could be very sparse and not have much there. There could be duplication, poorly recorded or empty values, or with large text there could be a lot of garbage in there if it was an open text field

Some things we'll do here:
1. Get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column (This would tell us if a column is especially sparse)
3. Check for duplicate rows

As we find things that need to be cleaned (bad text, duplicates etc.) we will write tested cleaning functions to cleanup our input data.

In [4]:
movies_df.shape

(58098, 3)

In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58098 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  58098 non-null  int64 
 1   title    58098 non-null  object
 2   genres   58098 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB


#### Cleanup

In [6]:
movies_cleaned_df = movies_df.copy()
movies_cleaned_df['genres'] = movies_cleaned_df['genres'].apply(lower_case_and_strip_spaces)

In [7]:
movies_cleaned_df.sample(20)

Unnamed: 0,movieId,title,genres
48021,171195,Chasing Two Hares (1961),comedy
9680,31247,"Fighting Sullivans, The (Sullivans, The) (1944)",drama
8380,25789,Shanghai Express (1932),adventure|drama|romance
6990,7101,Doc Hollywood (1991),comedy|romance
8194,8877,Second Chorus (1940),comedy|musical|romance
2396,2480,Dry Cleaning (Nettoyage à sec) (1997),drama
21982,105163,Ruins (2013),documentary
5843,5941,Drumline (2002),comedy|drama|musical|romance
7659,8127,"First $20 Million Is Always the Hardest, The (...",comedy
15490,78622,Nekromantik 2 (1991),horror


In [8]:
movies_cleaned_df = movies_cleaned_df.loc[movies_cleaned_df['genres'] != '(no genres listed)']

In [9]:
movies_cleaned_df.shape

(53832, 3)

In [10]:
assert movies_cleaned_df.shape[0] < movies_df.shape[0]

### Checking for Duplicates

#### We need to define what a "duplicate" is.

Get a series with True or False that indicates if the title was duplicated or not.

In [11]:
movies_cleaned_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy
1,2,Jumanji (1995),adventure|children|fantasy
2,3,Grumpier Old Men (1995),comedy|romance
3,4,Waiting to Exhale (1995),comedy|drama|romance
4,5,Father of the Bride Part II (1995),comedy


In [12]:
duplicated_by_title = movies_cleaned_df.duplicated(['title'])

In [13]:
duplicated_by_title_df = movies_cleaned_df.loc[duplicated_by_title]
duplicated_by_title_df

Unnamed: 0,movieId,title,genres
9142,26958,Emma (1996),romance
9157,26982,Men with Guns (1997),drama
13309,64997,War of the Worlds (2005),action|sci-fi
13395,65665,Hamlet (2000),drama
13614,67459,Chaos (2005),crime|drama|horror
...,...,...,...
56950,190881,The Boss (2016),documentary
57238,191713,Noise (2007),crime|drama|thriller
57269,191775,Berlin Calling (2008),comedy|drama
57361,192003,Journey to the Center of the Earth (2008),action|adventure|fantasy|sci-fi


Locate all records where the title was duplicated.

Get a series with True of False that indicates if the title and genre were both duplciated.

In [14]:
duplicated_by_title_and_genre = movies_cleaned_df.duplicated(['title', 'genres'])

Locate all records where the title and genre were duplicated

In [15]:
duplicated_by_title_and_genre_df = movies_cleaned_df.loc[duplicated_by_title_and_genre]
duplicated_by_title_and_genre_df

Unnamed: 0,movieId,title,genres
15902,80330,Offside (2006),comedy|drama
20835,101212,"Girl, The (2012)",drama
25046,115777,Beneath (2013),horror
27572,122940,Clear History (2013),comedy
29852,128991,Johnny Express (2014),animation|comedy|sci-fi
30226,130062,Darling (2007),drama
36172,143978,Home (2008),drama
38804,150310,Macbeth (2015),drama
44387,163246,Seven Years Bad Luck (1921),comedy
48620,172427,Little Man (2006),comedy


We'll need a list of titles from each category

In [16]:
records_duplicated_by_title_and_genre = set(duplicated_by_title_and_genre_df['title'])
len(records_duplicated_by_title_and_genre)

14

In [17]:
records_duplicated_by_title = set(duplicated_by_title_df['title'])
len(records_duplicated_by_title)

66

In [18]:
records_duplicated_by_title

{'20,000 Leagues Under the Sea (1997)',
 'Absolution (2015)',
 'Aftermath (2012)',
 'Aladdin (1992)',
 'Beneath (2013)',
 'Berlin Calling (2008)',
 'Blackout (2007)',
 'Cargo (2017)',
 'Casanova (2005)',
 'Chaos (2005)',
 'Classmates (2016)',
 'Clear History (2013)',
 'Clockstoppers (2002)',
 'Confessions of a Dangerous Mind (2002)',
 'Darling (2007)',
 'Delirium (2014)',
 'Deranged (2012)',
 'Detour (2017)',
 'Dracula (1931)',
 'Ecstasy (2011)',
 'Eden (2014)',
 'Emma (1996)',
 'Eros (2004)',
 'Forsaken (2016)',
 'Free Fall (2014)',
 'Frozen (2010)',
 'Girl, The (2012)',
 'Good People (2014)',
 'Gossip (2000)',
 'Grace (2014)',
 'Hamlet (2000)',
 'Holiday (2014)',
 'Home (2008)',
 'Hostage (2005)',
 'Interrogation (2016)',
 'Johnny Express (2014)',
 'Journey to the Center of the Earth (2008)',
 'Lagaan: Once Upon a Time in India (2001)',
 'Little Man (2006)',
 'Lucky (2017)',
 'Macbeth (2015)',
 'Men with Guns (1997)',
 'Noise (2007)',
 'Office (2015)',
 'Offside (2006)',
 'Paradise (

We have two lists of movies:
1. Has all the titles where the title was duplicated - in this list some titles represent records that have duplicated genre strings and some represent records that have different genre string.
2. Has all the titles where the title and genre was duplicated - in this list the title represents records that have duplicated genre strings.

With these two lists how can we get a list that is ONLY the titles where the title was duplicated but the genre was not?

In [19]:
records_duplicated_by_title_only = records_duplicated_by_title.difference(records_duplicated_by_title_and_genre)
records_duplicated_by_title_only

{'20,000 Leagues Under the Sea (1997)',
 'Absolution (2015)',
 'Aftermath (2012)',
 'Aladdin (1992)',
 'Blackout (2007)',
 'Cargo (2017)',
 'Casanova (2005)',
 'Chaos (2005)',
 'Classmates (2016)',
 'Clockstoppers (2002)',
 'Confessions of a Dangerous Mind (2002)',
 'Delirium (2014)',
 'Deranged (2012)',
 'Ecstasy (2011)',
 'Eden (2014)',
 'Emma (1996)',
 'Eros (2004)',
 'Forsaken (2016)',
 'Free Fall (2014)',
 'Frozen (2010)',
 'Good People (2014)',
 'Gossip (2000)',
 'Grace (2014)',
 'Hamlet (2000)',
 'Holiday (2014)',
 'Hostage (2005)',
 'Interrogation (2016)',
 'Journey to the Center of the Earth (2008)',
 'Lagaan: Once Upon a Time in India (2001)',
 'Men with Guns (1997)',
 'Noise (2007)',
 'Office (2015)',
 'Paradise (2013)',
 'Rose (2011)',
 'Saturn 3 (1980)',
 'Shelter (2015)',
 'Sing (2016)',
 'Slow Burn (2000)',
 'Stranded (2015)',
 'Tag (2015)',
 'The Boss (2016)',
 'The Break-In (2016)',
 'The Connection (2014)',
 'The Dream Team (2012)',
 'The Midnight Man (2016)',
 'The P

Now we can locate an example using the titles in our list.

In [20]:
ALADDIN = 'Aladdin (1992)'

In [21]:
def get_aladdin_example(df: pd.DataFrame) -> pd.DataFrame:
    return df.loc[df['title'] == ALADDIN]

In [22]:
movies_cleaned_df.loc[movies_cleaned_df['title'] == ALADDIN]

Unnamed: 0,movieId,title,genres
582,588,Aladdin (1992),adventure|animation|children|comedy|musical
24657,114240,Aladdin (1992),adventure|animation|children|comedy|fantasy


## Step 2: Feature Preparation

What is a feature?

A descriptive attribute that can be used in our algorithms.

Examples:
    - If we are trying to predict house prices, square footage could be a feature we use to predict the house price
    - In our case, as we try to find movies similar to our movie the "feature" we will be focusing on is the "genres" description
    - We need to "prepare" the columns data in such a way that we can compare one genre description to another and get some measure of similarity

We can focus on the genres to recommend movies.  Let's prepare our genres list. First we need to group by movie.

In [23]:
movies_grouped_by_title_df = movies_cleaned_df.copy()
movies_grouped_by_title_df = movies_grouped_by_title_df.groupby('title').agg({'genres': lambda x: x.to_list()}).reset_index()

In [24]:
movies_grouped_by_title_df

Unnamed: 0,title,genres
0,"""Great Performances"" Cats (1998)",[musical]
1,#1 Cheerleader Camp (2010),[comedy|drama]
2,#Captured (2017),[horror]
3,#Horror (2015),[drama|horror|mystery|thriller]
4,#SCREAMERS (2016),[horror]
...,...,...
53761,ארבינקא (1967),[comedy|crime|romance]
53762,…And the Fifth Horseman Is Fear (1965),[drama|war]
53763,キサラギ (2007),[comedy|mystery]
53764,チェブラーシカ (2010),[animation|children]


In [25]:
get_aladdin_example(movies_grouped_by_title_df)

Unnamed: 0,title,genres
2023,Aladdin (1992),"[adventure|animation|children|comedy|musical, ..."


Clean up the genres lists.

In [26]:
from analysis.utils.cleaning import combine_genres_list

movies_unique_genres_df = movies_grouped_by_title_df.copy()
movies_unique_genres_df['genres'] = movies_unique_genres_df['genres'].apply(combine_genres_list)

In [27]:
movies_unique_genres_df.sample(20)

Unnamed: 0,title,genres
25770,Level Up (2016),{thriller}
38003,Second Wind (Le deuxième souffle) (Second Brea...,"{film-noir, crime, drama}"
14265,Emma (1932),"{romance, drama, comedy}"
31330,No Down Payment (1957),{drama}
33407,Peacock (2010),{thriller}
7678,Burn Notice: The Fall of Sam Axe (2011),"{crime, drama, mystery}"
46016,The Music Lovers (1970),{drama}
19688,Helen of Troy (1956),"{war, romance, action, drama, adventure}"
383,24: Redemption (2008),"{action, drama, thriller, crime, adventure}"
17488,Germany in Autumn (1978),"{documentary, drama}"


In [29]:
get_aladdin_example(movies_unique_genres_df)

Unnamed: 0,title,genres
2023,Aladdin (1992),"{children, musical, comedy, animation, adventu..."


Let's think about our recommendation engine now again.  Let's say that we want to recommend movies to by recommending the movies with the most similar genres list.

In order to to use TF IDF we need a list of all the "words" (genres) used in our corpus.  This is easy for us to do.  We can make a list of all the genres by:
1. Creating a column with a list of genres
2. Grouping by the genre
3. Aggregating the results
4. Transforming the resulting series into a list of genres

Collect Unique List of Genres

In [34]:
for_genres_list_df = movies_unique_genres_df.copy()
for_genres_list_df = for_genres_list_df['genres'].explode().reset_index()

In [35]:
for_genres_list_df.genres.unique()

array(['musical', 'drama', 'comedy', 'horror', 'thriller', 'mystery',
       'documentary', 'crime', 'western', 'animation', 'war', 'action',
       'adventure', 'fantasy', 'romance', 'children', 'sci-fi',
       'film-noir', 'imax'], dtype=object)

In [36]:
all_genres = list(for_genres_list_df.genres.unique())

In [37]:
len(all_genres)

19

Let's turn our genres column into a space separated list of genres (as if they were words in a document)

In [38]:
movies_with_document_description_df = movies_unique_genres_df.copy()
movies_with_document_description_df['genres'] = movies_with_document_description_df['genres'].apply(lambda x: ' '.join(x))

In [39]:
movies_with_document_description_df

Unnamed: 0,title,genres
0,"""Great Performances"" Cats (1998)",musical
1,#1 Cheerleader Camp (2010),drama comedy
2,#Captured (2017),horror
3,#Horror (2015),horror thriller drama mystery
4,#SCREAMERS (2016),horror
...,...,...
53761,ארבינקא (1967),romance crime comedy
53762,…And the Fifth Horseman Is Fear (1965),war drama
53763,キサラギ (2007),mystery comedy
53764,チェブラーシカ (2010),children animation


In [40]:
get_aladdin_example(movies_with_document_description_df)

Unnamed: 0,title,genres
2023,Aladdin (1992),children musical comedy animation adventure fa...


In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(vocabulary=all_genres)
tfidf_matrix = tf.fit_transform(movies_with_document_description_df['genres'])

In [42]:
pd.DataFrame(tfidf_matrix.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,1.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
1,0.0,0.630926,0.775843,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
2,0.0,0.000000,0.000000,1.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
3,0.0,0.292446,0.000000,0.530876,0.467441,0.643541,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
4,0.0,0.000000,0.000000,1.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53761,0.0,0.000000,0.442574,0.000000,0.000000,0.000000,0.0,0.670223,0.0,0.000000,0.000000,0.0,0.0,0.0,0.595759,0.000000,0.0,0.0,0.0
53762,0.0,0.379998,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.924987,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
53763,0.0,0.000000,0.487812,0.000000,0.000000,0.872949,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0
53764,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.709951,0.000000,0.0,0.0,0.0,0.000000,0.704251,0.0,0.0,0.0


Now let's calculate the dot product of the tfidf_martix with itself in order to get a cosine similarity matrix.

In [43]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

 What do we expect the dimensions of this matrix to be?

In [44]:
cosine_sim

array([[1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 1.        , 0.        , ..., 0.37846524, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        1.        ],
       ...,
       [0.        , 0.37846524, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        1.        ]])

Notice that this matrix also has 1's along the diagonals. Why is that?

In [45]:
cosine_sim.shape

(53766, 53766)

The next part you can Test Drive because now we'll add logic to grab the top 30 movie titles by index

We can build this in a separate module and import it here to see some results

In [46]:
from analysis.utils.recommendation import get_similar_movies

similar_movies = get_similar_movies('Toy Story (1995)', cosine_sim, movies_with_document_description_df, 30)
similar_movies

['Antz (1998)',
 'Asterix and the Vikings (Astérix et les Vikings) (2006)',
 'Boxtrolls, The (2014)',
 'Brother Bear 2 (2006)',
 'DuckTales: The Movie - Treasure of the Lost Lamp (1990)',
 "Emperor's New Groove, The (2000)",
 'Home (2015)',
 'Moana (2016)',
 'Monsters, Inc. (2001)',
 "Olaf's Frozen Adventure (2017)",
 'Penguin Highway (2018)',
 'Puss in Book: Trapped in an Epic Tale (2017)',
 'Scooby-Doo! Mask of the Blue Falcon (2012)',
 'Shrek the Third (2007)',
 'Space Jam (1996)',
 'Tale of Despereaux, The (2008)',
 'Tangled: Before Ever After (2017)',
 'The Croods 2 (2017)',
 'The Dragon Spell (2016)',
 'The Good Dinosaur (2015)',
 'The Magic Crystal (2011)',
 'Toy Story (1995)',
 'Toy Story 2 (1999)',
 'Toy Story Toons: Hawaiian Vacation (2011)',
 'Toy Story Toons: Small Fry (2011)',
 'Trolls Holiday (2017)',
 'Turbo (2013)',
 'Wild, The (2006)',
 'Inside Out (2015)',
 'Pokémon the Movie: I Choose You! (2017)']

THAT'S IT! :)