In [18]:
import pandas as pd

### The Data and what we want to do with it:

These datasets where originally from Kaggle here is the description they gave:

This dataset contains 27753444 ratings and 1108997 tag applications across 58098 movies. These data sets were created by 283228 users between January 09, 1995 and September 26, 2018. This dataset was generated on September 26, 2018.

Goal: Product a simple recommendation engine that accepts a movie title and then recommends two similar movies

### Reading the Data and Initial Stats

In [19]:
movies_df = pd.read_csv('input/all_movies.csv')
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|animation|CHILDREN|comedy|Fantasy
1,2,Jumanji (1995),Adventure|children|FANTASY
2,3,Grumpier Old Men (1995),Comedy|romance
3,4,Waiting to Exhale (1995),Comedy|drama|ROMANCE
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),Comedy
58095,193880,Her Name Was Mumu (2016),Drama
58096,193882,Flora (2017),Adventure|drama|HORROR|sci-fi


In [20]:
ratings_df = pd.read_csv('input/ratings.csv')
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264
...,...,...,...,...
27753439,283228,8542,4.5,1379882795
27753440,283228,8712,4.5,1379882751
27753441,283228,34405,4.5,1379882889
27753442,283228,44761,4.5,1354159524


## Step 1: Cleaning
You shouldn't necessarily assume that your data is good.  It could be very sparse and not have much there. There could be duplication, poorly recorded or empty values, or with large text there could be a lot of garbage in there if it was an open text field

Some things we'll do here:
1. Get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column (This would tell us if a column is especially sparse)
3. Check for duplicate rows

As we find things that need to be cleaned (bad text, duplicates etc.) we will write tested cleaning functions to cleanup our input data.

In [21]:
movies_df.shape

(58098, 3)

In [22]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58098 entries, 0 to 58097
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  58098 non-null  int64 
 1   title    58098 non-null  object
 2   genres   58098 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.3+ MB


#### Checking for Duplicates
You can add assert statements throughout.  These are good as checks because they can directly translate into your tests or even just live in your scripts

Here we need to define what a duplicate is. If you don't specify column names then pandas will only call something a duplicate when all columns match.

For movies, it seems like a "duplicate" would be rows that have exactly matching titles and genres

In [23]:
assert (len(movies_df) - len(movies_df.drop_duplicates(['title', 'genres']))) == 0

AssertionError: 

That is expected, but let's also make sure that the number of duplicated titles is equal to the number of duplicated genres/title combos

In [24]:
movies_df.loc[movies_df.duplicated(['title', 'genres'])].shape

(14, 3)

In [25]:
movies_df.loc[movies_df.duplicated(['title'])].shape

(78, 3)

#### Cleaning the duplicated/bad columns

Notice that the genres column has inconsistency in it's capitalization.  We probably also want to make sure any leading or trailing spaces are removed. So let's start by cleaning up the genres column.

In [26]:
def lower_case_and_strip_spaces(input):
    return input.lower().strip()

We can start by writing tests in the notebook.  Our tests right now won't need any extra libraries like pytest.  They will simply be a function that runs our cleaning code against an input and asserts that the output is what we expect.

In [27]:
initial: str = "Crime|drama|HORROR"
expected: str = "crime|drama|horror"

initial_2: str = " CRIME|DRAMA|HORROR "
expected_2: str = "crime|drama|horror"

initial_3: str = " CRIME "
expected_3: str = "crime"

initial_4: str = " comedy "
expected_4: str = "comedy"
actual = lower_case_and_strip_spaces(initial)
assert actual == expected
actual = lower_case_and_strip_spaces(initial_2)
assert actual == expected_2
actual = lower_case_and_strip_spaces(initial_3)
assert actual == expected_3
actual = lower_case_and_strip_spaces(initial_4)
assert actual == expected_4

In [28]:
movies_df['genres'] = movies_df['genres'].apply(lower_case_and_strip_spaces)
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy
1,2,Jumanji (1995),adventure|children|fantasy
2,3,Grumpier Old Men (1995),comedy|romance
3,4,Waiting to Exhale (1995),comedy|drama|romance
4,5,Father of the Bride Part II (1995),comedy
...,...,...,...
58093,193876,The Great Glinka (1946),(no genres listed)
58094,193878,Les tribulations d'une caissière (2011),comedy
58095,193880,Her Name Was Mumu (2016),drama
58096,193882,Flora (2017),adventure|drama|horror|sci-fi


Now we can move our cleaning function to a python file. From here on our we can test drive and write other cleaning functions in that file.  Then we can import the code to use in our notebook.

In [29]:
duplicated_titles_and_genres = movies_df.loc[movies_df.duplicated(['title', 'genres'])]
duplicated_titles_and_genres

Unnamed: 0,movieId,title,genres
15902,80330,Offside (2006),comedy|drama
20835,101212,"Girl, The (2012)",drama
25046,115777,Beneath (2013),horror
27572,122940,Clear History (2013),comedy
29852,128991,Johnny Express (2014),animation|comedy|sci-fi
30226,130062,Darling (2007),drama
36172,143978,Home (2008),drama
38804,150310,Macbeth (2015),drama
44387,163246,Seven Years Bad Luck (1921),comedy
48620,172427,Little Man (2006),comedy


In [30]:
duplicated_titles = movies_df.loc[movies_df.duplicated(['title'])]
duplicated_titles

Unnamed: 0,movieId,title,genres
9142,26958,Emma (1996),romance
9157,26982,Men with Guns (1997),drama
13309,64997,War of the Worlds (2005),action|sci-fi
13395,65665,Hamlet (2000),drama
13614,67459,Chaos (2005),crime|drama|horror
...,...,...,...
57269,191775,Berlin Calling (2008),comedy|drama
57305,191867,Let There Be Light (2017),documentary
57361,192003,Journey to the Center of the Earth (2008),action|adventure|fantasy|sci-fi
57463,192243,Contact (1992),drama|horror|mystery|thriller


We still have the same number of rows in each "duplicate" category (duplicate based on title and duplicate based on title & genre).  So let's continue our analysis and see why that is.

It would be good to see an example of a title that is a duplicate title, but has a different list of genres in each entry.

To do that we can create lists of the title column from both data sets and find the difference between the two lists.

In [33]:
from typing import List

list_of_titles_from_duplicated_titles_and_genres: List = duplicated_titles_and_genres['title'].to_list()
list_of_titles_from_duplicated_titles: List = duplicated_titles['title'].to_list()
duplicated_titles = list(set(list_of_titles_from_duplicated_titles).difference(set(list_of_titles_from_duplicated_titles_and_genres)))
duplicated_titles

['Ecstasy (2011)',
 'The Promise (2016)',
 'The Midnight Man (2016)',
 'Good People (2014)',
 'Casanova (2005)',
 'The Void (2016)',
 'Office (2015)',
 'Slow Burn (2000)',
 'Family Life (1971)',
 'Saturn 3 (1980)',
 'Clockstoppers (2002)',
 'Eros (2004)',
 'Cargo (2017)',
 'Let There Be Light (2017)',
 'Tag (2015)',
 'Hamlet (2000)',
 'Inside (2012)',
 'Absolution (2015)',
 'Interrogation (2016)',
 'Rose (2011)',
 'The Connection (2014)',
 'Aladdin (1992)',
 'Sing (2016)',
 'Truth (2015)',
 'The Dream Team (2012)',
 'Ava (2017)',
 'Blackout (2007)',
 'Another World (2014)',
 'The Break-In (2016)',
 'Grace (2014)',
 'Free Fall (2014)',
 'Frozen (2010)',
 'Delirium (2014)',
 'Confessions of a Dangerous Mind (2002)',
 'Escape Room (2017)',
 'Deranged (2012)',
 '20,000 Leagues Under the Sea (1997)',
 'The Tunnel (1933)',
 'Classmates (2016)',
 'Forsaken (2016)',
 'Paradise (2013)',
 'The Forest (2016)',
 'Weekend (2011)',
 'Men with Guns (1997)',
 'Chaos (2005)',
 'Black Field (2009)',
 'S

Now we can see which titles are duplicates by title but not duplicates by title & genre.  Let's look at an example:

In [32]:
movies_df.loc[movies_df['title'] == 'Aladdin (1992)']

Unnamed: 0,movieId,title,genres
582,588,Aladdin (1992),adventure|animation|children|comedy|musical
24657,114240,Aladdin (1992),adventure|animation|children|comedy|fantasy


It was possible that a movie title could have been assigned to two different movies that have different genres, but based on our example it looks like this really is the same movie with two slight different genre lists.  So now let's write a couple cleaning functions in our cleaning module that can combine these lists for us.

In [34]:
from analysis.utils.cleaning import find_duplicates_and_combine

movies_with_combined_genres_df = find_duplicates_and_combine(movies_df, duplicated_titles)

Before we run this - the way we have written it the movies data frame should not lose any rows. Let's assert that remains true.

In [35]:
assert len(movies_with_combined_genres_df.index) == len(movies_df.index)

### Ratings Data Frame
Same Steps
1. Easily get the number of rows and columns by looking at the shape
2. Determine the number of non-null rows for each given column
3. Check for duplicate rows

In [36]:
ratings_df.shape

(27753444, 4)

In [37]:
ratings_df.notnull().agg(sum)

userId       27753444
movieId      27753444
rating       27753444
timestamp    27753444
dtype: int64

Here we should again define what it means to be a "duplicate":

In [38]:
assert (len(ratings_df) - len(ratings_df.drop_duplicates(subset=['userId', 'movieId', 'rating', 'timestamp']))) == 0

## Step 2: Feature Creation
We want to be able to recommend movies to users. What are some things we could do next?

Answer:
- Join movies to ratings
- Group by movie title to find an average rating for each movie

Before we join we need to figure which direction to join - so we should ask ourselves:

Why are we joining the two data frames?
How many rows should the resulting data frame have?
Do we expect a given column to have any nulls after the join?

Here we will left join movies to their ratings.  If we expect all movies to have ratings than we can start with an assertion that indicates that.

In [39]:
movies_ratings_joined_df = movies_df.set_index('movieId').join(ratings_df.set_index('movieId')).reset_index()

In [40]:
movies_with_ratings = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].notna()]

In [41]:
assert len(movies_ratings_joined_df.index) == len(movies_with_ratings.index)

AssertionError: 

Well our assertion fails but what does that mean?

In [42]:
movies_missing_ratings = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].isna()]

In [43]:
movies_missing_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
21871974,25817,Break of Hearts (1935),drama|romance,,,
21914110,26361,Baby Blue Marine (1976),drama,,,
21971501,27153,Can't Be Heaven (Forever Together) (2000),children|comedy|drama|romance,,,
21986044,27433,Bark! (2002),comedy|drama,,,
22225201,31945,Always a Bridesmaid (2000),documentary,,,
...,...,...,...,...,...,...
27756743,192399,Under Wraps (1997),children|comedy|horror,,,
27757129,192933,Rosie (2018),drama,,,
27757205,193109,Ach śpij kochanie (2017),crime|thriller,,,
27757340,193321,Pledges (2018),comedy|horror,,,


Joining movies to their ratings means that we have a subset of movies that don't have ratings.  We actually may want to know about these so it's good to know that they exist. For now though, we'll drop the records where the rating is NaN and then assert we get the number of rows we expect.

In [44]:
movies_ratings_joined_df = movies_ratings_joined_df.loc[movies_ratings_joined_df['rating'].notna()]

In [45]:
movies_ratings_joined_df

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,4.0,4.0,1.113766e+09
1,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,10.0,5.0,9.488858e+08
2,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,14.0,4.5,1.442169e+09
3,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,15.0,4.0,1.370810e+09
4,1,Toy Story (1995),adventure|animation|children|comedy|fantasy,22.0,4.0,1.237623e+09
...,...,...,...,...,...,...
27757648,193878,Les tribulations d'une caissière (2011),comedy,176871.0,2.0,1.537875e+09
27757649,193880,Her Name Was Mumu (2016),drama,81710.0,2.0,1.537886e+09
27757650,193882,Flora (2017),adventure|drama|horror|sci-fi,33330.0,2.0,1.537891e+09
27757651,193886,Leal (2018),action|crime|drama,206009.0,2.5,1.537918e+09


Next let's get an average rating for each movie by grouping by title

In [117]:
movies_with_all_ratings_df = movies_ratings_joined_df.groupby(['title']).agg({'rating': lambda x : x.to_list(), 'genres': lambda x: x.to_list()[0], 'timestamp': lambda x: x.to_list()}).reset_index()

In [118]:
movies_with_all_ratings_df.sample(20)

Unnamed: 0,title,rating,genres,timestamp
19310,Hatfields & McCoys (2012),"[4.5, 0.5, 4.0, 3.5, 3.5, 3.0, 4.0, 3.5, 0.5, ...",drama|romance,"[1506498749.0, 1458413253.0, 1372018854.0, 146..."
43963,The Deceased (1965),"[4.0, 3.0, 3.0]",(no genres listed),"[1437950130.0, 1479221019.0, 1443828957.0]"
35107,Puck Hogs (2009),[3.5],comedy,[1494626223.0]
11154,Daria: Is It College Yet? (2002),"[3.0, 5.0, 4.5, 3.5, 0.5, 4.0, 4.0, 5.0, 3.5, ...",animation|comedy,"[1409524566.0, 1256702125.0, 1484306566.0, 130..."
36927,Room (2015),"[5.0, 4.0, 4.5, 4.0, 4.0, 2.5, 5.0, 3.5, 4.5, ...",drama,"[1472945467.0, 1466733971.0, 1485359233.0, 145..."
25661,Les invincibles (2013),"[2.5, 2.0, 3.5]",comedy,"[1416478412.0, 1450162909.0, 1494623338.0]"
44694,The Good Die Young (1954),[2.5],crime,[1479221022.0]
16256,For Ever Mozart (1996),"[5.0, 4.0, 2.5, 3.5, 5.0, 3.5, 4.0, 2.0, 1.5, ...",drama,"[1006781344.0, 891478200.0, 1155132769.0, 1166..."
13240,Dr. Jekyll and Mr. Hyde (1941),"[3.5, 5.0, 3.0, 5.0, 2.0, 3.0, 2.0, 3.5, 4.0, ...",drama|horror,"[1073325922.0, 1073290271.0, 1074438813.0, 110..."
12263,"Deux mondes, Les (2007)","[4.0, 2.5, 2.5, 1.5, 4.0, 2.5, 3.5, 3.0, 5.0, ...",comedy|fantasy,"[1500487736.0, 1262124442.0, 1483348116.0, 150..."


Anything else we want to check at this point?

Answer: Worth it at this point to check that we still have all the movies that we expect to have.

Remember the movies that had no ratings?  Good thing we grabbed those :) let's use them now.

In [119]:
movies_missing_ratings

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
21871974,25817,Break of Hearts (1935),drama|romance,,,
21914110,26361,Baby Blue Marine (1976),drama,,,
21971501,27153,Can't Be Heaven (Forever Together) (2000),children|comedy|drama|romance,,,
21986044,27433,Bark! (2002),comedy|drama,,,
22225201,31945,Always a Bridesmaid (2000),documentary,,,
...,...,...,...,...,...,...
27756743,192399,Under Wraps (1997),children|comedy|horror,,,
27757129,192933,Rosie (2018),drama,,,
27757205,193109,Ach śpij kochanie (2017),crime|thriller,,,
27757340,193321,Pledges (2018),comedy|horror,,,


What can we assert to verify that we have all the movies we expect to have?

We should check that:

Original Number of Movies = # Movies Missing Ratings + # Movies with Ratings + # Movies with duplicated Titles

Why?

Remember that we kept all rows when we were combining genres above (we didn't create one version of the title with all genres).  We did this because ratings were joined to movies on movie id.  By keeping multiple rows of the movie we were able to capture all of the movie's ratings when we joined to the ratings df.  Then we grouped by movie title - which means we should only have one row for each movie title and it should have all the movie's ratings as a list in the rating column.

In [120]:
len(movies_df.index)

58098

In [121]:
(len(movies_missing_ratings.index)+len(movies_with_all_ratings_df.index)+len(duplicated_titles))

58090

In [122]:
assert len(movies_df.index) == (len(movies_missing_ratings.index)+len(movies_with_all_ratings_df.index)+len(duplicated_titles))

AssertionError: 

Let's checkout Aladdin again:

In [124]:
movies_with_all_ratings_df.loc[movies_with_all_ratings_df['title'] == 'Aladdin (1992)']

Unnamed: 0,title,rating,genres,timestamp
2055,Aladdin (1992),"[2.5, 4.0, 3.5, 4.0, 5.0, 3.5, 4.0, 4.0, 4.5, ...",adventure|animation|children|comedy|musical|fa...,"[1113797093.0, 948885872.0, 1442171461.0, 8295..."


Now let's assign each movie an average rating.

In [125]:
def average_list_items(ratings_list: List):
    return sum(ratings_list)/len(ratings_list)

In [126]:
movies_with_all_ratings_df['average_rating'] = movies_with_all_ratings_df['rating'].apply(average_list_items)

Let's think about our recommendation engine now again.  Let's say that we want to recommend movies to by recommending the movies with the highest ratings and the most similar genres list.

To find the movies with the "most similar genres list" we can use the nearest neighbors algorithm.  The first step in this would involve transforming the genres list to a vector.

First let's drop some rows that could mess up our recommendation engine.  You may have noticed with some of the dataframe samples that some movies have the genre "(no genres listed)"  - that sort of feels like a miscellaneous category that would hurt us more than help us. So let's get rid of those movies.

In [138]:
movies_with_ratings_and_genres = movies_with_all_ratings_df.loc[movies_with_all_ratings_df['genres'] != '(no genres listed)']

In [139]:
movies_with_ratings_and_genres.sample(20)

Unnamed: 0,title,rating,genres,timestamp,average_rating
10792,Curse of Chucky (Child's Play 6) (2013),"[3.5, 5.0, 0.5, 1.5, 3.0, 2.5, 4.0, 2.0, 3.0, ...",horror|thriller,"[1449326707.0, 1507403422.0, 1515971446.0, 145...",2.751678
16146,"Flowers of St. Francis (Francesco, giullare di...","[4.0, 2.0, 4.0, 4.5, 2.5, 2.5, 3.5, 3.0, 4.5, ...",drama,"[1425927458.0, 1506498752.0, 1221567153.0, 122...",3.633333
1838,After (2009),[2.5],drama,[1535602359.0],2.5
33286,Parker (2013),"[4.5, 3.0, 3.0, 3.5, 4.0, 2.5, 3.0, 4.0, 4.0, ...",crime|thriller,"[1459980853.0, 1499887120.0, 1370377622.0, 139...",3.188433
36869,Roman Polanski: Odd Man Out (2012),"[3.0, 4.0, 3.5]",documentary,"[1450163663.0, 1435540481.0, 1459091569.0]",3.5
16874,"Front, The (1976)","[3.5, 3.5, 1.5, 4.0, 5.0, 4.0, 4.0, 2.0, 3.5, ...",comedy|drama,"[1105378043.0, 1506648127.0, 1224457482.0, 115...",3.693662
31500,Noah's Ark (1999),"[2.5, 1.5, 3.0, 0.5]",adventure|drama|romance,"[1422767382.0, 1447929546.0, 1422195322.0, 145...",1.875
32390,One More Time (1970),"[2.0, 1.0, 2.0, 3.5]",action|comedy,"[1480641538.0, 1511809520.0, 1484395413.0, 150...",2.125
32836,Outbreak (1995),"[3.5, 4.0, 5.0, 4.5, 1.5, 3.0, 4.0, 4.0, 3.0, ...",action|drama|sci-fi|thriller,"[1113767157.0, 845061417.0, 836433605.0, 12252...",3.420117
51788,Wet Behind the Ears (2013),[3.5],comedy,[1536459350.0],3.5


Now let's drop movies that have less than 10 ratings. We probably don't have enough data to know if those ratings actually reflect how good the movie is or not.

TODO: We could probably come up with a less random value than 10.

In [140]:
movies_with_ratings_and_genres['number of ratings'] = movies_with_ratings_and_genres['rating'].apply(lambda x: len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_with_ratings_and_genres['number of ratings'] = movies_with_ratings_and_genres['rating'].apply(lambda x: len(x))


In [141]:
movies_with_many_ratings = movies_with_ratings_and_genres.loc[movies_with_ratings_and_genres['number of ratings'] > 5]

In [142]:
movies_with_many_ratings.loc[movies_with_many_ratings['genres'] == '(no genres listed)']

Unnamed: 0,title,rating,genres,timestamp,average_rating,number of ratings


In [143]:
movies_with_many_ratings['genres'] = movies_with_many_ratings['genres'].apply(lambda x: x.split('|'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies_with_many_ratings['genres'] = movies_with_many_ratings['genres'].apply(lambda x: x.split('|'))


In [144]:
movies_with_many_ratings.sample(20)

Unnamed: 0,title,rating,genres,timestamp,average_rating,number of ratings
29144,Miss Firecracker (1989),"[4.0, 2.5, 3.0, 3.5, 4.0, 3.0, 4.0, 4.0, 3.0, ...",[comedy],"[1092789641.0, 1069145669.0, 1091379617.0, 111...",3.217073,205
20239,Hollywood Ending (2002),"[3.0, 2.0, 2.5, 4.5, 3.0, 3.0, 3.5, 1.0, 5.0, ...","[comedy, drama]","[1032811534.0, 1105388200.0, 1160077571.0, 150...",3.013534,665
28289,"Matchmaker, The (1958)","[3.5, 3.0, 4.5, 2.5, 3.5, 5.0, 3.5, 3.5, 3.5, ...","[comedy, romance]","[1212240077.0, 1180426876.0, 1166624947.0, 122...",3.369565,46
49804,Twin Sitters (1994),"[1.5, 3.0, 2.5, 3.5, 3.5, 4.0, 1.5, 3.5, 3.5, ...",[thriller],"[1120626269.0, 1503470251.0, 1506362390.0, 149...",3.0,35
20303,Home Alone: The Holiday Heist (2012),"[0.5, 2.5, 0.5, 1.5, 3.5, 3.0, 0.5, 0.5, 4.0, ...","[children, comedy, crime]","[1470694163.0, 1496771247.0, 1463282894.0, 148...",2.176471,34
23539,Just Before Nightfall (1971),"[3.5, 3.0, 3.5, 3.5, 5.0, 4.0, 4.0]","[crime, drama]","[1485254290.0, 1535602331.0, 1462783924.0, 145...",3.785714,7
41375,Style (2001),"[0.5, 3.5, 1.0, 0.5, 2.0, 3.5, 4.0, 1.5, 3.0, ...",[thriller],"[1462734110.0, 1525819896.0, 1455216261.0, 146...",2.35,10
50102,Uncle Buck (1989),"[2.0, 4.5, 3.0, 2.5, 3.5, 4.0, 5.0, 2.0, 3.5, ...",[comedy],"[1145937939.0, 1275404425.0, 1011899371.0, 111...",3.313157,3329
35194,Pure Luck (1991),"[2.0, 3.5, 3.0, 3.0, 2.5, 1.0, 3.0, 1.0, 2.0, ...","[comedy, crime]","[1070432222.0, 1466323232.0, 1074829793.0, 112...",2.717391,138
30929,Nevada Smith (1966),"[3.0, 3.5, 3.0, 4.0, 3.0, 2.0, 3.5, 3.5, 2.5, ...",[western],"[1487636845.0, 1075437357.0, 1183581671.0, 150...",3.19,100


Next let's One Hot Encode the genres list
1. Get a list of all the genres
2. Use that to encode each genres list - this can be function in the transformations code

In [146]:
genres_list = list(movies_with_many_ratings.explode('genres').groupby('genres').agg(sum).reset_index()['genres'])