## DATA CLEANING AND EXPLORATION

In this notebook I will be sorting, cleaning and preparing the [The Movies Dataset]("https://www.kaggle.com/rounakbanik/the-movies-dataset") from kaggle to be able to use in our Movie Recommendation System, [Deep Film]("https://github.com/marreche/deep_film").

In [1]:
import pandas as pd

### 1.1 Loading csv data to a pandas DataFrame

In [2]:
movies = pd.read_csv("data/movies_metadata.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


### 1.2 Taking a look into the data

In [3]:
movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [5]:
movies.shape

(45466, 24)

### 1.3 Filtering Data

After taking a look into our dataset we need to filter the data according to our needs, so to begin with we will check if there are any null values or any columns we need to take care of.

In [6]:
movies.isna().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

The following columns have too many null values to be useful:
- belongs_to_collection
- homepage

So we might as well delete these columns.

In [7]:
movies = movies.drop(columns=["belongs_to_collection", "homepage"])

Now we can also delete the "tagline" column as more than half of the values are null and when information is provided it is rather useless.

In [8]:
movies = movies.drop(columns=["tagline"])

In [9]:
movies.isna().sum()

adult                     0
budget                    0
genres                    0
id                        0
imdb_id                  17
original_language        11
original_title            0
overview                954
popularity                5
poster_path             386
production_companies      3
production_countries      3
release_date             87
revenue                   6
runtime                 263
spoken_languages          6
status                   87
title                     6
video                     6
vote_average              6
vote_count                6
dtype: int64

Our DataFrame is looking a bit better now we have gotten rid of those not so useful columns, so now we can delve deeper into our data.

Lets look at the following columns:
- production_companies
- production_countries
- revenue
- status
- video
- original_title/title

In [10]:
movies["production_companies"]

0           [{'name': 'Pixar Animation Studios', 'id': 3}]
1        [{'name': 'TriStar Pictures', 'id': 559}, {'na...
2        [{'name': 'Warner Bros.', 'id': 6194}, {'name'...
3        [{'name': 'Twentieth Century Fox Film Corporat...
4        [{'name': 'Sandollar Productions', 'id': 5842}...
                               ...                        
45461                                                   []
45462               [{'name': 'Sine Olivia', 'id': 19653}]
45463    [{'name': 'American World Pictures', 'id': 6165}]
45464                 [{'name': 'Yermoliev', 'id': 88753}]
45465                                                   []
Name: production_companies, Length: 45466, dtype: object

This information could be useful to have in the movie description but is rather useless for our machine learning model, therefore we will drop this column as well.

In [11]:
movies = movies.drop(columns=["production_companies"])

In [12]:
movies["production_countries"]

0        [{'iso_3166_1': 'US', 'name': 'United States o...
1        [{'iso_3166_1': 'US', 'name': 'United States o...
2        [{'iso_3166_1': 'US', 'name': 'United States o...
3        [{'iso_3166_1': 'US', 'name': 'United States o...
4        [{'iso_3166_1': 'US', 'name': 'United States o...
                               ...                        
45461               [{'iso_3166_1': 'IR', 'name': 'Iran'}]
45462        [{'iso_3166_1': 'PH', 'name': 'Philippines'}]
45463    [{'iso_3166_1': 'US', 'name': 'United States o...
45464             [{'iso_3166_1': 'RU', 'name': 'Russia'}]
45465     [{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]
Name: production_countries, Length: 45466, dtype: object

In [13]:
movies["revenue"]

0        373554033.0
1        262797249.0
2                0.0
3         81452156.0
4         76578911.0
            ...     
45461            0.0
45462            0.0
45463            0.0
45464            0.0
45465            0.0
Name: revenue, Length: 45466, dtype: float64

In [14]:
movies["status"]

0        Released
1        Released
2        Released
3        Released
4        Released
           ...   
45461    Released
45462    Released
45463    Released
45464    Released
45465    Released
Name: status, Length: 45466, dtype: object

In [15]:
movies["video"]

0        False
1        False
2        False
3        False
4        False
         ...  
45461    False
45462    False
45463    False
45464    False
45465    False
Name: video, Length: 45466, dtype: object

In [16]:
movies["original_title"]

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45461                        رگ خواب
45462            Siglo ng Pagluluwal
45463                       Betrayal
45464            Satana likuyushchiy
45465                       Queerama
Name: original_title, Length: 45466, dtype: object

All this information can be useful but is not crucial for our model, thus we will drop it from our DataFrame.

In [17]:
movies = movies.drop(columns=["budget", "poster_path", "original_title", "video", "status", "revenue", "production_countries"])

In [18]:
movies.head()

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
0,False,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,1995-10-30,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,5415.0
1,False,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,When siblings Judy and Peter discover an encha...,17.0155,1995-12-15,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,2413.0
2,False,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,92.0
3,False,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",3.85949,1995-12-22,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,34.0
4,False,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Just when George Banks has recovered from his ...,8.38752,1995-02-10,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,173.0


In [19]:
movies.isna().sum()

adult                  0
genres                 0
id                     0
imdb_id               17
original_language     11
overview             954
popularity             5
release_date          87
runtime              263
spoken_languages       6
title                  6
vote_average           6
vote_count             6
dtype: int64

In [20]:
movies['overview'] = movies['overview'].fillna('')

In [21]:
movies[movies["runtime"].isna()]

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
634,False,"[{'id': 35, 'name': 'Comedy'}]",287305,tt0117312,de,,0.066123,1996-03-21,,[],Peanuts – Die Bank zahlt alles,4.0,1.0
635,False,"[{'id': 35, 'name': 'Comedy'}]",339428,tt0116485,de,,0.002229,1996-03-14,,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Happy Weekend,0.0,0.0
644,False,"[{'id': 18, 'name': 'Drama'}]",278978,tt0118026,de,,0.439989,1996-02-29,,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Und keiner weint mir nach,0.0,0.0
802,False,"[{'id': 18, 'name': 'Drama'}]",282919,tt0112865,de,,0.106345,1996-06-20,,[],Diebinnen,4.0,1.0
863,False,"[{'id': 53, 'name': 'Thriller'}]",253632,tt0094822,fr,,0.437895,1988-10-08,,[],Baton Rouge,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
45246,False,"[{'id': 35, 'name': 'Comedy'}]",231216,tt0441908,de,,0.002513,2004-12-02,,"[{'iso_639_1': 'de', 'name': 'Deutsch'}]",Villa Henriette,0.0,0.0
45310,False,[],418757,tt4112020,pl,,0.030803,2014-08-01,,[],Między nami dobrze jest,0.0,0.0
45313,False,"[{'id': 18, 'name': 'Drama'}]",369444,tt0098035,pl,,0.000102,1989-10-27,,[],Ostatni dzwonek,0.0,0.0
45377,False,"[{'id': 12, 'name': 'Adventure'}]",317389,tt0070695,es,,0.006352,1973-07-22,,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Simbad e il califfo di Bagdad,0.0,0.0


In [22]:
movies['runtime'] = movies['runtime'].fillna('')

In [23]:
movies.isna().sum()

adult                 0
genres                0
id                    0
imdb_id              17
original_language    11
overview              0
popularity            5
release_date         87
runtime               0
spoken_languages      6
title                 6
vote_average          6
vote_count            6
dtype: int64

We have significantly reduced the numbers of NaNs in our dataset so now we can finally prepare the data for our machine learning model.

In [24]:
movies.head()

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
0,False,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,1995-10-30,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,5415.0
1,False,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,When siblings Judy and Peter discover an encha...,17.0155,1995-12-15,104,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,2413.0
2,False,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,101,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,92.0
3,False,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",3.85949,1995-12-22,127,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,34.0
4,False,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Just when George Banks has recovered from his ...,8.38752,1995-02-10,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,173.0


Now we're gonna join the credits dataset which contains information on the director and cast of each particular film. We will merge both these datasets on the 'id' column.

In [25]:
df = pd.read_csv("data/credits.csv")
df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [26]:
df['id']

0           862
1          8844
2         15602
3         31357
4         11862
          ...  
45471    439050
45472    111109
45473     67758
45474    227506
45475    461257
Name: id, Length: 45476, dtype: int64

In [27]:
movies['id']

0           862
1          8844
2         15602
3         31357
4         11862
          ...  
45461    439050
45462    111109
45463     67758
45464    227506
45465    461257
Name: id, Length: 45466, dtype: object

We need to first convert the movies['id'] column as it appears to be in string format.

In [28]:
movies.loc[movies['id'] == '1997-08-20']

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
19730,- Written by Ørnås,"[{'name': 'Carousel Productions', 'id': 11176}...",1997-08-20,0,104.0,Released,,1,,,,,


In [29]:
movies = movies.drop([19730])

In [30]:
movies.loc[movies['id'] == '2012-09-29']

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
29503,Rune Balot goes to a casino connected to the ...,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...",2012-09-29,0,68.0,Released,,12,,,,,


In [31]:
movies = movies.drop([29503])

In [32]:
movies.loc[movies['id'] == '2014-01-01']

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count
35587,Avalanche Sharks tells the story of a bikini ...,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...",2014-01-01,0,82.0,Released,Beware Of Frost Bites,22,,,,,


In [33]:
movies = movies.drop([35587])

In [34]:
movies['id'] = movies['id'].astype(int)

We have now converted the column into integers as there was some faulty data preventing us from doing so directly. By deleting these rows we can finally merge both datasets.

In [35]:
movies = movies.merge(df, on='id')
movies.head()

Unnamed: 0,adult,genres,id,imdb_id,original_language,overview,popularity,release_date,runtime,spoken_languages,title,vote_average,vote_count,cast,crew
0,False,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,tt0114709,en,"Led by Woody, Andy's toys live happily in his ...",21.9469,1995-10-30,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Toy Story,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,False,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,tt0113497,en,When siblings Judy and Peter discover an encha...,17.0155,1995-12-15,104,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Jumanji,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,False,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,tt0113228,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,101,"[{'iso_639_1': 'en', 'name': 'English'}]",Grumpier Old Men,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,False,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,tt0114885,en,"Cheated on, mistreated and stepped on, the wom...",3.85949,1995-12-22,127,"[{'iso_639_1': 'en', 'name': 'English'}]",Waiting to Exhale,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,False,"[{'id': 35, 'name': 'Comedy'}]",11862,tt0113041,en,Just when George Banks has recovered from his ...,8.38752,1995-02-10,106,"[{'iso_639_1': 'en', 'name': 'English'}]",Father of the Bride Part II,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."
