# # Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

### First Glance

In [1]:
import pandas as pd
pd.options.display.max_columns = 30

Use the low_memory parameter to read the data since the data consists of mixed types.

In [2]:
df = pd.read_csv("movies_metadata.csv",low_memory=False)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Here, we can see that in four columns, we have float data types and, 20 columns text or mixed data types. For example, in the budget column we have mixed data types. This should be a numerically column. Another one can be release_data. This should be a date time column. In addition, we can see that in some columns, we have missing values. For instance, in the homepage column, we have only have 7782 non missing values. Let's investigate some nested columns to understand the data.

In [4]:
df.genres[0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

This column is stringified JSON column not a list.

In [5]:
df.belongs_to_collection[0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

This column is also a part of a dictionary. We should flatten and clean these columns.

### Dropping Irrevelant Columns

We can drop some columns that we likely don't need for the explotary analysis and data presentation

In [6]:
df.drop(columns = ['adult'], inplace = True)

In [7]:
df.drop(columns = ['imdb_id'], inplace = True)

In [8]:
df.drop(columns = ['original_title'], inplace = True)

In [9]:
df.drop(columns = ['video'], inplace= True)

In [10]:
df.drop(columns = ['homepage'], inplace= True)

### Let's handle stringified JSON columns 

Let's load the needful modules. In Python, the ast module stands for "Abstract Syntax Tree."

In [11]:
import json
import ast

For the automation, we determine all the stringified JSON columns

In [12]:
json_col = ["belongs_to_collection", "genres", "production_countries", 
            "production_companies", "spoken_languages"]

Let's look at one of the JSON columns.

In [13]:
df.belongs_to_collection[0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In the first way, we replace single quotation marks with double quatation marks so that the json.loads() function works properly. This method is used to parse JSON formated string into a Python data structure.

In [14]:
df.genres.apply(lambda x: json.loads(x.replace("'", '"')))[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In the second way, we can use literal_eval method in the ast library. This function evaluates an expression node or a string consisting of a Python literal or container display.

In [15]:
df.genres.apply(ast.literal_eval)[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

Then, we can apply this method to all stringified JSON columns. However, stringified JSON objects does not consist of only strings. If we try to apply this method to the all JSON columns (in the following expression), we can not achive to desired results. For this reason, we should apply alternative methods by using numpy library.

In [16]:
#df.loc[:, json_col].apply(ast.literal_eval, axis = 0)

In [17]:
import numpy as np

Let's check one of the columns of the data frame if it consists of strings or not. Then, we can apply the literal_eval function to transform the stringified JSON into a Python data structure. For the belongs_to_collection column, we can see that it is transformed to the dictionary type object.

In [18]:
df.belongs_to_collection.apply(lambda x: isinstance(x, str))

0         True
1        False
2         True
3        False
4         True
         ...  
45461    False
45462    False
45463    False
45464    False
45465    False
Name: belongs_to_collection, Length: 45466, dtype: bool

In [19]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) 
                                                          else np.nan)

In [20]:
df.belongs_to_collection [0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

Lets's apply this method to the other JSON columns.

In [21]:
df.spoken_languages = df.spoken_languages.apply(lambda x: ast.literal_eval(x) 
                                                if isinstance(x, str) else np.nan)

In [22]:
df.production_countries = df.production_countries.apply(lambda x: ast.literal_eval(x) 
                                                        if isinstance(x, str) else np.nan)

In [23]:
df.production_companies = df.production_companies.apply(lambda x: ast.literal_eval(x) 
                                                        if isinstance(x, str) else np.nan)

In [24]:
df.genres = df.genres.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

### Let's Flatten the Nested Columns

We should extract the selected data from the lists or dictionaries. For instance, we can extract the name of the collection that the movie belongs to.

In [25]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: x['name'] if isinstance(x, dict) else np.nan) 

In [26]:
df.belongs_to_collection.value_counts(dropna = False).head(20)

NaN                                       40975
The Bowery Boys                              29
Totò Collection                              27
James Bond Collection                        26
Zatôichi: The Blind Swordsman                26
The Carry On Collection                      25
Pokémon Collection                           22
Charlie Chan (Sidney Toler) Collection       21
Godzilla (Showa) Collection                  16
Uuno Turhapuro                               15
Dragon Ball Z (Movie) Collection             15
Charlie Chan (Warner Oland) Collection       15
The Land Before Time Collection              14
Monster High Collection                      14
Sharpe Collection                            13
George Carlin Comedy Collection              13
Johan Falk GSI Collection                    12
Sherlock Holmes (1939 series)                12
Friday the 13th Collection                   12
The Amityville Horror Collection             12
Name: belongs_to_collection, dtype: int6

We can see that 40975 movie does not belong to a collection and, the largest collection belongs to The Bowery Boys.

Let's extract movie genres from the genres column and add separators to the genres in which the movies have one or more genres.

In [27]:
df.genres = df.genres.apply(lambda x: "|".join(i['name'] for i in x))

In [28]:
df.genres.value_counts(dropna = False).head(20)

Drama                   5000
Comedy                  3621
Documentary             2723
                        2442
Drama|Romance           1301
Comedy|Drama            1135
Horror                   974
Comedy|Romance           930
Comedy|Drama|Romance     593
Drama|Comedy             532
Horror|Thriller          528
Drama|Thriller           497
Thriller                 465
Crime|Drama              430
Romance|Drama            343
Western                  318
Action|Thriller          301
Drama|Foreign            283
Action                   278
Drama|History            267
Name: genres, dtype: int64

We should assign the missing values as na.

In [29]:
df.genres.replace("", np.nan, inplace = True)

Let's apply similar process to the other columns.

In [30]:
df.spoken_languages = df.spoken_languages.apply(lambda x: "|".join(i['name'] for i in x) 
                                                if isinstance(x, list) else np.nan) 

In [31]:
df.spoken_languages.value_counts(dropna = False).head(20)

English             22395
                     3952
Français             1853
日本語                  1289
Italiano             1218
Español               902
Pусский               807
Deutsch               762
English|Français      681
English|Español       572
हिन्दी                481
English|Deutsch       462
한국어/조선말               425
普通话                   347
English|Italiano      326
svenska               311
No Language           303
suomi                 275
Português             275
Polski                213
Name: spoken_languages, dtype: int64

In [32]:
df.spoken_languages.replace("", np.nan, inplace = True)

In [33]:
df.production_countries = df.production_countries.apply(lambda x: "|".join(i['name'] for i in x) 
                                                        if isinstance(x, list) else np.nan)

In [34]:
df.production_countries.value_counts(dropna = False).head(20)

United States of America                   17851
                                            6282
United Kingdom                              2238
France                                      1654
Japan                                       1356
Italy                                       1030
Canada                                       840
Germany                                      749
India                                        735
Russia                                       735
United Kingdom|United States of America      569
South Korea                                  432
Spain                                        398
Hong Kong                                    365
Canada|United States of America              365
Australia                                    336
Sweden                                       332
Finland                                      271
France|Italy                                 235
Germany|United States of America             214
Name: production_cou

In [35]:
df.production_countries.replace("", np.nan, inplace = True)

In [36]:
df.production_companies = df.production_companies.apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan) 

In [37]:
df.production_companies.value_counts(dropna = False).head(20)

                                          11875
Metro-Goldwyn-Mayer (MGM)                   742
Warner Bros.                                540
Paramount Pictures                          505
Twentieth Century Fox Film Corporation      439
Universal Pictures                          320
RKO Radio Pictures                          247
Columbia Pictures Corporation               207
Columbia Pictures                           146
Mosfilm                                     145
Walt Disney Pictures                         85
Universal International Pictures (UI)        82
New Line Cinema                              75
Walt Disney Productions                      75
Shaw Brothers                                71
Touchstone Pictures                          70
Toho Company                                 65
TriStar Pictures                             62
Orion Pictures                               61
Hammer Film Productions                      60
Name: production_companies, dtype: int64

In [38]:
df.production_companies.replace("", np.nan, inplace = True)

Do not forget that the data has missing values does not imply that the data is not clean.

### Let's Clean the Numerical Columns

df.budget.astype("float") does not work here. If you try to apply this function to the budget column, you will see there is a specific string that belongs to one data point that is not convertible to float. There is more convenient way to do this transformation.

In [39]:
df.budget = pd.to_numeric(df.budget,errors = "coerce") # invalid parcing will be set as NaN

In [40]:
df.budget.value_counts()

0.0            36573
5000000.0        286
10000000.0       259
20000000.0       243
2000000.0        242
               ...  
270000000.0        1
923.0              1
72500000.0         1
2160000.0          1
1254040.0          1
Name: budget, Length: 1223, dtype: int64

As we can see there 36573 movies have a budget with 0. This is unrealistic and probably a missing value. We should change 0's as NaN.

In [41]:
df.budget = df.budget.replace(0, np.nan)

To increase readability, we can divide every budget to one million and, apply same process to revenue column. Lastly, we can change the columns names.

In [42]:
df.budget = df.budget.div(1000000)

In [43]:
df.revenue = df.revenue.replace(0, np.nan)
df.revenue = df.revenue.div(1000000)

In [44]:
df.rename(columns = {"revenue":"revenue_musd", "budget":"budget_musd"}, inplace = True)

We should transform 0's into nan in the runtime column.

In [46]:
df.runtime = df.runtime.replace(0, np.nan)

Id and popularity columns should be numeric.

In [47]:
df.id = pd.to_numeric(df.id, errors = "coerce")

In [48]:
df.popularity = pd.to_numeric(df.popularity, errors='coerce')

For movies where we do not have any vote rating, we do not have average rating.

In [52]:
df.loc[df.vote_count == 0, "vote_average"] = np.nan

### Let's Celan DateTime Columns

Let's look at some elements of the release_date column.

In [53]:
df.release_date 

0        1995-10-30
1        1995-12-15
2        1995-12-22
3        1995-12-22
4        1995-02-10
            ...    
45461           NaN
45462    2011-11-17
45463    2003-08-01
45464    1917-10-21
45465    2017-06-09
Name: release_date, Length: 45466, dtype: object

In [54]:
df.release_date = pd.to_datetime(df.release_date, errors = "coerce") # coerce parameter for the missing values

### Let's Clean Text/String Columns

In [55]:
df.overview.replace("No overview found.", np.nan, inplace = True)

In [56]:
df.overview.replace("No Overview", np.nan, inplace = True)

In [57]:
df.overview.replace("No movie overview available.", np.nan, inplace = True)

In [58]:
df.overview.replace(" ", np.nan, inplace = True)

In [59]:
df.overview.replace("No overview yet.", np.nan, inplace = True)

In [60]:
df.tagline.replace("-", np.nan, inplace = True)

### Let's Remove Duplicates

In the movies data set we have id's for the movies that should be unique for every movies.

In [62]:
df.drop_duplicates(inplace = True)

In [64]:
df.drop_duplicates(subset = "id", inplace = True)

In [65]:
df.id.value_counts(dropna = False)

862.0       1
74458.0     1
296206.0    1
107308.0    1
16247.0     1
           ..
44399.0     1
10138.0     1
32084.0     1
42191.0     1
461257.0    1
Name: id, Length: 45434, dtype: int64

### Handling Missing Values & Removing Observations


Let's check the number of the missing values for every coloumn in the data set.

In [66]:
df.isna().sum()

belongs_to_collection    40946
budget_musd              36554
genres                    2442
id                           1
original_language           11
overview                  1104
popularity                   4
poster_path                386
production_companies     11872
production_countries      6283
release_date                88
revenue_musd             38036
runtime                   1819
spoken_languages          3954
status                      85
tagline                  25037
title                        4
vote_average              2900
vote_count                   4
dtype: int64

In [68]:
df.dropna(subset = ["id", "title"], inplace = True)

Let's check the number of non missing values.

In [69]:
df.notna().sum(axis = 1).value_counts().sort_values(ascending = False)

15    12522
16    11454
14     5424
17     4265
18     3859
13     3040
12     1891
19     1132
11     1020
10      511
9       184
8       104
7        20
6         4
dtype: int64

We can determine a threshold for the non missing values and then we can drop the values with respect to this threshold. Movies that includes more 10 non missing values seems okay.

In [70]:
df.dropna(thresh = 10, inplace = True)

### Final Steps

Let's check the status column. We are interested in only movies which are released.

In [72]:
df.status.value_counts()

Released           44691
Rumored              226
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

In [73]:
df = df.loc[df.status == "Released"].copy()

In [74]:
df.drop(columns = ["status"], inplace = True)

Let's rearrange all the columns by importance.

In [75]:
col = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
       "original_language", "budget_musd", "revenue_musd", "production_companies",
       "production_countries", "vote_count", "vote_average", "popularity", "runtime",
       "overview", "spoken_languages", "poster_path"]

In [76]:
df = df.loc[:, col]

We can now drop the indexes.

In [77]:
df.reset_index(drop = True, inplace =True)

Finally, let's add the html path for the poster.

In [78]:
df.poster_path[0]

'/rhIRbceoE9lR4veEXuwCC2wARtG.jpg'

In [79]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df.poster_path = "<img src='" + base_poster_url + df.poster_path + "' style='height:100px;'>"

Let's import the almost clean data as a csv file.

In [80]:
df.to_csv("movies_clean.csv", index = False)