# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

In [234]:
import pandas as pd
pd.options.display.max_columns = 30

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [235]:
df = pd.read_csv('movies_metadata.csv', low_memory=False)


In [236]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [237]:
df.belongs_to_collection[0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In [238]:
df.genres[0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [239]:
df.drop(columns=['adult'], inplace=True)
df.drop(columns=['imdb_id'], inplace=True)
df.drop(columns=['original_title'], inplace=True)
df.drop(columns=['video'], inplace=True)
df.drop(columns=['homepage'], inplace=True)

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [240]:
import ast
import numpy as np

In [241]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
df.genres = df.genres.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
df.production_countries = df.production_countries.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
df.production_companies = df.production_companies.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)
df.spoken_languages = df.spoken_languages.apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [242]:
df.belongs_to_collection[0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

In [243]:
df.genres[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [244]:
df.production_countries[0]

[{'iso_3166_1': 'US', 'name': 'United States of America'}]

In [245]:
df.production_companies[0]

[{'name': 'Pixar Animation Studios', 'id': 3}]

In [246]:
df.spoken_languages[0]

[{'iso_639_1': 'en', 'name': 'English'}]

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [247]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x: x['name'] if isinstance(x,dict) else np.nan)

In [248]:
df.belongs_to_collection.value_counts(dropna=False, ).head(20)

NaN                                       40975
The Bowery Boys                              29
Totò Collection                              27
James Bond Collection                        26
Zatôichi: The Blind Swordsman                26
The Carry On Collection                      25
Pokémon Collection                           22
Charlie Chan (Sidney Toler) Collection       21
Godzilla (Showa) Collection                  16
Uuno Turhapuro                               15
Dragon Ball Z (Movie) Collection             15
Charlie Chan (Warner Oland) Collection       15
The Land Before Time Collection              14
Monster High Collection                      14
Sharpe Collection                            13
George Carlin Comedy Collection              13
Johan Falk GSI Collection                    12
Sherlock Holmes (1939 series)                12
Friday the 13th Collection                   12
The Amityville Horror Collection             12
Name: belongs_to_collection, dtype: int6

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [249]:
df.genres = df.genres.apply(lambda x: '|'.join(i['name'] for i in x))

In [250]:
df.genres.replace("", np.nan, inplace = True)

In [251]:
df.genres[0]

'Animation|Comedy|Family'

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [252]:
df.spoken_languages = df.spoken_languages.apply(lambda x: '|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [253]:
df.spoken_languages.replace("", np.nan, inplace = True)

In [254]:
df.spoken_languages[0]

'English'

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [255]:
df.production_countries = df.production_countries.apply(lambda x: '|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)

In [256]:
df.production_countries.replace("", np.nan, inplace = True)

In [257]:
df.production_countries[0]

'United States of America'

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [258]:
df.production_companies = df.production_companies.apply(lambda x: '|'.join(i['name'] for i in x) if isinstance(x,list) else np.nan)


In [259]:
df.production_companies.replace("", np.nan, inplace = True)

In [260]:
df.production_companies[0]

'Pixar Animation Studios'

9. __Inspect__ all columns above with value_counts(). Do you see anything strange? __Take reasonable measures__!

In [261]:
df.isna().sum()

belongs_to_collection    40975
budget                       0
genres                    2442
id                           0
original_language           11
overview                   954
popularity                   5
poster_path                386
production_companies     11881
production_countries      6288
release_date                87
revenue                      6
runtime                    263
spoken_languages          3958
status                      87
tagline                  25054
title                        6
vote_average                 6
vote_count                   6
dtype: int64

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [262]:
df.budget = pd.to_numeric(df.budget, errors='coerce')
df.id = pd.to_numeric(df.id, errors='coerce')
df.popularity = pd.to_numeric(df.popularity, errors='coerce')

11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [263]:
df.budget = df.budget.replace(0, np.nan)
df.revenue = df.revenue.replace(0, np.nan)
df.runtime = df.runtime.replace(0, np.nan)

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [264]:
df.budget = df.budget.div(1000000)
df.revenue = df.revenue.div(1000000)

13. __Analyze__ movies with a __vote_count of 0__. What´s the __vote_average__ for those movies? Do you think this value is the most appropriate value? __Take reasonable measures__!

In [265]:
df.loc[df.vote_count == 0, "vote_average"] = np.nan

In [266]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget                 8890 non-null   float64
 2   genres                 43024 non-null  object 
 3   id                     45463 non-null  float64
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45460 non-null  float64
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                7408 non-null   float64
 12  runtime                43645 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [267]:
df.release_date = pd.to_datetime(df.release_date, errors='coerce')

## Cleaning Text / String Columns

15. __Analyze__ the text columns "overview" and "tagline". Try to identify __missing data that is not represented by NaN__ (e.g. "No Data"). __Replace as NaN__ (np.nan)!

In [268]:
df.overview.value_counts(dropna = False).head(20)

NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              954
No overview found.                                                                                                                                                 

In [269]:
df.overview.replace("No overview found.", np.nan, inplace = True)
df.overview.replace("No Overview", np.nan, inplace = True)
df.overview.replace(" ", np.nan, inplace = True)
df.overview.replace("No movie overview available.", np.nan, inplace = True)

In [270]:
df.tagline.value_counts(dropna = False).head(20)

NaN                                                           25054
Based on a true story.                                            7
Trust no one.                                                     4
Be careful what you wish for.                                     4
-                                                                 4
Classic Albums                                                    3
Some doors should never be opened.                                3
A Love Story                                                      3
Drama                                                             3
Know Your Enemy                                                   3
Which one is the first to return - memory or the murderer?        3
How far would you go?                                             3
The end is near.                                                  3
There is no turning back                                          3
There are two sides to every love story.        

In [271]:
df.tagline.replace("-", np.nan, inplace = True)

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [272]:
df[df.duplicated(keep =  False)].sort_values(by = "id")

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
7345,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
9165,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
24844,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
14012,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
22151,,,Action|Horror|Science Fiction,18440.0,en,When a comet strikes Earth and kicks up a clou...,1.436085,/tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg,,United States of America,2007-01-01,,89.0,English,Released,,Days of Darkness,5.0,5.0
14000,,,Action|Horror|Science Fiction,18440.0,en,When a comet strikes Earth and kicks up a clou...,1.436085,/tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg,,United States of America,2007-01-01,,89.0,English,Released,,Days of Darkness,5.0,5.0
8068,,,Adventure|Animation|Drama|Action|Foreign,23305.0,en,"In feudal India, a warrior (Khan) who renounce...",1.967992,/9GlrmbZO7VGyqhaSR1utinRJz3L.jpg,Filmfour,France|Germany|India|United Kingdom,2001-09-23,,86.0,हिन्दी,Released,,The Warrior,6.3,15.0
9327,,,Adventure|Animation|Drama|Action|Foreign,23305.0,en,"In feudal India, a warrior (Khan) who renounce...",1.967992,/9GlrmbZO7VGyqhaSR1utinRJz3L.jpg,Filmfour,France|Germany|India|United Kingdom,2001-09-23,,86.0,हिन्दी,Released,,The Warrior,6.3,15.0
17229,,,Drama,25541.0,da,Former Danish servicemen Lars and Jimmy are th...,2.587911,/q19Q5BRZpMXoNCA4OYodVozfjUh.jpg,,Sweden|Denmark,2009-10-21,,90.0,Dansk,Released,,Brotherhood,7.1,21.0
23044,,,Drama,25541.0,da,Former Danish servicemen Lars and Jimmy are th...,2.587911,/q19Q5BRZpMXoNCA4OYodVozfjUh.jpg,,Sweden|Denmark,2009-10-21,,90.0,Dansk,Released,,Brotherhood,7.1,21.0


In [273]:
df.drop_duplicates(inplace = True)

In [274]:
df[df.duplicated(subset='id', keep =  False)].sort_values(by = "id")

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
33826,,30.0,Comedy|Crime|Drama|Romance|Thriller,4912.0,en,"Television made him famous, but his biggest hi...",7.645827,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,United States of America,2002-12-30,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,6.6,281.0
5865,,30.0,Comedy|Crime|Drama|Romance|Thriller,4912.0,en,"Television made him famous, but his biggest hi...",11.331072,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,United States of America,2002-12-30,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,6.6,281.0
4114,Pokémon Collection,16.0,Adventure|Fantasy|Animation|Action|Family,10991.0,ja,When Molly Hale's sadness of her father's disa...,10.264597,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,Japan,2000-07-08,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,6.0,143.0
44821,Pokémon Collection,16.0,Adventure|Fantasy|Animation|Action|Family,10991.0,ja,When Molly Hale's sadness of her father's disa...,6.480376,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,Japan,2000-07-08,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,6.0,144.0
44826,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,12600.0,ja,"All your favorite Pokémon characters are back,...",6.080108,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,Japan|United States of America,2001-07-06,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,5.7,82.0
5535,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,12600.0,ja,"All your favorite Pokémon characters are back,...",7.072301,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,Japan|United States of America,2001-07-06,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,5.7,82.0
15765,,0.0025,Drama|Comedy|Foreign,13209.0,fa,"Since women are banned from soccer matches, Ir...",1.529879,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,Iran,2006-05-26,,93.0,فارسی,Released,,Offside,6.7,27.0
11342,,0.0025,Drama|Comedy|Foreign,13209.0,fa,"Since women are banned from soccer matches, Ir...",1.52896,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,Iran,2006-05-26,,93.0,فارسی,Released,,Offside,6.7,27.0
10419,,1.6,Drama|Crime|Mystery,14788.0,en,Set against the backdrop of a decaying Midwest...,3.185256,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,United States of America,2005-09-03,,73.0,English,Released,,Bubble,6.4,36.0
12066,,1.6,Drama|Crime|Mystery,14788.0,en,Set against the backdrop of a decaying Midwest...,3.008299,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,United States of America,2005-09-03,,73.0,English,Released,,Bubble,6.4,36.0


In [275]:
df.drop_duplicates(subset='id', inplace = True)

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [276]:
df.isna().sum()

belongs_to_collection    40946
budget                   36554
genres                    2442
id                           1
original_language           11
overview                  1102
popularity                   4
poster_path                386
production_companies     11872
production_countries      6283
release_date                88
revenue                  38036
runtime                   1819
spoken_languages          3954
status                      85
tagline                  25037
title                        4
vote_average              2900
vote_count                   4
dtype: int64

In [277]:
df[df.title.isna()]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
19729,,,Action|Thriller|Drama,82663.0,en,British soldiers force a recently captured IRA...,,,,,NaT,,,,,,,,
19730,,,Carousel Productions|Vision View Entertainment...,,104.0,Released,,Midnight Man,,,NaT,,,,,,,,
29502,Mardock Scramble Collection,,Animation|Science Fiction,122662.0,ja,Third film of the Mardock Scramble series.,,,,,NaT,,,,,,,,
35586,,,TV Movie|Action|Horror|Science Fiction,249260.0,en,A group of skiers are terrorized during spring...,,,,,NaT,,,,,,,,


In [278]:
df.dropna(subset = ["id", "title"], inplace = True)

In [279]:
df.id = df.id.astype('int')

18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [280]:
df.notna().sum(axis = 1).value_counts().sort_values(ascending = False)

15    12522
16    11454
14     5424
17     4265
18     3859
13     3041
12     1890
19     1132
11     1020
10      512
9       183
8       104
7        20
6         4
dtype: int64

In [281]:
df[df.notna().sum(axis = 1) == 7]

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
2140,,,,77314,fr,,0.0375,/8prmlT6iOYl3zFsDDJl9oDMUGeD.jpg,,,1991-12-04,,,,,,The Cabinet of Dr. Ramirez,,0.0
4130,,,Drama|Thriller|Romance,109472,en,,0.001653,,,,2001-06-06,,,,,,The Girl,,0.0
14890,,,,174748,no,,0.0,,,,1984-12-30,,,,Released,,Lars i porten,,0.0
18572,,,Documentary,404471,fi,,0.0,,,,NaT,,,,Released,,Pölynimurikauppiaat,,0.0
19955,,,,397339,en,Black and White,0.0,,,,NaT,,,,Released,,The Awful Truth,,0.0
20301,,,,367678,en,American Documentary,0.0,,,,NaT,,,,Released,,Enola Gay and the Atomic Bombing of Japan,,0.0
22798,,,,158517,en,On her way home from an evening shift at a men...,0.000143,,,,NaT,,,,Released,,Lain ulkopuolella,,0.0
24157,,,,287831,en,"Harry Raymond, a foreign ambassador in Moscow,...",0.0,,,,NaT,,,,Released,,External Affairs,,0.0
29309,,,,335141,fr,,0.001648,,,,1998-01-01,,,,Released,,Bob le magnifique,,0.0
35652,,,,374698,nl,,0.00129,,,,2001-10-24,,,,Released,,Vallen,,0.0


In [282]:
df.dropna(thresh = 10, inplace = True)

In [283]:
df.notna().sum(axis = 1).value_counts().sort_values(ascending = False)

15    12522
16    11454
14     5424
17     4265
18     3859
13     3041
12     1890
19     1132
11     1020
10      512
dtype: int64

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [284]:
df.status.value_counts()

Released           44692
Rumored              226
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

In [287]:
df = df.loc[df.status == "Released"].copy()

In [288]:
df.drop(columns = ["status"], inplace = True)

20. The Order of the columns should be as follows: 

In [289]:
columns = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget_musd", "revenue_musd", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]

In [291]:
df.rename(columns = {"revenue":"revenue_musd", "budget":"budget_musd"}, inplace = True)

In [292]:
df = df.loc[:,columns]

21. __Reset__ the Index and create a __RangeIndex__.

In [293]:
df.reset_index(drop=True, inplace=True)

22. __Save__ the cleaned dataset in a __csv-file__.

In [294]:
df.to_csv('movies_clean.csv', index=False)

In [296]:
pd.read_csv('movies_clean.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44692 entries, 0 to 44691
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     44692 non-null  int64  
 1   title                  44692 non-null  object 
 2   tagline                20284 non-null  object 
 3   release_date           44658 non-null  object 
 4   genres                 42587 non-null  object 
 5   belongs_to_collection  4463 non-null   object 
 6   original_language      44682 non-null  object 
 7   budget_musd            8854 non-null   float64
 8   revenue_musd           7385 non-null   float64
 9   production_companies   33356 non-null  object 
 10  production_countries   38835 non-null  object 
 11  vote_count             44692 non-null  float64
 12  vote_average           42077 non-null  float64
 13  popularity             44692 non-null  float64
 14  runtime                43179 non-null  float64
 15  ov