# Project 3: Data Cleaning - Tidy up messy Datasets (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 3 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## First Steps 

1. __Load__ and __inspect__ the messy dataset __movies_metadata.csv__. Identify columns with nested / stringified json data.

In [118]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np
import json, ast
pd.set_option('display.max_rows', 20)
df = pd.read_csv("movies_metadata.csv", low_memory=False)


## Dropping irrelevant Columns

2. __Drop__ the irrelevant columns 'adult', 'imdb_id', 'original_title', 'video' and 'homepage'.

In [119]:
df.drop(columns = "adult imdb_id original_title".split(), axis = 1, inplace = True)

In [120]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4494 non-null   object 
 1   budget                 45466 non-null  object 
 2   genres                 45466 non-null  object 
 3   homepage               7782 non-null   object 
 4   id                     45466 non-null  object 
 5   original_language      45455 non-null  object 
 6   overview               44512 non-null  object 
 7   popularity             45461 non-null  object 
 8   poster_path            45080 non-null  object 
 9   production_companies   45463 non-null  object 
 10  production_countries   45463 non-null  object 
 11  release_date           45379 non-null  object 
 12  revenue                45460 non-null  float64
 13  runtime                45203 non-null  float64
 14  spoken_languages       45460 non-null  object 
 15  st

## How to handle stringified JSON columns

3. __Evaluate__ Python Expressions in the stringified columns ["belongs_to_collection", "genres", "production_countries", "production_companies", "spoken_languages"] and __remove quotes__ ("") where possible.

In [121]:
def convert_json(a):
    if isinstance(a, str):
        return ast.literal_eval(a)
    return np.nan
cols = ["belongs_to_collection", "genres", "production_countries", "production_companies","spoken_languages"]
for col in cols:
    df[col] = df[col].apply(func = convert_json)

## How to flatten nested Columns

4. __Extract__ only the __collection name__ from the column "belongs_to_collection" and __overwrite__ "belongs_to_collection". <br> For example: The value in the first row (Toy Story) should be 'Toy Story Collection'.

In [None]:
df.belongs_to_collection = df.belongs_to_collection.apply(lambda x : x["name"] if isinstance(x, dict) else np.nan)

5. __Extract__ all __genre names__ from the column "genres" and __overwrite__ "genres". If a movie has more than one genre, __seperate genres by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Animation|Comedy|Family'.

In [132]:
def conver_genres(x):
    if isinstance(x, list):
        ls = [i["name"] for i in x]
        return "|".join(ls)
    return np.nan
df.genres = df.genres.apply(conver_genres)
df.genres.value_counts()

Drama                              5000
Comedy                             3621
Documentary                        2723
                                   2442
Drama|Romance                      1301
                                   ... 
Action|Drama|Comedy|Documentary       1
War|Drama|History|Thriller            1
Horror|Drama|History|Thriller         1
Comedy|Crime|Action|Drama             1
Family|Animation|Romance|Comedy       1
Name: genres, Length: 4069, dtype: int64

6. __Extract__ all __spoken language names__ from the column "spoken_languages" and __overwrite__ "spoken_languages". If a movie has more than one spoken language, __seperate spoken languages by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'English'.

In [124]:
def conver_collection(x):
    if isinstance(x, dict):
        return x["name"]
    return np.nan

df.belongs_to_collection = df.belongs_to_collection.apply(conver_collection)
df.belongs_to_collection.value_counts()

The Bowery Boys                  29
Totò Collection                  27
James Bond Collection            26
Zatôichi: The Blind Swordsman    26
The Carry On Collection          25
                                 ..
Glass Tiger collection            1
Kathleen Madigan Collection       1
The Big Bottom Box                1
Joséphine - Saga                  1
Red Lotus Collection              1
Name: belongs_to_collection, Length: 1695, dtype: int64

In [125]:
def conver_language(x):
    if isinstance(x, list):
        ls = [i["iso_639_1"] for i in x]
        return "|".join(ls)
    return np.nan
df.spoken_languages = df.spoken_languages.apply(conver_language)
df.spoken_languages.value_counts(dropna = False)

en                22395
                   3829
fr                 1853
ja                 1289
it                 1218
                  ...  
en|hu|sh              1
en|de|pl|es           1
la|en|fr|ja|zh        1
en|fr|ar|ru|es        1
ff|en                 1
Name: spoken_languages, Length: 1932, dtype: int64

7. __Extract__ all __production countries names__ from the column "production_countries" and __overwrite__ "production_countries". If a movie has more than one production country, __seperate production countries by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'United States of America'.

In [126]:
def conver_country(x):
    if isinstance(x, list):
        ls = [i["iso_3166_1"] for i in x]
        return "|".join(ls)
    return np.nan
df.production_countries = df.production_countries.apply(conver_country)
df.production_countries.value_counts(dropna = False)

US             17851
                6282
GB              2238
FR              1654
JP              1356
               ...  
RO|GB|CA           1
FI|DE|NL           1
FR|DK|ES|SE        1
FR|US|CA           1
EG|IT|US           1
Name: production_countries, Length: 2391, dtype: int64

8. __Extract__ all __production companies names__ from the column "production_companies" and __overwrite__ "production_companies". If a movie has more than one production company, __seperate production companies by a pipe__ "|".<br>
For example: The value in the first row (Toy Story) should be 'Pixar Animation Studios'

In [127]:
df.production_companies

0           [{'name': 'Pixar Animation Studios', 'id': 3}]
1        [{'name': 'TriStar Pictures', 'id': 559}, {'na...
2        [{'name': 'Warner Bros.', 'id': 6194}, {'name'...
3        [{'name': 'Twentieth Century Fox Film Corporat...
4        [{'name': 'Sandollar Productions', 'id': 5842}...
                               ...                        
45461                                                   []
45462               [{'name': 'Sine Olivia', 'id': 19653}]
45463    [{'name': 'American World Pictures', 'id': 6165}]
45464                 [{'name': 'Yermoliev', 'id': 88753}]
45465                                                   []
Name: production_companies, Length: 45466, dtype: object

In [128]:
def conver_production(x):
    if isinstance(x, list):
        ls = [i["name"] for i in x]
        return "|".join(ls)
    return np.nan
df.production_companies = df.production_companies.apply(conver_production)
df.production_companies.value_counts(dropna = False)

                                                                                                                                              11875
Metro-Goldwyn-Mayer (MGM)                                                                                                                       742
Warner Bros.                                                                                                                                    540
Paramount Pictures                                                                                                                              505
Twentieth Century Fox Film Corporation                                                                                                          439
                                                                                                                                              ...  
HBO Films|Moving Pictures                                                                                       

## Cleaning Numerical Columns

10. __Convert__ the datatype in the columns __"budget"__, __"id"__ and __"popularity"__ __to numeric__. Set invalid values as NaN.

In [129]:
df.budget = pd.to_numeric(df.budget, errors= 'coerce')
df.id = pd.to_numeric(df.id, errors='coerce')
df.popularity = pd.to_numeric(df.popularity, errors='coerce')

In [130]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget                 45463 non-null  float64
 2   genres                 45466 non-null  object 
 3   homepage               7782 non-null   object 
 4   id                     45463 non-null  float64
 5   original_language      45455 non-null  object 
 6   overview               44512 non-null  object 
 7   popularity             45460 non-null  float64
 8   poster_path            45080 non-null  object 
 9   production_companies   45460 non-null  object 
 10  production_countries   45460 non-null  object 
 11  release_date           45379 non-null  object 
 12  revenue                45460 non-null  float64
 13  runtime                45203 non-null  float64
 14  spoken_languages       45460 non-null  object 
 15  st

In [133]:
df

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,original_language,overview,popularity,poster_path,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,Toy Story Collection,30000000.0,Animation|Comedy|Family,http://toystory.disney.com/toy-story,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,...,1995-10-30,373554033.0,81.0,en,Released,,Toy Story,False,7.7,5415.0
1,,65000000.0,Adventure|Fantasy|Family,,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,...,1995-12-15,262797249.0,104.0,en|fr,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,Grumpy Old Men Collection,0.0,Romance|Comedy,,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,...,1995-12-22,0.0,101.0,en,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,,16000000.0,Comedy|Drama|Romance,,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,...,1995-12-22,81452156.0,127.0,en,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,Father of the Bride Collection,0.0,Comedy,,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,...,1995-02-10,76578911.0,106.0,en,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,0.0,Drama|Family,http://www.imdb.com/title/tt6209470/,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,...,,0.0,90.0,fa,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,,0.0,Drama,,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,...,2011-11-17,0.0,360.0,tl,Released,,Century of Birthing,False,9.0,3.0
45463,,0.0,Action|Drama|Thriller,,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,...,2003-08-01,0.0,90.0,en,Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,,0.0,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,...,1917-10-21,0.0,87.0,,Released,,Satan Triumphant,False,0.0,0.0


11. __Analyze__ the columns __"budget"__ and __"revenue"__ and __"runtime"__. Analyze movies with a budget/revenue/runtime of 0. Do you think the value 0 is the most appropriate value? __Take reasonable measures__! 

In [134]:
for col in "budget revenue runtime".split():
    df[col].replace(0, np.nan, inplace = True)

In [135]:
df.runtime.value_counts(dropna = False)

90.0     2556
NaN      1821
100.0    1470
95.0     1412
93.0     1214
         ... 
410.0       1
283.0       1
238.0       1
566.0       1
780.0       1
Name: runtime, Length: 353, dtype: int64

12. The columns "budget" and "revenue" shall show values in Million USD. __Convert and Overwrite__!

In [136]:
df.budget = df.budget.div(1000000)
df.revenue = df.revenue.div(1000000)

## Cleaning DateTime Columns

14. __Convert__ the datatype in the column __"release_date"__ __to datetime__. Set invalid values as NaN.

In [137]:
df.release_date = pd.to_datetime(df.release_date , errors="coerce")

## Removing Duplicates

16. __Identify__ and __remove__ duplicates!

In [138]:
df.drop_duplicates(subset = "id", inplace = True)

## Handling Missing Values & Removing Observations

17. __Drop__ all rows/movies with unknown __id__ or __title__.

In [139]:
df.dropna(subset=["id", "title"], inplace = True)
df

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,original_language,overview,popularity,poster_path,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,http://toystory.disney.com/toy-story,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,...,1995-10-30,373.554033,81.0,en,Released,,Toy Story,False,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,...,1995-12-15,262.797249,104.0,en|fr,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,...,1995-12-22,,101.0,en,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,,16.0,Comedy|Drama|Romance,,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,...,1995-12-22,81.452156,127.0,en,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,Father of the Bride Collection,,Comedy,,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,...,1995-02-10,76.578911,106.0,en,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,http://www.imdb.com/title/tt6209470/,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,...,NaT,,90.0,fa,Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,,,Drama,,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,...,2011-11-17,,360.0,tl,Released,,Century of Birthing,False,9.0,3.0
45463,,,Action|Drama|Thriller,,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,...,2003-08-01,,90.0,en,Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,...,1917-10-21,,87.0,,Released,,Satan Triumphant,False,0.0,0.0


18. __Keep__ only those rows/movies in the df with __10 or more non-NaN__ values.

In [140]:
df.dropna(thresh=10, inplace = True)

## Final (Cleaning) Steps

19. __Keep__ only those rows/movies in the df with __status "Released"__. Then __drop__ the column "status".

In [141]:
mask = df.status.notna() & df.status.str.fullmatch("Released")
df = df[mask].copy()

In [142]:
df.drop("status", axis=1,inplace = True)
df

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,tagline,title,video,vote_average,vote_count
0,Toy Story Collection,30.0,Animation|Comedy|Family,http://toystory.disney.com/toy-story,862.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,Pixar Animation Studios,US,1995-10-30,373.554033,81.0,en,,Toy Story,False,7.7,5415.0
1,,65.0,Adventure|Fantasy|Family,,8844.0,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,TriStar Pictures|Teitler Film|Interscope Commu...,US,1995-12-15,262.797249,104.0,en|fr,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,Grumpy Old Men Collection,,Romance|Comedy,,15602.0,en,A family wedding reignites the ancient feud be...,11.712900,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,Warner Bros.|Lancaster Gate,US,1995-12-22,,101.0,en,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,,16.0,Comedy|Drama|Romance,,31357.0,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,Twentieth Century Fox Film Corporation,US,1995-12-22,81.452156,127.0,en,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,Father of the Bride Collection,,Comedy,,11862.0,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,Sandollar Productions|Touchstone Pictures,US,1995-02-10,76.578911,106.0,en,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,Drama|Family,http://www.imdb.com/title/tt6209470/,439050.0,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,,IR,NaT,,90.0,fa,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,,,Drama,,111109.0,tl,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,Sine Olivia,PH,2011-11-17,,360.0,tl,,Century of Birthing,False,9.0,3.0
45463,,,Action|Drama|Thriller,,67758.0,en,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,American World Pictures,US,2003-08-01,,90.0,en,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,,,,,227506.0,en,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,Yermoliev,RU,1917-10-21,,87.0,,,Satan Triumphant,False,0.0,0.0


20. The Order of the columns should be as follows: 

In [143]:
df.columns

Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'original_language', 'overview', 'popularity', 'poster_path',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [144]:
col = ["id", "title", "tagline", "release_date", "genres", "belongs_to_collection", 
"original_language", "budget", "revenue", "production_companies",
"production_countries", "vote_count", "vote_average", "popularity", "runtime",
"overview", "spoken_languages", "poster_path"]
df = df.loc[:, col]
df

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget,revenue,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862.0,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,US,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",en,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844.0,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,US,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,en|fr,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602.0,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,US,92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,en,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357.0,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,US,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",en,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862.0,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,US,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,en,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,439050.0,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,IR,1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,fa,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
45462,111109.0,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,PH,3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,tl,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
45463,67758.0,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,US,6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",en,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
45464,227506.0,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,RU,0.0,0.0,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


In [145]:
df

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget,revenue,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862.0,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,30.0,373.554033,Pixar Animation Studios,US,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",en,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844.0,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,65.0,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,US,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,en|fr,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602.0,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,US,92.0,6.5,11.712900,101.0,A family wedding reignites the ancient feud be...,en,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357.0,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,16.0,81.452156,Twentieth Century Fox Film Corporation,US,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",en,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862.0,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,US,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,en,/e64sOI48hQXyru7naBFyssKFxVd.jpg
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,439050.0,Subdue,Rising and falling between a man and woman,NaT,Drama|Family,,fa,,,,IR,1.0,4.0,0.072051,90.0,Rising and falling between a man and woman.,fa,/jldsYflnId4tTWPx8es3uzsB1I8.jpg
45462,111109.0,Century of Birthing,,2011-11-17,Drama,,tl,,,Sine Olivia,PH,3.0,9.0,0.178241,360.0,An artist struggles to finish his work while a...,tl,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg
45463,67758.0,Betrayal,A deadly game of wits.,2003-08-01,Action|Drama|Thriller,,en,,,American World Pictures,US,6.0,3.8,0.903007,90.0,"When one of her hits goes wrong, a professiona...",en,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg
45464,227506.0,Satan Triumphant,,1917-10-21,,,en,,,Yermoliev,RU,0.0,0.0,0.003503,87.0,"In a small town live two brothers, one a minis...",,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg


22. __Save__ the cleaned dataset in a __csv-file__.

In [146]:
df.to_json("clean_data.json",orient="records")

In [147]:
df.to_csv("clean_data.csv",index = False)