# 0036 
# Cleaning Datasets

In [1]:
import pandas as pd
pd.options.display.max_columns = 30

In [2]:
df = pd.read_csv('movies_metadata.csv', low_memory= False)

#### Explaination On ```low_memory = False``` :

By default, Pandas tries to determine the data type of each column in the CSV file as it reads the file. However, this can be a memory-intensive process, especially if the file is large or if there are many columns.

Setting low_memory=False tells Pandas to read the entire file into memory at once, rather than reading it in chunks. This can prevent memory errors that may occur when reading large files, but it can also be slower and more memory-intensive than reading in chunks.

If you're working with a large CSV file and you encounter memory errors or performance issues when reading the file, you may want to try setting low_memory=False. However, if you're working with a smaller file or if you have enough memory available, you may not need to set this argument.

It's important to note that setting low_memory=False can have performance implications, especially if you're working with a very large file or if you have limited memory available. In general, it's a good idea to test your code with and without this argument to see which approach works best for your specific use case.

In [3]:
# check the dataframe
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],"[{'iso_3166_1': 'IR', 'name': 'Iran'}]",,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]","[{'iso_3166_1': 'PH', 'name': 'Philippines'}]",2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]","[{'iso_3166_1': 'RU', 'name': 'Russia'}]",1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [4]:
# check the information of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [5]:
df['genres'][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [6]:
df['belongs_to_collection'][0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

#### Explaination on Output of previous cells:

***The string is enclosed in square brackets, indicating that it represents a list. Within the square brackets, there are three dictionaries, each enclosed in curly braces. Each dictionary has two key-value pairs: 'id', which represents the unique identifier for the genre, and 'name', which represents the name of the genre.***

***In this case, the first dictionary has an 'id' of 16 and a 'name' of 'Animation'. The second dictionary has an 'id' of 35 and a 'name' of 'Comedy'. The third dictionary has an 'id' of 10751 and a 'name' of 'Family'.***

***This format is commonly used to represent nested or structured data in a single column of a Pandas DataFrame. To work with this data, you could use the ast.literal_eval() function to convert the string representation to a list of dictionarie***

# 0037
#### Dropping Irrelevant Columns:  

***dropping irrelevant columns in Pandas is a common task when working with data. Sometimes, a dataset may contain columns that are not useful for a particular analysis or that contain mostly missing values. In these cases, it can be useful to drop these columns to simplify the data and reduce memory usage.***

***To drop columns in Pandas, you can use the drop() method on a DataFrame. The drop() method takes one or more column names as arguments and returns a new DataFrame with those columns removed.***

In [7]:
df.drop(columns= ['adult', 'imdb_id', 'original_title', 'video', 'homepage'], inplace = True)
# we do not need these columns , so we will drop them

In [8]:
# check information of df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4494 non-null   object 
 1   budget                 45466 non-null  object 
 2   genres                 45466 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   45463 non-null  object 
 9   production_countries   45463 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                45460 non-null  float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       45460 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

# 0038
# How To Handle Stringfied JSON Columns 

#### Explaination of Stringfied JSON Columns:  

A stringified JSON column is a column in a dataset that contains JSON data in string format. JSON stands for JavaScript Object Notation, and it is a lightweight data interchange format that is easy for humans to read and write, and easy for machines to parse and generate.

In a stringified JSON column, the JSON data is represented as a string, rather than as a Python object or other data type. This can make it difficult to work with the JSON data, especially if you need to extract specific values or perform calculations on the data.

To work with stringified JSON columns in a Pandas dataset, you typically need to convert the stringified JSON to a Python object using the json module in Python

In [9]:
# import libs
import json
import ast

In [10]:
json_cols = ['belongs_to_collection', 'genres', 'production_companies'
            , 'production_countries', 'spoken_languages']

In [11]:
df['belongs_to_collection'][0]
# stringfied json column

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In [12]:
json1 = '{"dog": 3, "cat": 5}'
json1

'{"dog": 3, "cat": 5}'

In [13]:
ast.literal_eval(json1)

{'dog': 3, 'cat': 5}

We have a string called json1 that contains a JSON object in string format, and we want to convert this JSON object to a Python dictionary. One way to do this is to use the ast.literal_eval() function from the Python ast module.  

In this code, we call ast.literal_eval() on json1, which safely evaluates the string as a Python expression and returns a dictionary with keys and values corresponding to the keys and values in the JSON object. In this case, the resulting dictionary will have a key 'dog' with a value of 3, and a key 'cat' with a value of 5.

#### Test on Dataset

In [14]:
df['genres'].apply(ast.literal_eval)[0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [15]:
df_genres = df['genres'].apply(ast.literal_eval)
df_genres

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1        [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                           [{'id': 35, 'name': 'Comedy'}]
                               ...                        
45461    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462                        [{'id': 18, 'name': 'Drama'}]
45463    [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464                                                   []
45465                                                   []
Name: genres, Length: 45466, dtype: object

# 0039
# How to Handle Stringfied JSON Columns P2

In [16]:
# necessary lib
import numpy as np

#### Work on ```belongs_to_collection``` column

In [17]:
df['belongs_to_collection'][0]

"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}"

In [18]:
# check whether the Elements is Instance of str 
df['belongs_to_collection'].apply(lambda x: isinstance(x, str))

0         True
1        False
2         True
3        False
4         True
         ...  
45461    False
45462    False
45463    False
45464    False
45465    False
Name: belongs_to_collection, Length: 45466, dtype: bool

In [19]:
df['belongs_to_collection'] = df['belongs_to_collection'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

#### Explaination of ```apply()``` & ```lambda```: 

The apply() method in Pandas can be used to apply a function to each element in a DataFrame or Series. In this case, we're applying a lambda function to the 'belongs_to_collection' column of the df DataFrame.

The lambda function checks if the input is a string using the isinstance() function. If the input is a string, it is assumed to be a JSON string, and the ast.literal_eval() function is used to convert the string to a Python object (in this case, a dictionary). If the input is not a string, the function returns np.nan.

The resulting output will be a new Series where each element is either a dictionary or np.nan, depending on the values in the 'belongs_to_collection' column of the df DataFrame. This can be useful for working with JSON data in a Pandas DataFrame, for example to extract specific values from a nested dictionary.

In [20]:
df['belongs_to_collection'][0]

{'id': 10194,
 'name': 'Toy Story Collection',
 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',
 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}

In [21]:
# check the type of first record
type(df['belongs_to_collection'][0])

dict

________________

#### Work on ```spoken_language``` column:

In [22]:
# check how its look like
df['spoken_languages'][0]

"[{'iso_639_1': 'en', 'name': 'English'}]"

In [23]:
# check the type of first record of this col
type(df['spoken_languages'][0])

str

In [24]:
df['spoken_languages'] = df['spoken_languages'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [25]:
df['spoken_languages'][0]

[{'iso_639_1': 'en', 'name': 'English'}]

In [26]:
type(df['spoken_languages'][0])

list

***The type of this column's record change to list, as previously it was str***

______________________________________

#### Work on ```production_countries``` column:

In [27]:
# check the status of first record
df['production_countries'][0]

"[{'iso_3166_1': 'US', 'name': 'United States of America'}]"

In [28]:
# check the type of first record
type(df['production_countries'][0])

str

In [29]:
df['production_countries'] = df['production_countries'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [30]:
# check how first records looks like
df['production_countries'][0]

[{'iso_3166_1': 'US', 'name': 'United States of America'}]

In [31]:
# check the type of first record of this col
type(df['production_countries'][0])

list

____________________________________

#### Work on ```production_companies``` columns

In [32]:
# check first record to see how its look like
df['production_companies'][0]

"[{'name': 'Pixar Animation Studios', 'id': 3}]"

In [33]:
# check type of it
type(df['production_companies'][0])

str

In [34]:
df['production_companies'] = df['production_companies'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [35]:
# check first record again 
df['production_companies'][0]

[{'name': 'Pixar Animation Studios', 'id': 3}]

In [36]:
# check type of first record
type(df['production_companies'][0])

list

_______________________________

#### Work on `genres` column

In [37]:
df['genres'][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [38]:
type(df['genres'][0])

str

In [39]:
df['genres'] = df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else np.nan)

In [40]:
# check frist record again
df['genres'][0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [41]:
type(df['genres'][0])

list

#### Check the DataFrame :

In [42]:
df.head()

Unnamed: 0,belongs_to_collection,budget,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4494 non-null   object 
 1   budget                 45466 non-null  object 
 2   genres                 45466 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   45463 non-null  object 
 9   production_countries   45463 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                45460 non-null  float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       45460 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

# 0040:
# How to Flatten Nested Columns

In [44]:
df['belongs_to_collection'] = df['belongs_to_collection'].apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)

The apply() method in Pandas can be used to apply a function to each element in a DataFrame or Series. In this case, we're applying a lambda function to the 'belongs_to_collection' column of the df DataFrame.

The lambda function checks if the input is a dictionary using the isinstance() function. If the input is a dictionary, it assumes that it represents a collection and returns the value associated with the 'name' key in the dictionary. If the input is not a dictionary, the function returns np.nan.

The resulting output will be a new Series where each element is either the value associated with the 'name' key in the original dictionary, or np.nan, depending on the values in the 'belongs_to_collection' column of the df DataFrame. This can be useful for extracting specific values from a nested dictionary in a Pandas DataFrame.

In [45]:
df['belongs_to_collection'].value_counts(dropna = False).head(20)

belongs_to_collection
NaN                                       40975
The Bowery Boys                              29
Totò Collection                              27
James Bond Collection                        26
Zatôichi: The Blind Swordsman                26
The Carry On Collection                      25
Pokémon Collection                           22
Charlie Chan (Sidney Toler) Collection       21
Godzilla (Showa) Collection                  16
Uuno Turhapuro                               15
Dragon Ball Z (Movie) Collection             15
Charlie Chan (Warner Oland) Collection       15
The Land Before Time Collection              14
Monster High Collection                      14
Sharpe Collection                            13
George Carlin Comedy Collection              13
Johan Falk GSI Collection                    12
Sherlock Holmes (1939 series)                12
Friday the 13th Collection                   12
The Amityville Horror Collection             12
Name: count, dtype

_________________________________________

#### Flatten Data on `Genres` Column

In [46]:
# check the first row
df['genres'][0]

[{'id': 16, 'name': 'Animation'},
 {'id': 35, 'name': 'Comedy'},
 {'id': 10751, 'name': 'Family'}]

In [47]:
df['genres'] = df['genres'].apply(lambda x: "|".join(i['name'] for i in x))

The 'genres' column of the df DataFrame appears to contain a list of dictionaries, where each dictionary represents a genre with a 'name' key that contains the name of the genre. This format is commonly used to represent nested or structured data in a single column of a Pandas DataFrame.

The lambda function you're using with the apply() method is using a list comprehension to extract the value associated with the 'name' key from each dictionary in the list, and then joining those values into a single string separated by the '|' character. This creates a new column where each value is a string representing the genres associated with a particular entry in the DataFrame.

For example, if the 'genres' column of a particular row in the DataFrame contains the list `[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]`, the lambda function will join the 'name' values together into the string 'Animation|Comedy|Family'.

The resulting output will be a new 'genres' column in the df DataFrame where each value is a string representing the genres associated with a particular entry in the DataFrame. This can be useful for filtering or grouping by genre in a Pandas DataFrame.

In [48]:
# check the frist row again
df['genres'][0]

'Animation|Comedy|Family'

In [49]:
df['genres'].value_counts(dropna = False).head(20)

genres
Drama                   5000
Comedy                  3621
Documentary             2723
                        2442
Drama|Romance           1301
Comedy|Drama            1135
Horror                   974
Comedy|Romance           930
Comedy|Drama|Romance     593
Drama|Comedy             532
Horror|Thriller          528
Drama|Thriller           497
Thriller                 465
Crime|Drama              430
Romance|Drama            343
Western                  318
Action|Thriller          301
Drama|Foreign            283
Action                   278
Drama|History            267
Name: count, dtype: int64

In [50]:
df['genres'].replace("", np.nan , inplace = True)

as we can see in below output , we have 2442 rows without any genres, within previous cell , we did fill them with `np.nan`

Drama                   5000  
Comedy                  3621  
Documentary             2723  
`                        2442`  
Drama|Romance           1301  
Comedy|Drama            1135  
Horror                   974  
Comedy|Romance           930  
Comedy|Drama|Romance     593  
Drama|Comedy             532  
Horror|Thriller          528  
Drama|Thriller           497  
Thriller                 465  
Crime|Drama              430  
Romance|Drama            343  
Western                  318  
Action|Thriller          301  
Drama|Foreign            283  
Action                   278  
Drama|History            267  
Name: count, dtype: int64  

In [51]:
df['genres'].value_counts(dropna = False).head(20)

genres
Drama                   5000
Comedy                  3621
Documentary             2723
NaN                     2442
Drama|Romance           1301
Comedy|Drama            1135
Horror                   974
Comedy|Romance           930
Comedy|Drama|Romance     593
Drama|Comedy             532
Horror|Thriller          528
Drama|Thriller           497
Thriller                 465
Crime|Drama              430
Romance|Drama            343
Western                  318
Action|Thriller          301
Drama|Foreign            283
Action                   278
Drama|History            267
Name: count, dtype: int64

_______________

#### Flatten Data on `spoken_language`

In [52]:
df['spoken_languages'][0]

[{'iso_639_1': 'en', 'name': 'English'}]

In [53]:
df['spoken_languages'] = df['spoken_languages'].apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [54]:
df['spoken_languages'].value_counts(dropna = False).head(20)

spoken_languages
English             22395
                     3952
Français             1853
日本語                  1289
Italiano             1218
Español               902
Pусский               807
Deutsch               762
English|Français      681
English|Español       572
हिन्दी                481
English|Deutsch       462
한국어/조선말               425
普通话                   347
English|Italiano      326
svenska               311
No Language           303
suomi                 275
Português             275
Polski                213
Name: count, dtype: int64

In [55]:
df['spoken_languages'].replace("", np.nan , inplace= True)

In [56]:
df['spoken_languages'].value_counts(dropna = False).head(20)

spoken_languages
English             22395
NaN                  3958
Français             1853
日本語                  1289
Italiano             1218
Español               902
Pусский               807
Deutsch               762
English|Français      681
English|Español       572
हिन्दी                481
English|Deutsch       462
한국어/조선말               425
普通话                   347
English|Italiano      326
svenska               311
No Language           303
Português             275
suomi                 275
Polski                213
Name: count, dtype: int64

__________________________

#### Flatten Data on `production_countries` column

In [57]:
df['production_countries'][0]

[{'iso_3166_1': 'US', 'name': 'United States of America'}]

In [58]:
df['production_countries'] = df['production_countries'].apply(lambda x : "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [59]:
df['production_countries'].value_counts(dropna = False).head(20)

production_countries
United States of America                   17851
                                            6282
United Kingdom                              2238
France                                      1654
Japan                                       1356
Italy                                       1030
Canada                                       840
Germany                                      749
India                                        735
Russia                                       735
United Kingdom|United States of America      569
South Korea                                  432
Spain                                        398
Hong Kong                                    365
Canada|United States of America              365
Australia                                    336
Sweden                                       332
Finland                                      271
France|Italy                                 235
Germany|United States of America             214

In [60]:
df['production_countries'].replace("", np.nan, inplace = True)

In [61]:
df['production_countries'].value_counts(dropna = False).head(20)

production_countries
United States of America                   17851
NaN                                         6288
United Kingdom                              2238
France                                      1654
Japan                                       1356
Italy                                       1030
Canada                                       840
Germany                                      749
India                                        735
Russia                                       735
United Kingdom|United States of America      569
South Korea                                  432
Spain                                        398
Hong Kong                                    365
Canada|United States of America              365
Australia                                    336
Sweden                                       332
Finland                                      271
France|Italy                                 235
Germany|United States of America             214

______________

#### Flatten Data on `production_companies` column

In [62]:
df['production_companies'][0:5]

0       [{'name': 'Pixar Animation Studios', 'id': 3}]
1    [{'name': 'TriStar Pictures', 'id': 559}, {'na...
2    [{'name': 'Warner Bros.', 'id': 6194}, {'name'...
3    [{'name': 'Twentieth Century Fox Film Corporat...
4    [{'name': 'Sandollar Productions', 'id': 5842}...
Name: production_companies, dtype: object

In [63]:
df['production_companies'] = df['production_companies'].apply(lambda x: "|".join(i['name'] for i in x) if isinstance(x, list) else np.nan)

In [64]:
df['production_companies'].value_counts(dropna = False).head(20)

production_companies
                                          11875
Metro-Goldwyn-Mayer (MGM)                   742
Warner Bros.                                540
Paramount Pictures                          505
Twentieth Century Fox Film Corporation      439
Universal Pictures                          320
RKO Radio Pictures                          247
Columbia Pictures Corporation               207
Columbia Pictures                           146
Mosfilm                                     145
Walt Disney Pictures                         85
Universal International Pictures (UI)        82
New Line Cinema                              75
Walt Disney Productions                      75
Shaw Brothers                                71
Touchstone Pictures                          70
Toho Company                                 65
TriStar Pictures                             62
Orion Pictures                               61
Hammer Film Productions                      60
Name: count, dtype:

In [65]:
df['production_companies'].replace("", np.nan, inplace = True)
df['production_companies'].value_counts(dropna = False).head(20)

production_companies
NaN                                       11881
Metro-Goldwyn-Mayer (MGM)                   742
Warner Bros.                                540
Paramount Pictures                          505
Twentieth Century Fox Film Corporation      439
Universal Pictures                          320
RKO Radio Pictures                          247
Columbia Pictures Corporation               207
Columbia Pictures                           146
Mosfilm                                     145
Walt Disney Pictures                         85
Universal International Pictures (UI)        82
Walt Disney Productions                      75
New Line Cinema                              75
Shaw Brothers                                71
Touchstone Pictures                          70
Toho Company                                 65
TriStar Pictures                             62
Orion Pictures                               61
Hammer Film Productions                      60
Name: count, dtype:

________________________

In [66]:
df.isna().sum()

belongs_to_collection    40975
budget                       0
genres                    2442
id                           0
original_language           11
overview                   954
popularity                   5
poster_path                386
production_companies     11881
production_countries      6288
release_date                87
revenue                      6
runtime                    263
spoken_languages          3958
status                      87
tagline                  25054
title                        6
vote_average                 6
vote_count                   6
dtype: int64

In [67]:
# original dataframe
pd.read_csv('movies_metadata.csv', low_memory= False).isna().sum()

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

# 0041
# Cleaning Numberical Columns
***Budget & BoxOffice Col***

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget                 45466 non-null  object 
 2   genres                 43024 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue                45460 non-null  float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

#### Explaination of `pd.to_numeric()`
The pd.to_numeric() function in Pandas can be used to convert a Series or DataFrame column to a numeric data type. In this case, you're passing the 'budget' column of the df DataFrame to the function, along with the argument errors='coerce'. The errors argument tells the function how to handle any non-numeric values in the column.

When errors='coerce', any non-numeric values in the column will be converted to NaN, which stands for "Not a Number". This is useful for handling missing or invalid data in a column without raising an error.

The resulting output will be a new 'budget' column in the df DataFrame where each value is a numeric data type, or NaN if the original value could not be converted to a numeric data type. This can be useful for performing mathematical operations or comparing values in a Pandas DataFrame.

In [73]:
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')

#### **Replace the 0 with np.nan**

budget  
`0.0           36573  `
5000000.0       286  
10000000.0      259  
20000000.0      243  
2000000.0       242  

In [74]:
df['budget'] = df['budget'].replace(0, np.nan)

In [75]:
df['budget'].value_counts(dropna = False) 

budget
NaN            36576
5000000.0        286
10000000.0       259
20000000.0       243
2000000.0        242
               ...  
270000000.0        1
923.0              1
72500000.0         1
2160000.0          1
1254040.0          1
Name: count, Length: 1223, dtype: int64

#### Conver values of `Budget` to Smaller Number by Dividing 1000000

In [77]:
df['budget'] = df['budget'].div(1000000)

In [78]:
df['budget'].value_counts(dropna = False)

budget
NaN             36576
5.000000e-06      286
1.000000e-05      259
2.000000e-05      243
2.000000e-06      242
                ...  
2.700000e-04        1
9.230000e-10        1
7.250000e-05        1
2.160000e-06        1
1.254040e-06        1
Name: count, Length: 1223, dtype: int64

_____________________

#### `Revenue` Column

In [79]:
df['revenue'].value_counts(dropna = False)

revenue
0.0           38052
12000000.0       20
10000000.0       19
11000000.0       19
2000000.0        18
              ...  
36565280.0        1
439564.0          1
35610100.0        1
10217873.0        1
1413000.0         1
Name: count, Length: 6864, dtype: int64

In [80]:
df['revenue'] = df['revenue'].replace(0, np.nan)
df['revenue'] = df['revenue'].div(1000000)

alling the div() method on the 'revenue' column of the df DataFrame, and passing the argument 1000000. This divides each value in the 'revenue' column by 1,000,000, effectively converting the revenue values from their original units to millions of dollars.

##### Changing the Name of `Revenue` & `Budget` Column

In [81]:
df.rename(columns={'revenue' : 'revenue_musd', 'budget' : 'budget_musd'}, inplace = True)

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget_musd            8890 non-null   float64
 2   genres                 43024 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue_musd           7408 non-null   float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

##### `df.rename` function :
***The rename() method in Pandas can be used to rename one or more columns (or row labels) in a DataFrame. In this case, you're calling the rename() method on the df DataFrame, and passing the argument columns={'revenue': 'revenue_musd', 'budget': 'budget_musd'} to specify the new names for the 'revenue' and 'budget' columns.***

***The columns argument is a dictionary where the keys are the original column names, and the values are the new column names. In this case, the original 'revenue' column is being renamed to 'revenue_musd', and the original 'budget' column is being renamed to 'budget_musd'.***

# 0042
# Cleaning Numerical Columns P2

In [83]:
df['runtime'].value_counts(dropna = False).head(20)

runtime
90.0     2556
0.0      1558
100.0    1470
95.0     1412
93.0     1214
96.0     1104
92.0     1080
94.0     1062
91.0     1057
88.0     1032
97.0     1027
85.0     1024
98.0     1019
105.0    1002
89.0      958
87.0      919
110.0     850
86.0      846
99.0      794
102.0     791
Name: count, dtype: int64

One of the most occured values of runtime is `0` which is not reallistic in this df. it is better to change 0 values to `np.nan`  

runtime
90.0     2556    
`0.0      1558`  
100.0    1470  
95.0     1412  
93.0     1214  
96.0     1104  
92.0     1080  

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget_musd            8890 non-null   float64
 2   genres                 43024 non-null  object 
 3   id                     45466 non-null  object 
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue_musd           7408 non-null   float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

_____________________

#### Covert `id` Column Dtype to Numeric

as we can see in output of `df.info` the data type related to column `id` is object which is not proper dtype.  
in this case we will change it to number using `pd.to_numeric()`

In [85]:
df['id'] = pd.to_numeric(df['id'], errors = 'coerce')

In [86]:
df['id'].value_counts(dropna = False).head(20)

id
NaN         3
141971.0    3
11115.0     2
25541.0     2
15028.0     2
132641.0    2
84198.0     2
13209.0     2
77221.0     2
152795.0    2
12600.0     2
10991.0     2
42495.0     2
14788.0     2
18440.0     2
168538.0    2
105045.0    2
159849.0    2
22649.0     2
4912.0      2
Name: count, dtype: int64

***WE Have Dupelicate And Missing Values in `id` Column***

________________

In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   belongs_to_collection  4491 non-null   object 
 1   budget_musd            8890 non-null   float64
 2   genres                 43024 non-null  object 
 3   id                     45463 non-null  float64
 4   original_language      45455 non-null  object 
 5   overview               44512 non-null  object 
 6   popularity             45461 non-null  object 
 7   poster_path            45080 non-null  object 
 8   production_companies   33585 non-null  object 
 9   production_countries   39178 non-null  object 
 10  release_date           45379 non-null  object 
 11  revenue_musd           7408 non-null   float64
 12  runtime                45203 non-null  float64
 13  spoken_languages       41508 non-null  object 
 14  status                 45379 non-null  object 
 15  ta

as we can see in output of `df.info` the data type related to column `popularity` is object which is not proper dtype.  
in this case we will change it to number using `pd.to_numeric()`

In [88]:
df['popularity'] = pd.to_numeric(df['popularity'], errors = 'coerce')

In [89]:
df['popularity'].value_counts(dropna = False).head(20)

popularity
0.000000    66
0.000001    56
0.000308    43
0.000220    40
0.000844    38
0.000578    38
0.001177    38
0.002001    28
0.003013    21
0.001393    19
0.003530    19
0.036471    18
0.002353    18
0.000603    16
0.001586    15
0.004425    14
0.001021    13
0.000431    13
0.004706    12
0.001247    11
Name: count, dtype: int64

### Analysis on `vote_counts` and `vote_average` Columns

In [90]:
df['vote_count'].value_counts(dropna = False).head(20)

vote_count
1.0     3264
2.0     3132
0.0     2899
3.0     2787
4.0     2480
5.0     2097
6.0     1747
7.0     1570
8.0     1359
9.0     1194
10.0    1171
11.0     944
12.0     859
13.0     733
14.0     700
15.0     674
16.0     601
17.0     554
18.0     497
20.0     463
Name: count, dtype: int64

In [91]:
df['vote_average'].value_counts(dropna = False).head(20)

vote_average
0.0    2998
6.0    2468
5.0    2001
7.0    1886
6.5    1722
6.3    1603
5.5    1381
5.8    1369
6.4    1350
6.7    1342
6.8    1324
6.1    1281
6.6    1263
6.2    1253
5.9    1196
5.3    1082
5.7    1046
6.9    1037
5.6    1006
7.3    1000
Name: count, dtype: int64

In [92]:
df.loc[df['vote_count'] == 0 , 'vote_count']

83       0.0
107      0.0
126      0.0
132      0.0
137      0.0
        ... 
45432    0.0
45434    0.0
45452    0.0
45464    0.0
45465    0.0
Name: vote_count, Length: 2899, dtype: float64

The loc[] method in Pandas can be used to select specific rows and columns in a DataFrame based on their labels. In this case, you're using boolean indexing to select the rows in the 'vote_average' column of the df DataFrame where the value is 0.

You're then accessing the 'vote_average' column of the selected rows using the syntax ['vote_average']. This returns a Series containing only the values in the 'vote_average' column of the selected rows.

The resulting output will be a new Series where the index labels are the same as the original DataFrame df, but containing only the 'vote_average' values from the rows where the 'vote_average' value is 0.

______________

# 0043
# How to Clean Columns With Datetime Information
***`release_date` Column***

In [93]:
# check how release_date columns is look like
df['release_date'][0]

'1995-10-30'

In [95]:
df['release_date'] = pd.to_datetime(df['release_date'], errors = 'coerce')

In [96]:
df['release_date'].value_counts(dropna = False).head(10)

release_date
2008-01-01    136
2009-01-01    121
2007-01-01    118
2005-01-01    111
2006-01-01    101
2002-01-01     96
2004-01-01     90
NaT            90
2001-01-01     84
2003-01-01     76
Name: count, dtype: int64

In [97]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4491 non-null   object        
 1   budget_musd            8890 non-null   float64       
 2   genres                 43024 non-null  object        
 3   id                     45463 non-null  float64       
 4   original_language      45455 non-null  object        
 5   overview               44512 non-null  object        
 6   popularity             45460 non-null  float64       
 7   poster_path            45080 non-null  object        
 8   production_companies   33585 non-null  object        
 9   production_countries   39178 non-null  object        
 10  release_date           45376 non-null  datetime64[ns]
 11  revenue_musd           7408 non-null   float64       
 12  runtime                45203 non-null  float64       
 13  s

***It is shown that `release_date` column's dtype has been changed to datetime by using `pd.to_datatime()` function***

# 004
# How to Clean String Text Columns
***`original_language` and `overview`***

In [98]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4491 non-null   object        
 1   budget_musd            8890 non-null   float64       
 2   genres                 43024 non-null  object        
 3   id                     45463 non-null  float64       
 4   original_language      45455 non-null  object        
 5   overview               44512 non-null  object        
 6   popularity             45460 non-null  float64       
 7   poster_path            45080 non-null  object        
 8   production_companies   33585 non-null  object        
 9   production_countries   39178 non-null  object        
 10  release_date           45376 non-null  datetime64[ns]
 11  revenue_musd           7408 non-null   float64       
 12  runtime                45203 non-null  float64       
 13  s

In [99]:
df['original_language'].value_counts(dropna = False).head(10)

original_language
en    32269
fr     2438
it     1529
ja     1350
de     1080
es      994
ru      826
hi      508
ko      444
zh      409
Name: count, dtype: int64

In [100]:
df['title']

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
                    ...             
45461                         Subdue
45462            Century of Birthing
45463                       Betrayal
45464               Satan Triumphant
45465                       Queerama
Name: title, Length: 45466, dtype: object

In [101]:
df['title'].value_counts(dropna = False).head(10)

title
Cinderella              11
Alice in Wonderland      9
Hamlet                   9
Les Misérables           8
Beauty and the Beast     8
Treasure Island          7
A Christmas Carol        7
The Three Musketeers     7
Blackout                 7
Home                     6
Name: count, dtype: int64

#### ***Cleaning Values on Overview Column***

In [102]:
df['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [104]:
df['overview'].value_counts(dropna = False).head(20)

overview
NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              954
No overview found.                                                                                                                                        

In [105]:
df['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [106]:
df['overview'].replace('No overview found.', np.nan, inplace = True)

In [107]:
df['overview'].value_counts(dropna = False).head(20)

overview
NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1087
No Overview                                                                                                                                              

In [108]:
df['overview'].replace('No Overview', np.nan, inplace = True)

In [110]:
df['overview'].replace('No movie overview available.', np.nan, inplace = True)

In [111]:
df['overview'].replace(" ", np.nan, inplace = True)

In [112]:
df['overview'].replace('No overview yet', np.nan, inplace = True)

In [113]:
df['overview'].value_counts(dropna = False).head(20)

overview
NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   1102
Recovering from a nail gun shot to the head and 13 months of coma, doctor Pekka Valinta starts to unravel the mystery of his past, still suffering from total amnesia.                                                                                                                                                                                              

#### Cleaning `tagline` column

In [114]:
df['tagline'].value_counts(dropna = False).head(50)

tagline
NaN                                                                                                                      25054
Based on a true story.                                                                                                       7
Trust no one.                                                                                                                4
Be careful what you wish for.                                                                                                4
-                                                                                                                            4
Classic Albums                                                                                                               3
Some doors should never be opened.                                                                                           3
A Love Story                                                                                           

In [115]:
df['tagline'].replace('-', np.nan, inplace = True)

In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4491 non-null   object        
 1   budget_musd            8890 non-null   float64       
 2   genres                 43024 non-null  object        
 3   id                     45463 non-null  float64       
 4   original_language      45455 non-null  object        
 5   overview               44364 non-null  object        
 6   popularity             45460 non-null  float64       
 7   poster_path            45080 non-null  object        
 8   production_companies   33585 non-null  object        
 9   production_countries   39178 non-null  object        
 10  release_date           45376 non-null  datetime64[ns]
 11  revenue_musd           7408 non-null   float64       
 12  runtime                45203 non-null  float64       
 13  s

# 0045
# Removing Duplicate Rows

In [117]:
df[df.duplicated(keep = False)].sort_values(by = 'id')

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
7345,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
9165,,,Crime|Drama|Thriller,5511.0,fr,Hitman Jef Costello is a perfectionist who alw...,9.091288,/cvNW8IXigbaMNo4gKEIps0NGnhA.jpg,Fida cinematografica|Compagnie Industrielle et...,France|Italy,1967-10-25,0.039481,105.0,Français,Released,There is no solitude greater than that of the ...,Le Samouraï,7.9,187.0
24844,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
14012,,,Comedy|Drama,11115.0,en,As an ex-gambler teaches a hot-shot college ki...,6.880365,/kHaBqrrozaG7rj6GJg3sUCiM29B.jpg,Andertainment Group|Crescent City Pictures|Tag...,United States of America,2008-01-29,,85.0,English,Released,,Deal,5.2,22.0
22151,,,Action|Horror|Science Fiction,18440.0,en,When a comet strikes Earth and kicks up a clou...,1.436085,/tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg,,United States of America,2007-01-01,,89.0,English,Released,,Days of Darkness,5.0,5.0
14000,,,Action|Horror|Science Fiction,18440.0,en,When a comet strikes Earth and kicks up a clou...,1.436085,/tWCyKXHuSrQdLAvNeeVJBnhf1Yv.jpg,,United States of America,2007-01-01,,89.0,English,Released,,Days of Darkness,5.0,5.0
8068,,,Adventure|Animation|Drama|Action|Foreign,23305.0,en,"In feudal India, a warrior (Khan) who renounce...",1.967992,/9GlrmbZO7VGyqhaSR1utinRJz3L.jpg,Filmfour,France|Germany|India|United Kingdom,2001-09-23,,86.0,हिन्दी,Released,,The Warrior,6.3,15.0
9327,,,Adventure|Animation|Drama|Action|Foreign,23305.0,en,"In feudal India, a warrior (Khan) who renounce...",1.967992,/9GlrmbZO7VGyqhaSR1utinRJz3L.jpg,Filmfour,France|Germany|India|United Kingdom,2001-09-23,,86.0,हिन्दी,Released,,The Warrior,6.3,15.0
17229,,,Drama,25541.0,da,Former Danish servicemen Lars and Jimmy are th...,2.587911,/q19Q5BRZpMXoNCA4OYodVozfjUh.jpg,,Sweden|Denmark,2009-10-21,,90.0,Dansk,Released,,Brotherhood,7.1,21.0
23044,,,Drama,25541.0,da,Former Danish servicemen Lars and Jimmy are th...,2.587911,/q19Q5BRZpMXoNCA4OYodVozfjUh.jpg,,Sweden|Denmark,2009-10-21,,90.0,Dansk,Released,,Brotherhood,7.1,21.0


In [118]:
df.drop_duplicates(inplace = True)

In [121]:
df[df.duplicated(subset = 'id', keep= False)].sort_values(by = 'id')

Unnamed: 0,belongs_to_collection,budget_musd,genres,id,original_language,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue_musd,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
33826,,3e-05,Comedy|Crime|Drama|Romance|Thriller,4912.0,en,"Television made him famous, but his biggest hi...",7.645827,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,United States of America,2002-12-30,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,6.6,281.0
5865,,3e-05,Comedy|Crime|Drama|Romance|Thriller,4912.0,en,"Television made him famous, but his biggest hi...",11.331072,/o3Im9nPLAgtlw1j2LtpMebAotSe.jpg,Miramax Films|Allied Filmmakers|Mad Chance,United States of America,2002-12-30,33.013805,113.0,English,Released,Some things are better left top secret.,Confessions of a Dangerous Mind,6.6,281.0
4114,Pokémon Collection,1.6e-05,Adventure|Fantasy|Animation|Action|Family,10991.0,ja,When Molly Hale's sadness of her father's disa...,10.264597,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,Japan,2000-07-08,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,6.0,143.0
44821,Pokémon Collection,1.6e-05,Adventure|Fantasy|Animation|Action|Family,10991.0,ja,When Molly Hale's sadness of her father's disa...,6.480376,/5ILjS6XB5deiHop8SXPsYxXWVPE.jpg,TV Tokyo|4 Kids Entertainment|Nintendo|Pikachu...,Japan,2000-07-08,68.411275,93.0,English,Released,Pokémon: Spell of the Unknown,Pokémon: Spell of the Unknown,6.0,144.0
44826,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,12600.0,ja,"All your favorite Pokémon characters are back,...",6.080108,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,Japan|United States of America,2001-07-06,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,5.7,82.0
5535,Pokémon Collection,,Adventure|Fantasy|Animation|Science Fiction|Fa...,12600.0,ja,"All your favorite Pokémon characters are back,...",7.072301,/bqL0PVHbQ8Jmw3Njcl38kW0CoeM.jpg,,Japan|United States of America,2001-07-06,28.023563,75.0,日本語,Released,,Pokémon 4Ever: Celebi - Voice of the Forest,5.7,82.0
15765,,2.5e-09,Drama|Comedy|Foreign,13209.0,fa,"Since women are banned from soccer matches, Ir...",1.529879,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,Iran,2006-05-26,,93.0,فارسی,Released,,Offside,6.7,27.0
11342,,2.5e-09,Drama|Comedy|Foreign,13209.0,fa,"Since women are banned from soccer matches, Ir...",1.52896,/nfkOkpudNNIjRrf0mTFVoiGzHyc.jpg,Jafar Panahi Film Productions,Iran,2006-05-26,,93.0,فارسی,Released,,Offside,6.7,27.0
10419,,1.6e-06,Drama|Crime|Mystery,14788.0,en,Set against the backdrop of a decaying Midwest...,3.185256,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,United States of America,2005-09-03,,73.0,English,Released,,Bubble,6.4,36.0
12066,,1.6e-06,Drama|Crime|Mystery,14788.0,en,Set against the backdrop of a decaying Midwest...,3.008299,/w56oo9nREcF54sNXVYuE9QxZFjT.jpg,Magnolia Pictures|Extension 765,United States of America,2005-09-03,,73.0,English,Released,,Bubble,6.4,36.0


#### Explaination On `df.duplicated()` : 
The duplicated() method in Pandas can be used to identify duplicate rows in a DataFrame based on one or more columns. In this case, you're calling the duplicated() method on the df DataFrame, and passing the arguments subset='id' to specify that you want to consider only the 'id' column when identifying duplicates, and keep=False to keep all instances of duplicated rows in the DataFrame, rather than just the first or last instance.

The resulting output will be a Boolean Series where each value is True if the corresponding row is a duplicate based on the 'id' column, and False otherwise.

In [122]:
df.drop_duplicates(subset = 'id', inplace = True)

In [123]:
df['id'].value_counts(dropna = False)

id
862.0       1
74458.0     1
296206.0    1
107308.0    1
16247.0     1
           ..
44399.0     1
10138.0     1
32084.0     1
42191.0     1
461257.0    1
Name: count, Length: 45434, dtype: int64

________________________

# Final Steps of Cleaning Df

In [124]:
df['status'].value_counts()

status
Released           44985
Rumored              229
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: count, dtype: int64

In [125]:
df = df.loc[df['status'] == 'Released'].copy()

The loc[] method in Pandas can be used to select specific rows and columns in a DataFrame based on their labels. In this case, you're using boolean indexing to select the rows in the 'status' column of the df DataFrame where the value is 'Released'.

You're then calling the copy() method on the resulting DataFrame to create a copy of the selected rows. This is useful because modifying a view of a DataFrame (i.e., a slice of the original DataFrame) can sometimes modify the original DataFrame as well, which can lead to unexpected behavior.

The resulting output will be a new DataFrame containing only the rows where the 'status' value is 'Released'. This can be useful for filtering out any rows that are not relevant to your analysis.

In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 44985 entries, 0 to 45465
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  4463 non-null   object        
 1   budget_musd            8855 non-null   float64       
 2   genres                 42601 non-null  object        
 3   id                     44985 non-null  float64       
 4   original_language      44975 non-null  object        
 5   overview               43920 non-null  object        
 6   popularity             44985 non-null  float64       
 7   poster_path            44612 non-null  object        
 8   production_companies   33357 non-null  object        
 9   production_countries   38839 non-null  object        
 10  release_date           44907 non-null  datetime64[ns]
 11  revenue_musd           7385 non-null   float64       
 12  runtime                44734 non-null  float64       
 13  spoken

In [127]:
df.drop(columns=['status'], inplace = True)

In [128]:
col = ['id', 'title', 'tagline', 'release_date', 'genres', 'belongs_to_collection',
      'original_language', 'budget_musd', 'revenue_musd', 'production_companies', 
      'production_countries', 'vote_count', 'vote_average', 'popularity', 'runtime', 'overview',
      'spoken_languages', 'poster_path']

In [129]:
df = df.loc[:, col]

We're storing the list of column names in the variable col, which contains the following column names:
['id', 'title', 'tagline', 'release_date', 'genres', 'belongs_to_collection', 'original_language', 'budget_musd', 'revenue_musd', 'production_companies', 'production_countries', 'vote_count', 'vote_average', 'popularity', 'runtime', 'overview', 'spoken_languages', 'poster_path']

The resulting output will be a new DataFrame containing only the columns in the col list, and all rows from the original DataFrame. This can be useful for selecting only the columns that are relevant to your analysis.

In [130]:
df.head(5)

Unnamed: 0,id,title,tagline,release_date,genres,belongs_to_collection,original_language,budget_musd,revenue_musd,production_companies,production_countries,vote_count,vote_average,popularity,runtime,overview,spoken_languages,poster_path
0,862.0,Toy Story,,1995-10-30,Animation|Comedy|Family,Toy Story Collection,en,3e-05,373.554033,Pixar Animation Studios,United States of America,5415.0,7.7,21.946943,81.0,"Led by Woody, Andy's toys live happily in his ...",English,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg
1,8844.0,Jumanji,Roll the dice and unleash the excitement!,1995-12-15,Adventure|Fantasy|Family,,en,6.5e-05,262.797249,TriStar Pictures|Teitler Film|Interscope Commu...,United States of America,2413.0,6.9,17.015539,104.0,When siblings Judy and Peter discover an encha...,English|Français,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
2,15602.0,Grumpier Old Men,Still Yelling. Still Fighting. Still Ready for...,1995-12-22,Romance|Comedy,Grumpy Old Men Collection,en,,,Warner Bros.|Lancaster Gate,United States of America,92.0,6.5,11.7129,101.0,A family wedding reignites the ancient feud be...,English,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
3,31357.0,Waiting to Exhale,Friends are the people who let you be yourself...,1995-12-22,Comedy|Drama|Romance,,en,1.6e-05,81.452156,Twentieth Century Fox Film Corporation,United States of America,34.0,6.1,3.859495,127.0,"Cheated on, mistreated and stepped on, the wom...",English,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
4,11862.0,Father of the Bride Part II,Just When His World Is Back To Normal... He's ...,1995-02-10,Comedy,Father of the Bride Collection,en,,76.578911,Sandollar Productions|Touchstone Pictures,United States of America,173.0,5.7,8.387519,106.0,Just when George Banks has recovered from his ...,English,/e64sOI48hQXyru7naBFyssKFxVd.jpg


In [131]:
df.reset_index(drop = True, inplace = True)

#### Explaination on `reset_index()`:  

The reset_index() method in Pandas is used to reset the index of a DataFrame or a Series. In this case, you're calling the reset_index() method on the DataFrame df.

The first parameter drop=True indicates that the original index of the DataFrame should be dropped and not added as a new column in the resulting DataFrame.

The second parameter inplace=True indicates that the original DataFrame should be modified in place, rather than creating a copy of the DataFrame with the new index.

Therefore, the resulting output of this code will be a modified version of the df DataFrame where the original index has been reset and the modified DataFrame is saved in memory.

#### Correct the poster path :

In [132]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df.poster_path = "<img src='" + base_poster_url + df.poster_path + "' style='height:100px;'>"

In [133]:
df['poster_path'][0]

"<img src='http://image.tmdb.org/t/p/w185//rhIRbceoE9lR4veEXuwCC2wARtG.jpg' style='height:100px;'>"

## Save the dataframe to new file

In [None]:
df.to_csv('movies_cleaned1.csv', index= False)