## **Data Cleaning Using Python**

#### **Introduction**
---
Data cleaning is one of the most important steps to take before the analysis process begins. It is the process that can make or mar your analysis. safe to say, a dirty data cannot produce a clean analysis.

#### **What is Data Cleaning?**

Data cleaning is the process of identifying, correcting or removing errors, inconsistencies , and inaccuracies in data in order to imporve its quality and ensure the data is accurate and reliable for analysis. Here, i choose to use python because of the various libraries and tools that can be used for data cleaning. Stay with me and grab a snack!  

#### **Data Source** | **Backstory** | **Summary**
---
This dataset was gotten from kaggle.com, was scrapped off imdb top netflix and tvshows. it contains 9 columns and about 9999 rows. This data is completely raw. 

#### **Data Cleaning Process** 
---
These are the process i carried out in the course of this project.

- Import the libraries 
- Load the data
- Check for Null Values
- Drop or Replace Nulls
- Check and convert datatype 
- Check for Duplicates 
- Drop duplicates
- Check for string inconsistency 
- Check for whitespaces and irrelevant puntuations.

##### **_Importing the basics_**
First step is to import the necessary libraries, then import the dataset. 
here, i am using pandas and numpy as they are the most popular libraries used in cleaning data. 

In [3]:
#first, i would import the data to my jupyter 
import pandas as pd
import numpy as np

file_path = "C:/Users/user/Desktop/archive122/movies.csv" 
#read the data into the notebook
movie_df = pd.read_csv(file_path)
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",,\nAdd a Plot\n,\n,,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,,,


A quick summary of the dataset...

In [4]:
#checking the dataset sneak pick
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MOVIES    9999 non-null   object 
 1   YEAR      9355 non-null   object 
 2   GENRE     9919 non-null   object 
 3   RATING    8179 non-null   float64
 4   ONE-LINE  9999 non-null   object 
 5   STARS     9999 non-null   object 
 6   VOTES     8179 non-null   object 
 7   RunTime   7041 non-null   float64
 8   Gross     460 non-null    object 
dtypes: float64(2), object(7)
memory usage: 703.2+ KB


#### **_Handling Missing Values_**

In the cell above, notice the non-null numbers dropping, that is to show that there are null values in the dataset. so i looked further. 

In [5]:
#first step in my cleaning process, is to get rid of null values. 
#here we notice a drop in the numbers of non_null, meaning there are null values in there. lets take a look
movie_df.isna()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,True
4,False,False,False,True,False,False,True,True,True
...,...,...,...,...,...,...,...,...,...
9994,False,False,False,True,False,False,True,True,True
9995,False,False,False,True,False,False,True,True,True
9996,False,False,False,True,False,False,True,True,True
9997,False,False,False,True,False,False,True,True,True


In [6]:
#this supports the claim, but lets look inwards to the count of these null cells
movie_df.isna().sum()
#There were enough Null values in the dataset.

MOVIES         0
YEAR         644
GENRE         80
RATING      1820
ONE-LINE       0
STARS          0
VOTES       1820
RunTime     2958
Gross       9539
dtype: int64

In [7]:
#let us treat them one after the other. 
#for the year column, we will drop the rows where year is null. 

movie_df = movie_df.dropna(subset = ['YEAR'])
#Drop rows with Null values made more sense for an unbiase analysis. 
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",,\nAdd a Plot\n,\n,,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,,,


In [8]:
#the rows with null values have been dropped. 
#let's take a look
movie_df.isna().sum()

MOVIES         0
YEAR           0
GENRE         39
RATING      1176
ONE-LINE       0
STARS          0
VOTES       1176
RunTime     2342
Gross       8895
dtype: int64

In [9]:
#next is the genre column, imagine a movie without a genre classification, it should be dropped to avoid confusion 
movie_df = movie_df.dropna(subset = ['GENRE'])
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",,\nAdd a Plot\n,\n,,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,,,


In [10]:
#so let's check
movie_df.isna().sum()

MOVIES         0
YEAR           0
GENRE          0
RATING      1148
ONE-LINE       0
STARS          0
VOTES       1148
RunTime     2311
Gross       8856
dtype: int64

In [11]:
#Next is the rating column. 
#Rating column can be replaced with the average rating in the column. 
movie_df['RATING'].fillna(movie_df['RATING'].mean(), inplace = True)
#hit with a warning error.
#print dataset 
movie_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.100000,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.000000,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.200000,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.200000,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",6.921658,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.921658,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",6.921658,\nAdd a Plot\n,\n,,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",6.921658,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.921658,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,,,


Got hit with the setting with copy warning, i got upset at this point, until i got help from a senior colleague. then it was suppressed! 

In [12]:
#Now, i would round off the column to a 1dp. 
#I need to fix the warning error before proceeding. 
pd.options.mode.chained_assignment = None


In [13]:
movie_df['RATING'] = movie_df['RATING'].round(1)
#print the dataset 
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,,,


In [14]:
#confirm changes and proceed.
movie_df.isna().sum()

MOVIES         0
YEAR           0
GENRE          0
RATING         0
ONE-LINE       0
STARS          0
VOTES       1148
RunTime     2311
Gross       8856
dtype: int64

In [15]:
#up to the next column VOTES.i need to replace the null values with the mean value, however there is a problem, the column is a string,
#so i might need to convert first before finding the mean value. 
movie_df['VOTES'] = movie_df['VOTES'].astype(float)
#converting directly didnt go through as there were ',' in the values. 

ValueError: could not convert string to float: '21,062'

In [16]:
#so with value error, lets try another approach 
try:
    #replace the ',' with ''.
    movie_df['VOTES'] = movie_df['VOTES'].str.replace(',' , '').astype(float)
except ValueError as e:
    #print error if any
    print(f'Error: {e}')
#print the dataset
print(movie_df)


                                   MOVIES         YEAR  \
0                           Blood Red Sky       (2021)   
1     Masters of the Universe: Revelation     (2021– )   
2                        The Walking Dead  (2010–2022)   
3                          Rick and Morty     (2013– )   
4                         Army of Thieves       (2021)   
...                                   ...          ...   
9994                       The Imperfects     (2021– )   
9995                               Arcane     (2021– )   
9996                    Heart of Invictus     (2022– )   
9997                       The Imperfects     (2021– )   
9998                       The Imperfects     (2021– )   

                                           GENRE  RATING  \
0         \nAction, Horror, Thriller                 6.1   
1     \nAnimation, Action, Adventure                 5.0   
2          \nDrama, Horror, Thriller                 8.2   
3     \nAnimation, Adventure, Comedy                 9.2   
4  

In [17]:
#next, i'd get the mean value and replace the null values with it. 
movie_df['VOTES'].fillna(movie_df['VOTES'].mean(), inplace = True)
#print to show changes
movie_df
#next, i want to round up to 1dp. 
movie_df['VOTES'] = movie_df['VOTES'].round(1)
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,,


In [18]:
#next is the runtime column 
movie_df['RunTime'].fillna(movie_df['RunTime'].mean(), inplace = True)
#print the dataset
movie_df
#next, i would round off to 1dp
movie_df['RunTime'] = movie_df['RunTime'].round(1)
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,68.8,
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,


In [19]:
#next is the gross column
movie_df.isna().sum()
#there are 8856 null values here

MOVIES         0
YEAR           0
GENRE          0
RATING         0
ONE-LINE       0
STARS          0
VOTES          0
RunTime        0
Gross       8856
dtype: int64

In [20]:
#so Gross has so many null values in the column, i would replace the null with ZERO. 
movie_df['Gross'].fillna(0, inplace = True)
movie_df
#job is done on that. next i would be changing their data type to the appropriate one.

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0
4,Army of Thieves,(2021),"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0
9995,Arcane,(2021– ),"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,68.8,0
9996,Heart of Invictus,(2022– ),"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0
9997,The Imperfects,(2021– ),"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0


#### **_Handling Incorrect Data Types_**

It is fine to have wrong data types for each columns. The job here is to convert to the appropriate type, some columns have already been treated in the previous step.

In [21]:
#first, let me check the affected columns
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9316 entries, 0 to 9998
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   MOVIES    9316 non-null   object 
 1   YEAR      9316 non-null   object 
 2   GENRE     9316 non-null   object 
 3   RATING    9316 non-null   float64
 4   ONE-LINE  9316 non-null   object 
 5   STARS     9316 non-null   object 
 6   VOTES     9316 non-null   float64
 7   RunTime   9316 non-null   float64
 8   Gross     9316 non-null   object 
dtypes: float64(3), object(6)
memory usage: 727.8+ KB


In [22]:
#the year is in string, it should be changed to the date data type.
movie_df['YEAR']
#to achieve that, first i must strip the column off any delimiter that is not a digit

0            (2021)
1          (2021– )
2       (2010–2022)
3          (2013– )
4            (2021)
           ...     
9994       (2021– )
9995       (2021– )
9996       (2022– )
9997       (2021– )
9998       (2021– )
Name: YEAR, Length: 9316, dtype: object

In [23]:
#regex helps to filter the values in the columns and ensure they are trated as the condition set.
new_date = movie_df['YEAR'].str.replace('[^0-9]', '', regex = True)
#print the dataset
new_date
# there are rows with extra values aside the dates, as shown below.

0           2021
1           2021
2       20102022
3           2013
4           2021
          ...   
9994        2021
9995        2021
9996        2022
9997        2021
9998        2021
Name: YEAR, Length: 9316, dtype: object

In [24]:
# Trim the strings by removing extra digits behind the first 4.
trimmed_dates = [date[:4] for date in new_date]
# Displaying the trimmed strings
for trimmed_date in trimmed_dates:
    print(trimmed_date)
#replace the values with the year column 

2021
2021
2010
2013
2021
2020
2021
2006
2020
2019
2021
2016
2021
2021
2011
2005
2008
2017
2017
2016
2021
1994
2014
2013
2021
2021
2013
2015
2005
2013
2021
2016
2013
2003
2019
2009
2018
2019
2010
2013
2011
2017
2015
2014
2015
2011
2017
2016
2021
2021
2014
2005
2009
2008
2016
2009
2020
2019
2018
2017
2020
2016
1987
2021
2021
2019
2015
2018
2012
2014
2011
2005
2019
2021
2021
2021
2021
2017
2007
2018
2000
2021
2021
2021
2007
1993
1999
2015
2017
2016
2018
2019
2014
2016
2013
2016
2021
2012
2013
2007
2020
2018
2011
2017
2010
2021
2016
2000
2019
2015
2014
2001
2019
1997
2011
2017
2011
2014
1993
1989
2010
2010
2015
2017
2003
2019
2018
2017
1975
2005
2017
1995
2006
2015
2021
2008
1984
2021
2010
2021
2021
2019
2021
2014
2017
2000
2013
2015
2021
2019
2014
2009
2016
2013
2009
2021
2003
2021
2017
2006
1998
2017
2010
1966
2009
2019
1990
2021
2017
2019
2021
2012
1995
2013
2020
2017
2016
2020
2017
2020
2013
2015
2019
2021
2017
2020
2021
2013
2012
2015
2013
2019
2020
2003
2017
2017
2002
2019
2020
2020


2010
2019
2018
2014
2001
2019
2019
2013
2016
2018
2020
2015
2018
2012
2016
2016
2015
2019
2017
2019
2020
2019
2021
2020
2022
2020
2019
2020
2020
2018
2017
2019
2017
2011
2020
2020
2016
2011
2018
2008
2012
2018
2018
2015
2017
2020
2019
2020
2021
2020
2016
2018
2018
2020
2018
2018
2019
2020
2013
2018
2019
2018

2022
2019
2018
2019
2015
2018
2019
2019
2017
2016
2019
2021
2020
2017
2014
2013
2020
2019
2018
2016
2020
2007
2017
2020
2018
2009
2021
2015
2020
2009
2016
2015
2020
2018
2019
2019
2012
2021
2019
2019
2020
2021
1997
2014
2020
2021
2020
2018
2021
2010
2015
2018
2016
2017
2010
2016
2018
2015
2012
2020
2019
2018
2019
2019
2021
2019
2018
2017
2020
2020
2022
2020
2019
2020
2020
1946
2019
2019
2016
2014
2019
2017
2010
2020
2017
2020
2019
2020
2011
2017
2019
2014
2020
2014
2012
2013
2020
2016
2020
2014
2020
2017
2019
2019
2016
2020
2022
2020
2020
2020
2013
2019
1981
2022
2016
2019
2019
2020
2019
2016
2017
2018
2020
2017
2021
2015
2014
2017
2020
2018
2017
2021
2015
2018
2021
2021
2019
2017

2017
2021
2010
2017
2019
2019
2019
2019
2021
2017
2020
2015
2020
2016
2018
2017
1972
2019
2019
2021
2019
2016
2020
2014
2018
2019
2017
2010
2000
2010
2017
2015
2015
2018
2016
2020
2021
2019
2020
2015
2020
2018
2017
2018
2021
2013
2020
2014
2021
2020
2021
2018
2006
2016
2016
2015
2020
2018
2019
2016
2018
2015
2019
2017
2020
2019
2018
2017
2017
2017
2018
2020
2019
2018
2016
2015
2011
2011
2020
2021
2018
2017
2022
2014
2020
2014
2002
2009
2018
2020
2021
2018
2021
2018
2015
2017
2018
2019
2016
2020
2016
2020
2010
2019
2021
2014
2020
2010
2017
2009
2019
2019
2022
2020
2017
2021
2016
2016
2020
2015
2004
2003
2018
2016
2017
2016
2018
2019
2018
2019
2017
2017
2019
2015
2016
2016
2019
2016
2022
2020
2019
2011
2017
2020
2012
2020
2015
2017
2021
2018
2017
2018
2014
2021
2014
2019
2021
2019
2007
2014
2019
2018
2016
2017
2014
2019
2015
2020
2015
2019
2019
2020
2013
2015
2019
2022
2014
2017
2017
2012
2020
2016
2021
2015
2016
2013
2015
2013
2019
2020
2018
2020
2018
2020
2019
2020
2020
2018
2018
2021


2015
2017
2014
2019
2015
2019
2019
2020
2001
2013
2016
2018
2015
2015
2018
2010
2014
2011
2017
2022
2017
2020

2015
2013
2018
2006

2017
2015
2014
2014
2015
2015
2017
2013
2011
2011
2006
2021
2015
2017

2013
2017
2013
2017
2006
2020
2022
2005
2015
2017
2014
2016
2019
2016
2010
2015
2019
2011
2013
2018
1997
2010
2018
2018
2020
2018
2018
2022
2003
2012
2017
2017
2018
2014
2009
2011
2018
2015
2021
2021

2012
2013
2016
2016
2021
2014
2001
2019
2013
2019
2018
2019
2018
2021
2016
2017
2007
2018
2020
2017
2013
2018
2021
2016
2017
2011
2010
2020
2020

2016
2012
2013
2015
2017
2021
2022
2013
2016
2006
2019
2013
2015

2012
2017
2021
2015
2020
2016
2016
2006
2006
2014
2020
2007
2019
2017
2018
2006
2021
2017
2010
2016
2015
2008
2018
2013
2009
2017
2013
2000
2019

2018

2015
2016
2018
2016
2018
2017
2006
2017
2016
2018
2019

2019
2018
2001
2021
2000
1995
2014
2020
2008
2017
2017
2017
2011
2019
2010
2003
2003
2021
2016
2015
2016
2016
2004
2019
2016
2008
2017
2016
1998
2010
2001
2011
2017
2021
2018
2

2021
2021
2015
2019
2019
2019
2019
2019
2019
2005
2020
2017
2019
2019
2019
2019
2019
2019
2019
2015
2017
2005
2021
2021
2018
2020
2020
2020
2005
2015
2020
2020
2020
2020
2021
2015
2015
2016
2015
2017
2020
2020
2020
2020
2020
2013
2013
2018
2020
2020
2020
2020
2020
1997
2017
2017
2017
2015
2017
2017
2020
2017
2020
2020
2019
2019
2019
2019
2019
2019
2019
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2017
2017
2014
2013
2005
2020
2015
2020
2017
2020
2018
2018
2020
2020
2020
2020
2020
2020
2020
2020
2020
2020
2015
2015
2015
2020
2020
2020
2020
2020
2020
2020
2020
2020
2021
2021
2017
2017
2015
2018
2018
2018
2018
2018
2014
2018
2018
2018
2018
2018
2021
2017
2017
2017
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2021
2015
2017
2005
2005
2020
2015
2019
2020
2020
2005
2020
2020
2020
2020
2020
2020
2020
2020
2020
2019
2019
2019
2019
2019
2016
2016
2016
2016
2016
2020
2020
2019
2019
2019
2019
2019
2020
2020
2020
2020
2020
2020
2020
2020
2005


In [25]:
#replace the column with the trimmed date values
movie_df.loc[:, 'YEAR'] = trimmed_dates
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,2021,"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0
1,Masters of the Universe: Revelation,2021,"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0
2,The Walking Dead,2010,"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0
3,Rick and Morty,2013,"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0
4,Army of Thieves,2021,"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021,"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0
9995,Arcane,2021,"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,68.8,0
9996,Heart of Invictus,2022,"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0
9997,The Imperfects,2021,"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0


In [26]:
#to double check the column
movie_df['YEAR'].unique()

array(['2021', '2010', '2013', '2020', '2006', '2019', '2016', '2011',
       '2005', '2008', '2017', '1994', '2014', '2015', '2003', '2009',
       '2018', '1987', '2012', '2007', '2000', '1993', '1999', '2001',
       '1997', '1989', '1975', '1995', '1984', '1998', '1966', '1990',
       '2002', '1976', '1978', '2022', '1982', '1968', '2004', '1996',
       '1971', '1980', '1962', '1991', '1960', '1988', '1969', '1961',
       '1979', '1956', '1983', '1986', '1967', '1974', '', '1992', '1958',
       '1932', '1941', '1950', '1946', '1981', '1952', '1957', '1954',
       '1955', '1948', '1947', '1977', '2023', '1945', '1953', '1985',
       '1973', '1972', '1965', '1944', '1933', '1938'], dtype=object)

In [27]:
#next, i will change the datatype to datestamp 
movie_df['YEAR'] = movie_df['YEAR'].astype('datetime64')
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,2021-01-01,"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0
1,Masters of the Universe: Revelation,2021-01-01,"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0
2,The Walking Dead,2010-01-01,"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0
3,Rick and Morty,2013-01-01,"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0
4,Army of Thieves,2021-01-01,"\nAction, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0
9995,Arcane,2021-01-01,"\nAnimation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,68.8,0
9996,Heart of Invictus,2022-01-01,"\nDocumentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0
9997,The Imperfects,2021-01-01,"\nAdventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0


In [28]:
#next,the genre column. 
#let me get the distinct values first.
movie_df['GENRE'].unique()
#this is to show the extent of the dirty data.

array(['\nAction, Horror, Thriller            ',
       '\nAnimation, Action, Adventure            ',
       '\nDrama, Horror, Thriller            ',
       '\nAnimation, Adventure, Comedy            ',
       '\nAction, Crime, Horror            ',
       '\nAction, Crime, Drama            ',
       '\nDrama, Romance            ',
       '\nCrime, Drama, Mystery            ', '\nComedy            ',
       '\nAction, Adventure, Thriller            ',
       '\nCrime, Drama, Fantasy            ',
       '\nDrama, Horror, Mystery            ',
       '\nComedy, Drama, Romance            ',
       '\nCrime, Drama, Thriller            ', '\nDrama            ',
       '\nComedy, Drama            ',
       '\nDrama, Fantasy, Horror            ',
       '\nComedy, Romance            ',
       '\nAction, Adventure, Drama            ',
       '\nCrime, Drama            ',
       '\nDrama, History, Romance            ',
       '\nHorror, Mystery            ', '\nComedy, Crime            ',
     

In [29]:
#first i will get rid of the white spaces.
movie_df['GENRE'] = movie_df['GENRE'].str.strip()
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,2021-01-01,"Action, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0
1,Masters of the Universe: Revelation,2021-01-01,"Animation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0
2,The Walking Dead,2010-01-01,"Drama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0
3,Rick and Morty,2013-01-01,"Animation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0
4,Army of Thieves,2021-01-01,"Action, Crime, Horror",6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0
...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,"Adventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0
9995,Arcane,2021-01-01,"Animation, Action, Adventure",6.9,\nAdd a Plot\n,\n,15144.3,68.8,0
9996,Heart of Invictus,2022-01-01,"Documentary, Sport",6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0
9997,The Imperfects,2021-01-01,"Adventure, Drama, Fantasy",6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0


In [30]:
#i want to spilt the key words into columns 
#get the unique vales first
movie_df['GENRE'].unique()

array(['Action, Horror, Thriller', 'Animation, Action, Adventure',
       'Drama, Horror, Thriller', 'Animation, Adventure, Comedy',
       'Action, Crime, Horror', 'Action, Crime, Drama', 'Drama, Romance',
       'Crime, Drama, Mystery', 'Comedy', 'Action, Adventure, Thriller',
       'Crime, Drama, Fantasy', 'Drama, Horror, Mystery',
       'Comedy, Drama, Romance', 'Crime, Drama, Thriller', 'Drama',
       'Comedy, Drama', 'Drama, Fantasy, Horror', 'Comedy, Romance',
       'Action, Adventure, Drama', 'Crime, Drama',
       'Drama, History, Romance', 'Horror, Mystery', 'Comedy, Crime',
       'Action, Drama, History', 'Action, Adventure, Crime',
       'Action, Adventure, Fantasy', 'Action, Crime, Mystery',
       'Drama, Fantasy, Romance', 'Drama, Sci-Fi, Thriller',
       'Biography, Drama, History', 'Crime, Thriller',
       'Comedy, Crime, Drama', 'Drama, Mystery, Thriller',
       'Action, Adventure, Mystery', 'Action, Comedy',
       'Crime, Drama, Horror', 'Drama, Mystery, Sc

In [31]:
#i will split the keywords into columns 
# Split the keywords into separate columns using get_dummies
#get_dummies would split the keywords into columns and change it to a categorical data type, where 1 means present and 0 means otherwise
genre_df = movie_df['GENRE'].str.get_dummies(', ')
#print the new columns dataset
genre_df

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,0,1,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,0,1,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


In [32]:
#change the dataframe to a categorical data type.
genre_df = genre_df.astype('category')
#print the dataset
genre_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9316 entries, 0 to 9998
Data columns (total 27 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Action       9316 non-null   category
 1   Adventure    9316 non-null   category
 2   Animation    9316 non-null   category
 3   Biography    9316 non-null   category
 4   Comedy       9316 non-null   category
 5   Crime        9316 non-null   category
 6   Documentary  9316 non-null   category
 7   Drama        9316 non-null   category
 8   Family       9316 non-null   category
 9   Fantasy      9316 non-null   category
 10  Film-Noir    9316 non-null   category
 11  Game-Show    9316 non-null   category
 12  History      9316 non-null   category
 13  Horror       9316 non-null   category
 14  Music        9316 non-null   category
 15  Musical      9316 non-null   category
 16  Mystery      9316 non-null   category
 17  News         9316 non-null   category
 18  Reality-TV   9316 non-null  

In [33]:
#next, i'll drop the old column and replace it with this new data frame
movie_df.drop('GENRE', axis = 1, inplace = True)
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,2021-01-01,6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0
2,The Walking Dead,2010-01-01,8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0
3,Rick and Morty,2013-01-01,9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0
4,Army of Thieves,2021-01-01,6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0
...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0
9995,Arcane,2021-01-01,6.9,\nAdd a Plot\n,\n,15144.3,68.8,0
9996,Heart of Invictus,2022-01-01,6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0
9997,The Imperfects,2021-01-01,6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0


In [34]:
#then i would concat the new dataframe to the old dataframe
movie_df = pd.concat([movie_df, genre_df], axis=1)
#print the dataset after joining the new columns
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,\nAdd a Plot\n,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,Arcane,2021-01-01,6.9,\nAdd a Plot\n,\n,15144.3,68.8,0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,\nAdd a Plot\n,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,\nAdd a Plot\n,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [35]:
#A quick look at the table summery
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9316 entries, 0 to 9998
Data columns (total 35 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   MOVIES       9316 non-null   object        
 1   YEAR         9227 non-null   datetime64[ns]
 2   RATING       9316 non-null   float64       
 3   ONE-LINE     9316 non-null   object        
 4   STARS        9316 non-null   object        
 5   VOTES        9316 non-null   float64       
 6   RunTime      9316 non-null   float64       
 7   Gross        9316 non-null   object        
 8   Action       9316 non-null   category      
 9   Adventure    9316 non-null   category      
 10  Animation    9316 non-null   category      
 11  Biography    9316 non-null   category      
 12  Comedy       9316 non-null   category      
 13  Crime        9316 non-null   category      
 14  Documentary  9316 non-null   category      
 15  Drama        9316 non-null   category      
 16  Family

In [36]:
#next is the one-line column 
#trim the whitespaces off 
movie_df['ONE-LINE'] = movie_df['ONE-LINE'].str.strip()
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",\n Director:\nMatthias Schweighöfer\n| \n ...,15144.3,68.8,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,Add a Plot,\n \n Stars:\nMorgan Taylor Camp...,15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,\n,15144.3,68.8,0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,Add a Plot,\n Director:\nOrlando von Einsiedel\n| \n ...,15144.3,68.8,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,Add a Plot,\n Director:\nJovanka Vuckovic\n| \n Sta...,15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [37]:
#checking for any other irregularities
movie_df['ONE-LINE']

0       A woman with a mysterious illness is forced in...
1       The war for Eternia begins again in what may b...
2       Sheriff Deputy Rick Grimes wakes up from a com...
3       An animated series that follows the exploits o...
4       A prequel, set before the events of Army of th...
                              ...                        
9994                                           Add a Plot
9995                                           Add a Plot
9996                                           Add a Plot
9997                                           Add a Plot
9998                                           Add a Plot
Name: ONE-LINE, Length: 9316, dtype: object

In [38]:
#next,is the stars column
#first, get rid of the white spaces 
movie_df['STARS'] = movie_df['STARS'].str.strip()
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:\nPeter Thorwarth\n| \n Stars:\nPe...,21062.0,121.0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:\nChris Wood, \nSarah Michelle Gellar, \...",17870.0,25.0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:\nAndrew Lincoln, \nNorman Reedus, \nMel...",885805.0,44.0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:\nJustin Roiland, \nChris Parnell, \nSpe...",414849.0,23.0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:\nMatthias Schweighöfer\n| \n Star...,15144.3,68.8,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,Add a Plot,"Stars:\nMorgan Taylor Campbell, \nChris Cope, ...",15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144.3,68.8,0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,Add a Plot,Director:\nOrlando von Einsiedel\n| \n Star...,15144.3,68.8,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,Add a Plot,Director:\nJovanka Vuckovic\n| \n Stars:\nM...,15144.3,68.8,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [39]:
#lets take a closer look at the column again
movie_df['STARS']
#i need to replace the ''\n'' with blanks and also replace the '|' with blank
movie_df['STARS'] = movie_df['STARS'].str.replace('\n', '').str.replace('|', ' ')
movie_df['STARS']
#okay, this looks better. 

0       Director:Peter Thorwarth      Stars:Peri Baume...
1       Stars:Chris Wood, Sarah Michelle Gellar, Lena ...
2       Stars:Andrew Lincoln, Norman Reedus, Melissa M...
3       Stars:Justin Roiland, Chris Parnell, Spencer G...
4       Director:Matthias Schweighöfer      Stars:Matt...
                              ...                        
9994    Stars:Morgan Taylor Campbell, Chris Cope, Iñak...
9995                                                     
9996    Director:Orlando von Einsiedel      Star:Princ...
9997    Director:Jovanka Vuckovic      Stars:Morgan Ta...
9998    Director:Jovanka Vuckovic      Stars:Morgan Ta...
Name: STARS, Length: 9316, dtype: object

In [40]:
#next, is the gross column, lets take a look
movie_df['Gross'].unique()

#i will change the data type but first, have to strip  these string off.
movie_df['Gross'] = movie_df['Gross'].str.replace('$M', '')

#check for the changes made
movie_df['Gross'].unique()
#that did not go as planned, okay would try again

movie_df['Gross'] = movie_df['Gross'].str.replace('$','').str.replace('M','')
movie_df['Gross']
#okay, lets look at the unique values

movie_df['Gross'].unique()
#next, i would change the data type to float

movie_df['Gross'] = movie_df['Gross'].astype('numeric')
#cannot convert cos of presence of some nan number, to pass that, we coerce the error

movie_df['Gross'] = pd.to_numeric(movie_df['Gross'], errors='coerce')
# Round the 'float_column' to 1 decimal place
movie_df['Gross'] = movie_df['Gross'].round(1)


In [41]:
#the values been replaced with NaN, next i would change the NaN to zero.
movie_df['Gross']
movie_df['Gross'].fillna(0, inplace = True)
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062.0,121.0,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,Add a Plot,"Stars:Morgan Taylor Campbell, Chris Cope, Iñak...",15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144.3,68.8,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,Add a Plot,Director:Orlando von Einsiedel Star:Princ...,15144.3,68.8,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,Add a Plot,Director:Jovanka Vuckovic Stars:Morgan Ta...,15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [42]:
#check for unique values to know if the changes have been made
movie_df['Gross'].unique()

array([0.000e+00, 7.550e+01, 4.024e+02, 8.920e+01, 3.155e+02, 5.700e+01,
       2.600e+02, 1.324e+02, 1.678e+02, 4.045e+02, 1.510e+01, 7.010e+01,
       2.106e+02, 3.275e+02, 3.905e+02, 3.030e+02, 5.660e+01, 5.810e+01,
       3.530e+02, 4.690e+01, 7.000e+00, 3.778e+02, 1.078e+02, 4.037e+02,
       3.168e+02, 1.006e+02, 2.830e+01, 1.888e+02, 2.135e+02, 2.260e+02,
       4.081e+02, 1.010e+01, 1.480e+01, 1.680e+02, 1.836e+02, 3.426e+02,
       9.650e+01, 1.402e+02, 1.726e+02, 3.304e+02, 1.780e+01, 2.000e-01,
       5.680e+01, 6.620e+01, 7.560e+01, 4.600e+00, 1.066e+02, 5.000e+00,
       2.270e+01, 1.029e+02, 1.105e+02, 5.040e+02, 2.690e+01, 2.000e+00,
       3.270e+01, 3.380e+01, 4.007e+02, 1.900e+01, 3.630e+01, 6.700e+00,
       4.550e+01, 7.570e+01, 3.000e-01, 4.000e+00, 2.020e+01, 9.590e+01,
       1.765e+02, 1.267e+02, 8.010e+01, 4.230e+01, 4.700e+00, 1.177e+02,
       2.291e+02, 3.370e+01, 9.770e+01, 2.120e+01, 2.780e+01, 2.500e+01,
       1.009e+02, 8.000e-01, 3.120e+01, 1.435e+02, 

#### **_Handling Duplicated Values_**

Duplicated values can be handled in different ways, some might be tolerated depending on the columns they exist on and what type of data present there. for a data like this, the movies column is a unique identifier and should not be allowed to have duplicated values. Alright, so lets continue. 

In [43]:
#i wan to check for duplicates and remove duplicates for each column.
#print the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062.0,121.0,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,The Imperfects,2021-01-01,6.9,Add a Plot,"Stars:Morgan Taylor Campbell, Chris Cope, Iñak...",15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144.3,68.8,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,Add a Plot,Director:Orlando von Einsiedel Star:Princ...,15144.3,68.8,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,Add a Plot,Director:Jovanka Vuckovic Stars:Morgan Ta...,15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [44]:
#show the duplicated rows
movie_df[movie_df.duplicated()]
#drop duplicates 
movie_df = movie_df.drop_duplicates(keep = 'first')
#view the changes made in the dataframe
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062.0,121.0,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9993,Totenfrau,2022-01-01,6.9,Add a Plot,"Director:Nicolai Rohde Stars:Felix Klare,...",15144.3,68.8,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144.3,68.8,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
9996,Heart of Invictus,2022-01-01,6.9,Add a Plot,Director:Orlando von Einsiedel Star:Princ...,15144.3,68.8,0.0,0,0,...,0,0,0,0,0,1,0,0,0,0
9997,The Imperfects,2021-01-01,6.9,Add a Plot,Director:Jovanka Vuckovic Stars:Morgan Ta...,15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [45]:
#several rows have been dropped, but to ensure there are no duplicates, we check by the columns
movie_df['MOVIES'].duplicated()
#drop duplicates by columns
movie_df['MOVIES'] = movie_df['MOVIES'].drop_duplicates(keep = False)
#check for duplicate
movie_df['MOVIES'].duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
9993    False
9995    False
9996    False
9997     True
9998     True
Name: MOVIES, Length: 8989, dtype: bool

In [46]:
#i want to show the duplicated values in the column movies, for a closer look
duplicate = movie_df[movie_df.duplicated(subset = 'MOVIES', keep = False)]
duplicate
#the movie title columns contain NaN, this is unacceptable as the column is a unique identifier and a major kpi to consider.

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
34,,2019-01-01,8.4,While strange rumors about their ill King grip...,"Stars:Ju Ji-Hoon, Bae Doona, Kim Sungkyu, Hye-...",34906.0,45.0,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
139,,2021-01-01,6.8,Millions in stolen cash. Missing luxury bourbo...,"Stars:William Guirola, Megan Barlow, Scott Ben...",846.0,41.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
177,,2020-01-01,6.9,Seven years after the world has become a froze...,"Stars:Daveed Diggs, Iddo Goldberg, Mickey Sumn...",39433.0,60.0,0.0,1,0,...,0,0,0,1,0,0,0,0,0,0
235,,2021-01-01,4.7,"Hoping to say goodbye to superficial dating, r...",Star:Rob Delaney,592.0,68.8,0.0,0,0,...,0,1,1,0,0,0,0,0,0,0
247,,2009-01-01,9.1,Two brothers search for a Philosopher's Stone ...,"Stars:Kent Williams, Iemasa Kayumi, Vic Mignog...",134855.0,24.0,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9983,,2015-01-01,6.9,Add a Plot,Director:Paul Wilmshurst Stars:Alexander ...,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
9986,,2015-01-01,6.9,Add a Plot,,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
9987,,2015-01-01,6.9,Add a Plot,Director:Anthony Philipson,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
9997,,2021-01-01,6.9,Add a Plot,Director:Jovanka Vuckovic Stars:Morgan Ta...,15144.3,68.8,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [49]:
#i want to drop the NaN values from the dataset, but i would take a closer look before i do that
movie_df['MOVIES']

0                             Blood Red Sky
1       Masters of the Universe: Revelation
2                          The Walking Dead
3                            Rick and Morty
4                           Army of Thieves
                       ...                 
9993                              Totenfrau
9995                                 Arcane
9996                      Heart of Invictus
9997                                    NaN
9998                                    NaN
Name: MOVIES, Length: 8989, dtype: object

In [51]:
#there are definately NaN values there so i would drop them now. 
movie_df = movie_df.drop_duplicates(subset = 'MOVIES', keep = False)
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062.0,121.0,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870.0,25.0,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805.0,44.0,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849.0,23.0,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144.3,68.8,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9845,Disparu à jamais,2021-01-01,6.9,Add a Plot,Director:Juan Carlos Medina Star:Bojesse ...,15144.3,68.8,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
9901,Modern Family,2009-01-01,8.2,"Jay must adapt to his young new wife, Gloria a...","Director:Jason Winer Stars:Ed O'Neill, So...",3404.0,23.0,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
9993,Totenfrau,2022-01-01,6.9,Add a Plot,"Director:Nicolai Rohde Stars:Felix Klare,...",15144.3,68.8,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144.3,68.8,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [52]:
#check to see if the changes were okay
movie_df['MOVIES'].duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
9845    False
9901    False
9993    False
9995    False
9996    False
Name: MOVIES, Length: 5955, dtype: bool

In [58]:
#next i want to round VOTES TO a whole munber
movie_df['VOTES'] = movie_df['VOTES'].round(0)

#then i want to round the runtime column to a whole number
movie_df['RunTime'] = movie_df['RunTime'].round(0)
movie_df

#Votes should be in int, not floats
movie_df['VOTES'] = movie_df['VOTES'].astype('int64')
movie_df

#Runtime is also supposed to be in minutes, the right data type for that is int.
movie_df['RunTime'] = movie_df['RunTime'].astype('int64')
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062,121,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870,25,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805,44,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849,23,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144,69,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9845,Disparu à jamais,2021-01-01,6.9,Add a Plot,Director:Juan Carlos Medina Star:Bojesse ...,15144,69,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
9901,Modern Family,2009-01-01,8.2,"Jay must adapt to his young new wife, Gloria a...","Director:Jason Winer Stars:Ed O'Neill, So...",3404,23,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
9993,Totenfrau,2022-01-01,6.9,Add a Plot,"Director:Nicolai Rohde Stars:Felix Klare,...",15144,69,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144,69,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [59]:
#that is a wrap. 
#what a journey that was!
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5955 entries, 0 to 9996
Data columns (total 35 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   MOVIES       5955 non-null   object        
 1   YEAR         5892 non-null   datetime64[ns]
 2   RATING       5955 non-null   float64       
 3   ONE-LINE     5955 non-null   object        
 4   STARS        5955 non-null   object        
 5   VOTES        5955 non-null   int64         
 6   RunTime      5955 non-null   int64         
 7   Gross        5955 non-null   float64       
 8   Action       5955 non-null   category      
 9   Adventure    5955 non-null   category      
 10  Animation    5955 non-null   category      
 11  Biography    5955 non-null   category      
 12  Comedy       5955 non-null   category      
 13  Crime        5955 non-null   category      
 14  Documentary  5955 non-null   category      
 15  Drama        5955 non-null   category      
 16  Family

In [60]:
#final desription of the data.
movie_df.describe()

Unnamed: 0,RATING,VOTES,RunTime,Gross
count,5955.0,5955.0,5955.0,5955.0
mean,6.619647,21219.01,78.239127,3.247221
std,1.167889,80856.98,45.487856,24.696203
min,1.1,5.0,1.0,0.0
25%,5.9,461.5,55.0,0.0
50%,6.8,2047.0,70.0,0.0
75%,7.4,13817.0,98.0,0.0
max,9.4,1713028.0,853.0,486.3


In [61]:
#final shape of the dataset
movie_df.shape

(5955, 35)

In [62]:
#final look at the dataset
movie_df

Unnamed: 0,MOVIES,YEAR,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross,Action,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Blood Red Sky,2021-01-01,6.1,A woman with a mysterious illness is forced in...,Director:Peter Thorwarth Stars:Peri Baume...,21062,121,0.0,1,0,...,0,0,0,0,0,0,0,1,0,0
1,Masters of the Universe: Revelation,2021-01-01,5.0,The war for Eternia begins again in what may b...,"Stars:Chris Wood, Sarah Michelle Gellar, Lena ...",17870,25,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,The Walking Dead,2010-01-01,8.2,Sheriff Deputy Rick Grimes wakes up from a com...,"Stars:Andrew Lincoln, Norman Reedus, Melissa M...",885805,44,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,Rick and Morty,2013-01-01,9.2,An animated series that follows the exploits o...,"Stars:Justin Roiland, Chris Parnell, Spencer G...",414849,23,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Army of Thieves,2021-01-01,6.9,"A prequel, set before the events of Army of th...",Director:Matthias Schweighöfer Stars:Matt...,15144,69,0.0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9845,Disparu à jamais,2021-01-01,6.9,Add a Plot,Director:Juan Carlos Medina Star:Bojesse ...,15144,69,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
9901,Modern Family,2009-01-01,8.2,"Jay must adapt to his young new wife, Gloria a...","Director:Jason Winer Stars:Ed O'Neill, So...",3404,23,0.0,0,0,...,0,0,1,0,0,0,0,0,0,0
9993,Totenfrau,2022-01-01,6.9,Add a Plot,"Director:Nicolai Rohde Stars:Felix Klare,...",15144,69,0.0,0,0,...,0,0,0,0,0,0,0,1,0,0
9995,Arcane,2021-01-01,6.9,Add a Plot,,15144,69,0.0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [64]:
#finally, i would save the dataset to my desktop
file_path = 'C:/Users/user/Desktop/cleaned_movie_data.csv'
movie_df.to_csv(file_path, index = False)
#see you next time! caio!