In this notebook, I will walk you through the steps to be taken to clean and carry out exploratory data analysis on the IMDB TV series data from kaggle https://www.kaggle.com/datasets/suraj520/imdb-tv-series-data
The dataset contains information about TV series from IMDb, including details such as title, IMDb ID, release year, genre, cast, synopsis, rating, runtime, certificate, number of votes, and gross revenue. The data is scraped from the IMDb website using web scraping techniques and is organized into separate CSV files for each genre.

Importing Required Libraries:
* glob: This library is used to retrieve file paths matching specified patterns.
* pandas: It provides data manipulation and analysis capabilities.
* os: This library provides a way to interact with the operating system, including file and directory operations.
* zipfile: This library offers tools to create, read, write, and extract files from ZIP archives.

In [2]:
import glob
import pandas as pd
import os
from zipfile import ZipFile

The imdb data is a zip file called archive.zip, to find it, you start with an empty list called zip_files to store the paths of the identified ZIP files.
Begin walking through the directory structure using os.walk("/home/anees/projects/EDA_and_data_cleaning").
For each directory, examine the files within it.
If a file has a ".zip" extension, add its path to the zip_files list.
Continue the process until all directories have been traversed.
Finally, print the zip_files list to display the paths of all identified ZIP files.

In [3]:
zip_files = []
for root, dirs, files in os.walk("/home/anees/projects/EDA_and_data_cleaning"):
    for file in files:
        if file.endswith(".zip"):
            zip_files.append(os.path.join(root, file))

print(zip_files)

['/home/anees/projects/EDA_and_data_cleaning/venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata02/ziptestdata.zip', '/home/anees/projects/EDA_and_data_cleaning/venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata01/ziptestdata.zip', '/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/archive.zip']


After obtaining the list of ZIP file paths, you will notice that the "archive.zip" file is located at the last index. Utilizing the zipfile library, you can extract the contents of this ZIP file to a folder named "imdb_files" within the "raw_data/imdb" directory. Once extracted, you can verify the extraction process by printing the paths of the extracted CSV files.

In [4]:
with ZipFile(zip_files[-1], "r") as file:
    file.extractall(path="/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files")

csv_files = glob.glob("/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/*.csv")
for csv_file in csv_files:
    print(csv_file)

/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/adventure_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/drama_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/fantasy_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/history_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/biography_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/crime_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/sci-fi_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/music_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/western_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/action_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/thriller_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw

To process the extracted CSV files and create a consolidated DataFrame, you can iterate through the list of CSV file paths, read each CSV file using pd.read_csv(), and append the resulting DataFrames to a list called dataframes. Finally, you can use pd.concat() to concatenate the DataFrames into a single DataFrame called imdb_df.
By executing this code, you will obtain the consolidated DataFrame imdb_df, which contains the data from all the extracted CSV files.

In [5]:
dataframes = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    dataframes.append(df)
imdb_df = pd.concat(dataframes, ignore_index=True)

To retrieve information about the structure and summary of the consolidated DataFrame imdb_df, you can use the info() method.
Here's the code:

In [6]:
imdb_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Title            236828 non-null  object 
 1   IMDb ID          236828 non-null  object 
 2   Release Year     236819 non-null  object 
 3   Genre            236828 non-null  object 
 4   Cast             235956 non-null  object 
 5   Synopsis         236828 non-null  object 
 6   Rating           236828 non-null  float64
 7   Runtime          216983 non-null  object 
 8   Certificate      169091 non-null  object 
 9   Number of Votes  236828 non-null  object 
 10  Gross Revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


The imdb_df.info() command provides you with information about the structure and summary of the consolidated DataFrame imdb_df. 
The output provides the following details:
* The DataFrame has a RangeIndex with 236,828 entries, ranging from index 0 to index 236,827.
* There are 11 columns in the DataFrame.
* Each column is listed along with its non-null count and data type.
* The DataFrame contains a mix of data types, with 10 columns being of type object and 1 column being of type float64.
* The memory usage of the DataFrame is reported as approximately 19.9+ MB.
This information helps you understand the composition and structure of the DataFrame, including the number of entries, data types, and missing values for each column.

To enhance the DataFrame's structure, you can modify the column names to adhere to professional conventions.

You can make the column names consistent by replacing any spaces with underscores (_) and converting them to lowercase. You can then use the info() method again to verify the changes have been effected

Consider the following code:

In [7]:
imdb_df.columns = imdb_df.columns.str.replace(' ', '_').str.lower()
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   title            236828 non-null  object 
 1   imdb_id          236828 non-null  object 
 2   release_year     236819 non-null  object 
 3   genre            236828 non-null  object 
 4   cast             235956 non-null  object 
 5   synopsis         236828 non-null  object 
 6   rating           236828 non-null  float64
 7   runtime          216983 non-null  object 
 8   certificate      169091 non-null  object 
 9   number_of_votes  236828 non-null  object 
 10  gross_revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


To preview the data in the DataFrame, you can use the head() and tail(), and sample() methods. By defualt the head() method shows the top 5 rows of the DataFrame, the tail() method shows the bottom 5 rows and sample() method returns a single random row from the DataFrame. These methods can take an argument specifying the number of rows to be returned.
For example, imdb_df.head(10) will display the top 10 rows, imdb_df.tail(3) will display the last 3 rows and imdb_df.sample(6) randomly selects and returns 6 rows from the DataFrame.

By utilizing these methods, you can preview the data in the DataFrame and get a sense of its contents and structure.

In [9]:
imdb_df.head(3)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
0,The Little Mermaid,tt5971474,I) (2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",A young mermaid makes a deal with a sea witch ...,7.2,135 min,PG,69638,
1,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
2,FUBAR,tt13064902,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,


In [10]:
imdb_df.tail(2)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
236826,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
236827,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856


In [11]:
imdb_df.sample(5)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
87138,Rhythm and Weep,tt0038881,1946,"Short, Comedy, Music","Director:, Jules White, | , Stars:, Moe Ho...",The stooges are actors who can't seem to find ...,7.0,17 min,,399,
85250,Karan Aujla: Na Na Na,tt9603874,2018 Music Video,Music,"Director:, Rupan Bal, | , Stars:, Sukh Baj...",Add a Plot,6.5,4 min,,10,
195620,Byker Grove,tt0163437,1989–2006,"Drama, Family, Romance","Stars:, Daymon Britton, , Brett Adams, , Kerry...",Byker Grove follows the lives and relationship...,6.1,30 min,,905,
59897,The Morning After,tt0091554,1986,"Crime, Mystery, Romance","Director:, Sidney Lumet, | , Stars:, Jane ...","A washed up, alcoholic actress who is prone to...",5.9,103 min,R,7051,25147055.0
213037,Rocky,tt0075148,1976,"Drama, Sport","Director:, John G. Avildsen, | , Stars:, S...",A small-time Philadelphia boxer gets a supreme...,8.1,120 min,PG,598945,117235247.0


One thing you have to decide very early on in a data cleaning and EDA process is identify your index column, an index column provides labels or names for the rows in the DataFrame. It allows you to uniquely identify and access specific rows based on their index values. If you don't specify an index column, pandas assigns a default integer index starting from 0. However, setting a meaningful and appropriate index can enhance data analysis and manipulation capabilities.

Looking at the output of the info, head, tail and sample methods, one colunm that looks like it can be use as an index column is imdb_id column, lets explore further. to see the content of the column, use the the code below

In [12]:
imdb_df["imdb_id"].sample(10)

106903     tt2309320
219118     tt6004144
93032      tt0046422
191554    tt11083696
37875      tt0082883
76887      tt3234050
216473    tt10442802
60482     tt12825632
208682     tt0950775
99578      tt1192628
Name: imdb_id, dtype: object

also since you want the column to uniqely identify each row in the dataframe, you have to check and make sure there is no duplicate value in the column

In [13]:
imdb_df["imdb_id"].duplicated().sum()

127631

there are 127631 duplicate values in the column, you con confirm if there are actual duplicates in the dataframe or if the same imdb_id is associated to different records

In [15]:
imdb_df.head(100).duplicated()

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 100, dtype: bool

using the head method to select the first 100 rows shows no duplicates, lets use the tail method instead, if that doesnt work also, we can also try the sample method

In [16]:
imdb_df.tail(100).duplicated()

236728    False
236729    False
236730    False
236731    False
236732    False
          ...  
236823     True
236824     True
236825     True
236826     True
236827     True
Length: 100, dtype: bool

the tail method in the other hand shows records at the end of the dataframe are duplicated, selct the last record on the dataframe using the iloc method

In [17]:
last_row = imdb_df.iloc[-1]
print(last_row)

title                                                      Evil Dead
imdb_id                                                    tt1288558
release_year                                                    2013
genre                                                         Horror
cast               Director:, Fede Alvarez, | ,     Stars:, Jane ...
synopsis           Five friends head to a remote cabin, where the...
rating                                                           6.5
runtime                                                       91 min
certificate                                                        R
number_of_votes                                               187215
gross_revenue                                             54,239,856
Name: 236827, dtype: object


using the retuned information, check for every occurence of Evil Dead to confirm if they are duplicates

In [18]:
imdb_df[imdb_df["title"] == "Evil Dead"]

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
225299,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234759,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234806,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234853,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234900,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234947,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234994,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235041,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235088,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235135,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856


from the result, we can see that all mention of Evil Dead have the same imdb_id, so they are actual duplicate, to be safe, lets check for another row thats duplicated and repeat the process

In [19]:
second_last_row = imdb_df.iloc[-2]
print(second_last_row)

title                                                         Scream
imdb_id                                                    tt0117571
release_year                                                    1996
genre                                                Horror, Mystery
cast               Director:, Wes Craven, | ,     Stars:, Neve Ca...
synopsis           A year after the murder of her mother, a teena...
rating                                                           7.4
runtime                                                      111 min
certificate                                                        R
number_of_votes                                               364584
gross_revenue                                            103,046,663
Name: 236826, dtype: object


In [20]:
imdb_df[imdb_df["title"] == 'Scream']

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
112330,Scream,tt11245972,I) (2022,"Horror, Mystery, Thriller","Directors:, Matt Bettinelli-Olpin, , Tyler Gil...",25 years after a streak of brutal murders shoc...,6.3,114 min,R,137554,81641405
181578,Scream,tt11245972,I) (2022,"Horror, Mystery, Thriller","Directors:, Matt Bettinelli-Olpin, , Tyler Gil...",25 years after a streak of brutal murders shoc...,6.3,114 min,R,137554,81641405
181618,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364588,103046663
186596,Scream,tt0086262,1981,"Horror, Mystery","Director:, Byron Quisenberry, | , Stars:, ...",A group of friends on a rafting trip down a ri...,2.9,82 min,R,866,1083395
225285,Scream,tt11245972,I) (2022,"Horror, Mystery, Thriller","Directors:, Matt Bettinelli-Olpin, , Tyler Gil...",25 years after a streak of brutal murders shoc...,6.3,114 min,R,137541,81641405
...,...,...,...,...,...,...,...,...,...,...,...
236732,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
236766,Scream,tt11245972,I) (2022,"Horror, Mystery, Thriller","Directors:, Matt Bettinelli-Olpin, , Tyler Gil...",25 years after a streak of brutal murders shoc...,6.3,114 min,R,137541,81641405
236779,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
236813,Scream,tt11245972,I) (2022,"Horror, Mystery, Thriller","Directors:, Matt Bettinelli-Olpin, , Tyler Gil...",25 years after a streak of brutal murders shoc...,6.3,114 min,R,137541,81641405


even though there are multiple movies with the title scream, they all have different imdb_id, so we can conclude that each imdb_id is associated with only one unique record. Now we can safely drop any duplicate record in the column

In [21]:
imdb_df = imdb_df.drop_duplicates(subset="imdb_id")


you then verify if the duplicates have been dropped.

In [22]:
imdb_df["imdb_id"].duplicated().sum()

0

this time around, you got zero, meaning there are no duplicate value in the column. now we can safely use the imdb_id column as our index column

In [25]:
imdb_df.set_index("imdb_id", inplace=True)
imdb_df.info()

KeyError: "None of ['imdb_id'] are in the columns"

The info method did not include the imdb_id column, because it is now set as our index column. Moving on, we can continue our cleaning process by dropping any columns that would not be needed for our analysis process

In [28]:
imdb_df.drop("synopsis", axis=1, inplace=True)
imdb_df.head()

KeyError: "['synopsis'] not found in axis"

By now, you might have noticed is that the release_year column contain charcaters that are not a year format. since we know what type of characters are supposed to be in the column, (digits), you can clean the column by removing all characters that are not digits. The following code accomplishes this
The code uses the str.replace() method with the regular expression pattern r'[^\d]' to match any character that is not a digit. The regex=True parameter ensures that the replacement is performed using regular expressions. By replacing the matched characters with an empty string, you effectively remove all non-digit characters from the 'release_year' column.

After executing the code, the 'release_year' column will only contain the cleaned numerical values.

In [29]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'[^\d]', '', regex=True)
imdb_df.head()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt5971474,The Little Mermaid,2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",7.2,135 min,PG,69638,
tt9362722,Spider-Man: Across the Spider-Verse,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...",9.1,140 min,PG,71960,
tt13064902,FUBAR,2023,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",6.5,,TV-MA,15422,
tt5433140,Fast X,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",6.3,141 min,PG-13,39326,
tt6791350,Guardians of the Galaxy Vol. 3,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...",8.2,150 min,PG-13,160447,


The resulting 'release_year' column is expected to display cleaned numerical values upon executing the code.
However, it is important to note that the 'release_year' column may still exhibit an issue as identified when inspecting the tail of the DataFrame. The removal of non-digit characters inadvertently eliminated the separator (-) that differentiates the two years in some entries.

In [33]:
imdb_df.sample()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0198181,Mickey Mouse Works,1999-2001,"Animation, Short, Comedy","Stars:, Bill Farmer, , Tony Anselmo, , Wayne A...",7.1,30 min,TV-Y,422,


To address this problem, you can use regular expressions to insert the hyphen at the desired position. Here's an example:
In the following code, we use the str.replace() method with a regular expression pattern (\d{4})(\d{4}) to match the string of eight digits representing two years. The pattern captures the first four digits and the next four digits separately. Then, we replace the match with \1-\2, which inserts a hyphen '-' between the two captured groups.

In [34]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'(\d{4})(\d{4})', r'\1-\2', regex=True)
imdb_df.tail()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0070445,Terror Circus,1973,Horror,"Director:, Alan Rudolph, | , Stars:, Andre...",4.1,86 min,R,732,
tt1971558,The Sleeper,2012,"Horror, Thriller","Director:, Justin Russell, | , Stars:, Bri...",3.9,90 min,Not Rated,1359,
tt13964002,The Fear Footage: 3AM,2021,Horror,"Director:, Ricky Umberger, | , Stars:, Ale...",4.2,71 min,,258,
tt5532370,Two Pigeons,2017,"Comedy, Horror, Thriller","Director:, Dominic Bridges, | , Stars:, Ko...",5.5,80 min,TV-MA,1253,
tt15485388,Death Bed Game,2022,"Short, Action, Horror","Director:, Rui Constantino, | , Stars:, Ru...",7.7,32 min,,246,


rename the runtime column to runtime(in_minute) and remove any mention of mins from the columns, then change the datatype from object to numeric, the reason for this is to allow for calculations  

In [35]:
imdb_df = imdb_df.rename(columns={"runtime":"runtime(in_minutes)"})
imdb_df.head()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime(in_minutes),certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt5971474,The Little Mermaid,2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",7.2,135 min,PG,69638,
tt9362722,Spider-Man: Across the Spider-Verse,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...",9.1,140 min,PG,71960,
tt13064902,FUBAR,2023,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",6.5,,TV-MA,15422,
tt5433140,Fast X,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",6.3,141 min,PG-13,39326,
tt6791350,Guardians of the Galaxy Vol. 3,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...",8.2,150 min,PG-13,160447,


In [36]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].str.replace("min", "")
imdb_df.sample(5)

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime(in_minutes),certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt2450334,No Clue,2013,"Comedy, Crime, Mystery","Director:, Carl Bessai, | , Stars:, Brent ...",5.7,96,Not Rated,1461,
tt3598496,Texas Rising,2015,"Drama, History, Western","Stars:, Bill Paxton, , Jeffrey Dean Morgan, , ...",6.7,60,TV-14,4362,
tt0033352,Arizona Cyclone,1941,"Drama, Music, Western","Director:, Joseph H. Lewis, | , Stars:, Jo...",5.3,59,Passed,62,
tt4320352,Zindagi Ek Safar,1988,"Documentary, Biography, Musical","Director:, Sandip Ray, | , Stars:, Kishore...",8.4,120,,36,
tt2552394,Barney Thomson,2015,"Comedy, Crime","Director:, Robert Carlyle, | , Stars:, Rob...",6.2,96,Unrated,6287,


In [24]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("int64")
print(imdb_df["runtime(in_minutes)"]).dtype

ValueError: cannot convert float NaN to integer

The error message "ValueError: cannot convert float NaN to integer" occurs when you try to convert a NaN (Not a Number) value to an integer using the int() function. Since NaN represents missing or undefined values, it cannot be directly converted to an integer.

To handle this situation, you can first handle the NaN values before converting the column to integers. One approach is to use the fillna() method to replace the NaN values with a suitable default value (e.g., 0) or any other desired value. another method is to use the dropna() method to remove the NaN values. Here's an example:

In [37]:
imdb_df.dropna(subset=["runtime(in_minutes)"], inplace=True)

In [38]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("int64")
print(imdb_df["runtime(in_minutes)"]).dtype

ValueError: invalid literal for int() with base 10: '1,256 '

The error message "invalid literal for int() with base 10" occurs when you try to convert a string that is not a valid integer representation into an integer using the int() function. In your case, it seems like you're encountering this error while trying to convert the string '1,256 ' to an integer.

The issue is that the string contains a comma and a space, which are not valid characters in an integer representation. To resolve this, you can remove the comma and any leading or trailing whitespace from the column before converting its datatype to an integer. Here's an example:

In [39]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].str.replace(r'\D', '', regex=True)

In [42]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("Int64")
print(imdb_df["runtime(in_minutes)"]).dtypes

imdb_id
tt5971474     135
tt9362722     140
tt5433140     141
tt6791350     150
tt2906216     134
             ... 
tt0070445      86
tt1971558      90
tt13964002     71
tt5532370      80
tt15485388     32
Name: runtime(in_minutes), Length: 97918, dtype: Int64


AttributeError: 'NoneType' object has no attribute 'dtypes'

In [43]:
imdb_df.dtypes

title                   object
release_year            object
genre                   object
cast                    object
rating                 float64
runtime(in_minutes)      Int64
certificate             object
number_of_votes         object
gross_revenue           object
dtype: object

the datatype of the title column is currently the object dtype, which is roughly analogous to str in native Python. It encapsulates any field that can’t be neatly fit as numerical or categorical data. While you can leave a column as an object data type, it may not be the most efficient or appropriate choice depending on your use case. It is generally recommended to use more specific data types whenever possible, as they can provide better memory usage and improved performance. to convert the datatype to string you can use

In [29]:
imdb_df["title"] = imdb_df["title"].astype(str)
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109197 entries, tt5971474 to tt15485388
Data columns (total 10 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   title            109197 non-null  object 
 1   release_year     109191 non-null  object 
 2   genre            109197 non-null  object 
 3   cast             108567 non-null  object 
 4   synopsis         109197 non-null  object 
 5   rating           109197 non-null  float64
 6   runtime          97918 non-null   object 
 7   certificate      60292 non-null   object 
 8   number_of_votes  109197 non-null  object 
 9   gross_revenue    13253 non-null   object 
dtypes: float64(1), object(9)
memory usage: 9.2+ MB


In [28]:
print(imdb_df["title"].dtype)

object


To address this problem, you can use regular expressions to insert the hyphen at the desired position. Here's an example:
In the following code, we use the str.replace() method with a regular expression pattern (\d{4})(\d{4}) to match the string of eight digits representing two years. The pattern captures the first four digits and the next four digits separately. Then, we replace the match with \1-\2, which inserts a hyphen '-' between the two captured groups.

Moving on, you can check for the total number of duplicate values in the dataframe with the following

In [None]:
imdb_df.duplicated().sum()

In [None]:
imdb_df["title"].value_counts()

In [None]:
imdb_df[imdb_df['title'] == "Boys"]

In [None]:
imdb_df.tail()