# Task 1: Data Cleaning and Preprocessing
<b>Objective:</b> Clean and prepare a raw dataset (with nulls, duplicates, inconsistent formats).<br><br>
<b>Tools:</b> Excel / Python (Pandas)<br><br>
<b>Dataset from Kaggle</b>: https://www.kaggle.com/datasets/shivamb/netflix-shows<br><br>
<b>Deliverables (Summary of Changes):</b>
- Loaded and explored the Netflix dataset.
- Dropped unnecessary columns: description and date_added.
- Handled missing data by removing rows with null values.
- Renamed all columns for consistency and clarity (e.g., show_id to Show ID).
- Removed duplicate records to ensure data integrity.
- Converted the Release Year column from int64 to datetime64[ns] format using pd.to_datetime().
- Displayed dataset shape, column types, and verified changes with .info() and .head().
<br><br>

<b>Hints / Mini Guide:</b>
- Identify and handle missing values using .isnull() in Python or filters in Excel.
- Remove duplicate rows using .drop_duplicates() or Excel’s “Remove Duplicates”.
- Standardize text values like gender, country names, etc.
- Convert date formats to a consistent type (e.g., dd-mm-yyyy).
- Rename column headers to be clean and uniform (e.g., lowercase, no spaces).
- Check and fix data types (e.g., age should be int, date as datetime).

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r'D:\Mahi\Work\ELEVATE LABS Data Analyst Internship\Task 1\netflix_titles.csv', encoding='unicode_escape')

In [3]:
df.shape

(8807, 12)

In [4]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [6]:
df.drop(['description','date_added'], axis=1, inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   release_year  8807 non-null   int64 
 7   rating        8803 non-null   object
 8   duration      8804 non-null   object
 9   listed_in     8807 non-null   object
dtypes: int64(1), object(9)
memory usage: 688.2+ KB


In [8]:
pd.isnull(df)

Unnamed: 0,show_id,type,title,director,cast,country,release_year,rating,duration,listed_in
0,False,False,False,False,True,False,False,False,False,False
1,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False
3,False,False,False,True,True,True,False,False,False,False
4,False,False,False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8802,False,False,False,False,False,False,False,False,False,False
8803,False,False,False,True,True,True,False,False,False,False
8804,False,False,False,False,False,False,False,False,False,False
8805,False,False,False,False,False,False,False,False,False,False


In [9]:
pd.isnull(df).sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
release_year       0
rating             4
duration           3
listed_in          0
dtype: int64

In [10]:
df.shape

(8807, 10)

In [14]:
df.dropna(inplace=True)

In [15]:
df.shape

(5332, 10)

In [16]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country',
       'release_year', 'rating', 'duration', 'listed_in'],
      dtype='object')

In [20]:
df.rename(columns={'show_id':'Show ID','type':'Type','title':'Title','director':'Director','cast':'Cast','country':'Country','release_year':'Release Year','rating':'Rating','duration':'Duration','listed_in':'Listed In'}, inplace=True)

In [23]:
df.head()

Unnamed: 0,Show ID,Type,Title,Director,Cast,Country,Release Year,Rating,Duration,Listed In
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021,TV-14,9 Seasons,"British TV Shows, Reality TV"
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021,PG-13,104 min,"Comedies, Dramas"
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis NiewÃ¶hner, Milan Peschel,...","Germany, Czech Republic",2021,TV-MA,127 min,"Dramas, International Movies"
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",India,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies"


In [24]:
df.drop_duplicates(inplace=True)

In [25]:
df.head()

Unnamed: 0,Show ID,Type,Title,Director,Cast,Country,Release Year,Rating,Duration,Listed In
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021,TV-14,9 Seasons,"British TV Shows, Reality TV"
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021,PG-13,104 min,"Comedies, Dramas"
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis NiewÃ¶hner, Milan Peschel,...","Germany, Czech Republic",2021,TV-MA,127 min,"Dramas, International Movies"
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",India,1998,TV-14,166 min,"Comedies, International Movies, Romantic Movies"


In [19]:
df.dtypes

Show ID         object
Type            object
Title           object
Director        object
Cast            object
Country         object
Release Year     int64
Rating          object
Duration        object
Listed In       object
dtype: object

In [32]:
df['Release Year'] = pd.to_datetime(df['Release Year'], format='%Y')

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5332 entries, 7 to 8806
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Show ID       5332 non-null   object        
 1   Type          5332 non-null   object        
 2   Title         5332 non-null   object        
 3   Director      5332 non-null   object        
 4   Cast          5332 non-null   object        
 5   Country       5332 non-null   object        
 6   Release Year  5332 non-null   datetime64[ns]
 7   Rating        5332 non-null   object        
 8   Duration      5332 non-null   object        
 9   Listed In     5332 non-null   object        
dtypes: datetime64[ns](1), object(9)
memory usage: 458.2+ KB


In [34]:
df.head()

Unnamed: 0,Show ID,Type,Title,Director,Cast,Country,Release Year,Rating,Duration,Listed In
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...",1993-01-01,TV-MA,125 min,"Dramas, Independent Movies, International Movies"
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,2021-01-01,TV-14,9 Seasons,"British TV Shows, Reality TV"
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,2021-01-01,PG-13,104 min,"Comedies, Dramas"
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis NiewÃ¶hner, Milan Peschel,...","Germany, Czech Republic",2021-01-01,TV-MA,127 min,"Dramas, International Movies"
24,s25,Movie,Jeans,S. Shankar,"Prashanth, Aishwarya Rai Bachchan, Sri Lakshmi...",India,1998-01-01,TV-14,166 min,"Comedies, International Movies, Romantic Movies"
