<a href="https://www.kaggle.com/code/preetsinghsebh/cleaning-netflix-movies-tv-shows-dataset?scriptVersionId=289569805" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Netflix Movies & TV Shows â€“ Data Cleaning

This notebook focuses on cleaning the Netflix Movies & TV Shows dataset using Python and pandas.

### Objectives
- Handle missing values
- Standardize column names
- Clean text-based fields
- Convert data types
- Prepare a clean dataset for analysis




In [1]:
import pandas as pd

In [2]:
raw_titles = pd.read_csv(
    "/kaggle/input/netflix-tv-shows-and-movies/titles.csv"
)

titles = raw_titles.copy()

In [3]:
titles = titles.rename(columns={
    "release_year": "year",
    "age_certification": "age_rating",
    "imdb_score": "imdb_rating",
    "imdb_votes": "imdb_votes_count"
})

In [4]:
# Drop rows where title is missing
titles = titles.dropna(subset=["title"])

# Text columns that ACTUALLY exist
text_cols = ["description", "rating", "listed_in", "country"]

for col in text_cols:
    if col in titles.columns:
        titles[col] = titles[col].fillna("Unknown")

In [5]:
titles["year"] = titles["year"].astype(int)
titles["runtime"] = titles["runtime"].astype(int)
titles["imdb_rating"] = titles["imdb_rating"].astype(float)

In [6]:
titles = titles.drop_duplicates()

In [7]:
titles["genres"] = (
    titles["genres"]
    .str.replace("[", "", regex=False)
    .str.replace("]", "", regex=False)
    .str.replace("'", "", regex=False)
)

In [8]:
def age_group(rating):
    if rating in ["G", "PG"]:
        return "Kids"
    elif rating in ["PG-13", "TV-14"]:
        return "Teens"
    elif rating in ["R", "TV-MA"]:
        return "Adults"
    else:
        return "Unknown"

titles["audience_group"] = titles["age_rating"].apply(age_group)

In [9]:
print("Before cleaning shape:", raw_titles.shape)
print("After cleaning shape:", titles.shape)

print("\nBefore cleaning missing values:")
print(raw_titles.isnull().sum())

print("\nAfter cleaning missing values:")
print(titles.isnull().sum())

Before cleaning shape: (5850, 15)
After cleaning shape: (5849, 16)

Before cleaning missing values:
id                         0
title                      1
type                       0
description               18
release_year               0
age_certification       2619
runtime                    0
genres                     0
production_countries       0
seasons                 3744
imdb_id                  403
imdb_score               482
imdb_votes               498
tmdb_popularity           91
tmdb_score               311
dtype: int64

After cleaning missing values:
id                         0
title                      0
type                       0
description                0
year                       0
age_rating              2618
runtime                    0
genres                     0
production_countries       0
seasons                 3743
imdb_id                  403
imdb_rating              481
imdb_votes_count         497
tmdb_popularity           90
tmdb_score    

In [10]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5849 entries, 0 to 5849
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5849 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5849 non-null   object 
 3   description           5849 non-null   object 
 4   year                  5849 non-null   int64  
 5   age_rating            3231 non-null   object 
 6   runtime               5849 non-null   int64  
 7   genres                5849 non-null   object 
 8   production_countries  5849 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5446 non-null   object 
 11  imdb_rating           5368 non-null   float64
 12  imdb_votes_count      5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
 15  audience_group        5849

In [11]:
titles.to_csv("netflix_titles_cleaned.csv", index=False)

### Conclusion
The dataset has been cleaned and standardized, making it ready for exploratory data analysis or machine learning tasks.
