## Notebook 1: Data Cleaning (`1_Data_Cleaning.ipynb`)

**Purpose:** To prepare the raw dataset for analysis by fixing errors, handling missing data, and formatting columns correctly.

*   **Loading Data:** We start by importing `pandas` and reading the raw CSV file.
*   **Handling Missing Values:**
    *   Columns like `director`, `cast`, and `country` have many missing values. Instead of dropping these rows (which would lose too much data), we fill them with placeholders like "No Director" or "Country Unavailable".
    *   Critical columns like `date_added` and `rating` have very few missing values, so we drop those specific rows to ensure accurate time-series and categorical analysis later.
*   **Date Formatting:** The `date_added` column is originally a string (text). We convert it to a `datetime` object so Python understands it as a date. This allows us to easily extract the **Year** and **Month** into new columns (`year_added`, `month_added`).
*   **Export:** The cleaned dataframe is saved as `netflix_cleaned.csv`. This is the file used by all subsequent notebooks.


In [None]:
import pandas as pd
import numpy as np

# Load the dataset
file_path = "/Users/apple/Documents/Uni/Personal /Netlfix_Project/Data/netflix_dataset.csv"
df = pd.read_csv(file_path)

print(df.info())
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
None
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description     

In [2]:
# Fill missing values for categorical columns
df['director'] = df['director'].fillna('No Director')
df['cast'] = df['cast'].fillna('No Cast')
df['country'] = df['country'].fillna('Country Unavailable')

df.dropna(subset=['date_added', 'rating'], inplace=True)


In [None]:
# Convert 'date_added' to datetime
# The strip() removes any leading/trailing whitespace that might cause errors
df['date_added'] = pd.to_datetime(df['date_added'].str.strip())

df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month_name()


In [6]:
# Save to a new CSV for the next notebooks
df.to_csv('/Users/apple/Documents/Uni/Personal /Netlfix_Project/Data/netflix_cleaned.csv', index=False)
print("Data cleaned and saved to 'netflix_cleaned.csv'")


Data cleaned and saved to 'netflix_cleaned.csv'
