# Data Cleaning and Preprocessing Demonstration
#### Over Netflix Movies and TV Shows Dataset from Kaggle


Notes:
- After running this notebook you should find `cleaned_netflix_titles.csv` in the repository root.

### Step 1 — Install dependencies
This cell installs required Python packages from `requirements.txt`. Run this once after activating your environment.

In [None]:
%pip install -r requirements.txt

### Step 2 — Imports
Imports core libraries used for cleaning (`pandas`, `numpy`). Add other imports here if you extend the notebook.

In [None]:
import pandas as pd
import numpy as np

### Step 3 — Load dataset
Reads `netflix_titles.csv` into a pandas DataFrame and records the initial row count for later comparison.

In [None]:
df = pd.read_csv('netflix_titles.csv')
initial_count = df.shape[0]
print(df.shape)
print(initial_count)
df.head(10)

### Step 4 — Remove duplicates
Calls `df.drop_duplicates()` to remove exact duplicate rows to avoid redundant records before further cleaning and shows sample rows afterwards.

In [None]:
df = df.drop_duplicates()
df.head(10)

### Step 5 — Report duplicate removal
Computes how many rows were removed by deduplication and prints the new shape and null counts to inspect missing data.

In [None]:
count_after_drop_duplicates = df.shape[0]
count_drop = initial_count - count_after_drop_duplicates
print("Number of rows dropped: ", count_drop)
print(df.shape)
df.isnull().sum()

### Step 6 — Drop rows missing key fields
Drops rows where `title` or `type` are missing — these are considered essential fields for the dataset and resets the DataFrame index. Prints how many rows were dropped.

In [None]:
df = df.dropna(subset=['title', 'type']).reset_index(drop=True) #More logical
# df = df.dropna().reset_index(drop=True) #Not logical since dropping other coloumns than title and type makes no sense
print(df.shape)
count_after_dropping_na = df.shape[0]
count_drop_na = count_after_drop_duplicates - count_after_dropping_na
print("Number of rows dropped (NA): ", count_drop_na)
df.head(10)

### Step 7 — Normalize text columns
Strips whitespace, converts to lowercase, and converts 'nan' strings back to a proper `np.nan` for selected text columns to standardize values for later processing.

In [None]:
text_cols = ['country', 'type', 'director', 'cast', 'listed_in']
for col in text_cols:
    if col in df.columns:
        df[col] = df[col].astype(str).str.strip().replace('nan', np.nan)
        df[col] = df[col].where(df[col].isna(), df[col].str.lower())
df.head(10)

### Step 8 — Parse & format dates, convert types
Parses `date_added` with a flexible parser into a consistent `dd-mm-yyyy` string where possible, converts `release_year` to integer type `Int64`, and casts `rating` and `type` to categorical dtype where present.

In [None]:
if 'date_added' in df.columns:
    from dateutil import parser
    import warnings
    warnings.filterwarnings('ignore')
    
    def parse_date_flexible(date_str):
        if pd.isna(date_str):
            return None
        try:
            return parser.parse(str(date_str))
        except:
            try:
                return pd.to_datetime(date_str, format='%B %d, %Y')
            except:
                return None
    
    df['date_added'] = df['date_added'].apply(parse_date_flexible)
    
    df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce').dt.strftime('%d-%m-%Y')

if 'release_year' in df.columns:
    df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce').astype('Int64')
    
if 'rating' in df.columns:
    df['rating'] = df['rating'].astype('category')
    
if 'type' in df.columns:
    df['type'] = df['type'].astype('category')
    
df.head(10)

### Step 9 — Normalize column names
Converts column headers to lowercase and replaces spaces with underscores for consistency in downstream code (e.g., `Date Added` -> `date_added`).

In [None]:
df.columns = [col.strip().lower().replace(' ', '_') for col in df.columns]
df.head(10)

### Step 10 — Final checks & save
Prints final diagnostics (shape, null counts, dtypes) and writes the cleaned dataset to `cleaned_netflix_titles.csv`.

In [None]:
print('Final shape:', df.shape)
print('\nNull counts after cleaning:\n', df.isnull().sum(), '\n')
print(df.dtypes)
df.to_csv('cleaned_netflix_titles.csv', index=False)
print('Saved cleaned dataset to cleaned_netflix_titles.csv')
df.head(10)