# YouTube Success Prediction ML: Data Exploration

This notebook provides a clean exploratory view of the dataset used by the platform.

## Objectives
- Validate raw vs processed contracts
- Inspect missingness patterns
- Review category and country distributions
- Profile key numeric features

## 1) Environment Setup

In [None]:
from pathlib import Path
import sys

import pandas as pd
import plotly.express as px

ROOT = Path.cwd().resolve()
if not (ROOT / "src").exists() and (ROOT.parent / "src").exists():
    ROOT = ROOT.parent
sys.path.insert(0, str(ROOT / "src"))

from youtube_success_ml.config import DEFAULT_DATA_PATH
from youtube_success_ml.data.loader import load_raw_dataset, load_dataset

pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 180)


## 2) Load Raw + Processed Data

In [None]:
raw_df = load_raw_dataset(DEFAULT_DATA_PATH)
processed_df = load_dataset(DEFAULT_DATA_PATH)

print(f"Input path     : {DEFAULT_DATA_PATH}")
print(f"Raw shape      : {raw_df.shape}")
print(f"Processed shape: {processed_df.shape}")

In [None]:
raw_df.head(5)

## 3) Missingness Review

In [None]:
def missingness(df: pd.DataFrame) -> pd.DataFrame:
    out = pd.DataFrame(
        {
            "column": df.columns,
            "null_count": [int(df[c].isna().sum()) for c in df.columns],
            "null_pct": [float(df[c].isna().mean() * 100.0) for c in df.columns],
            "dtype": [str(df[c].dtype) for c in df.columns],
        }
    )
    return out.sort_values(["null_pct", "null_count"], ascending=False).reset_index(drop=True)

missingness(raw_df).head(20)

In [None]:
missingness(processed_df).head(20)

## 4) Category + Country Distribution

In [None]:
category_distribution = (
    processed_df.groupby("category", as_index=False)["subscribers"]
    .agg(channel_count="count", total_subscribers="sum")
    .sort_values("channel_count", ascending=False)
)

category_distribution.head(15)

In [None]:
top_countries = (
    processed_df.groupby("country", as_index=False)["subscribers"]
    .sum()
    .rename(columns={"subscribers": "total_subscribers"})
    .sort_values("total_subscribers", ascending=False)
    .head(20)
)

px.bar(top_countries, x="country", y="total_subscribers", title="Top 20 Countries by Subscribers")

## 5) Numeric Profiling

In [None]:
numeric_cols = [
    "uploads",
    "subscribers",
    "highest_yearly_earnings",
    "growth_target",
    "age",
]

processed_df[numeric_cols].describe(percentiles=[0.01, 0.05, 0.5, 0.95, 0.99]).transpose()