# 01 â€“ Exploratory Data Analysis (EDA) for US YouTube Trending Videos

This notebook performs a basic exploratory data analysis of the US YouTube
trending dataset. The goal is to:

- Inspect the raw structure of the data
- Understand missing values
- Summarize key numerical columns
- Visualize distributions and engagement ratios
- Inspect the distribution of `category_id`

The output is purely exploratory and helps guide later feature engineering
in `02_feature_engineering.ipynb`.


## 1. Imports

In [None]:
# 01_eda.ipynb
# Exploratory Data Analysis for US YouTube Trending Videos

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("ggplot")

print("EDA notebook ready.")

## 2. Load Raw Dataset

We load the original US videos dataset from the `../data/raw` folder.


In [None]:
# Load raw dataset
df = pd.read_csv("../data/raw/USvideos.csv")
print("Raw dataframe shape:", df.shape)
df.head()

## 3. Basic Structure and Missing Values

We first inspect the dataframe structure and check how many missing values each
column contains.


In [None]:
# Basic info
df.info()

In [None]:
# Missing values per column
df.isnull().sum()

## 4. Numerical Summaries

We focus on the main numerical engagement metrics:

- `views`
- `likes`
- `dislikes`
- `comment_count`

We compute their descriptive statistics and visualize their distributions.


In [None]:
# Numerical summaries
num_cols = ["views", "likes", "dislikes", "comment_count"]
df[num_cols].describe()

In [None]:
# Histograms of main numerical columns
ax = df[num_cols].hist(bins=50, figsize=(10, 6))
plt.suptitle("Distributions of views / likes / dislikes / comment_count")
plt.show()

## 5. Engagement Ratios

To normalize engagement across videos with different view counts, we compute:

- `like_view_ratio = likes / views`
- `comment_view_ratio = comment_count / views`

These ratios will later be used as features in the modeling pipeline.


In [None]:
# Engagement ratios
df["like_view_ratio"] = df["likes"] / (df["views"] + 1e-6)
df["comment_view_ratio"] = df["comment_count"] / (df["views"] + 1e-6)

df[["like_view_ratio", "comment_view_ratio"]].describe()

## 6. Category Distribution

Finally, we inspect the distribution of `category_id` among US trending videos.
This gives a sense of which categories dominate the trending list.


In [None]:
# Category distribution
df["category_id"].value_counts().plot(kind="bar", figsize=(10, 4))
plt.title("Distribution of category_id in US trending videos")
plt.xlabel("category_id")
plt.ylabel("count")
plt.show()