# YouTube Trend Analyzer — Full Jupyter Notebook Script

This notebook contains a complete pipeline: data loading, cleaning, EDA,
visualizations, machine learning, and insights.

------------------------------------------------------------------------

``` python
# YouTube Trend Analyzer — Full Notebook

# 1. Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

sns.set()

# 2. Load Dataset
# Use synthetic dataset (replace path with real API data after fetching)
df = pd.read_csv("data/processed/youtube_trends_full.csv")
df.head()
```

------------------------------------------------------------------------

``` python
# 3. Data Cleaning & Feature Engineering

df = df.drop_duplicates(subset=['video_id'])
for col in ['views','likes','comments']:
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0).astype(int)

df['title'] = df['title'].fillna('')
df['tags'] = df['tags'].fillna('')

# Engagement ratio
df['engagement_ratio'] = (df['likes'] + df['comments']) / (df['views'] + 1)

# Success label
q1 = df['engagement_ratio'].quantile(0.33)
q2 = df['engagement_ratio'].quantile(0.66)

def label(x):
    if x >= q2:
        return 'High'
    elif x >= q1:
        return 'Medium'
    else:
        return 'Low'

df['success_level'] = df['engagement_ratio'].apply(label)

df[['views','likes','comments','engagement_ratio','success_level']].head()
```

------------------------------------------------------------------------

``` python
# 4. Exploratory Data Analysis

# Average views by category
by_cat = df.groupby('category')['views'].mean().sort_values(ascending=False)
plt.figure(figsize=(8,5))
by_cat.plot.bar()
plt.title('Average Views by Category')
plt.ylabel('Average Views')
plt.show()

# Success level distribution
plt.figure(figsize=(6,6))
df['success_level'].value_counts().plot.pie(autopct='%1.1f%%')
plt.title('Success Level Distribution')
plt.ylabel('')
plt.show()

# Correlation heatmap
plt.figure(figsize=(6,4))
sns.heatmap(df[['views','likes','comments','engagement_ratio']].corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation")
plt.show()
```

------------------------------------------------------------------------

``` python
# 5. Machine Learning — RandomForest

# Encode categorical
df['text'] = df['title'] + ' ' + df['tags']
le = LabelEncoder()
y = le.fit_transform(df['success_level'])

# Text vectorization
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_text = tfidf.fit_transform(df['text'])

# Numeric features
numeric = df[['views','likes','comments']].fillna(0)
X = hstack([X_text, numeric.values])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=le.classes_))
```

------------------------------------------------------------------------

``` python
# 6. Save Model

joblib.dump({
    'model': clf,
    'tfidf': tfidf,
    'label_encoder': le
}, 'models/rf_success.pkl')
print("Model saved to models/rf_success.pkl")
```

------------------------------------------------------------------------

``` python
# 7. Insights

# Top performing categories
print("Top categories by average views:\n", by_cat.head(5))

# Engagement distribution summary
print(df.groupby('success_level')['engagement_ratio'].mean())

# Example prediction pipeline
sample = ["Amazing Music Official Trailer"]
X_sample = tfidf.transform(sample)
X_num = np.array([[100000, 5000, 200]])  # views, likes, comments
X_input = hstack([X_sample, X_num])
pred = clf.predict(X_input)
print("Prediction for sample video:", le.inverse_transform(pred))
```

------------------------------------------------------------------------

### ✅ Notebook Covers:

-   Data cleaning
-   EDA (plots: bar, pie, correlation heatmap)
-   RandomForest ML classifier
-   TF-IDF feature engineering on titles & tags
-   Saving trained model
-   Generating predictions

You can copy this into a `.ipynb` file or run in Jupyter/Colab. It works
with both **synthetic dataset** (`youtube_trends_full.csv`) and **real
API data** after fetching & cleaning.