# üé¨ Trending YouTube Dataset Project (2025-26)
**Author:** Hussein Shamas

**Description:**
This Jupyter Notebook solves all 15 tasks required for the 2025‚Äì26 YouTube Trending Dataset Project.
It uses pandas, numpy, matplotlib, and seaborn.

---
‚ö†Ô∏è **Notes:**
- Place all CSV files in `data/csv/`
- Place all JSON files in `data/json/`
- Each CSV represents a country (e.g., `USvideos.csv`, `GBvideos.csv`)
- The project must be committed to GitHub.
- You may use AI tools, but describe their contribution below.

---

### ü§ñ AI Tools Usage Declaration
This notebook was partially generated with the assistance of ChatGPT (OpenAI GPT‚Äë5). AI was used to:
- Structure the workflow and provide clean, commented code.
- Suggest best practices for Pandas manipulation and visualization.

I fully understand every section of the code and manually verified all results.

In [None]:

import pandas as pd
import numpy as np
import json
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
pd.set_option("display.max_columns", None)


## 1Ô∏è‚É£ Combine all CSV files into a single DataFrame with 'country' column

In [None]:

csv_files = glob.glob("data/csv/*.csv")
dfs = []

for file in csv_files:
    country = os.path.basename(file).split("videos")[0].upper()
    df = pd.read_csv(file, encoding="utf-8")
    df["country"] = country
    dfs.append(df)

youtube = pd.concat(dfs, ignore_index=True)
print("‚úÖ Combined shape:", youtube.shape)
youtube.head()


## 2Ô∏è‚É£ Extract all videos with no tag

In [None]:

no_tag_videos = youtube[youtube["tags"] == "[none]"]
print("Videos without tags:", len(no_tag_videos))
no_tag_videos.head()


## 3Ô∏è‚É£ Total number of views per channel

In [None]:

channel_views = youtube.groupby("channel_title")["views"].sum().reset_index().sort_values("views", ascending=False)
channel_views.head()


## 4Ô∏è‚É£ Create 'excluded' DataFrame for disabled comments/ratings or removed videos

In [None]:

excluded = youtube[
    (youtube["comments_disabled"] == True) |
    (youtube["ratings_disabled"] == True) |
    (youtube["video_error_or_removed"] == True)
]
print("Excluded videos:", excluded.shape)

youtube = youtube.drop(excluded.index)
print("Remaining videos:", youtube.shape)


## 5Ô∏è‚É£ Add 'like_ratio' column

In [None]:

youtube["like_ratio"] = youtube["likes"] / youtube["dislikes"].replace(0, np.nan)
youtube["like_ratio"] = youtube["like_ratio"].fillna(0)
youtube.head()


## 6Ô∏è‚É£ Cluster publish times into 10-minute intervals

In [None]:

youtube["publish_time"] = pd.to_datetime(youtube["publish_time"], errors="coerce")
youtube["time_interval"] = youtube["publish_time"].dt.floor("10T")
youtube.head()


## 7Ô∏è‚É£ For each interval: number of videos, average likes/dislikes

In [None]:

interval_stats = youtube.groupby("time_interval").agg(
    videos_count=("video_id", "count"),
    avg_likes=("likes", "mean"),
    avg_dislikes=("dislikes", "mean")
).reset_index()
interval_stats.head()


## 8Ô∏è‚É£ & 9Ô∏è‚É£ Tag analysis: count and find most common tags

In [None]:

all_tags = youtube["tags"].dropna().str.split("|").explode()
tag_counts = all_tags.value_counts().reset_index()
tag_counts.columns = ["tag", "count"]
print("Top 10 tags:")
tag_counts.head(10)


## üîü For each (tag, country) pair, average like/dislike ratio

In [None]:

youtube["tags_split"] = youtube["tags"].str.split("|")
tags_country = youtube.explode("tags_split")
tag_country_ratio = tags_country.groupby(["tags_split", "country"])["like_ratio"].mean().reset_index()
tag_country_ratio.head()


## 1Ô∏è‚É£1Ô∏è‚É£ For each (trending_date, country) pair, video with most views

In [None]:

top_video_per_day = youtube.loc[youtube.groupby(["trending_date", "country"])["views"].idxmax()]
top_video_per_day.head()


## 1Ô∏è‚É£2Ô∏è‚É£ Split trending_date into year, month, and day

In [None]:

youtube["trending_date"] = youtube["trending_date"].astype(str)
youtube[["year", "month", "day"]] = youtube["trending_date"].str.split(".", expand=True)
youtube.head()


## 1Ô∏è‚É£3Ô∏è‚É£ For each (month, country), video with largest views

In [None]:

top_month_video = youtube.loc[youtube.groupby(["month", "country"])["views"].idxmax()]
top_month_video.head()


## 1Ô∏è‚É£4Ô∏è‚É£ Read all JSON category files

In [None]:

json_files = glob.glob("data/json/*.json")
categories = []

for file in json_files:
    country = os.path.basename(file).split("_")[0].upper()
    with open(file, "r", encoding="utf-8") as f:
        data = json.load(f)
        if "items" in data:
            for item in data["items"]:
                categories.append({"country": country, "id": item["id"], "title": item["snippet"]["title"]})

categories_df = pd.DataFrame(categories)
categories_df.head()


## 1Ô∏è‚É£5Ô∏è‚É£ For each country, count videos with unassignable category

In [None]:

youtube["category_id"] = youtube["category_id"].astype(str)
categories_df["id"] = categories_df["id"].astype(str)

merged = youtube.merge(categories_df, left_on=["category_id", "country"], right_on=["id", "country"], how="left")
unassigned = merged[merged["title"].isnull()]
unassigned_count = unassigned.groupby("country")["video_id"].count().reset_index(name="unassignable_count")
unassigned_count


## üìä Bonus Visualization ‚Äî Top 10 Channels by Views

In [None]:

plt.figure(figsize=(10,5))
sns.barplot(x="views", y="channel_title", data=channel_views.head(10))
plt.title("Top 10 Channels by Total Views")
plt.xlabel("Total Views")
plt.ylabel("Channel")
plt.show()
