📌 Project Introduction

This project analyzes the Netflix Movies & TV Shows dataset as part of my data science portfolio.
The goal is to practice core data analysis skills by:

Cleaning and preparing a real-world dataset

Exploring trends in Netflix content (Movies vs TV Shows, growth over time)

Identifying leading countries and most popular genres

Highlighting insights relevant to Australia

Summarizing the results in both visuals and a Markdown report

The outputs include charts (PNG images) and a concise findings summary file, which can be shared on GitHub or LinkedIn to demonstrate skills in Python, Pandas, visualization, and storytelling with data.

In [1]:
#@title 🛠️ Setup (creates folders, sets defaults)
import os, textwrap, zipfile, shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

ME = "Issa Mohamud"
PROJECT = "Netflix Movies & Shows Analysis"
SEED = 42
np.random.seed(SEED)

# Clean plot style
plt.rcParams.update({
    "figure.figsize": (10, 5),
    "axes.grid": True,
    "grid.alpha": 0.3,
    "axes.spines.top": False,
    "axes.spines.right": False,
    "font.size": 11
})

# Folders
os.makedirs("outputs", exist_ok=True)
os.makedirs("notebook_assets", exist_ok=True)

print("✅ Setup complete.")


✅ Setup complete.


In [3]:
#@title ⬆️ Upload `netflix_titles.csv`
from google.colab import files
uploaded = files.upload()  # select your netflix_titles.csv

CSV = None
for name in uploaded.keys():
    if name.lower().endswith(".csv"):
        CSV = name
        break

if CSV is None:
    raise FileNotFoundError("Please upload a CSV file named like netflix_titles.csv")

print(f"✅ Found CSV: {CSV}")


Saving netflix_titles.csv to netflix_titles (1).csv
✅ Found CSV: netflix_titles (1).csv


In [4]:
#@title 📦 Load & Tidy Data
df_raw = pd.read_csv(CSV)

def tidy_netflix(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # canonical cols
    df.columns = [c.strip().replace(" ", "_").lower() for c in df.columns]

    # fill missing
    for col in ["country", "rating", "listed_in", "director", "cast"]:
        if col in df.columns:
            df[col] = df[col].fillna("Unknown")

    # dates
    if "date_added" in df.columns:
        df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")
        df["year_added"] = df["date_added"].dt.year

    # duration → minutes / seasons
    if "duration" in df.columns:
        mins = df["duration"].str.extract(r"(\d+)\s*min", expand=False).astype("float")
        seasons = df["duration"].str.extract(r"(\d+)\s*Season", expand=False).astype("float")
        df["duration_minutes"] = mins
        df["seasons"] = seasons

    # first country / first genre
    df["primary_country"] = df["country"].str.split(",").str[0].str.strip()
    df["primary_genre"] = df["listed_in"].str.split(",").str[0].str.strip()
    return df

df = tidy_netflix(df_raw)
print("✅ Loaded rows:", len(df))
df.head(3)


✅ Loaded rows: 8807


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,duration_minutes,seasons,primary_country,primary_genre
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",2021.0,90.0,,United States,Documentaries
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",2021.0,,2.0,South Africa,International TV Shows
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,2021.0,,1.0,Unknown,Crime TV Shows


In [6]:
#@title 🔧 Plot helpers
def save_plot(fig, name):
    path = f"outputs/{name}.png"
    fig.savefig(path, bbox_inches="tight", dpi=180)
    plt.close(fig)
    return path

def autopct_pct(values):
    def _p(pct):
        total = np.sum(values)
        val = int(round(pct * total / 100.0))
        return f"{pct:.1f}%\n(n={val})"
    return _p

print("✅ Helpers ready.")


✅ Helpers ready.


In [7]:
#@title 🍿 Movies vs TV Shows
counts_type = df["type"].value_counts().sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(6,6))
ax.pie(counts_type.values, labels=counts_type.index, autopct=autopct_pct(counts_type.values))
ax.set_title("Movies vs TV Shows on Netflix")
p1 = save_plot(fig, "01_movies_vs_tv")

print("✅ Saved:", p1)


✅ Saved: outputs/01_movies_vs_tv.png


In [8]:
#@title 📈 Growth of catalog by year added
growth = df.dropna(subset=["year_added"]).groupby("year_added").size()

fig, ax = plt.subplots()
ax.plot(growth.index, growth.values, marker="o")
ax.set_title("Growth of Netflix Catalog by Year Added")
ax.set_xlabel("Year added")
ax.set_ylabel("Titles added")
p2 = save_plot(fig, "02_growth_by_year_added")

print("✅ Saved:", p2)


✅ Saved: outputs/02_growth_by_year_added.png


In [9]:
#@title 🌍 Top 10 countries (primary)
top_countries = df["primary_country"].value_counts().head(10)[::-1]

fig, ax = plt.subplots()
ax.barh(top_countries.index, top_countries.values)
ax.set_title("Top 10 Countries by Number of Titles (Primary Country)")
ax.set_xlabel("Count of titles")
p3 = save_plot(fig, "03_top_countries")

print("✅ Saved:", p3)


✅ Saved: outputs/03_top_countries.png


In [10]:
#@title 🎭 Top genres (primary)
top_genres = df["primary_genre"].value_counts().head(12)[::-1]

fig, ax = plt.subplots()
ax.barh(top_genres.index, top_genres.values)
ax.set_title("Top Genres (Primary)")
ax.set_xlabel("Count of titles")
p4 = save_plot(fig, "04_top_genres")

print("✅ Saved:", p4)


✅ Saved: outputs/04_top_genres.png


In [11]:
#@title 🏷️ Ratings distribution (Movies vs TV)
def plot_rating_dist(subset, title, fname):
    vc = subset["rating"].value_counts().head(10)[::-1]
    fig, ax = plt.subplots()
    ax.barh(vc.index, vc.values)
    ax.set_title(title)
    ax.set_xlabel("Count of titles")
    return save_plot(fig, fname)

p5a = plot_rating_dist(df[df["type"]=="Movie"], "Rating Distribution (Movies)", "05a_ratings_movies")
p5b = plot_rating_dist(df[df["type"]=="TV Show"], "Rating Distribution (TV Shows)", "05b_ratings_tv")

print("✅ Saved:", p5a, "and", p5b)


✅ Saved: outputs/05a_ratings_movies.png and outputs/05b_ratings_tv.png


In [12]:
#@title 🇦🇺 Australia spotlight
au = df[df["primary_country"].str.contains("Australia", case=False, na=False)]
au_titles = len(au)

au_growth = au.dropna(subset=["year_added"]).groupby("year_added").size()
fig, ax = plt.subplots()
ax.plot(au_growth.index, au_growth.values, marker="o")
ax.set_title("Australia: Titles Added by Year")
ax.set_xlabel("Year added")
ax.set_ylabel("Titles added")
p6 = save_plot(fig, "06_au_growth")

print(f"✅ AU titles: {au_titles} | Saved:", p6)
au.head(5)


✅ AU titles: 117 | Saved: outputs/06_au_growth.png


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,year_added,duration_minutes,seasons,primary_country,primary_genre
25,s26,TV Show,Love on the Spectrum,Unknown,Brooke Satchwell,Australia,2021-09-21,2021,TV-14,2 Seasons,"Docuseries, International TV Shows, Reality TV",Finding love can be hard for anyone. For young...,2021.0,,2.0,Australia,Docuseries
108,s109,TV Show,Dive Club,Unknown,"Aubri Ibrag, Sana'a Shaik, Miah Madden, Mercy ...",Australia,2021-09-03,2021,TV-G,1 Season,"Kids' TV, TV Dramas, Teen TV Shows","On the shores of Cape Mercy, a skillful group ...",2021.0,,1.0,Australia,Kids' TV
120,s121,TV Show,Heroes of Goo Jit Zu,Unknown,"Jon Allen, Kellen Goff, Joe Hernandez, Kaiji Tang",Australia,2021-09-02,2021,TV-Y7,1 Season,"Kids' TV, TV Comedies","After a meteor crash, a group of zoo animals t...",2021.0,,1.0,Australia,Kids' TV
137,s138,Movie,Crocodile Dundee in Los Angeles,Simon Wincer,"Paul Hogan, Linda Kozlowski, Jere Burns, Jonat...","Australia, United States",2021-09-01,2001,PG,95 min,"Action & Adventure, Comedies","When Mick ""Crocodile"" Dundee and his family la...",2021.0,95.0,,Australia,Action & Adventure
393,s394,Movie,A Second Chance: Rivals!,Clay Glen,"Emily Morris, Stella Shute, Eva Grados, India ...",Australia,2021-07-23,2021,PG,91 min,"Children & Family Movies, Sports Movies",Crushed when she doesn't qualify for the Olymp...,2021.0,91.0,,Australia,Children & Family Movies


In [15]:
#@title 🧾 Write short findings (Markdown)

findings = []

# 1) Movies vs TV
mv_pct = round(100 * (df["type"] == "Movie").mean(), 1)
tv_pct = 100 - mv_pct
findings.append(f"- Mix: ~{mv_pct}% Movies vs ~{tv_pct}% TV Shows.")

# 2) Growth (peak year)
if not growth.empty:
    peak_year = int(growth.idxmax())
    peak_val = int(growth.max())
    findings.append(f"- Peak content additions in **{peak_year}** (~{peak_val} titles).")

# 3) Top countries
tc = df["primary_country"].value_counts().head(3).index.tolist()
if tc:
    findings.append(f"- Top countries: **{', '.join(tc)}**.")

# 4) Top genres
tg = df["primary_genre"].value_counts().head(3).index.tolist()
if tg:
    findings.append(f"- Popular genres: **{', '.join(tg)}**.")

# 5) Australia note
findings.append(f"- Australia spotlight: **{len(au)}** titles in the catalog.")

# ✅ Fix: join outside the f-string
findings_text = "\n".join(findings)

summary_md = (
    f"# {PROJECT} — Key Findings (by {ME})\n\n"
    f"{findings_text}\n\n"
    f"> Notes:\n"
    f"> - Dataset: `netflix_titles.csv`\n"
    f"> - Steps: tidy data → EDA → visuals → findings\n"
    f"> - Author: {ME}\n"
)

with open("outputs/findings.md", "w", encoding="utf-8") as f:
    f.write(summary_md)

print("✅ Wrote outputs/findings.md")
print(summary_md)


✅ Wrote outputs/findings.md
# Netflix Movies & Shows Analysis — Key Findings (by Issa Mohamud)

- Mix: ~69.6% Movies vs ~30.400000000000006% TV Shows.
- Peak content additions in **2019** (~1999 titles).
- Top countries: **United States, India, Unknown**.
- Popular genres: **Dramas, Comedies, Action & Adventure**.
- Australia spotlight: **117** titles in the catalog.

> Notes:
> - Dataset: `netflix_titles.csv`
> - Steps: tidy data → EDA → visuals → findings
> - Author: Issa Mohamud



In [16]:
#@title 📦 Make ZIP & Download
from google.colab import files

zip_path = "issa_netflix_outputs.zip"
if os.path.exists(zip_path):
    os.remove(zip_path)

shutil.make_archive("issa_netflix_outputs", "zip", "outputs")
files.download(zip_path)

print("✅ Download started for:", zip_path)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Download started for: issa_netflix_outputs.zip
