---
title: "Assignment 04 — Lightcast Job Market Analysis"
author:
  - name: "Othmane Elouardi"
    affiliations:
      - id: bu
        name: "Boston University"
        city: "Boston"
        state: "MA"
date: 2025-10-08
number-sections: true
format:
  html:
    theme:
      light: lux
      dark: slate
    toc: true
    toc-depth: 3
    toc-location: right
    smooth-scroll: true
    code-fold: true
    code-tools: true
    code-line-numbers: true
    highlight-style: a11y
    page-layout: article
    css: styles.css
    grid:
      body-width: 900px     
      margin-width: 280px   
execute:
  echo: true
  warning: false
  error: false
  freeze: auto
jupyter: env
---

# Introduction

This report analyzes job postings from the **Lightcast Job Market dataset**, exploring salary trends, employment types, skill demand, and more.  
All visualizations are interactive, allowing you to hover and explore insights dynamically.

---

# Load the Dataset

In [None]:
#| label: load-data
#| echo: true
#| warning: false

import pandas as pd
import numpy as np
from pathlib import Path

DATA_PATH = Path("data/lightcast_job_postings.csv")

# 1) File sanity check
assert DATA_PATH.exists(), f"CSV not found at {DATA_PATH.resolve()}"

# 2) Robust read (handles wide schema + mixed types)
df = pd.read_csv(
    DATA_PATH,
    low_memory=False,        # avoid dtype guessing issues
    parse_dates=False,       # we’ll parse dates explicitly later
    dtype=str                # keep raw text first; coerce below
)

print("✅ Dataset loaded successfully!")
print(f"Rows: {len(df):,}  |  Columns: {len(df.columns):,}")

# 3) Quick schema peek (first 12 columns to keep output tidy)
preview_cols = list(df.columns[:12])
display(df[preview_cols].head(5))

# 4) Helpful normalized aliases (so later sections work even if column names vary a bit)
#    Feel free to add more aliases if your CSV headers differ.
ALIASES = {
    "EMPLOYMENT_TYPE_NAME": ["EMPLOYMENT_TYPE_NAME", "EMPLOYMENT_TYPE", "EMP_TYPE"],
    "SALARY_FROM":          ["SALARY_FROM", "SAL_FROM", "MIN_SALARY", "SALARY_MIN"],
    "SALARY_TO":            ["SALARY_TO", "SAL_TO", "MAX_SALARY", "SALARY_MAX"],
    "INDUSTRY_NAME":        ["INDUSTRY_NAME", "NAICS2_NAME", "NAICS_NAME"],
    "JOB_TITLE":            ["JOB_TITLE", "TITLE_NAME", "TITLE"],
    "POSTED":               ["POSTED", "POSTED_DATE", "DATE_POSTED"],
    "REMOTE_TYPE_NAME":     ["REMOTE_TYPE_NAME", "REMOTE_TYPE", "REMOTE"],
    "SKILL_NAME":           ["SKILL_NAME", "SKILL"]
}

def pick(existing: list[str], candidates: list[str]) -> str | None:
    for c in candidates:
        if c in existing:
            return c
    return None

use = {k: pick(df.columns.tolist(), v) for k, v in ALIASES.items()}
print("Resolved column names:", use)

# 5) Minimal cleaning: numbers/dates we’ll need later
if use["SALARY_FROM"]:
    df["SALARY_FROM_NUM"] = pd.to_numeric(df[use["SALARY_FROM"]], errors="coerce")
if use["SALARY_TO"]:
    df["SALARY_TO_NUM"] = pd.to_numeric(df[use["SALARY_TO"]], errors="coerce")
if use["POSTED"]:
    df["POSTED_DATE"] = pd.to_datetime(df[use["POSTED"]], errors="coerce", utc=True).dt.date

# 6) Tiny health report
health = {
    "non-null rows (any)": int(df.dropna(how="all").shape[0]),
    "with salary_from": int(df["SALARY_FROM_NUM"].notna().sum() if "SALARY_FROM_NUM" in df else 0),
    "with salary_to": int(df["SALARY_TO_NUM"].notna().sum() if "SALARY_TO_NUM" in df else 0),
    "with posted_date": int(df["POSTED_DATE"].notna().sum() if "POSTED_DATE" in df else 0),
}
health

# Salary Distribution by Employment Type


In [None]:
#| label: salary-by-employment-type
#| echo: true
#| warning: false

import plotly.express as px

# usable columns
col_emp = use.get("EMPLOYMENT_TYPE_NAME")
col_sal = "SALARY_FROM_NUM" if "SALARY_FROM_NUM" in df else use.get("SALARY_FROM")

if col_emp and col_sal:
    # Drop missing values for clean plotting
    subset = df.dropna(subset=[col_emp, col_sal])
    
    # filter extreme outliers for clearer visualization
    q_low, q_high = subset[col_sal].quantile([0.05, 0.95])
    subset = subset[(subset[col_sal] >= q_low) & (subset[col_sal] <= q_high)]

    # Create the boxplot
    fig = px.box(
        subset,
        x=col_emp,
        y=col_sal,
        color=col_emp,
        title="Salary Distribution by Employment Type",
        template="plotly_dark",
        color_discrete_sequence=px.colors.qualitative.Bold
    )

    fig.update_layout(
        xaxis_title="Employment Type",
        yaxis_title="Salary (From)",
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        font=dict(family="Inter", size=13),
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        margin=dict(t=60, l=60, r=40, b=60)
    )

    fig.show()
else:
    print("❌ Required columns not found for salary distribution plot.")

## ✏️ Explanation

The box plot shows how salary levels vary across different employment types. Full-time positions generally have higher median salaries and a wider pay range, reflecting greater earning potential but also more variability.
In contrast, part-time and contract roles exhibit lower median salaries with tighter ranges, suggesting more consistency but fewer high-paying opportunities.




# Salary Distribution by Industry

In [None]:
#| label: salary-by-industry
#| echo: true
#| warning: false

import numpy as np
import pandas as pd
import plotly.express as px

# Resolve column names dynamically (using your earlier `use` dict if present)
col_ind = (use.get("INDUSTRY_NAME") if "use" in locals() else
           ("INDUSTRY_NAME" if "INDUSTRY_NAME" in df.columns else None))
col_sal = ("SALARY_FROM_NUM" if "SALARY_FROM_NUM" in df.columns else
           (use.get("SALARY_FROM") if "use" in locals() else
            ("SALARY_FROM" if "SALARY_FROM" in df.columns else None)))

if col_ind and col_sal:
    # Keep only rows with industry and salary
    dfi = df.dropna(subset=[col_ind, col_sal]).copy()

    # Pick top-N industries by posting volume to keep the chart readable
    TOP_N = 10
    top_inds = (dfi[col_ind]
                .value_counts(dropna=False)
                .nlargest(TOP_N)
                .index)
    dfi = dfi[dfi[col_ind].isin(top_inds)]

    # Trim extreme outliers globally for clarity (5th–95th percentile)
    ql, qh = dfi[col_sal].quantile([0.05, 0.95])
    dfi = dfi[(dfi[col_sal] >= ql) & (dfi[col_sal] <= qh)]

    # Order industries by median salary (descending)
    medians = dfi.groupby(col_ind)[col_sal].median().sort_values(ascending=False)
    category_order = medians.index.tolist()

    # Box plot (clean + interactive)
    fig = px.box(
        dfi,
        x=col_ind,
        y=col_sal,
        color=col_ind,
        category_orders={col_ind: category_order},
        title="Salary Distribution by Industry (Top 10 by Postings)",
        template="plotly_dark",
        color_discrete_sequence=px.colors.qualitative.Bold,
        points=False  # hide raw points to keep it tidy
    )

    fig.update_layout(
        xaxis_title="Industry",
        yaxis_title="Salary (From)",
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        font=dict(family="Inter", size=13),
        xaxis_tickangle=35,
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        margin=dict(t=60, l=60, r=40, b=80),
        showlegend=False
    )

    fig.show()
else:
    print("❌ Required columns not found for industry salary plot.")


In [None]:
#| label: salary-by-industry-stats
#| echo: false
#| warning: false

if col_ind and col_sal:
    stats = (df.dropna(subset=[col_ind, col_sal])
               .groupby(col_ind)[col_sal]
               .agg(N="size", mean="mean", median="median", p25=lambda s: s.quantile(0.25),
                    p75=lambda s: s.quantile(0.75))
               .sort_values("median", ascending=False)
               .head(10))

    # Round for neatness
    display(stats.round(0))


## ✏️ Explanation

The chart shows that salary levels vary notably across industries. The Information and Accommodation and Food Services sectors exhibit the highest median and upper-range salaries, suggesting strong compensation potential in these fields.
Meanwhile, industries like Administrative Support and Retail Trade tend to offer lower median salaries, reflecting more standardized pay structures and fewer high-paying roles.


# Job Posting Trends Over Time


In [None]:
#| label: job-posting-trends
#| echo: true
#| warning: false

import pandas as pd
import plotly.express as px

# Ensure date column exists and is parsed
if "POSTED" in df.columns:
    df["POSTED_DATE"] = pd.to_datetime(df["POSTED"], errors="coerce")

    # Aggregate daily counts
    trend = (df.dropna(subset=["POSTED_DATE"])
               .groupby("POSTED_DATE")
               .size()
               .reset_index(name="Job_Postings"))

    # Create line chart
    fig = px.line(
        trend,
        x="POSTED_DATE",
        y="Job_Postings",
        title="Job Posting Trends Over Time",
        template="plotly_dark",
        color_discrete_sequence=["#37f3c0"]
    )

    fig.update_layout(
        xaxis_title="Posted Date",
        yaxis_title="Number of Job Postings",
        font=dict(family="Inter", size=13),
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        hovermode="x unified",
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        margin=dict(t=60, l=60, r=40, b=80),
    )

    fig.update_traces(line=dict(width=2.5))
    fig.show()
else:
    print("❌ 'POSTED' column not found.")


## ✏️ Explanation

The trend line reveals noticeable fluctuations in job posting activity, indicating that hiring demand changes frequently over time. Peaks suggest periods of intensified recruitment, possibly driven by seasonal hiring cycles or new project launches, while the dips represent slower hiring phases.
Overall, the data highlights a dynamic job market with recurring surges in posting volume.


# Top 10 Job Titles by Count


In [None]:
#| label: top-job-titles
#| echo: true
#| warning: false

import pandas as pd
import plotly.express as px

if "TITLE_NAME" in df.columns:
    # Count occurrences and select top 10
    top_jobs = df["TITLE_NAME"].value_counts().nlargest(10)

    # Create bar chart
    fig = px.bar(
        x=top_jobs.index,
        y=top_jobs.values,
        title="Top 10 Job Titles by Count",
        text_auto=True,
        color=top_jobs.values,
        color_continuous_scale="tealgrn",
        template="plotly_dark"
    )

    fig.update_layout(
        xaxis_title="Job Title",
        yaxis_title="Number of Postings",
        font=dict(family="Inter", size=13),
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        xaxis_tickangle=40,
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        margin=dict(t=60, l=60, r=40, b=100)
    )

    fig.show()
else:
    print("❌ 'TITLE_NAME' column not found in dataset.")


## ✏️ Explanation
The chart shows that Data Analyst is by far the most frequently posted job title, indicating a high market demand for data-focused professionals. Other roles like Unclassified, Enterprise Architect, and Data Engineer also appear prominently, reflecting the growing need for both analytical and technical expertise in data-driven organizations.

# Remote vs On-Site Job Postings


In [None]:
#| label: remote-vs-onsite
#| echo: true
#| warning: false

import plotly.express as px

if "REMOTE_TYPE_NAME" in df.columns:
    remote_counts = df["REMOTE_TYPE_NAME"].value_counts().reset_index()
    remote_counts.columns = ["Remote Type", "Count"]

    fig = px.pie(
        remote_counts,
        names="Remote Type",
        values="Count",
        title="Remote vs On-Site Job Postings",
        color_discrete_sequence=px.colors.qualitative.Pastel
    )

    fig.update_traces(textposition="inside", textinfo="percent+label")
    fig.update_layout(
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        font=dict(family="Inter", size=14),
        showlegend=False,
        paper_bgcolor="rgba(0,0,0,0)",
        plot_bgcolor="rgba(0,0,0,0)"
    )

    fig.show()
else:
    print("❌ 'REMOTE_TYPE_NAME' column not found in dataset.")


## ✏️ Explanation
The chart shows that a large majority of job postings do not specify a remote type, while around 17% explicitly offer remote positions. A smaller portion of listings are hybrid or partially remote, indicating that while remote work is available, most employers still emphasize on-site or unspecified work arrangements.


# Skill Demand Analysis by Industry (Stacked Bar Chart)


In [None]:
#| label: skill-demand-by-industry
#| echo: true
#| warning: false
#| message: false

import pandas as pd
import numpy as np
import plotly.express as px
import ast

# --- 1) Robustly parse SKILLS_NAME into lists, then explode ---
def to_list(value):
    """Convert SKILLS_NAME cells to a list of clean strings."""
    if pd.isna(value) or value == "":
        return []
    if isinstance(value, list):
        return [str(v).strip() for v in value]
    s = str(value).strip()

    # If it looks like a Python list literal, parse safely
    if s.startswith("[") and s.endswith("]"):
        try:
            parsed = ast.literal_eval(s)
            return [str(v).strip() for v in parsed if str(v).strip()]
        except Exception:
            pass

    # Fall back to common delimiters
    for sep in ["|", ";", " / ", "/", ","]:
        if sep in s:
            return [t.strip() for t in s.split(sep) if t.strip()]

    # Otherwise treat the whole thing as one skill
    return [s]

required = {"NAICS2_NAME", "SKILLS_NAME"}
if required.issubset(df.columns):
    skills = df.loc[:, ["NAICS2_NAME", "SKILLS_NAME"]].copy()
    skills["SKILLS_NAME"] = skills["SKILLS_NAME"].apply(to_list)
    skills = skills.explode("SKILLS_NAME", ignore_index=True)
    skills = skills[skills["SKILLS_NAME"].notna() & (skills["SKILLS_NAME"] != "")]
    skills.rename(columns={"NAICS2_NAME": "Industry", "SKILLS_NAME": "Skill"}, inplace=True)

    # --- 2) Limit to top skills & top industries (keeps chart readable) ---
    TOP_SKILLS = 8
    TOP_INDS   = 10
    top_skills = skills["Skill"].value_counts().head(TOP_SKILLS).index
    top_inds   = skills["Industry"].value_counts().head(TOP_INDS).index
    skills_top = skills[skills["Skill"].isin(top_skills) & skills["Industry"].isin(top_inds)]

    agg = (skills_top
           .groupby(["Industry", "Skill"])
           .size()
           .reset_index(name="Count"))

    # --- 3) Horizontal stacked bar (better for long labels) ---
    fig = px.bar(
        agg,
        x="Count",
        y="Industry",
        color="Skill",
        orientation="h",
        barmode="stack",
        title="Skill Demand by Industry (Top Skills & Industries)",
        color_discrete_sequence=px.colors.qualitative.Vivid
    )
    fig.update_layout(
        height=640,
        font=dict(family="Inter", size=13),
        title_font=dict(size=20, family="Inter", color="#1f6feb"),
        xaxis_title="Skill Count",
        yaxis_title="Industry",
        yaxis=dict(categoryorder="total ascending"),
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        legend_title_text="Skill",
        margin=dict(l=10, r=10, t=60, b=10)
    )
    fig.show()
else:
    missing = required - set(df.columns)
    print(f"❌ Missing required columns: {sorted(missing)}")


## ✏️ Explanation

The chart shows that professional, scientific, and technical services industries have the highest demand for skills, especially in communication, data analysis, and leadership.
Across most industries, soft skills like communication and problem-solving appear as consistently essential, highlighting their broad value alongside technical expertise such as SQL and computer science.




# Salary Analysis by ONET Occupation Type (Bubble Chart)

In [None]:
# --- Salary Analysis by Occupation (Bubble Chart) - robust ---

import pandas as pd
import numpy as np
import plotly.express as px
from pathlib import Path

# Load once if df isn't already present
if "df" not in globals():
    df = pd.read_csv(Path("data/lightcast_job_postings.csv"), low_memory=False)

work = df.copy()

# ---------- Salary normalization (annualize) ----------
# Make numeric
for c in ["SALARY_FROM", "SALARY_TO", "SALARY"]:
    if c in work.columns:
        work[c] = pd.to_numeric(work[c], errors="coerce")

# Helper: annualize by period
def annualize(row):
    # prefer bounds if present, else SALARY
    base = np.nan
    if pd.notna(row.get("SALARY_FROM")) and pd.notna(row.get("SALARY_TO")):
        base = 0.5 * (row["SALARY_FROM"] + row["SALARY_TO"])
    elif pd.notna(row.get("SALARY_FROM")):
        base = row["SALARY_FROM"]
    elif pd.notna(row.get("SALARY_TO")):
        base = row["SALARY_TO"]
    elif pd.notna(row.get("SALARY")):
        base = row["SALARY"]

    if pd.isna(base):
        return np.nan

    period = str(row.get("ORIGINAL_PAY_PERIOD", "")).strip().upper()
    if period == "HOUR":
        return base * 2080            # 40h * 52w
    elif period == "DAY":
        return base * 260             # 5d * 52w
    elif period == "WEEK":
        return base * 52
    elif period == "MONTH":
        return base * 12
    # assume already annual
    return base

# Compute annual salary
work["SALARY_ANNUAL"] = work.apply(annualize, axis=1)

# Keep a wide annual band to avoid clipping true values
work = work[work["SALARY_ANNUAL"].between(15000, 500000, inclusive="both")]

# ---------- Pick a grouping column that has variety among salary rows ----------
candidates = [
    "SOC_2021_4_NAME", "SOC_2021_3_NAME", "SOC_2021_2_NAME",
    "ONET_2019_NAME", "ONET_NAME",
    "TITLE_NAME"  # final fallback if none of the above has variety
]

def pick_group_col(frame, cols, min_unique=6):
    avail = []
    for c in cols:
        if c in frame.columns:
            n = frame[c].dropna().nunique()
            avail.append((c, n))
    # sort by original order but filter by variety
    for c, n in avail:
        if n >= min_unique:
            return c, avail
    # otherwise take the one with max variety among the available
    if avail:
        return max(avail, key=lambda x: x[1])[0], avail
    return None, []

group_col, availability = pick_group_col(work.dropna(subset=["SALARY_ANNUAL"]), candidates)

print("Grouping candidates (non-null uniques among rows with salary):")
for c, n in availability:
    print(f"  - {c}: {n}")

if not group_col:
    print("❌ No suitable occupation/title column found.")
else:
    print(f"✅ Using group column: {group_col}")

    # ---------- Aggregate ----------
    salary_df = (
        work.dropna(subset=[group_col, "SALARY_ANNUAL"])
            .groupby(group_col, as_index=False)
            .agg(
                Median_Salary=("SALARY_ANNUAL", "median"),
                Job_Postings=(group_col, "size")
            )
            .sort_values("Job_Postings", ascending=False)
    )

    # If there are still tons of unique groups, keep the busiest N
    top_n = 20
    if len(salary_df) > top_n:
        salary_df = salary_df.head(top_n)

    # If everything collapsed to 1, show a short sample so you can sanity check
    if salary_df[group_col].nunique() <= 1:
        print("ℹ️ Still only one group after cleaning. Here are the top 10 rows with salary:")
        print(work.loc[work["SALARY_ANNUAL"].notna(), [group_col, "TITLE_NAME", "SALARY_ANNUAL", "ORIGINAL_PAY_PERIOD"]].head(10))

    # ---------- Plot ----------
    fig = px.scatter(
        salary_df,
        x=group_col,
        y="Median_Salary",
        size="Job_Postings",
        color="Median_Salary",
        size_max=42,
        color_continuous_scale="Viridis",
        title="Salary Analysis by Occupation",
        hover_data={group_col: True, "Median_Salary": ":,.0f", "Job_Postings": True}
    )

    fig.update_layout(
        xaxis_title="Occupation / Title",
        yaxis_title="Median Salary ($, annualized)",
        plot_bgcolor="rgba(0,0,0,0)",
        paper_bgcolor="rgba(0,0,0,0)",
        font=dict(family="Inter, 'Fira Sans', Arial", size=13),
        height=620
    )
    fig.update_xaxes(tickangle=35, tickfont=dict(size=11))

    fig.show()


## ✏️ Explanation

This chart shows how median annualized salaries vary across the most common job titles in the dataset.
Larger bubbles represent job titles with a higher number of postings, while color indicates the relative salary level.
From the visualization, we can see that technical and leadership roles such as “Data Architects,” “Principal Data Engineers,” and “SAP Data Analytics Managers” tend to offer the highest median salaries (often exceeding $160K), while more common positions like “Data Analyst” or “Business Intelligence Analyst” have lower median pay but significantly higher posting volume, reflecting strong demand for analytical talent across industries.


# Career Pathway Trends (Sankey Diagram)

In [None]:
# --- Career Pathway Trends (Sankey): TITLE_NAME → ONET_NAME ---

import pandas as pd
import numpy as np
import plotly.graph_objects as go

REQ = {"TITLE_NAME", "ONET_NAME"}
if not REQ.issubset(df.columns):
    print(f"❌ Missing columns for Sankey: need {REQ}, have {set(df.columns)}")
else:
    # Params you can tweak
    TOP_TITLES = 20      # keep top-N job titles by postings
    TOP_ONETS  = 20      # keep top-N ONET occupations by postings
    MIN_FLOW   = 10      # drop very tiny flows to declutter

    # Clean & filter
    sankey_df = (
        df[["TITLE_NAME", "ONET_NAME"]]
        .dropna()
        .assign(
            TITLE_NAME=lambda d: d["TITLE_NAME"].str.strip(),
            ONET_NAME=lambda d: d["ONET_NAME"].str.strip(),
        )
        .query("TITLE_NAME != '' and ONET_NAME != ''")
    )

    # Keep only top titles / onets to make the diagram readable
    top_titles = sankey_df["TITLE_NAME"].value_counts().nlargest(TOP_TITLES).index
    top_onets  = sankey_df["ONET_NAME"].value_counts().nlargest(TOP_ONETS).index
    sankey_df  = sankey_df[sankey_df["TITLE_NAME"].isin(top_titles) & sankey_df["ONET_NAME"].isin(top_onets)]

    flows = (
        sankey_df
        .value_counts(["TITLE_NAME", "ONET_NAME"])
        .rename("count")
        .reset_index()
    )
    flows = flows[flows["count"] >= MIN_FLOW].sort_values("count", ascending=False)

    # Diagnostics
    diag = {
        "Rows considered": int(len(sankey_df)),
        "Unique Titles": int(sankey_df["TITLE_NAME"].nunique()),
        "Unique ONETs": int(sankey_df["ONET_NAME"].nunique()),
        "Flows kept": int(len(flows)),
        "Min flow shown": int(MIN_FLOW),
    }
    print("ℹ️ Sankey diagnostics:", diag)

    if flows.empty:
        print("⚠️ No flows after filtering. Try lowering MIN_FLOW or increasing TOP_TITLES/TOP_ONETS.")
    else:
        # Build node list
        src_labels = flows["TITLE_NAME"].tolist()
        tgt_labels = flows["ONET_NAME"].tolist()
        all_labels = pd.Index(src_labels + tgt_labels).unique().tolist()

        # indices for plotly
        label_to_idx = {lab: i for i, lab in enumerate(all_labels)}
        sources = flows["TITLE_NAME"].map(label_to_idx).tolist()
        targets = flows["ONET_NAME"].map(label_to_idx).tolist()
        values  = flows["count"].tolist()

        # Optional: color nodes by side (titles vs onets)
        node_colors = []
        for lab in all_labels:
            if lab in top_titles:
                node_colors.append("rgba(56, 182, 255, 0.8)")  # blue-ish for titles
            else:
                node_colors.append("rgba(72, 201, 176, 0.8)")  # teal-ish for ONET

        link_color = "rgba(200, 200, 200, 0.35)"

        fig = go.Figure(data=[go.Sankey(
            arrangement="snap",
            node=dict(
                pad=16,
                thickness=18,
                line=dict(width=0.5, color="rgba(0,0,0,0.25)"),
                label=all_labels,
                color=node_colors
            ),
            link=dict(
                source=sources,
                target=targets,
                value=values,
                color=link_color
            )
        )])

        fig.update_layout(
            title="Career Pathway Trends (Job Title → ONET Occupation)",
            font=dict(size=12),
            height=650,
            margin=dict(l=10, r=10, t=50, b=10),
            paper_bgcolor="rgba(0,0,0,0)",
            plot_bgcolor="rgba(0,0,0,0)"
        )
        fig.show()


## ✏️ Explanation
The Sankey diagram shows how various job titles—such as Data Analyst, Data Modeler, and ERP Business Analyst—flow into a single ONET occupation category, Business Intelligence Analysts.
This highlights that many data-related positions ultimately map to the same occupational classification, reflecting the overlap and convergence of skills within business intelligence and analytics roles.