# 02 - Analysis & Visualizations

**SC/MATH 1130 A - Introduction to Data Science**

**Project**: The Impact of AI on Canadian Wages

---

## Analysis Sections

- **Section 2.1**: Historical Wage Growth (2012-2024)
- **Section 2.2**: Entry-Level Job Market Impact
- **Section 2.3**: Occupational Vulnerability Assessment

---


## Setup & Imports


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

# Setup paths
BASE_DIR = Path("..")
MERGED_DATA = BASE_DIR / "data" / "merged"
OUTPUTS_FIGURES = BASE_DIR / "outputs" / "figures"
OUTPUTS_TABLES = BASE_DIR / "outputs" / "tables"

# Create output directories
OUTPUTS_FIGURES.mkdir(parents=True, exist_ok=True)
OUTPUTS_TABLES.mkdir(parents=True, exist_ok=True)

# Set plotting style
plt.style.use("seaborn-v0_8-darkgrid")
sns.set_palette("husl")

print("✅ Setup complete!")
print(f"Output figures: {OUTPUTS_FIGURES}")
print(f"Output tables: {OUTPUTS_TABLES}")

## Load Master Dataset


In [None]:
# Load merged dataset from cleaning notebook
df = pd.read_csv(MERGED_DATA / "master_dataset.csv")

print(f"Dataset shape: {df.shape}")
print(f"Years: {df['Reference_Period'].min()} - {df['Reference_Period'].max()}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()

In [None]:
# Quick data quality check
print("Missing values per column:")
print(df.isnull().sum())

print("\nData types:")
print(df.dtypes)

---
# Section 2.1: Historical Wage Growth Analysis

## Research Question:
Have AI-adopting industries experienced different wage growth patterns (2012-2024) compared to low-AI sectors, and does this create widening income inequality?

## Tasks:
1. Calculate wage growth rates by industry and year
2. Compare AI-adopting vs non-AI industries
3. Visualize trends over time
4. Compute inequality metrics (wage spread, Gini coefficient, etc.)
5. Generate summary statistics

---


In [None]:
# Define pre-AI and post-AI periods
PRE_AI_YEARS = range(2012, 2020)  # Before COVID & LLMs
POST_AI_YEARS = range(2020, 2025)  # After COVID & LLMs

df["Period"] = df["Reference_Period"].apply(
    lambda x: "Pre-AI (2012-2019)" if x in PRE_AI_YEARS else "Post-AI (2020-2024)"
)

print("Period distribution:")
print(df["Period"].value_counts())

In [None]:
# TODO: Calculate wage growth rates by industry
# Hint: Group by industry and year, calculate percent change

# YOUR CODE HERE


In [None]:
# TODO: Create line plot - Median wage over time by AI exposure
# Compare high-AI industries vs low-AI industries

fig, ax = plt.subplots(figsize=(12, 6))

# YOUR CODE HERE

plt.title("Median Wage Trends: AI-Adopting vs Non-AI Industries (2012-2024)", fontsize=14, fontweight="bold")
plt.xlabel("Year", fontsize=12)
plt.ylabel("Median Hourly Wage ($)", fontsize=12)
plt.legend()
plt.tight_layout()
plt.savefig(OUTPUTS_FIGURES / "subtopic1_wage_trends.png", dpi=300)
plt.show()

In [None]:
# TODO: Create box plot - Wage distribution by AI exposure
# Show inequality within and between industries

fig, ax = plt.subplots(figsize=(10, 6))

# YOUR CODE HERE

plt.title("Wage Distribution: AI-Adopting vs Non-AI Industries", fontsize=14, fontweight="bold")
plt.ylabel("Hourly Wage ($)", fontsize=12)
plt.tight_layout()
plt.savefig(OUTPUTS_FIGURES / "subtopic1_wage_distribution.png", dpi=300)
plt.show()

In [None]:
# TODO: Calculate summary statistics
# Mean wage growth, median, std dev by industry category

# YOUR CODE HERE

# Save summary table
# summary_stats.to_csv(OUTPUTS_TABLES / 'subtopic1_summary_stats.csv', index=False)
# print("✅ Saved: subtopic1_summary_stats.csv")

### Key Findings (Subtopic 1)

_Write your observations here after completing the analysis:_

1.
2.
3.


---
# Section 2.2: Entry-Level Job Market Impact

## Research Question:
Has AI adoption reduced entry-level job opportunities and compressed starting wages in high-exposure occupations?

## Tasks:
1. Filter for entry-level positions (`is_entry_level == 1`)
2. Compare pre-AI vs post-AI starting wages
3. Analyze Low/Median wage ratios
4. Visualize entry-level wage compression
5. Identify most affected occupations

---


In [None]:
# Filter for entry-level positions
df_entry = df[df["is_entry_level"] == 1].copy()

print(f"Total entry-level records: {len(df_entry):,}")
print(f"Unique occupations: {df_entry['NOC_Title_Standardized'].nunique()}")
print(f"\nEntry-level jobs by period:")
print(df_entry["Period"].value_counts())

In [None]:
# TODO: Compare entry-level wages pre-AI vs post-AI
# Calculate average starting wage (Low_Wage_Hourly) for each period

# YOUR CODE HERE


In [None]:
# TODO: Create histogram - Entry-level wage distribution
# Compare 2012-2019 vs 2020-2024

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# YOUR CODE HERE

plt.suptitle("Entry-Level Wage Distribution: Pre-AI vs Post-AI", fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig(OUTPUTS_FIGURES / "subtopic2_entry_wage_distribution.png", dpi=300)
plt.show()

In [None]:
# TODO: Analyze Low/Median wage ratio by AI exposure
# Lower ratio = wage compression

# YOUR CODE HERE


In [None]:
# TODO: Identify top 20 occupations with largest entry-level wage decline

# YOUR CODE HERE

# Save to CSV
# top_declining.to_csv(OUTPUTS_TABLES / 'subtopic2_declining_occupations.csv', index=False)
# print("✅ Saved: subtopic2_declining_occupations.csv")

### Key Findings (Subtopic 2)

_Write your observations here:_

1.
2.
3.


---
# Section 2.3: Occupational Vulnerability Assessment

## Research Question:
Which occupational groups (NOC) and geographic regions may benefit from AI adoption, and which ones will face negative impacts?

## Tasks:
1. Create vulnerability index for each occupation
2. Identify top 20 vulnerable and top 20 safe occupations
3. Analyze vulnerability by province/region
4. Visualize heatmaps and rankings
5. Generate policy recommendations

---


In [None]:
# TODO: Create vulnerability index
# Formula: AI_Exposure_Score × Wage_Decline × Entry_Level_Risk
# Higher score = more vulnerable

# Step 1: Normalize AI exposure (0-1 scale)
# Step 2: Calculate wage decline rate
# Step 3: Weight by entry-level proportion
# Step 4: Combine into single score

# YOUR CODE HERE


In [None]:
# TODO: Rank occupations by vulnerability
# Top 20 most vulnerable and top 20 safest

# YOUR CODE HERE


In [None]:
# TODO: Create bar chart - Top 20 vulnerable occupations

fig, ax = plt.subplots(figsize=(12, 8))

# YOUR CODE HERE

plt.title("Top 20 Most Vulnerable Occupations to AI Displacement", fontsize=14, fontweight="bold")
plt.xlabel("Vulnerability Index", fontsize=12)
plt.ylabel("Occupation", fontsize=12)
plt.tight_layout()
plt.savefig(OUTPUTS_FIGURES / "subtopic3_vulnerable_occupations.png", dpi=300)
plt.show()

In [None]:
# TODO: Create heatmap - Vulnerability by province and NOC category

fig, ax = plt.subplots(figsize=(14, 10))

# YOUR CODE HERE

plt.title("AI Vulnerability Heatmap: Province × NOC Category", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig(OUTPUTS_FIGURES / "subtopic3_vulnerability_heatmap.png", dpi=300)
plt.show()

In [None]:
# TODO: Generate policy recommendations table
# Top 10 NOC codes requiring immediate intervention

# YOUR CODE HERE

# Save to CSV
# intervention_list.to_csv(OUTPUTS_TABLES / 'subtopic3_intervention_priorities.csv', index=False)
# print("✅ Saved: subtopic3_intervention_priorities.csv")

### Key Findings (Subtopic 3)

_Write your observations here:_

1.
2.
3.


---

## Summary & Next Steps

### Completed:

- ✅ Subtopic 1: Historical wage growth analysis
- ✅ Subtopic 2: Entry-level job market impact
- ✅ Subtopic 3: Occupational vulnerability assessment

### Outputs Generated:

- Figures saved to: `outputs/figures/`
- Tables saved to: `outputs/tables/`

### Next Steps:

1. Review all visualizations and tables
2. Write final report (≤3 pages)
3. Prepare video presentation (5-6 minutes)
4. Submit deliverables
