# Income vs Safety Scatter Plot

Explores the correlation between median household income and safety score across Toronto neighbourhoods.

## 1. Data Reference

### Source Tables

| Table | Grain | Key Columns |
|-------|-------|-------------|
| `mart_neighbourhood_overview` | neighbourhood Ã— year | neighbourhood_name, median_household_income, safety_score, population |

### SQL Query

In [1]:
import os

import pandas as pd
from dotenv import load_dotenv
from sqlalchemy import create_engine

# Load .env from project root
load_dotenv("../../.env")

engine = create_engine(os.environ["DATABASE_URL"])

query = """
SELECT
    neighbourhood_name,
    median_household_income,
    safety_score,
    population,
    livability_score,
    crime_rate_per_100k
FROM mart_toronto.mart_neighbourhood_overview
WHERE year = (SELECT MAX(year) FROM mart_toronto.mart_neighbourhood_overview)
  AND median_household_income IS NOT NULL
  AND safety_score IS NOT NULL
ORDER BY median_household_income DESC
"""

df = pd.read_sql(query, engine)
print(f"Loaded {len(df)} neighbourhoods with income and safety data")

Loaded 158 neighbourhoods with income and safety data


### Transformation Steps

1. Filter out null values for income and safety
2. Optionally scale income to thousands for readability
3. Pass to scatter figure factory with optional trendline

In [2]:
# Scale income to thousands for better axis readability
df["income_thousands"] = df["median_household_income"] / 1000

# Prepare data for figure factory
data = df.to_dict("records")

### Sample Output

In [3]:
df[
    [
        "neighbourhood_name",
        "median_household_income",
        "safety_score",
        "crime_rate_per_100k",
    ]
].head(10)

Unnamed: 0,neighbourhood_name,median_household_income,safety_score,crime_rate_per_100k
0,Bridle Path-Sunnybrook-York Mills,222000.0,79.0,153.227143
1,Kingsway South,184000.0,61.1,174.266667
2,Lawrence Park North,168000.0,91.1,138.771667
3,Lawrence Park South,162000.0,92.4,138.368333
4,Princess-Rosethorn,162000.0,53.5,184.873333
5,Leaside-Bennington,148000.0,69.4,164.13
6,Runnymede-Bloor West Village,138000.0,35.0,216.393333
7,Bedford Park-Nortown,135000.0,74.5,159.358333
8,Centennial Scarborough,134000.0,98.7,114.715714
9,Rosedale-Moore Park,122000.0,8.3,344.961667


## 2. Data Visualization

### Figure Factory

Uses `create_scatter_figure` from `portfolio_app.figures.toronto.scatter`.

**Key Parameters:**
- `x_column`: 'income_thousands' (median household income in $K)
- `y_column`: 'safety_score' (0-100 percentile rank)
- `name_column`: 'neighbourhood_name' (hover label)
- `size_column`: 'population' (optional, bubble size)
- `trendline`: True (adds OLS regression line)

In [4]:
import sys

sys.path.insert(0, "../..")

from portfolio_app.figures.toronto.scatter import create_scatter_figure

fig = create_scatter_figure(
    data=data,
    x_column="income_thousands",
    y_column="safety_score",
    name_column="neighbourhood_name",
    size_column="population",
    title="Income vs Safety by Neighbourhood",
    x_title="Median Household Income ($K)",
    y_title="Safety Score (0-100)",
    trendline=True,
)

fig.show()

### Interpretation

This scatter plot reveals the relationship between income and safety:

- **Positive correlation**: Higher income neighbourhoods tend to have higher safety scores
- **Bubble size**: Represents population (larger = more people)
- **Trendline**: Orange dashed line shows the overall trend
- **Outliers**: Neighbourhoods far from the trendline are interesting cases
  - Above line: Safer than income would predict
  - Below line: Less safe than income would predict

In [5]:
# Calculate correlation coefficient
correlation = df["median_household_income"].corr(df["safety_score"])
print(f"Correlation coefficient (Income vs Safety): {correlation:.3f}")

Correlation coefficient (Income vs Safety): 0.315
