# Project 2 â€” Violent Crime and Income Across NYS Counties
**Objective:**  
Investigate the relationship between **violent crime** and **family income** across all counties in New York State using two separate datasets. 

**Research question:**  
How does median family income relate to violent crime levels (counts and rates) across New York State counties in 2023?

**Hypothesis:**
There would be a strong negative correlation between violent crime rates and familiy income as I assume richer county would have lower violent crime rates.

**Datasets used:**
1. `NYS Index Crimes by County in 2023.csv`  
   - Source: DCJS, Uniform Crime Reporting (UCR) data
   - Source website: https://www.criminaljustice.ny.gov/crimnet/ojsa/tableau_index_crime.htm  
   - Contains: county name, year, population, index crime counts, **violent crime counts**

2. `HDPulse_data_export.csv`  
   - Source: HDPulse (NIMHD)  
   - Source website: https://hdpulse.nimhd.nih.gov/data-portal/social/table?age=001&age_options=ageall_1&demo=00010&demo_options=income_3&race=00&race_options=race_7&sex=0&sex_options=sexboth_1&socialtopic=030&socialtopic_options=social_6&statefips=36&statefips_options=area_states
   - Contains: **median family income** by county 
   - For this dataset, Median Family Income data is based on a 5-year estimate from 2019-2023 as the sample size from a single year would not be large enough to be reliable for those areas.

**Note for simple preparation**
1. I did my simple preparation from Excel. I extracted NYS Index Crimes by County in 2023 from a larger dataset that concludes New York State crimes by county from 2000 to 2024, and deleted columns of soecific crime types(murder count, robbery count, etc).

In [97]:
import pandas as pd
import plotly.express as px

import plotly.io as pio
pio.renderers.default = "notebook_connected"


## 1. Load Data

In [98]:
crime_path = "NYS Index Crimes by County in 2023.csv"
income_path = "HDPulse_data_export.csv"

crime_raw = pd.read_csv(crime_path)
income_raw = pd.read_csv(income_path)

crime_raw.head()


Unnamed: 0,"NYS Index Crimes by County in 2023 Data Source: DCJS, Uniform Crime Reporting File. Includes all reports received as of 11/3/2025.",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,County/Region,Year,Population,Index Total Count,Violent Total Count
1,Albany,2023,318366,9830,1354
2,Allegany,2023,46624,415,88
3,Bronx,2023,1344322,50895,16552
4,Broome,2023,197632,4849,555


## 2. Clean Crime Data
For the crime file:
- The first row contains the true column headers  
- The remaining rows contain data for each county

Steps:
1. Rename columns to simpler labels.
2. Drop the header row from the data.
3. Convert numeric columns to proper numeric types.
4. Create a cleaned county name column for merging later.

In [99]:
crime = crime_raw.copy()
# Rename columns to simpler names
crime.columns = ["County", "Year", "Population", "Index_Total_Count", "Violent_Total_Count"]
crime = crime.iloc[1:].reset_index(drop=True)

# Convert numeric columns
for col in ["Year", "Population", "Index_Total_Count", "Violent_Total_Count"]:
    crime[col] = pd.to_numeric(crime[col], errors="coerce")

# Create a cleaned county name for merging later and fix specific spelling differences between files (St Lawrence vs St. Lawrence)
crime["County_clean"] = crime["County"].str.strip().replace({"St Lawrence": "St. Lawrence"})

crime.head()

Unnamed: 0,County,Year,Population,Index_Total_Count,Violent_Total_Count,County_clean
0,Albany,2023,318366,9830,1354,Albany
1,Allegany,2023,46624,415,88,Allegany
2,Bronx,2023,1344322,50895,16552,Bronx
3,Broome,2023,197632,4849,555,Broome
4,Cattaraugus,2023,74831,1168,149,Cattaraugus


## 3. Clean Income Data
The HDPulse income file includes:
- Descriptive metadata in the first few rows
- A header row around row 3
- Footnote rows at the bottom

Steps:
1. Use the 4th row as the header row.
2. Keep only rows that look like real counties (have a valid FIPS code).
3. Extract and clean the median family income column (remove commas).
4. Create a cleaned county name column that matches the crime dataset.

In [100]:
# Copy and re-assign header from the 4th row (index 3)
income = income_raw.copy()
income.columns = ["col0", "col1", "col2", "col3"]

header = income.iloc[3]
income = income.iloc[4:].reset_index(drop=True)
income.columns = header.values
income.head()

Unnamed: 0,County,FIPS,Value (Dollars),Rank within US (of 3139 counties)
0,Bronx County,36005,59322,2883
1,Fulton County,36035,73183,2123
2,Cattaraugus County,36009,75127,1980
3,Franklin County,36033,75978,1918
4,Chautauqua County,36013,76149,1903


In [101]:
# Keep only rows with a non-missing FIPS value (these correspond to real counties)
income = income[income["FIPS"].notna()].copy()
income["County"] = income["County"].astype(str)

# Remove the literal ' County' suffix to match the crime dataset naming
income["County_short"] = income["County"].str.replace(" County", "", regex=False)
income["County_clean"] = income["County_short"]

# Clean the median family income column: remove commas and convert to float
income["Median_family_income"] = income["Value (Dollars)"].astype(str).str.replace(",", "", regex=False).astype(float)

income[["County", "County_clean", "Median_family_income"]].head()

Unnamed: 0,County,County_clean,Median_family_income
0,Bronx County,Bronx,59322.0
1,Fulton County,Fulton,73183.0
2,Cattaraugus County,Cattaraugus,75127.0
3,Franklin County,Franklin,75978.0
4,Chautauqua County,Chautauqua,76149.0


## 4. Merge Datasets
Merge the two datasets on the cleaned county name (`County_clean`).  

In [102]:
merged = pd.merge(
    crime,
    income[["County_clean", "Median_family_income"]],
    on="County_clean",
    how="inner"
)

print("Rows:", merged.shape[0])
merged.head()

Rows: 62


Unnamed: 0,County,Year,Population,Index_Total_Count,Violent_Total_Count,County_clean,Median_family_income
0,Albany,2023,318366,9830,1354,Albany,115490.0
1,Allegany,2023,46624,415,88,Allegany,80013.0
2,Bronx,2023,1344322,50895,16552,Bronx,59322.0
3,Broome,2023,197632,4849,555,Broome,83422.0
4,Cattaraugus,2023,74831,1168,149,Cattaraugus,75127.0


## 5. Compute Violent Crime Rate per 100k

Raw crime counts are heavily influenced by county population. To make comparisons fair, I compute a violent crime rate per 100,000 residents. This gives us a standardized metric to compare across counties.

In [103]:
merged['Violent_rate_per_100k'] = merged['Violent_Total_Count'] / merged['Population'] * 100000
merged.head()

Unnamed: 0,County,Year,Population,Index_Total_Count,Violent_Total_Count,County_clean,Median_family_income,Violent_rate_per_100k
0,Albany,2023,318366,9830,1354,Albany,115490.0,425.296671
1,Allegany,2023,46624,415,88,Allegany,80013.0,188.743995
2,Bronx,2023,1344322,50895,16552,Bronx,59322.0,1231.252631
3,Broome,2023,197632,4849,555,Broome,83422.0,280.824968
4,Cattaraugus,2023,74831,1168,149,Cattaraugus,75127.0,199.11534


## 6. Preliminary Visualization: Violent Crime Count vs Income
This scatterplot shows the violent crime counts (from the crime dataset) against median family income (from the income dataset) for each county.  

In [104]:
fig = px.scatter(
    merged,
    x="Median_family_income",
    y="Violent_Total_Count",
    hover_name="County",
    trendline="ols",
    trendline_color_override="red",
    title="Violent Crime Count vs Median Family Income (NY Counties, 2023)",
    labels={
        "Median_family_income": "Median Family Income (USD)",
        "Violent_Total_Count": "Violent Crime Count (2023)"
    }
)
fig

**Chart conclusion:**
The relationship between violent crime count and median family income is nearly none, because raw counts are driven by county population. A more meaningful measure could be the violent crime rate per 100,000 residents. 

## 7. Further Visualization: Violent Crime Rate vs Income 
 Since the relationship between violent crime count and median family income is not apparent, I want to covert violent crime count into ciolent crime rate per 100k to see if the data shows any obvious pattern.

In [105]:
fig2 = px.scatter(
    merged,
    x="Median_family_income",
    y="Violent_rate_per_100k",
    hover_name="County",
    trendline="ols",
    trendline_color_override="red",
    title="Violent Crime Rate per 100,000 vs Median Family Income (NY, 2023)",
    labels={
        "Median_family_income": "Median Family Income (USD)",
        "Violent_rate_per_100k": "Violent Crime Rate per 100,000"
    }
)
fig2

**Chart conclusion:**
The relationship between violent crime rate and median family income is slightly negative. 

## 8. Correlation Analysis
To quantify the strength and direction of the relationship between income and violent crime, I compute Pearson correlation coefficients:

- Between **median family income** and **violent crime counts**
- Between **median family income** and **violent crime rate per 100k**

In [106]:
corr_counts = merged["Median_family_income"].corr(merged["Violent_Total_Count"])
corr_rates = merged["Median_family_income"].corr(merged["Violent_rate_per_100k"])

corr_counts, corr_rates

(np.float64(0.03443959659053857), np.float64(-0.10499780775349558))

## 9. Takeaways

### Observations

- Counties with larger populations tend to have higher violent crime counts, which is expected, regardless of income. This makes it hard to interpret counts alone.
- When looking at **violent crime rate per 100,000 residents**, there is a weak negative relationship with median family income:
  - The correlation between income and violent crime **counts** is close to zero (indicating little to no linear relationship in raw counts).
  - The correlation between income and violent crime **rates** is slightly negative, suggesting that higher-income counties tend to have somewhat lower violent crime rates, but the effect is small.
- There is substantial variation: some mid-income counties have relatively high violent crime rates(such as Queens), and some lower-income counties have moderate rates(such as Delaware). This suggests that factors beyond income (e.g., urban vs rural, policing, demographics) likely play an important role.

### Overall conclusion

Based on these datasets, median family income alone is not a strong predictor of violent crime at the county level in New York State. The income variable used here is median family income, which may not fully represent individual-level economic circumstances. Family income aggregates multiple earners within a household and can differ significantly from personal or per-capita income. This means the relationship between income and violent crime might appear weaker because the metric does not directly capture individual economic disadvantage.
  
There is at most a weak negative relationship between family income and violent crime rates, and any conclusions should be made cautiously, considering other social and structural factors that were not included in this analysis.
