# Melanoma Health Disparities Analysis

A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

### Purpose
This notebook explores patterns and relationships in the SEER dataset:
- Distributions of key variables
- Crosstabs of race with other variables
- Creat visualizations

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Final sample:** 226,696 cutaneous melanoma cases across 13 variables

The data has been processed to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only
- Known survival time
- Known race

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Research Question

Are melanoma survival disparities by race explained by later stage at diagnosis and socioeconomic factors, or do disparities persist independent of these factors?

### Analysis Workflow

This is the second notebook in a three-part series:

1. **01_data_cleaning.ipynb** - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** *(this notebook)* - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Load the cleaned dataset
Load the cleaned dataset that we filtere and processed in the first notebook `01_data_cleaning.ipynb`.

In [44]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned data
mel_data = pd.read_csv('../data/melanoma_data_clean.csv')

# Quick verification
print(f"Dataset loaded: {len(df):,} cases")
print(f"Variables: {df.shape[1]}")
print(f"\nColumn names:")
print(df.columns.tolist())

Dataset loaded: 226,696 cases
Variables: 13

Column names:
['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban']


## Step 2: Look at the distribution of patient survival and outcome by race
Take a look at the distributions of survival time, cancer stage, and vital status to establish a baseline for disparities in patient outcomes.

In [33]:
# Summary statistics for survival time by race
survival_by_race = mel_data.groupby('race')['survival_months'].describe()
print(survival_by_race)

# Summary statistics for vital status
survival_by_race = mel_data.groupby('race')[''].describe()
print(survival_by_race)

                                               count        mean        std  \
race                                                                          
Hispanic (All Races)                          8077.0  111.641575  73.841803   
Non-Hispanic American Indian/Alaska Native     575.0  120.288696  73.500078   
Non-Hispanic Asian or Pacific Islander        1598.0  105.993742  76.112290   
Non-Hispanic Black                            1027.0   94.595910  75.167657   
Non-Hispanic White                          215419.0  122.820884  71.432517   

                                            min   25%    50%    75%    max  
race                                                                        
Hispanic (All Races)                        0.0  57.0  101.0  167.0  275.0  
Non-Hispanic American Indian/Alaska Native  0.0  67.0  110.0  175.5  275.0  
Non-Hispanic Asian or Pacific Islander      0.0  39.0   95.0  165.0  274.0  
Non-Hispanic Black                          0.0  27.5   76.0 

### Subset the data to look at patients who died due to melanoma

### Create crosstabs of patient outcomes by race

### Make a barplot to show average survival times by race