# Melanoma Health Disparities Analysis

A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

### Purpose
This notebook explores patterns and relationships in the SEER dataset:
- Distributions of key variables
- Crosstabs of race with other variables
- Data visualizations

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Final sample:** 226,696 cutaneous melanoma cases across 13 variables

The data has been processed to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only
- Known survival time
- Known race

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Research Question

Are racial disparities in melanoma survival explained by a later stage at diagnosis and socioeconomic factors, or do disparities persist even after controlling for these factors?

### Analysis Workflow

This is the second notebook in a three-part series:

1. **01_data_cleaning.ipynb** - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** *(this notebook)* - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Load the cleaned dataset
Load the cleaned dataset that we filtered and processed in the first notebook `01_data_cleaning.ipynb`.

In [3]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned data
mel_data = pd.read_csv('../data/melanoma_data_clean.csv', header=0, low_memory=False)

# Quick verification of the dataset
print(f"Dataset loaded: {len(mel_data):,} cases")
print(f"Variables: {mel_data.shape[1]}")
print(f"\nColumn names:")
print(mel_data.columns.tolist())


Dataset loaded: 226,696 cases
Variables: 14

Column names:
['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban', 'race_labels']


## Step 2: Disparities in patient survival time and outcome
#### First, we'll look at the distributions of survival time by race to establish a baseline for disparities in patient outcomes.

In [5]:
# Show the summary statistics for survival time by race
survival_by_race = round(mel_data.groupby('race_labels')['survival_months'].describe()[['count','50%','std']],1)
survival_by_race = survival_by_race.rename(columns={'50%': 'median'}) # rename median column
survival_by_race[['count', 'median']] = survival_by_race[['count', 'median']].astype(int)

# Sort by median survival time ascending
survival_by_race = survival_by_race.sort_values(by='median')
# remove the index name for display purposes
survival_by_race.index.name = None

print("Median survival time by race: \n", survival_by_race, "\n")

# Create a boxplot to show the distribution of survival time by race

# Set up the plot aesthetics
sns.set_style("ticks")
custom_pal = ['#5790c4', '#e89c5e', '#6db388', '#c377a3', '#8c7fb8']

# get the order to plot the data from the summary stats table (ordered by median)
order = survival_by_race.index.tolist()

# Create the box plot
plt.figure(figsize=(7, 2.5))
sns.boxplot(data=mel_data, x='survival_months', y='race_labels', hue='race_labels', order=order,
             palette=custom_pal, gap=0.15, medianprops=dict(color='#333333', linewidth=1.5, solid_capstyle='butt'))

# Move the x-axis to the right for aesthetics
ax = plt.gca()
ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

# remove the left axis lines, keep the right
sns.despine(left=True, right=False)

# Force ticks to show on the right
ax.tick_params(axis='y', which='both', right=True, left=False, pad=5)
# Adjust the y axis margins
plt.margins(y=0.07)

# Title the plot and the axes
plt.xlabel("Survival Time (Months)", fontsize=10, labelpad=7, fontweight="bold")
plt.ylabel(None)
plt.title("Survival Time by Race", fontsize=12)

# Add median survival time as text labels
for i, race in enumerate(order):
    median_val = survival_by_race.loc[race, 'median']
    # Print the text label on the plot
    ax.text(median_val + 3, i, f'{median_val}', 
            va='center', ha='left', fontsize=8, fontweight="bold", 
            color='white')

# Save the boxplot as a PNG image
plt.savefig('../images/boxplot_survival_time_by_race.png', dpi=175, bbox_inches='tight')

plt.close()  # Don't display in output

Median survival time by race: 
                                 count  median   std
Black                            1027      76  75.2
Asian or Pacific Islander        1598      95  76.1
Hispanic                         8077     101  73.8
American Indian/Alaska Native     575     110  73.5
White                          215419     115  71.4 



<img src="../images/boxplot_survival_time_by_race.png" width="75%">

Black patients have a median survival time of 76 months compared to 115 months for White patients— a difference of **39 months**. This highlights significant disparities in survival time by race. However, other variables such as stage at diagnosis and socioeconomic factors may contribute to this disparity, which we will look at further.

Notably, American Indian / Alaska Native patients have a median survival time that's similar to White patients (110 months), while patients of Hispanic and Asian or Pacific Islander descent fall somewhere in between.

#### Next, we'll look at death rates and cause of death by race.

In [8]:
# Group the data by race and cause of death.
cause_of_death = pd.crosstab(mel_data['race_labels'], mel_data['cause_death'], 
                          normalize='index') * 100
# Round the data
cause_of_death = cause_of_death.round(1)
# Rename the columns to simplify
cause_of_death.columns = ['Alive / Other', 'Melanoma', 'Unknown']

# Sort the table in the same order as above (by median survial time ascending)
cause_of_death = cause_of_death.reindex(order)

# For display, remove the index name
cause_of_death.index.name = None

print("Cause of death by race (%):")
cause_of_death


Cause of death by race (%):


Unnamed: 0,Alive / Other,Melanoma,Unknown
Black,66.8,32.4,0.8
Asian or Pacific Islander,74.3,23.4,2.3
Hispanic,81.3,17.1,1.7
American Indian/Alaska Native,83.1,15.8,1.0
White,87.9,11.7,0.4


Here we note a similar trend as seen for survival time by race, with Black patients dying of melanoma at nearly 3× the rate of White patients (32.4% vs 11.7%). <br>**These findings confirm that there is a significant racial disparity in survival for melanoma patients.**

## Step 3: Disparities in cancer stage at diagnosis
Is there a disparity in the stage of cancer diagnosis by race? Are minorities diagnosed at later stages?

#### Create a crosstab to look at cancer stage at diagnosis by race.

In [12]:
# Group the data by race and cause of death.
stage_by_race = pd.crosstab(mel_data['race_labels'], mel_data['stage'], 
                          normalize='index') * 100
# Round the data
stage_by_race = stage_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
stage_by_race = stage_by_race.reindex(order)
# Also sort stage in order of increasingly worse prognosis
stage_by_race = stage_by_race[['Localized', 'Regional', 'Distant']] 

# For display, remove the index and column names
stage_by_race.index.name = None
stage_by_race.columns.name = None

print("Cancer Stage at Diagnosis by Race (%):")
stage_by_race

Cancer Stage at Diagnosis by Race (%):


Unnamed: 0,Localized,Regional,Distant
Black,59.9,25.8,14.3
Asian or Pacific Islander,69.6,19.8,10.6
Hispanic,76.6,16.0,7.5
American Indian/Alaska Native,80.9,14.3,4.9
White,87.2,8.9,3.9


#### Visualize differences in cancer stage at diagnosis by race with a bar plot.

In [14]:
# We want a stacked bar plot showing the proportion of patients with each cancer stage for each racial group

# Define the bar colors for each stage
stage_colors = {'Localized': '#6db388',
                'Regional': '#e89c5e',
                'Distant': '#e15759', }    

# Create abbreviated labels for the plot
short_labels = {
    'Black': 'Black',
    'Asian or Pacific Islander': 'API',
    'Hispanic': 'Hispanic',
    'American Indian/Alaska Native': 'AI/AN',
    'White': 'White' }

# Apply the index before plotting
stage_by_race_plot = stage_by_race.copy()
stage_by_race_plot.index = stage_by_race_plot.index.map(short_labels)

# Create the bar plot
stage_by_race_plot.plot(kind='bar', stacked=True, figsize=(6, 4),
                   color=[stage_colors[col] for col in stage_by_race.columns], width=0.7)

# Format the plot title and labels
plt.ylabel('Percentage of Patients (%)', fontweight='bold', fontsize=10)
plt.xlabel('')  # Remove x-label
plt.title('Cancer Stage at Diagnosis by Race', fontsize=14, y=1.11)

# Get current axes
ax = plt.gca()

# Add percentage labels to each segment
for container in ax.containers:
    labels = [f'{v:.0f}%' if v > 3 else '' for v in container.datavalues]  # Only show label if >3%
    ax.bar_label(container, labels=labels, label_type='center', 
                 fontsize=10, color='white', fontweight='bold')

# Format the axes
plt.xticks(rotation=0, fontsize=11, fontweight="bold")
plt.yticks(fontsize=10)
ax.tick_params(axis='x', pad=3) 
plt.ylim(0, 100)

# Format the legend
plt.legend(ncol=3, loc='upper center', frameon=False, bbox_to_anchor=(0.5, 1.11), fontsize=9)
sns.despine() # remove upper and right axes lines

# Save the barplot as a PNG image
plt.savefig('../images/barplot_stage_by_race.png', dpi=150, bbox_inches='tight')

plt.close()  # Don't display in output


<img src="../images/barplot_stage_by_race.png" width="50%">

*API = Asian or Pacific Islander; AI/AN = American Indian/Alaska Native*

Overall, Black patients are 3.7× more likely to be diagnosed with distant melanoma (14.3% vs 3.9%), which has a worse prognosis, while White patients are predominantly diagnosed at the earlier localized stage (87.2%). **Differences in stage at diagnosis likely account for much of the disparity in survival time across racial groups.**

## Step 4: Socioeconomic disparities
The county median household income serves as an area-based proxy of socioeconomic status.

**We'll categorize the county median household income into three groups:**

<img src="../images/median_income_groups.png" width="20%">


#### Create a crosstab of household income group by race.

In [61]:
# Create the income group mapping
def income_groups(income):
    if income == 'Unknown/missing/no match/Not 1990-2023':
        return 'Unknown'
    elif income in ['< $40,000', '$40,000 - $44,999', '$45,000 - $49,999', 
                    '$50,000 - $54,999', '$55,000 - $59,999']:
        return 'Low'
    elif income in ['$60,000 - $64,999', '$65,000 - $69,999', '$70,000 - $74,999',
                    '$75,000 - $79,999', '$80,000 - $84,999', '$85,000 - $89,999']:
        return 'Mid'
    else:  # $90,000+
        return 'High'

mel_data['income_tier'] = mel_data['median_income'].apply(income_groups)

# Median household income by race
income_by_race = pd.crosstab(mel_data['race_labels'], mel_data['income_tier'], 
                          normalize='index') * 100
# Round the data
income_by_race = income_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
income_by_race = income_by_race.reindex(order)

# Set the order of the columns
income_by_race = income_by_race[['Low', 'Mid', 'High', 'Unknown']]

# For display, remove the index and column names
income_by_race.index.name = None
income_by_race.columns.name = None

print("Household income group by race (%):")
income_by_race


Household income group by race (%):


Unnamed: 0,Low,Mid,High,Unknown
Black,22.4,57.4,20.2,0.1
Asian or Pacific Islander,2.1,45.3,52.6,0.0
Hispanic,5.9,63.2,30.9,0.1
American Indian/Alaska Native,16.2,54.1,29.7,0.0
White,11.4,53.9,34.7,0.0


Black patients, who have the worst melanoma survival outcomes, live disproportionately in low-income counties compared to the other racial groups. Interestingly, Asian or Pacific Islander patients have the highest proportion living in high-income counties (52.6%), yet still show worse survival outcomes than White patients.

**This suggests that socioeconomic factors alone do not fully explain racial disparities in melanoma outcomes.**

#### Create a crosstab of rural-urban continuum by race.
Patients in large or medium-sized metro areas likely have easier access to health care than patients living in remote rural areas.

In [86]:
# Rural-urban continuum by race
rural_urban_by_race = pd.crosstab(mel_data['race_labels'], mel_data['rural_urban'], 
                          normalize='index') * 100
# Round the data
rural_urban_by_race = rural_urban_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
rural_urban_by_race = rural_urban_by_race.reindex(order)

# Drop the two "unknown" columns
rural_urban_clean = rural_urban_by_race.drop([
    'Unknown/missing/no match (Alaska or Hawaii - Entire State)', 
    'Unknown/missing/no match/Not 1990-2023'
], axis=1)

# Simplify column names to be more informative
rural_urban_clean.columns = [
    'Large Metro (1M+)',
    'Medium Metro (250K-1M)', 
    'Small Metro (<250K)',
    'Nonmetro Adjacent',
    'Nonmetro Remote'
]

# For display, remove the index and column names
rural_urban_clean.index.name = None
rural_urban_clean.columns.name = None

print("Rural-Urban continuum by race (%):\n")
rural_urban_clean

Rural-Urban continuum by race (%):



Unnamed: 0,Large Metro (1M+),Medium Metro (250K-1M),Small Metro (<250K),Nonmetro Adjacent,Nonmetro Remote
Black,59.7,19.1,9.3,8.7,3.1
Asian or Pacific Islander,63.0,25.7,4.0,1.6,5.8
Hispanic,68.8,21.4,5.6,2.3,1.9
American Indian/Alaska Native,38.4,20.7,15.5,9.7,9.4
White,58.5,20.6,8.5,7.1,5.3


Hispanic and Asian/Pacific Islander patients have the highest proportion located in large metropolitan areas (68.8% and 63.0%), which typically offer better access to specialized care. Despite this geographic advantage, these groups still show worse survival outcomes than White patients. Likewise, American Indian/Alaska Native patients are the most likely to live in remote rural areas (9.4%), yet have survival outcomes similar to White patients.

**This further suggests that factors beyond socioeconomic status and healthcare access contribute to melanoma outcome disparities.**

#### Look at age at diagnosis by race
Because older melanoma patients likely have a shorter survival time than patients diagnosed younger, we should look at age at diagnosis as a potential confounding factor.

To do this, we will group patients into three clinically-relevant age groups:
* <50 years
* 50-69 years
* 70+ years

In [106]:
# Create age grouping function
def categorize_age(age_group):
    # Handle special cases
    if age_group == '00 years':
        return '<50'
    elif '90+' in age_group:
        return '70+'
    
    # Extract age from age ranges (lower bound)
    age = int(age_group.split('-')[0])
    
    if age < 50:
        return '<50'
    elif age < 70:
        return '50-69'
    else:
        return '70+'

mel_data['age_category'] = mel_data['age_group'].apply(categorize_age)

# Apply to create new age column
mel_data['age_category'] = mel_data['age_group'].apply(categorize_age)

# Create a crosstab of age at diagnosis by race
age_by_race = pd.crosstab(mel_data['race_labels'], mel_data['age_category'], 
                          normalize='index') * 100

# Round the data
age_by_race = age_by_race.round(1)

# Order columns properly
age_by_race = age_by_race[['<50', '50-69', '70+']]

# Order race by median survival time (same order as previous crosstabs)
age_by_race = age_by_race.reindex(order) 

# Remove the index and column names
age_by_race.index.name = None
age_by_race.columns.name = None

print()
age_by_race

Unnamed: 0,<50,50-69,70+
Black,27.9,41.5,30.6
Asian or Pacific Islander,34.6,39.0,26.3
Hispanic,41.5,38.3,20.2
American Indian/Alaska Native,36.5,43.5,20.0
White,28.2,44.4,27.4


Age distributions are relatively similar across racial groups, with Black and White patients showing similar proportions in the higher risk age group (30.6% vs 27.4% age 70+). Notably, Hispanic patients are diagnosed at younger ages (41.5% under 50), yet still experience worse survival outcomes than White patients.

**This suggests age does not explain the observed racial disparities in melanoma survival.**

## Summary of Key Findings

- There are significant racial disparities in survival time and cause-specific melanoma deaths, with Black patients having the worst melanoma outcomes and White patients having the best outcomes.
- Worse survival outcomes strongly correlate with a later cancer stage at diagnosis. Black patients are diagnosed with more advanced distant melanoma at 3.7× the rate of White patients, suggesting that late stage diagnosis could be contributing to survival differences.
- Socioeconomic factors such as county median household income and rural-urban continuum do not fully explain the observed racial disparities in melanoma survival.
- Age at diagnosis also does not explain racial disparities in melanoma survival.

**These findings suggest that multivariable survival analysis is needed to determine whether racial disparities persist after controlling for stage, socioeconomic factors, and age simultaneously.**