# Melanoma Health Disparities Analysis
A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

#### Research Question
*Are racial disparities in melanoma survival explained by a later stage at diagnosis and socioeconomic factors, or do disparities persist even after controlling for these factors?*

### Notebook Overview

This notebook explores patterns and relationships in the SEER dataset through variable distributions and crosstabs.

 **Input:** Cleaned data from `01_data_cleaning.ipynb`
 <br>**Output:** Plots, summary tables, statistics and `melanoma_data_final.csv` ready for survival analysis

**Derived variables for survival analysis:**
- *Stage:* Early, Advanced
- *Age:* <50, 50-69, 70+
- *Income:* Low, Medium, High
- *Metro Area:* 1 (Metro), 0 (Non-Metro)
- *Acral Melanoma:* 1 (Acral); 0 (Non-Acral)

#### Exploratory Analysis
**Step 1:** Load the cleaned dataset  
**Step 2:** Examine disparities in patient survival time and outcomes  
**Step 3:** Examine disparities in cancer stage at diagnosis  
**Step 4:** Examine socioeconomic disparities  
**Step 5:** Finalize variables for survival analysis  
**Step 6:** Create summary table of risk factors by race

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Final sample:** 226,587 cutaneous melanoma cases across 13 variables

The data has been processed to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only
- Known survival time
- Known race

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Analysis Workflow

This is the second notebook in a three-part series:

1. **01_data_cleaning.ipynb** - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** *(this notebook)* - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Load the cleaned dataset
Load the cleaned dataset that we filtered and processed in the first notebook `01_data_cleaning.ipynb`.

In [3]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Load the cleaned data
mel_data = pd.read_csv('../data/melanoma_data_clean.csv', header=0, low_memory=False)

# Quick verification of the dataset
print(f"Dataset loaded: {len(mel_data):,} cases")
print(f"Variables: {mel_data.shape[1]}")
print(f"\nColumn names:")
print(mel_data.columns.tolist())


Dataset loaded: 226,587 cases
Variables: 14

Column names:
['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban', 'race_labels']


## Step 2: Examine disparities in patient survival time and outcome
*Is there a disparity in survival time or melanoma-specific death rates by race?*
#### Distribution of survival time by race

In [5]:
# Show the summary statistics for survival time by race
survival_by_race = round(mel_data.groupby('race_labels')['survival_months'].describe()[['count','50%','std']],1)
survival_by_race = survival_by_race.rename(columns={'50%': 'median'}) # rename median column
survival_by_race[['count', 'median']] = survival_by_race[['count', 'median']].astype(int)

# Sort by median survival time ascending
survival_by_race = survival_by_race.sort_values(by='median')
# remove the index name for display purposes
survival_by_race.index.name = None

print("Median survival time by race: \n", survival_by_race, "\n")

# Create a boxplot to show the distribution of survival time by race

# Set up the plot aesthetics
sns.set_style("ticks")
custom_pal = ['#5790c4', '#e89c5e', '#6db388', '#c377a3', '#8c7fb8']

# Define color mapping once (at top of notebook with your other setup)
race_colors = {
    'White': '#5790c4',
    'Black': '#8c7fb8',
    'Asian or Pacific Islander': '#6db388',
    'Hispanic': '#e89c5e',
    'American Indian/Alaska Native': '#c377a3'
}

# get the order to plot the data from the summary stats table (ordered by median)
order = survival_by_race.index.tolist()

# Create the box plot
plt.figure(figsize=(7, 2.5))
sns.boxplot(data=mel_data, x='survival_months', y='race_labels', hue='race_labels', order=order,
             palette=race_colors, gap=0.15, medianprops=dict(color='#333333', linewidth=1.5, solid_capstyle='butt'))

# Move the x-axis to the right for aesthetics
ax = plt.gca()
ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

# remove the left axis lines, keep the right
sns.despine(left=True, right=False)

# Force ticks to show on the right
ax.tick_params(axis='y', which='both', right=True, left=False, pad=5)
# Adjust the y axis margins
plt.margins(y=0.07)

# Title the plot and the axes
plt.xlabel("Survival Time (Months)", fontsize=10, labelpad=7, fontweight="bold")
plt.ylabel(None)
plt.title("Survival Time by Race", fontsize=12)

# Add median survival time as text labels
for i, race in enumerate(order):
    median_val = survival_by_race.loc[race, 'median']
    # Print the text label on the plot
    ax.text(median_val + 3, i, f'{median_val}', 
            va='center', ha='left', fontsize=8, fontweight="bold", 
            color='white')

# Save the boxplot as a PNG image
plt.savefig('../images/boxplot_survival_time_by_race.png', dpi=175, bbox_inches='tight')

plt.close()  # Don't display in output

Median survival time by race: 
                                 count  median   std
Black                            1026      76  75.2
Asian or Pacific Islander        1598      95  76.1
Hispanic                         8070     101  73.8
American Indian/Alaska Native     539     111  74.1
White                          215354     115  71.4 



<img src="../images/boxplot_survival_time_by_race.png" width="74%">

Black patients have a median survival time of 76 months compared to 115 months for White patients— a difference of **39 months**. This highlights significant disparities in survival time by race. However, other variables such as stage at diagnosis and socioeconomic factors may contribute to this disparity, which we will look at further.

Notably, American Indian / Alaska Native patients have a median survival time that's similar to White patients (110 months), while patients of Hispanic and Asian or Pacific Islander descent fall somewhere in between.

#### Distribution of cause of death by race

In [8]:
# Group the data by race and cause of death.
cause_of_death = pd.crosstab(mel_data['race_labels'], mel_data['cause_death'], 
                          normalize='index') * 100
# Round the data
cause_of_death = cause_of_death.round(1)
# Rename the columns to simplify
cause_of_death.columns = ['Alive / Other', 'Melanoma', 'Unknown']

# Sort the table in the same order as above (by median survial time ascending)
cause_of_death = cause_of_death.reindex(order)

# For display, remove the index name
cause_of_death.index.name = None

print("Cause of death by race (%):")
cause_of_death


Cause of death by race (%):


Unnamed: 0,Alive / Other,Melanoma,Unknown
Black,66.8,32.5,0.8
Asian or Pacific Islander,74.3,23.4,2.3
Hispanic,81.3,17.1,1.7
American Indian/Alaska Native,83.9,15.0,1.1
White,87.9,11.7,0.4


Here we note a similar trend as seen for survival time by race, with Black patients dying of melanoma at nearly 3× the rate of White patients (32.5% vs 11.7%). <br>**These findings confirm that there is a significant racial disparity in survival for melanoma patients.**

#### Barplot of Melanoma-specific deaths by race

In [11]:
# Create abbreviated labels for the plot
short_labels = {
    'Black': 'Black',
    'Asian or Pacific Islander': 'API',
    'Hispanic': 'Hispanic',
    'American Indian/Alaska Native': 'AI/AN',
    'White': 'White' }

# Create the figure for the barplot
fig, ax = plt.subplots(figsize=(6.5, 5))

# Convert the melanoma percentages from the crosstab into a format we can plot
mel_rates = cause_of_death.loc[order, "Melanoma"].reset_index()
mel_rates.columns = ['Race', 'Percent']
mel_rates['Race_Short'] = mel_rates['Race'].map(short_labels) # map the short labels

# Creat the barplot
ax = sns.barplot(data=mel_rates, x='Race_Short', y='Percent', hue='Race', 
                 palette=race_colors, legend=False, width=0.7, edgecolor='#333333')

# Set title and axes properties
ax.set_xlabel(None)
ax.set_ylabel('Melanoma Deaths (%)', fontsize=14)
ax.set_title('Melanoma-Specific Mortality Rate by Race', fontsize=16)
ax.set_ylim(0, max(mel_rates.Percent) * 1.15)
# Format the axis labels
plt.xticks(fontsize=12, fontweight='bold')
plt.yticks(fontsize=12)
ax.tick_params(axis='x', pad=4)

# Add percentage labels on bars
for container in ax.containers:
    ax.bar_label(container, fmt='%.1f%%', padding=3, fontsize=12, fontweight='bold')

# remove upper and right axes lines
sns.despine()

# Save the boxplot as a PNG image
plt.savefig('../images/barplot_melanoma_deaths_by_race.png', dpi=175, bbox_inches='tight')
plt.close()

<img src="../images/barplot_melanoma_deaths_by_race.png" width="45%">

## Step 3: Examine disparities in cancer stage at diagnosis
*Is there a disparity in the stage of cancer diagnosis by race? Are minorities diagnosed at later stages?*

#### Crosstab of cancer stage at diagnosis by race

In [15]:
# Group the data by race and cause of death.
stage_by_race = pd.crosstab(mel_data['race_labels'], mel_data['stage'], 
                          normalize='index') * 100
# Round the data
stage_by_race = stage_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
stage_by_race = stage_by_race.reindex(order)
# Also sort stage in order of increasingly worse prognosis
stage_by_race = stage_by_race[['Localized', 'Regional', 'Distant']] 

# For display, remove the index and column names
stage_by_race.index.name = None
stage_by_race.columns.name = None

print("\nCancer Stage at Diagnosis by Race (%):")
stage_by_race


Cancer Stage at Diagnosis by Race (%):


Unnamed: 0,Localized,Regional,Distant
Black,59.9,25.7,14.3
Asian or Pacific Islander,69.6,19.8,10.6
Hispanic,76.5,16.0,7.5
American Indian/Alaska Native,80.7,15.0,4.3
White,87.2,8.9,3.9


#### Bar plot of cancer stage at diagnosis by race
We'll create a stacked bar plot showing the proportion of patients at each cancer stage for each racial group.

In [17]:
# Define the bar colors for each stage
stage_colors = {'Localized': '#6db388',
                'Regional': '#e89c5e',
                'Distant': '#e15759', }    

# Apply the index before plotting
stage_by_race_plot = stage_by_race.copy()
stage_by_race_plot.index = stage_by_race_plot.index.map(short_labels)

# Create the bar plot
stage_by_race_plot.plot(kind='bar', stacked=True, figsize=(6, 4),
                   color=[stage_colors[col] for col in stage_by_race.columns], width=0.7)

# Format the plot title and labels
plt.ylabel('Percentage of Patients (%)', fontweight='bold', fontsize=10)
plt.xlabel('')  # Remove x-label
plt.title('Cancer Stage at Diagnosis by Race', fontsize=14, y=1.11)

# Get current axes
ax = plt.gca()

# Add percentage labels to each segment
for container in ax.containers:
    labels = [f'{v:.0f}%' if v > 3 else '' for v in container.datavalues]  # Only show label if >3%
    ax.bar_label(container, labels=labels, label_type='center', 
                 fontsize=10, color='white', fontweight='bold')

# Format the axes
plt.xticks(rotation=0, fontsize=11, fontweight="bold")
plt.yticks(fontsize=10)
ax.tick_params(axis='x', pad=3) 
plt.ylim(0, 100)

# Format the legend
plt.legend(ncol=3, loc='upper center', frameon=False, bbox_to_anchor=(0.5, 1.13), fontsize=11)
sns.despine() # remove upper and right axes lines

# Save the barplot as a PNG image
plt.savefig('../images/barplot_stage_by_race.png', dpi=150, bbox_inches='tight')

plt.close()  # Don't display in output


<img src="../images/barplot_stage_by_race.png" width="50%">

*API = Asian or Pacific Islander; AI/AN = American Indian/Alaska Native*

Overall, Black patients are 3.7× more likely to be diagnosed with distant melanoma (14.3% vs 3.9%), which has a worse prognosis, while White patients are predominantly diagnosed at the earlier localized stage (87.2%). **Differences in stage at diagnosis likely account for much of the disparity in survival time across racial groups.**

### Check sample sizes by race and cancer stage
With distant stage representing only 4.1% of cases, we need to verify adequate sample sizes across racial groups for reliable survival analysis. Small sample sizes with few events can produce unstable survival estimates, particularly for time-to-event modeling.

In [21]:
# Check sample sizes by race within distant stage
distant_stage = mel_data[mel_data['stage'] == 'Distant']

print("Sample size for distant melanoma by race:\n")

# Create a dataframe with the sample sizes
sample_sizes = distant_stage['race_labels'].value_counts().reindex(order).to_frame()
sample_sizes.columns = ['Count']
# Calculate the percentages
sample_sizes['Percent'] = (sample_sizes['Count'] / len(distant_stage) * 100).round(1)
sample_sizes.index.name = None # for display

print(sample_sizes)
print(f"\nTotal distant stage: {len(distant_stage):,}")

sample_sizes.to_csv('../data/distant_stage_sample_sizes.csv', index=False)


Sample size for distant melanoma by race:

                               Count  Percent
Black                            147      1.6
Asian or Pacific Islander        170      1.8
Hispanic                         604      6.5
American Indian/Alaska Native     23      0.2
White                           8418     89.9

Total distant stage: 9,362


Only **0.2%** of American Indian and Alaska Native cases (n=23) were diagnosed at distant stage, which is too sparse for stable survival modeling.

To address this limitation, regional and distant stages will be combined into a single "Advanced" category.

### Create two-stage categorization
We'll create a varaible with the new stage categories and verify adequate sample sizes and events for survival analysis.

* **Early Stage:** Localized  
* **Advanced Stage:** Regional + Distant

In [24]:
# Define the stage groups
def stage_category(stage):
    if stage == 'Regional' or stage == 'Distant':
        return 'Advanced'
    else: # Localized stage
        return 'Early'

# Apply the stage category mapping
mel_data['stage_category'] = mel_data['stage'].apply(stage_category)

# Check the distribution of the new stage category variable
# Crosstab of stage_category counts by race
stage_cat_by_race = pd.crosstab(mel_data['race_labels'], mel_data['stage_category'])

# Sort the table in the same order (by median survial time ascending)
stage_cat_by_race = stage_cat_by_race.reindex(order)

# Set the order of the columns
stage_cat_by_race = stage_cat_by_race[['Early', 'Advanced']]

# For display, remove the index and column names
stage_cat_by_race.index.name = None
stage_cat_by_race.columns.name = None

print("\nAll cases by race and stage category:")
display(stage_cat_by_race)

# Check how many melanoma-specific deaths we have in our smallest category

# Calculate melanoma-specific deaths by race and stage_grouped
mel_death_table = mel_data.groupby(['race_labels', 'stage_category'])['cause_death'].apply(
    lambda x: (x == 'Dead (attributable to this cancer dx)').sum()
).unstack(fill_value=0)

# Sort the table in the same order (by median survial time ascending)
mel_death_table = mel_death_table.reindex(order)

# Set the order of the columns
mel_death_table = mel_death_table[['Early', 'Advanced']]

# For display, remove the index and column names
mel_death_table.index.name = None
mel_death_table.columns.name = None

print("\nMelanoma-specific deaths by race and stage category:")
display(mel_death_table)



All cases by race and stage category:


Unnamed: 0,Early,Advanced
Black,615,411
Asian or Pacific Islander,1112,486
Hispanic,6177,1893
American Indian/Alaska Native,435,104
White,187747,27607



Melanoma-specific deaths by race and stage category:


Unnamed: 0,Early,Advanced
Black,96,237
Asian or Pacific Islander,100,274
Hispanic,454,923
American Indian/Alaska Native,30,51
White,11770,13423


With our new cancer stage categories, there are a total of 104 American Indian / Alaska Native cases at Advanced stage (our smallest group). We have at least 30 melanoma-specific deaths in each of our groupings. This is a large enough sample size for a robust time-to-event analysis in the next notebook.

### Proportion of Black patients diagnosed at advanced stage compared to White patients

In [27]:
# New crosstab to calculate percentages for each group
stage_by_race_perc = pd.crosstab(mel_data['race_labels'], mel_data['stage_category'], 
                                normalize='index') * 100

# Pull specific percents from the crosstab
black_advanced_perc = stage_by_race_perc.loc['Black', 'Advanced']
white_advanced_perc = stage_by_race_perc.loc['White', 'Advanced']

# Calculate ratio of Black:White patients
ratio = black_advanced_perc / white_advanced_perc

# Print out summary
print("Compariring the percentage of Black patients and White patients at advanced stage:\n")
print(f"Black patients: {black_advanced_perc:.1f}%")
print(f"White patients: {white_advanced_perc:.1f}%")
print(f"Ratio: {ratio:.1f}×")

Compariring the percentage of Black patients and White patients at advanced stage:

Black patients: 40.1%
White patients: 12.8%
Ratio: 3.1×


Black patients are diagnosed with advanced stage melanoma at **3.1×** the rate of White patients (40.1% vs 12.8%). 
<br>**This strongly suggests that later stage at diagnosis contributes to the observed racial disparities in survival outcomes.**

## Step 4: Examine socioeconomic disparities
The county median household income serves as an area-based proxy of socioeconomic status.

**We'll categorize the county median household income into three groups:**

<img src="../images/median_income_groups.png" width="20%">


### Crosstab of household income group by race

In [31]:
# Create the income group mapping
def income_groups(income):
    if income in ['< $40,000', '$40,000 - $44,999', '$45,000 - $49,999', 
                    '$50,000 - $54,999', '$55,000 - $59,999']:
        return 'Low'
    elif income in ['$60,000 - $64,999', '$65,000 - $69,999', '$70,000 - $74,999',
                    '$75,000 - $79,999', '$80,000 - $84,999', '$85,000 - $89,999']:
        return 'Mid'
    else:  # $90,000+
        return 'High'

mel_data['income_tier'] = mel_data['median_income'].apply(income_groups)

# Crosstab of median household income by race
income_by_race = pd.crosstab(mel_data['race_labels'], mel_data['income_tier'], 
                          normalize='index') * 100
# Round the data
income_by_race = income_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
income_by_race = income_by_race.reindex(order)

# Set the order of the columns
income_by_race = income_by_race[['Low', 'Mid', 'High']]

# For display, remove the index and column names
income_by_race.index.name = None
income_by_race.columns.name = None

print("\nHousehold income group by race (%):")
income_by_race



Household income group by race (%):


Unnamed: 0,Low,Mid,High
Black,22.4,57.4,20.2
Asian or Pacific Islander,2.1,45.3,52.6
Hispanic,5.9,63.2,30.9
American Indian/Alaska Native,17.3,57.7,25.0
White,11.4,53.9,34.7


Black patients, who have the worst melanoma survival outcomes, live disproportionately in low-income counties compared to the other racial groups. Interestingly, Asian or Pacific Islander patients have the highest proportion living in high-income counties (52.6%), yet still show worse survival outcomes than White patients.

**This suggests that socioeconomic factors alone do not fully explain racial disparities in melanoma outcomes.**

### Crosstab of rural-urban continuum by race
Patients in large or medium-sized metro areas likely have easier access to health care than patients living in remote rural areas.

In [34]:
# Rural-urban continuum by race
rural_urban_by_race = pd.crosstab(mel_data['race_labels'], mel_data['rural_urban'], 
                          normalize='index') * 100
# Round the data
rural_urban_by_race = rural_urban_by_race.round(1)

# Sort the table in the same order (by median survial time ascending)
rural_urban_by_race = rural_urban_by_race.reindex(order)

# Simplify column names to be more informative
rural_urban_by_race.columns = [
    'Large Metro (1M+)',
    'Medium Metro (250K-1M)', 
    'Small Metro (<250K)',
    'Nonmetro Adjacent',
    'Nonmetro Remote'
]

# For display, remove the index and column names
rural_urban_by_race.index.name = None
rural_urban_by_race.columns.name = None

print("\nRural-Urban continuum by race (%):\n")
rural_urban_by_race


Rural-Urban continuum by race (%):



Unnamed: 0,Large Metro (1M+),Medium Metro (250K-1M),Small Metro (<250K),Nonmetro Adjacent,Nonmetro Remote
Black,59.7,19.1,9.4,8.7,3.1
Asian or Pacific Islander,63.0,25.7,4.0,1.6,5.8
Hispanic,68.9,21.4,5.6,2.3,1.9
American Indian/Alaska Native,41.0,22.1,16.5,10.4,10.0
White,58.6,20.6,8.5,7.1,5.3


Hispanic and Asian/Pacific Islander patients have the highest proportion located in large metropolitan areas (68.9% and 63.0%), which typically offer better access to specialized care. Despite this geographic advantage, these groups still show worse survival outcomes than White patients. Likewise, American Indian/Alaska Native patients are the most likely to live in remote rural areas (10%), yet have survival outcomes similar to White patients.

**This further suggests that factors beyond socioeconomic status and healthcare access contribute to melanoma outcome disparities.**

### Metro vs Non-Metro Areas
For survival analysis, rural-urban status will be simplified to a binary variable for Metropolitan vs Non-Metropolitan areas to capture a key distinction in healthcare access.

*1* = Metro area
<br>*0* = Non-Metro area

In [37]:
# Create the binary metro variable
mel_data['metro'] = mel_data['rural_urban'].str.contains('metropolitan areas', na=False).astype(int)

# Check the value counts for the new metro variable
print(mel_data['metro'].value_counts())

metro
1    199218
0     27369
Name: count, dtype: int64


## Step 5: Finalize variables for survival analysis 
### Age at diagnosis by race
Because older melanoma patients likely have a shorter survival time than patients diagnosed younger, we should look at age at diagnosis as a potential confounding factor.

To do this, we will group patients into three clinically-relevant age groups:
* <50 years
* 50-69 years
* 70+ years

In [39]:
# Create age grouping function
def categorize_age(age_group):
    # Handle special cases
    if age_group == '00 years':
        return '<50'
    elif '90+' in age_group:
        return '70+'
    
    # Extract age from age ranges (lower bound)
    age = int(age_group.split('-')[0])
    
    if age < 50:
        return '<50'
    elif age < 70:
        return '50-69'
    else:
        return '70+'

mel_data['age_category'] = mel_data['age_group'].apply(categorize_age)

# Apply to create new age column
mel_data['age_category'] = mel_data['age_group'].apply(categorize_age)

# Create a crosstab of age at diagnosis by race
age_by_race = pd.crosstab(mel_data['race_labels'], mel_data['age_category'], 
                          normalize='index') * 100

# Round the data
age_by_race = age_by_race.round(1)

# Order columns properly
age_by_race = age_by_race[['<50', '50-69', '70+']]

# Order race by median survival time (same order as previous crosstabs)
age_by_race = age_by_race.reindex(order) 

# Remove the index and column names
age_by_race.index.name = None
age_by_race.columns.name = None

print("\nAge categories by race (%):")
age_by_race


Age categories by race (%):


Unnamed: 0,<50,50-69,70+
Black,27.9,41.5,30.6
Asian or Pacific Islander,34.6,39.0,26.3
Hispanic,41.5,38.3,20.2
American Indian/Alaska Native,35.6,44.3,20.0
White,28.2,44.4,27.4


Age distributions are relatively similar across racial groups, with Black and White patients showing similar proportions in the higher risk age group (30.6% vs 27.4% age 70+). Notably, Hispanic patients are diagnosed at younger ages (41.5% under 50), yet still experience worse survival outcomes than White patients.

**This suggests age does not explain the observed racial disparities in melanoma survival.**

### Acral vs Non-Acral Melanoma by Race
Acral melanoma is melanoma of the palms, soles or under nails. This subtype of melanoma is more common in Black, Asian and Hispanic populations and has a worse prognosis than other melanoma subtypes. To examine whether this is a potential factor in survival disparities, we will creat a binary variable for acral melanoma.

*1* = Acral melanoma
<br>*0* = Non-acral melanoma

In [42]:
# Create the binary acral variable
mel_data['acral'] = (mel_data['histology'] == '8744/3: Acral lentiginous melanoma, malignant').astype(int)

# Create a crosstab to check the distribution of acral melanoma by race
acral_by_race = pd.crosstab(mel_data['race_labels'], mel_data['acral'], 
                          normalize='index') * 100

# Round the data
acral_by_race = acral_by_race.round(1)

# Rename the columns to be more informative
acral_by_race.columns = ['Non-Acral', 'Acral']
acral_by_race = acral_by_race[['Acral', 'Non-Acral']] # Reorder columns

# Order race by median survival time (same order as previous crosstabs)
acral_by_race = acral_by_race.reindex(order) 

# Remove the index and column names
acral_by_race.index.name = None
acral_by_race.columns.name = None

print("\nAcral Melanoma by Race (%):")
acral_by_race


Acral Melanoma by Race (%):


Unnamed: 0,Acral,Non-Acral
Black,17.4,82.6
Asian or Pacific Islander,11.8,88.2
Hispanic,4.9,95.1
American Indian/Alaska Native,2.2,97.8
White,0.7,99.3


Black and Asian/Pacific Islander patients have dramatically higher rates of acral melanoma (17.4% and 11.8%) compared to White patients (0.7%). Acral melanoma is known to have worse prognosis and may contribute to the observed survival disparities.

## Step 6: Create a summary table of risk factors by race
**For each racial group, we will include:**
- Cases in the dataset
- Median survival time
- % age 70+ 
- % Advanced stage
- % Acral Melanoma
- % Low income
- % Non-Metro area

In [45]:
# Definte a function to calculate the percent in a specific category
def perc_in_category(category):
    """A function that calculates % in specified category"""
    return lambda x: (x == category).mean() * 100

# Define another function to round percents to a whole percent unless they are under 5%, then round to 1 decimal
def smart_round(val):
    return f"{val:.0f}" if val >= 5 else f"{val:.1f}"

# Create summary table grouped by race
summary_by_race = mel_data.groupby('race_labels').agg({
    'race_labels': 'count',
    'survival_months': 'median',
    'age_category': perc_in_category('70+'),
    'stage_category': perc_in_category('Advanced'),
    'acral': lambda x: x.mean() * 100, # get percent of acral
    'income_tier': perc_in_category('Low'),
    'metro': lambda x: 100 - (x.mean() * 100) # get percent of non-metro
})

# Rename the columns
summary_by_race.columns = ['N', 'Median Survival (mo)', '% Age 70+', 
                            '% Advanced Stage', '% Acral Melanoma', '% Low Income', "% Non-Metro"]

# Round percents and convert columns to integer
summary_by_race = summary_by_race.round(0).astype(int)

# Order races by median survival time (asc)
summary_by_race = summary_by_race.reindex(order)

# Remove the index and column names
summary_by_race.index.name = None

# Save the table as a CSV file
summary_by_race.to_csv('../data/risk_summary_by_race.csv')

print("\nRisk Summary by Race:")
summary_by_race


Risk Summary by Race:


Unnamed: 0,N,Median Survival (mo),% Age 70+,% Advanced Stage,% Acral Melanoma,% Low Income,% Non-Metro
Black,1026,76,31,40,17,22,12
Asian or Pacific Islander,1598,95,26,30,12,2,7
Hispanic,8070,101,20,23,5,6,4
American Indian/Alaska Native,539,111,20,19,2,17,20
White,215354,115,27,13,1,11,12


#### Export the melanoma data with the newly derived variables.

In [47]:
# Export the final analysis-ready dataframe to a csv file
mel_data.to_csv('../data/melanoma_data_final.csv', index=False)

## Summary of Key Findings

- There are significant racial disparities in survival time and cause-specific melanoma deaths, with Black patients having the worst melanoma outcomes and White patients having the best outcomes.
- Worse survival outcomes strongly correlate with a later cancer stage at diagnosis. Black patients are diagnosed with advanced stage melanoma at 3.1× the rate of White patients (40.1% vs 12.8%), suggesting that late stage diagnosis could be contributing to survival differences.
- Acral melanoma is 25× more prevalent in Black patients than White patients (17.4% vs 0.7%). This subtype of melanoma is associated with worse prognosis and may contribute to observed survival disparities.
- Socioeconomic factors such as county median household income and rural-urban continuum do not fully explain the observed racial disparities in melanoma survival.
- Age at diagnosis also does not explain racial disparities in melanoma survival.

**These findings suggest the need for a multivariable survival analysis to determine whether racial disparities persist after controlling for these clinical and demographic factors.**