# Melanoma Health Disparities Analysis

A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

### Purpose
A survival analysis to establish factors influencing racial disparities:
- Kaplan-Meier curves
- COX regression analysis

 **Input:** `melanoma_data_final.csv` from `02_exploratory_analysis.ipynb`
 <br>**Output:** Kaplan-Meier curves, statistics and COX regression results

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Final sample:** 226,696 cutaneous melanoma cases across 13 variables

The data has been processed to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only
- Known survival time
- Known race

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Research Question

Are racial disparities in melanoma survival explained by a later stage at diagnosis and socioeconomic factors, or do disparities persist even after controlling for these factors?

### Analysis Workflow

This is the second notebook in a three-part series:

1. **01_data_cleaning.ipynb** - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** *(this notebook)* - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Import the final melanoma data

In [145]:
# Import necessary packages
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import multivariate_logrank_test
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
import seaborn as sns

# Load the cleaned data
mel_data = pd.read_csv('../data/melanoma_data_final.csv', header=0, low_memory=False)

# Set the order of racial groups to match previous analysis (ascending by median survival time)
order = ['Black', 'Asian or Pacific Islander', 'Hispanic', 
         'American Indian/Alaska Native', 'White']

# Quick verification of the dataset
print(f"Dataset loaded: {len(mel_data):,} cases")
print(f"Variables: {mel_data.shape[1]}")
print(f"\nColumn names:")
print(mel_data.columns.tolist())

print("\nNew groupings for median household income:")
print(mel_data["income_tier"].unique())
print("\nNew groupings for age at diagnosis:")   
print(mel_data["age_category"].unique())


Dataset loaded: 226,696 cases
Variables: 16

Column names:
['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban', 'race_labels', 'income_tier', 'age_category']

New groupings for median household income:
['High' 'Mid' 'Low' 'Unknown']

New groupings for age at diagnosis:
['50-69' '70+' '<50']


Note that we have our original 13 variables, with the addition of the simplified "race_labels" and broader groupings for "income_tier" and "age_category".

## Step 2: Create an event variable for melanoma-specific death
**Event:**  died of melanoma (1)
<br>**Censored:**  alive or died of other cause (0)

In [149]:
# Create an event indicator for melanoma-specific death
# 1 = event occured (died of melanoma)
# 0 = censored (alive or died of other cause)
mel_data['event'] = (mel_data['cause_death'] == 'Dead (attributable to this cancer dx)').astype(int)

# Check the distribution of the event variable
print("\nEvent distribution:")
print(mel_data['event'].value_counts(), "\n")
print(mel_data['cause_death'].value_counts())

censored_events = mel_data['event'].value_counts()[0]

print(f"\nMelanoma-specific deaths: {mel_data['event'].sum():,} ({mel_data['event'].mean()*100:.1f}%)")
print(f"Censored cases: {censored_events:,} ({censored_events/len(mel_data['event'])*100:.1f}%)")


Event distribution:
event
0    199322
1     27374
Name: count, dtype: int64 

cause_death
Alive or dead of other cause             198338
Dead (attributable to this cancer dx)     27374
Dead (missing/unknown COD)                  984
Name: count, dtype: int64

Melanoma-specific deaths: 27,374 (12.1%)
Censored cases: 199,322 (87.9%)


From this dataset, melanoma has an overall cause-specific mortality of **12.1%**. 
<br>Less than 1% of cases have an unknown cause of death, which are included as censored cases in this survival analysis.

## Step 3: Kaplan-Meier Survival Curves
### Create K-M survival curves by racial group.

In [286]:
# Initialize the K-M fitter
kmf = KaplanMeierFitter()

# Create figure
fig, ax = plt.subplots(figsize=(8, 4))

# Define color pallete (same as EDA)
colors = {
    'Black': '#5790c4',        
    'Asian or Pacific Islander': '#e89c5e',
    'Hispanic': '#6db388',        
    'American Indian/Alaska Native': '#c377a3', 
    'White': '#8c7fb8' 
}

# Fit and plot for each racial group
for race in order:
    data = mel_data[mel_data['race_labels'] == race]
    
    kmf.fit(durations=data['survival_months'], 
            event_observed=data['event'],
            label=race)
    
    kmf.plot_survival_function(color=colors.get(race), linewidth=2.5, ci_show=False)

# Format the plot and axes titles
plt.xlabel('Time (Months)', fontweight='bold', fontsize=11)
plt.ylabel('Survival Probability', fontweight='bold', fontsize=11)
plt.title('Kaplan-Meier Survival Curves by Race', fontsize=14)

# format the axes limits
plt.ylim(0, 1)
plt.xlim(0, mel_data['survival_months'].max())

# Further formatting of the plot
ax.legend(frameon=False, loc='lower left')
plt.grid(False)
sns.despine()

# Save the plot as a PNG image
plt.savefig('../images/km_curves_by_race.png', dpi=150, bbox_inches='tight')
plt.close()

<img src="../images/km_curves_by_race.png" width="60%">

#### Log Rank Statistical Tests
Use log rank statistical tests to compare differences in survival time across racial groups. 

We will start with an overall log rank test. If it's significant, we'll proceed by comparing each minority group to White patients.


In [185]:
# Test if there's any difference across racial groups with a multivariate log rank test
multi_var_result = multivariate_logrank_test(
    mel_data['survival_months'],
    mel_data['race_labels'],
    mel_data['event']
)

# Print the results of the multivariate test
print("Multivariate log-rank test (all groups):")
print(f"Test statistic: {multi_var_result.test_statistic:.2f}")
print(f"p-value: {multi_var_result.p_value:.2f}")
print(f"Degrees of freedom: {multi_var_result.degrees_of_freedom}")

# Determine if the test is significant
if multi_var_result.p_value < 0.05:
    print("\nThere is a significant difference between racial groups.\n")

    # Since the overall result is significant, proceed with pairwise comparisons
    white_data = mel_data[mel_data['race_labels'] == 'White']
    
    pairwise_results = []

    # Loop over every other race to compare to the White data
    for race in ['Black', 'Asian or Pacific Islander', 'Hispanic', 'American Indian/Alaska Native']:
        race_data = mel_data[mel_data['race_labels'] == race] # get the comparison race

        # run the log rank test
        result = logrank_test(
            durations_A=white_data['survival_months'],
            durations_B=race_data['survival_months'],
            event_observed_A=white_data['event'],
            event_observed_B=race_data['event']
        )
        
        pairwise_results.append({
            'Comparison': f'{race} vs White',
            'Statistic': f'{result.test_statistic:.2f}',
            'p-value': f'{result.p_value:.4f}',
            'Significant?': 'Yes' if result.p_value < 0.05 else 'No'
        })
    
    # Display the pairwise results as clean table
    pairwise_results_df = pd.DataFrame(pairwise_results)
    print("Pairwise log-rank tests:\n")
    print(pairwise_results_df.to_string(index=False))
    
else:
    print("\nThere is a significant difference between racial groups.")


Multivariate log-rank test (all groups):
Test statistic: 1005.62
p-value: 0.00
Degrees of freedom: 4

There is a significant difference between racial groups.

Pairwise log-rank tests:

                            Comparison Statistic p-value Significant?
                        Black vs White    531.59  0.0000          Yes
    Asian or Pacific Islander vs White    249.92  0.0000          Yes
                     Hispanic vs White    262.16  0.0000          Yes
American Indian/Alaska Native vs White      9.11  0.0025          Yes


The multivariate log-rank test confirms significant differences in melanoma survival across racial groups (overall p < 0.00001). All minority groups have significantly worse survival than White patients (p < 0.003 for all comparisons), with Black patients showing the largest disparity (test statistic = 531.59, p < 0.00001).

### Stratify the K-M survival curves by cancer stage at diagnosis
For the same cancer stage (localized, regional, distant), are there differences in survival time by race?

In [283]:
# Create figure with a subplot for each stage
fig, axes = plt.subplots(1, 3, figsize=(16, 3), dpi=150, sharey=True)

# order cancer stage by best to worst prognosis
stages = ['Localized', 'Regional', 'Distant']

# loop over each stage to plot the K-M curve
for idx, stage in enumerate(stages):
    ax = axes[idx] # get axis for current plot
    # pull the data for the current stage
    stage_data = mel_data[mel_data['stage'] == stage]
    
    # Plot each race for the current stage
    for race in order:
        race_stage_data = stage_data[stage_data['race_labels'] == race]
        
        kmf = KaplanMeierFitter()
        kmf.fit(durations=race_stage_data['survival_months'],
                event_observed=race_stage_data['event'],
                label=race)
        
        kmf.plot_survival_function(ax=ax, color=colors.get(race),
                                   linewidth=2.5, ci_show=False, legend=False)
    
    # Formatting for each subplot
    ax.set_title(f'{stage} Stage', fontweight='bold', fontsize=12)
    ax.set_xlabel('Time (Months)', fontweight='bold')
    if idx == 0: # set Y-axis label for the first plot
        ax.set_ylabel('Survival Probability', fontweight='bold')
    ax.set_ylim(0, 1)
    
    # Add sample size to the plot
    n = len(stage_data)
    ax.text(5, 0.05, f'n={n:,}',
            ha='left', va='bottom', fontsize=10, style='italic')

# Set overall plot title
plt.suptitle("K-M Survival Curves by Race Stratified by Cancer Stage", fontsize=18, y=1.1)

# Get handles and labels for the legend
handles, labels = axes[-1].get_legend_handles_labels()

# Create the legend below the plot
fig.legend(handles, labels, 
          loc='lower center',
          ncol=5,
          bbox_to_anchor=(0.5, -0.25),
          fontsize = 12)

# despine the plot
sns.despine()

# Save the the plot as a PNG image file
plt.savefig('../images/km_curves_strat_by_stage.png', dpi=150, bbox_inches='tight')
plt.close()

<img src="../images/km_curves_strat_by_stage.png" width="95%">

The K-M survival curves show similar patterns in survival times by race for the localized and regional stages as we saw before, but the distant stage has a very different pattern. 

Given the smaller number of patients diagnosed with distant melanoma (n = 9,371  total) and the fact that some minority groups have a small percentage of cases overall, we need to check that there is enough data for distant melanoma in each racial group to be able to interpret the plot.


In [331]:
# Check sample sizes by race within distant stage
distant_stage = mel_data[mel_data['stage'] == 'Distant']

print("Sample size for distant melanoma by race:\n")

# Create a dataframe with the sample sizes
sample_sizes = distant_stage['race_labels'].value_counts().reindex(order).to_frame()
sample_sizes.columns = ['Count']
# Calculate the percentages
sample_sizes['Percent'] = (sample_sizes['Count'] / len(distant_stage) * 100).round(1)
sample_sizes.index.name = None # for display

print(sample_sizes)
print(f"\nTotal distant stage: {len(distant_stage):,}")

Sample size for distant melanoma by race:

                               Count  Percent
Black                            147      1.6
Asian or Pacific Islander        170      1.8
Hispanic                         604      6.4
American Indian/Alaska Native     28      0.3
White                           8422     89.9

Total distant stage: 9,371


With only 28 American Indian/Alaska Native patients with distant melanoma in the dataset, survival estimates for this group are statistically unreliable, particularly at later timepoints. Thus, the inverted survival pattern seen for distant stage is likely a sampling artifact.

#### Log rank statistical tests for differences in survival time by race stratified by stage

In [336]:
# Repeat log rank tests for each stage

### Key Findings From Kaplan-Meier Survival Curves
**Localized Stage:** 
- All races have relatively high survival rates at this stage (>80%)
- Black patients have the lowest survival at each timepoint

**Regional Stage:**
- Survival rates are overall lower than the localized stage
- Similar pattern to the localized stage, but disparities are even more prominent

**Distant Stage:**
- Survival rates drop below 30% for all races by 250 months
- Survival attern is hard to interpret due to low sample size of the American Indian / Alaska Native group