# Melanoma Health Disparities Analysis

A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

### Purpose
A survival analysis to establish factors influencing racial disparities:
- Kaplan-Meier curves
- Log-rank statistical tests
- COX regression analysis

 **Input:** `melanoma_data_final.csv` from `02_exploratory_analysis.ipynb`
 <br>**Output:** Kaplan-Meier curves, statistics and COX regression results

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Final sample:** 226,696 cutaneous melanoma cases across 13 variables

The data has been processed to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only
- Known survival time
- Known race

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Research Question

Are racial disparities in melanoma survival explained by a later stage at diagnosis and socioeconomic factors, or do disparities persist even after controlling for these factors?

### Analysis Workflow

This is the last notebook in a three-part series:

1. **01_data_cleaning.ipynb** - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** *(this notebook)* - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Import the final melanoma data

In [282]:
# Import necessary packages
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import multivariate_logrank_test
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Load the final data ready for analysis
mel_data = pd.read_csv('../data/melanoma_data_final.csv', header=0, low_memory=False)

# Set the order of racial groups to match previous analysis (ascending by median survival time)
order = ['Black', 'Asian or Pacific Islander', 'Hispanic', 
         'American Indian/Alaska Native', 'White']

# Quick verification of the dataset
print(f"Dataset loaded: {len(mel_data):,} cases")
print(f"Variables: {mel_data.shape[1]}")
print(f"\nColumn names:")
print(mel_data.columns.tolist())

print("\nNew groupings for median household income:")
print(mel_data["income_tier"].unique())
print("\nNew groupings for age at diagnosis:")   
print(mel_data["age_category"].unique())


Dataset loaded: 226,696 cases
Variables: 16

Column names:
['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban', 'race_labels', 'income_tier', 'age_category']

New groupings for median household income:
['High' 'Mid' 'Low' 'Unknown']

New groupings for age at diagnosis:
['50-69' '70+' '<50']


Note that we have our original 13 variables, with the addition of the simplified "race_labels" and broader groupings for "income_tier" and "age_category".

## Step 2: Create an event variable for melanoma-specific death
**Event:**  died of melanoma (1)
<br>**Censored:**  alive or died of other cause (0)

In [286]:
# Create an event indicator for melanoma-specific death
# 1 = event occured (died of melanoma)
# 0 = censored (alive or died of other cause)
mel_data['event'] = (mel_data['cause_death'] == 'Dead (attributable to this cancer dx)').astype(int)

# Check the distribution of the event variable
print("\nEvent distribution:")
print(mel_data['event'].value_counts(), "\n")
print(mel_data['cause_death'].value_counts())

# Get the numbe of events and censored
mel_events = mel_data['event'].sum()
mel_perc = mel_data['event'].mean()*100

censored_events = mel_data['event'].value_counts()[0]
censored_perc = censored_events/len(mel_data['event'])*100

censored_alive_other = mel_data['cause_death'].value_counts().iloc[0]
censored_unknown = mel_data['cause_death'].value_counts().iloc[2]

print(f"\nMelanoma-specific deaths: {mel_events:,} ({mel_perc:.1f}%)")
print(f"Censored cases: {censored_events:,} ({censored_perc:.1f}%)")


Event distribution:
event
0    199322
1     27374
Name: count, dtype: int64 

cause_death
Alive or dead of other cause             198338
Dead (attributable to this cancer dx)     27374
Dead (missing/unknown COD)                  984
Name: count, dtype: int64

Melanoma-specific deaths: 27,374 (12.1%)
Censored cases: 199,322 (87.9%)


From this dataset, melanoma has an overall cause-specific mortality of **12.1%**. 
<br>Less than 1% of cases have an unknown cause of death, which are included as censored cases in this survival analysis.

### Create a summary table of the cohort

In [290]:
# We need to define some variables first
original_n = 234818
final_n = len(mel_data)

# get the N for each cancer stage
localized_n = mel_data['stage'].value_counts().loc["Localized"]
regional_n = mel_data['stage'].value_counts().loc["Regional"]
distant_n = mel_data['stage'].value_counts().loc["Distant"]

# Now we build the table
cohort_summary = []

# add each row with the N cases to the cohort summary
cohort_summary.append({"Description": "Original SEER cases", "Cases": f"{original_n:,}", "%": "100%"})
cohort_summary.append({"Description": "Final Dataset Analyzed", "Cases": f"{filtered_n:,}", "%": f"{filtered_n/original_n*100:.1f}%"})
cohort_summary.append({"Description": "", "Cases": "", "%": ""}) # spacer
# Events and Censored N
cohort_summary.append({"Description": "Melanoma Deaths", "Cases": f"{mel_events:,}", "%": f"{mel_perc:.1f}%"})
cohort_summary.append({"Description": "Censored (total)", "Cases": f"{censored_events:,}", "%": f"{censored_perc:.1f}%"})
cohort_summary.append({"Description": "* Censored (Alive or Other)", "Cases": f"{censored_alive_other:,}", "%": f"{censored_alive_other/final_n*100:.1f}%"})
cohort_summary.append({"Description": "* Censored (Unknown)", "Cases": f"{censored_unknown:,}", "%": f"{censored_unknown/final_n*100:.1f}%"})
cohort_summary.append({"Description": "", "Cases": "", "%": ""}) # spacer
# Cancer Stage N
cohort_summary.append({"Description": "Localized Cancer Stage", "Cases": f"{localized_n:,}", "%": f"{localized_n/final_n*100:.1f}%"})
cohort_summary.append({"Description": "Regional Cancer Stage", "Cases": f"{regional_n:,}", "%": f"{regional_n/final_n*100:.1f}%"})
cohort_summary.append({"Description": "Distant Cancer Stage", "Cases": f"{distant_n:,}", "%": f"{distant_n/final_n*100:.1f}%"})

# Make the cohort summary into a clean dataframe
cohort_summary_df = pd.DataFrame(cohort_summary)

# Save the cohort summary as a CSV file
cohort_summary_df.to_csv('../data/cohort_summary.csv', index=False)

cohort_summary_df.style.hide(axis='index')


Description,Cases,%
Original SEER cases,234818.0,100%
Final Dataset Analyzed,226696.0,96.5%
,,
Melanoma Deaths,27374.0,12.1%
Censored (total),199322.0,87.9%
* Censored (Alive or Other),198338.0,87.5%
* Censored (Unknown),984.0,0.4%
,,
Localized Cancer Stage,196180.0,86.5%
Regional Cancer Stage,21145.0,9.3%


## Step 3: Kaplan-Meier Survival Curves
### Create K-M survival curves by racial group.

In [293]:
# Initialize the K-M fitter
kmf = KaplanMeierFitter()

# Create figure
fig, ax = plt.subplots(figsize=(8, 4))

# Define color pallete (same as EDA)
colors = {
    'Black': '#8c7fb8',  # blue      
    'Asian or Pacific Islander': '#6db388', # green
    'Hispanic': '#e89c5e', # orange       
    'American Indian/Alaska Native': '#c377a3', # pink
    'White': '#5790c4' # purple
}

# Fit and plot for each racial group
for race in order:
    data = mel_data[mel_data['race_labels'] == race]
    
    kmf.fit(durations=data['survival_months'], 
            event_observed=data['event'],
            label=race)
    
    kmf.plot_survival_function(color=colors.get(race), linewidth=2.5, ci_show=False)

# Format the plot and axes titles
plt.xlabel('Time (Months)', fontweight='bold', fontsize=11)
plt.ylabel('Survival Probability', fontweight='bold', fontsize=11)
plt.title('Kaplan-Meier Survival Curves by Race', fontsize=14)

# format the axes limits
plt.ylim(0, 1)
plt.xlim(0, mel_data['survival_months'].max())

# Further formatting of the plot
ax.legend(frameon=False, loc='lower left')
plt.grid(False)
sns.despine()

# Save the plot as a PNG image
plt.savefig('../images/km_curves_by_race.png', dpi=150, bbox_inches='tight')
plt.close()

<img src="../images/km_curves_by_race.png" width="60%">

#### Log Rank Statistical Tests
Run log rank statistical tests to compare differences in survival time across racial groups. 

We'll start with a multivariate log rank test to get overall significance. If the test is significant, we'll proceed by comparing each minority group to White patients.


In [306]:
# Test if there's any difference across racial groups with a multivariate log rank test
multivar_result = multivariate_logrank_test(
    mel_data['survival_months'],
    mel_data['race_labels'],
    mel_data['event']
)

# Save the results of the multivariate test
print("Multivariate log-rank test (all groups):")

if multivar_result.p_value <= 0.05:
    multivar_sig = "Yes"
else:
    multivar_sig = "No"

multivar_result_df = pd.DataFrame({"Test": ["multivariate log-rank"], "Statistic": [f"{multivar_result.test_statistic:.2f}"], 
                                   "p-value": [f"{multivar_result.p_value:.2f}"], "Sig": [multivar_sig], "df": [multivar_result.degrees_of_freedom]})

# Save the results of the multivariate test
pairwise_results_df.to_csv('../data/logrank_overall_results.csv', index=False)

display(multivar_result_df.style.hide(axis="index"))

# Determine if the test is significant
if multivar_result.p_value < 0.05:
    print("\nThere is a significant difference between racial groups.\n")

    # Since the overall result is significant, proceed with pairwise comparisons
    white_data = mel_data[mel_data['race_labels'] == 'White']
    
    pairwise_results = []

    # Loop over every other race to compare to the White data
    for race in ['Black', 'Asian or Pacific Islander', 'Hispanic', 'American Indian/Alaska Native']:
        race_data = mel_data[mel_data['race_labels'] == race] # get the comparison race

        # run the log rank test
        result = logrank_test(
            durations_A=white_data['survival_months'],
            durations_B=race_data['survival_months'],
            event_observed_A=white_data['event'],
            event_observed_B=race_data['event']
        )
        
        pairwise_results.append({
            'Comparison': f'{race} vs White',
            'Statistic': f'{result.test_statistic:.2f}',
            'p-value': f'{result.p_value:.4f}',
            'Significant?': 'Yes' if result.p_value < 0.05 else 'No'
        })
    
    # Display the pairwise results as clean table
    pairwise_results_df = pd.DataFrame(pairwise_results)
    print("Pairwise log-rank tests:")

    # Save the pairwise comparisons as a CSV file
    pairwise_results_df.to_csv('../data/logrank_pairwise_results.csv', index=False)

    display(pairwise_results_df.style.hide(axis="index"))
    
else:
    print("\nThere is a significant difference between racial groups.")



Multivariate log-rank test (all groups):


Test,Statistic,p-value,Sig,df
multivariate log-rank,1005.62,0.0,Yes,4



There is a significant difference between racial groups.

Pairwise log-rank tests:


Comparison,Statistic,p-value,Significant?
Black vs White,531.59,0.0,Yes
Asian or Pacific Islander vs White,249.92,0.0,Yes
Hispanic vs White,262.16,0.0,Yes
American Indian/Alaska Native vs White,9.11,0.0025,Yes


The multivariate log-rank test confirms significant differences in melanoma survival across racial groups (overall p < 0.00001). All minority groups have significantly worse survival than White patients (p < 0.003 for all comparisons), with Black patients showing the largest disparity (test statistic = 531.59, p < 0.00001).

### Stratify the K-M survival curves by cancer stage at diagnosis
For the same cancer stage (localized, regional, distant), are there differences in survival time by race?

In [44]:
# Create figure with a subplot for each stage
fig, axes = plt.subplots(1, 3, figsize=(16, 3), dpi=150, sharey=True)

# order cancer stage by best to worst prognosis
stages = ['Localized', 'Regional', 'Distant']

# loop over each stage to plot the K-M curve
for idx, stage in enumerate(stages):
    ax = axes[idx] # get axis for current plot
    # pull the data for the current stage
    stage_data = mel_data[mel_data['stage'] == stage]
    
    # Plot each race for the current stage
    for race in order:
        race_stage_data = stage_data[stage_data['race_labels'] == race]
        
        kmf = KaplanMeierFitter()
        kmf.fit(durations=race_stage_data['survival_months'],
                event_observed=race_stage_data['event'],
                label=race)
        
        kmf.plot_survival_function(ax=ax, color=colors.get(race),
                                   linewidth=2.5, ci_show=False, legend=False)
    
    # Formatting for each subplot
    ax.set_title(f'{stage} Stage', fontweight='bold', fontsize=12)
    ax.set_xlabel('Time (Months)', fontweight='bold')
    if idx == 0: # set Y-axis label for the first plot
        ax.set_ylabel('Survival Probability', fontweight='bold')
    ax.set_ylim(0, 1)
    
    # Add sample size to the plot
    n = len(stage_data)
    ax.text(5, 0.05, f'n={n:,}',
            ha='left', va='bottom', fontsize=10, style='italic')

# Set overall plot title
plt.suptitle("K-M Survival Curves by Race Stratified by Cancer Stage", fontsize=18, y=1.1)

# Get handles and labels for the legend
handles, labels = axes[-1].get_legend_handles_labels()

# Create the legend below the plot
fig.legend(handles, labels, 
          loc='lower center',
          ncol=5,
          bbox_to_anchor=(0.5, -0.25),
          fontsize = 12)

# despine the plot
sns.despine()

# Save the the plot as a PNG image file
plt.savefig('../images/km_curves_strat_by_stage.png', dpi=150, bbox_inches='tight')
plt.close()

<img src="../images/km_curves_strat_by_stage.png" width="90%">

The K-M survival curves for localized and regional cancer stages show a similar pattern as we saw before, with White patients having the best survival outcome and Black patients having the worst. However, the distant stage has a distinct pattern that appears inverted. 

Given the smaller number of patients diagnosed with distant melanoma (n = 9,371  total) and the fact that most minority groups have a small percentage of cases overall, we need to check that there is enough data for distant melanoma in each racial group to be able to interpret the plot.


#### Check minority sample sizes at each cancer stage before we can interpret the curves.

In [308]:
# Check sample sizes by race within distant stage
distant_stage = mel_data[mel_data['stage'] == 'Distant']

print("Sample size for distant melanoma by race:\n")

# Create a dataframe with the sample sizes
sample_sizes = distant_stage['race_labels'].value_counts().reindex(order).to_frame()
sample_sizes.columns = ['Count']
# Calculate the percentages
sample_sizes['Percent'] = (sample_sizes['Count'] / len(distant_stage) * 100).round(1)
sample_sizes.index.name = None # for display

print(sample_sizes)
print(f"\nTotal distant stage: {len(distant_stage):,}")

sample_sizes.to_csv('../data/distant_stage_sample_sizes.csv', index=False)

Sample size for distant melanoma by race:

                               Count  Percent
Black                            147      1.6
Asian or Pacific Islander        170      1.8
Hispanic                         604      6.4
American Indian/Alaska Native     28      0.3
White                           8422     89.9

Total distant stage: 9,371


With only 28 American Indian/Alaska Native patients with distant melanoma in the dataset, survival estimates for this group are statistically unreliable, particularly at later timepoints. Thus, the inverted survival pattern seen for distant stage is likely a sampling artifact.

#### Log rank statistical tests for differences in survival by race stratified by stage
We will repeat the multivariate log rank test for each cancer stage to determine if survival differs significantly across racial groups. 

Due to the low sample sizes for minority groups at distant stage, we will not perform pairwise comparisons for that stage, as the results would be unreliable. We will also not perform pairwise comparisons for localized stage, where curves show minimal separation and the large sample size may produce statistically significant results that lack clinical or practical significance.

In [314]:
# Repeat the multivariate log rank test for each cancer stage
# Empty list for the multivariate results
stage_stat_results = []

# Stages are already defined in a list, so we can loop over each stage
for stage in stages:
    # Grab the data for the current stage
    stage_data = mel_data[mel_data['stage'] == stage]
    
    # Run multivariate test
    multi_result = multivariate_logrank_test(
    stage_data['survival_months'],
    stage_data['race_labels'],
    stage_data['event']
)

    # Append the results to the table
    stage_stat_results.append({
        'Stage': f'{stage}',
        'Statistic': f'{multi_result.test_statistic:.2f}',
        'p-value': f'{multi_result.p_value:.4f}',
        'Significant?': 'Yes' if multi_result.p_value < 0.05 else 'No'
        })
    
    # If significant, do pairwise comparisons for Regional stage only
    if stage == 'Regional' and result.p_value < 0.05:
        # Get the data for White patients at the regional stage
        white_data = stage_data[stage_data['race_labels'] == 'White']

        # to store the pairwise results
        pairwise_results = []
    
        # Loop over every other race to compare to the White data
        for race in ['Black', 'Asian or Pacific Islander', 'Hispanic', 'American Indian/Alaska Native']:
            race_data = stage_data[stage_data['race_labels'] == race] # get the comparison race
    
            # run the log rank test
            result = logrank_test(
                durations_A=white_data['survival_months'],
                durations_B=race_data['survival_months'],
                event_observed_A=white_data['event'],
                event_observed_B=race_data['event']
            )

            # Append the results to the list
            pairwise_results.append({
                'Comparison': f'{race} vs White',
                'Statistic': f'{result.test_statistic:.2f}',
                'p-value': f'{result.p_value:.4f}',
                'Significant?': 'Yes' if result.p_value < 0.05 else 'No'
            })

# Display the multivariate results as clean table
stage_stats_df = pd.DataFrame(stage_stat_results)
print("Multivariate log-rank tests by cancer stage:\n")
display(stage_stats_df.style.hide(axis="index"))

# If regional stage is significant, print the pairwise results
if stage_stats_df["Significant?"][1] == "Yes":
    print("\nThe multivariate test was significant at the regional stage. \nPairwise log-rank comparisons across race have been done for this stage.\n")
    # Save the regional stage pairwise results as a dataframe
    pairwise_results_df = pd.DataFrame(pairwise_results)
    print("Regional stage pairwise log-rank results:\n")
    display(pairwise_results_df.style.hide(axis="index"))
else:
    print("The multivariate test was not significant for the regional stage, thus pairwise log-rank comparisons were not done.")

# Save the multivariate test resuls as a CSV file
stage_stats_df.to_csv('../data/logrank_by_stage_results.csv', index=False)

# Save the pairwise test resuls for regional stage as a CSV file
pairwise_results_df.to_csv('../data/logrank_by_stage_pairwise_regional.csv', index=False)


Multivariate log-rank tests by cancer stage:



Stage,Statistic,p-value,Significant?
Localized,133.07,0.0,Yes
Regional,50.82,0.0,Yes
Distant,12.92,0.0117,Yes



The multivariate test was significant at the regional stage. 
Pairwise log-rank comparisons across race have been done for this stage.

Regional stage pairwise log-rank results:



Comparison,Statistic,p-value,Significant?
Black vs White,31.96,0.0,Yes
Asian or Pacific Islander vs White,9.23,0.0024,Yes
Hispanic vs White,8.62,0.0033,Yes
American Indian/Alaska Native vs White,3.79,0.0515,No


**Multivariate log-rank tests show significant differences across all stages.**

At regional stage, where sample sizes are adequate, pairwise comparisons reveal that Black, Asian/Pacific Islander, and Hispanic patients have significantly worse survival than White patients (all p<0.004). However, American Indian/Alaska Native patients show no significant difference from White patients at regional stage (p=0.0515), consistent with their similar median survival times seen in the exploratory analysis.

### Key Findings and Interpretation of Survival Curves

<img src="../images/km_curves_strat_by_stage.png" width="90%">

**Localized Stage:** 
- All races have relatively high survival rates at this stage (>80%).
- Racial disparities are statistically significant, but differences are minimal.

**Regional Stage:**
- Survival rates are overall lower for all racial groups and racial disparities are more pronounced.
- Black, Asian/Pacific Islander, and Hispanic patients demonstrate significantly worse survival than White patients, with Black patients having the worst survival outcomes.

**Distant Stage:**
- Survival rates drop below 30% for all races by 250 months
- Small sample sizes for some minority groups limit interpretation