# Melanoma Health Disparities Analysis

A personal project examining racial disparities in melanoma survival outcomes using SEER cancer registry data.

### Purpose
This notebook cleans and processes raw SEER data to prepare it for survival analysis.

### Dataset

**Source:** SEER Research Data, 17 Registries, Nov 2024 Sub (2000-2022)  
**Original Size:** 234,836 cutaneous melanoma patients

The data was pre-filtered in SEER*Stat to include only:
- Microscopy-confirmed malignant cutaneous melanoma
- Known stage at diagnosis
- First primary tumors only

**Note:** Individual patient-level data cannot be shared publicly per SEER Research Data Agreement. 
<br>Instructions for requesting access and recreating this dataset can be found in the [data README](../data/README.md).

### Research Question

Are melanoma survival disparities by race explained by later stage at diagnosis and socioeconomic factors, or do disparities persist independent of these factors?

### Analysis Workflow

This is the first notebook in a three-part series:

1. **01_data_cleaning.ipynb** (this notebook) - Data cleaning and filtering
2. **02_exploratory_analysis.ipynb** - Exploratory data analysis and visualization
3. **03_survival_analysis.ipynb** - Kaplan-Meier curves and Cox regression models

### GitHub Repository

**GitHub:** https://github.com/kpannoni/melanoma-project

---

## Step 1: Load the data exported from SEER

In [302]:
import pandas as pd
import os

# Load the melanoma data from CSV file
mel_data = pd.read_csv('../data/melanoma_data_SEER.csv', header=0, low_memory=False)

# Check the columns and data types in the dataframe
mel_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234836 entries, 0 to 234835
Data columns (total 13 columns):
 #   Column                                                      Non-Null Count   Dtype 
---  ------                                                      --------------   ----- 
 0   Age recode with <1 year olds and 90+                        234836 non-null  object
 1   Sex                                                         234836 non-null  object
 2   Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)  234836 non-null  object
 3   Year of diagnosis                                           234836 non-null  int64 
 4   Survival months                                             234836 non-null  object
 5   Summary stage 2000 (1998-2017)                              234836 non-null  object
 6   SEER cause-specific death classification                    234836 non-null  object
 7   Vital status recode (study cutoff used)                     234836 non-null  object

We have a total of 234,836 cases in the dataset and 13 variables. Most of the columns have an *object* datatype, except for "Year of diagnosis" which is *int64*. We will take a look more into the data to see why "Survival months" is not a numerical data type.

## Step 2: Rename columns to simplify variable names
Because the column names exported from SEER can be long and a bit cumbersome to work with, we will rename the columns to shorter and simpler names.

In [305]:
# Rename columns to shorter, cleaner names
mel_data = mel_data.rename(columns={
    'Age recode with <1 year olds and 90+': 'age_group',
    'Sex': 'sex',
    'Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)': 'race',
    'Marital status at diagnosis': 'marital_status',
    'Year of diagnosis': 'year_diag',
    'ICD-O-3 Hist/behav, malignant': 'histology',
    'Summary stage 2000 (1998-2017)': 'stage',
    'Primary Site - labeled': 'primary_site',
    'Vital status recode (study cutoff used)': 'vital_status',
    'SEER cause-specific death classification': 'cause_death',
    'Survival months': 'survival_months',
    'Median household income inflation adj to 2023': 'median_income',
    'Rural-Urban Continuum Code': 'rural_urban'
})

print("New column names:", list(mel_data.columns))

New column names: ['age_group', 'sex', 'race', 'year_diag', 'survival_months', 'stage', 'cause_death', 'vital_status', 'histology', 'primary_site', 'marital_status', 'median_income', 'rural_urban']


## Step 3: Take a closer look at the data
Let's take a look at the data values for the key variables of interest and check for missing or unknown data.

In [307]:
# Look at the value counts for "survival_months", "race", and "vital_status"
print(mel_data['survival_months'].value_counts(), "\n")
print(mel_data['race'].value_counts(), "\n")
print(mel_data['vital_status'].value_counts())

print("\nChecking for missing data...")

# Check for NaNs in the data
nan_count = mel_data.isna().sum()

print(f"There are {sum(nan_count)} missing values in the data.")

# If there is any missing data, show which columns
if sum(nan_count) > 0:
    print(nan_count)

survival_months
0000       2516
0064       1621
0066       1535
0090       1521
0065       1512
           ... 
0274        358
0264        347
0273        335
0275        302
Unknown      18
Name: count, Length: 277, dtype: int64 

race
Non-Hispanic White                            215435
Non-Hispanic Unknown Race                       8122
Hispanic (All Races)                            8077
Non-Hispanic Asian or Pacific Islander          1598
Non-Hispanic Black                              1028
Non-Hispanic American Indian/Alaska Native       576
Name: count, dtype: int64 

vital_status
Alive    159697
Dead      75139
Name: count, dtype: int64

Checking for missing data...
There are 0 missing values in the data.


Looks like there are no *NaN* values in the data. However, there are some "unknown" values in our key variables of *survival_months* and *race*. <br>For this analysis, we don't want to include cases where race or survival months is not known.

## Step 4: Remove cases where race or survival time is not known
We will filter the dataset to remove cases that are "unknown" for either *race* or *survival_months*.

In [322]:
n_original = len(mel_data) # number of patients before filtering
print(f"Original dataset: {n_original:,} cases")

print("\nRemoving patients where 'survival months' or 'race' is 'unknown'...\n")

# Remove unknown survival_months
mel_data_clean = mel_data[mel_data['survival_months'] != 'Unknown'].copy()
survival_removed = n_original - len(mel_data_clean)
print(f"Patients with unknown survival removed: {survival_removed:,} cases")

# Convert survival_months to INT64 dtype
mel_data_clean['survival_months'] = pd.to_numeric(mel_data_clean['survival_months']) 

# Number of cases after removing unknown survival
n_after_survival = len(mel_data_clean)

# Remove unknown race
mel_data_clean = mel_data_clean[mel_data_clean['race'] != 'Non-Hispanic Unknown Race'].copy()
n_final = len(mel_data_clean) # size of final dataset
race_removed = n_after_survival - n_final
print(f"Patients with unknown race removed: {race_removed:,} cases")

# Get the number of patients removed after filtering
n_filtered = len(mel_data_clean) # length of final dataset
removed = n_original - n_filtered

# Print the percentage of the data removed after filtering
print(f"\nTotal: {removed:,} patients were removed ({round((removed/n_original)*100, 1)}% of the data)")

print(f"\nThe cleaned dataset has {n_final:,} cases.")

Original dataset: 234,836 cases

Removing patients where 'survival months' or 'race' is 'unknown'...

Patients with unknown survival removed: 18 cases
Patients with unknown race removed: 8,122 cases

Total: 8,140 patients were removed (3.5% of the data)

The cleaned dataset has 226,696 cases.


## The Cleaned Dataset
**Exclusions:** 8,140 patients with unknown race or survival time were removed (3.5% of original data)
<br>**Final sample:** 226,696 cutaneous melanoma cases across 13 variables

### Key Characteristics

**Race/Ethnicity:**
- Non-Hispanic White: 95% of cases
- All other racial/ethnic groups: <5% combined

**Vital Status:**
- Alive at last follow-up: 67%
- Dead (melanoma-specific): 12%
- Dead (other causes or unknown): 21%

Despite the predominantly Non-Hispanic White sample, the large dataset provides >11,000 minority cases for a robust disparity analysis. The 12% melanoma-specific mortality reflects the generally favorable prognosis of cutaneous malignant melanoma.

In [316]:
print("Breakdown of Race (%):\n")
print(round(mel_data_clean['race'].value_counts(normalize=True) * 100,2))

# Get the total number of minority cases in the data
minority_cases = len(mel_data_clean[mel_data_clean['race'] != "Non-Hispanic White"])
print(f"\nCombined minority cases in the data: {minority_cases:,}")

print("\nBreakdown of Vital Status (%):\n")
print(round(mel_data_clean['vital_status'].value_counts(normalize=True) * 100,2), "\n")
print(round(mel_data_clean['cause_death'].value_counts(normalize=True) * 100,2))

Breakdown of Race (%):

race
Non-Hispanic White                            95.03
Hispanic (All Races)                           3.56
Non-Hispanic Asian or Pacific Islander         0.70
Non-Hispanic Black                             0.45
Non-Hispanic American Indian/Alaska Native     0.25
Name: proportion, dtype: float64

Combined minority cases in the data: 11,277

Breakdown of Vital Status (%):

vital_status
Alive    66.99
Dead     33.01
Name: proportion, dtype: float64 

cause_death
Alive or dead of other cause             87.49
Dead (attributable to this cancer dx)    12.08
Dead (missing/unknown COD)                0.43
Name: proportion, dtype: float64


**The data is now clean and ready for exploratory analysis!**
<br>Next, please run the notebook: `02_exploratory_analysis.ipynb`