Code with new CSV

Research questions: What is the relationship between income & breast cancer survival months? Is there one?
What is the relationship between race & breast cancer survival months? Is there one?
What is the relationship between age & breast cancer survival months? Is there one?
What is the relationship between income & stage at diagnosis? Is there one?
What is the relationship between race & stage at diagnosis? Is there one?
What is the relationship between age & stage at diagnosis? Is there one?
How do the breast cancer stages (T, N, M) affect survival months?

Below is the imports for the project

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy.stats import spearmanr
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer


import pandas as pd
import numpy as np
import sklearn as sklearn

CSV File

In [2]:
data = pd.read_csv("bcfile.csv")

In [4]:
df = pd.DataFrame(data)

Getting a sense of how the data looks

In [5]:
df.head

<bound method NDFrame.head of         Race recode (W, B, AI, API) Breast - Adjusted AJCC 6th T (1988-2015)  \
0                             White                                 Blank(s)   
1                             White                                 Blank(s)   
2                             White                                 Blank(s)   
3                             White                                 Blank(s)   
4                             White                                 Blank(s)   
...                             ...                                      ...   
8720790                       White                                 Blank(s)   
8720791                       White                                 Blank(s)   
8720792                       White                                 Blank(s)   
8720793                       White                                 Blank(s)   
8720794                       White                                 Blank(s)   

        B

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

Get rid of 'Blanks', nan, and 'Unknown'

In [6]:
df.isna().any()

Race recode (W, B, AI, API)                      False
Breast - Adjusted AJCC 6th T (1988-2015)          True
Breast - Adjusted AJCC 6th N (1988-2015)          True
Breast - Adjusted AJCC 6th M (1988-2015)          True
Patient ID                                       False
Age recode with single ages and 85+              False
Median household income inflation adj to 2019    False
Survival months                                  False
dtype: bool

In [None]:
df.dropna(inplace=True)

In [None]:
# drop na seems to have removed 'Blanks' rows
df = df[~df['Breast - Adjusted AJCC 6th T (1988-2015)'].str.contains('Blank\(s\)')]

In [26]:
# drop na seems to have removed 'Unknown' rows
df = df[~df['Survival months'].str.contains('Unknown')]

In [25]:
print(df['Survival months'].unique())

['0208' '0049' '0143' '0000' '0026' '0014' '0154' '0094' '0023' '0056'
 '0053' '0022' '0081' '0004' '0007' '0047' '0224' '0011' '0106' '0024'
 '0061' '0065' '0171' '0128' '0048' '0003' '0050' '0037' '0130' '0028'
 '0082' '0087' '0075' '0069' '0127' '0083' '0218' '0214' '0001' '0151'
 '0089' '0182' '0005' '0066' '0010' '0129' '0074' '0192' '0191' '0036'
 '0040' '0125' '0095' '0017' '0033' '0002' '0029' '0088' '0225' '0160'
 '0027' '0035' '0015' '0057' '0135' '0084' '0238' '0098' '0034' '0006'
 '0044' '0080' '0008' '0039' '0013' '0059' '0031' '0019' '0120' '0085'
 '0025' '0052' '0041' '0016' '0108' '0018' '0147' '0062' '0096' '0161'
 '0105' '0072' '0099' '0067' '0051' '0012' '0097' '0030' '0060' '0146'
 '0136' '0231' '0211' '0116' '0070' '0043' '0076' '0157' '0038' '0045'
 '0032' '0115' '0109' '0090' '0020' '0093' '0046' '0058' '0073' '0101'
 '0104' '0068' '0103' '0054' '0107' '0086' '0112' '0118' '0113' '0139'
 '0229' '0009' '0234' '0078' '0063' '0021' '0055' '0220' '0164' '0199'
 '0123

In [None]:
print(df['Breast - Adjusted AJCC 6th T (1988-2015)'].unique())

In [None]:
print(df['Breast - Adjusted AJCC 6th N (1988-2015)'].unique())

In [None]:
print(df['Breast - Adjusted AJCC 6th M (1988-2015)'].unique())

In [None]:
# checks for nan's and blanks after
df.isna().any()

In [None]:
import matplotlib.pyplot as plt

for col in df.columns:
    plt.hist(df[col], bins=20)
    plt.title(col)
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.show()

One hot encoding for columns

In [None]:
race_encode = 'Race recode (W, B, AI, API)'

race_encoded = pd.get_dummies(df[race_encode], prefix = race_encode)

df = pd.concat([df, race_encoded], axis=1)

df.drop('Race recode (W, B, AI, API)', axis=1, inplace=True)

In [None]:
sixthstage_encode = 'Breast - Adjusted AJCC 6th T (1988-2015)'

sixthstage_encoded = pd.get_dummies(df[sixthstage_encode], prefix = sixthstage_encode)

df = pd.concat([df, sixthstage_encoded], axis=1)

df.drop('Breast - Adjusted AJCC 6th T (1988-2015)', axis=1, inplace=True)

In [None]:
nstage_encode = 'Breast - Adjusted AJCC 6th N (1988-2015)'

nstage_encoded = pd.get_dummies(df[nstage_encode], prefix = nstage_encode)

df = pd.concat([df, nstage_encoded], axis=1)

df.drop('Breast - Adjusted AJCC 6th N (1988-2015)', axis=1, inplace=True)

In [None]:
mstage_encode = 'Breast - Adjusted AJCC 6th M (1988-2015)'

mstage_encoded = pd.get_dummies(df[mstage_encode], prefix = mstage_encode)

df = pd.concat([df, mstage_encoded], axis=1)

df.drop('Breast - Adjusted AJCC 6th M (1988-2015)', axis=1, inplace=True)

Cleaning data for age, income, and race

In [None]:
# instead of encoding age, remove "years" from value and change it to int type
df['Age recode with single ages and 85+'] = df['Age recode with single ages and 85+'].str.replace(' years', '')

# # there is a value with 85+ years, so replace '85+' with 85
df['Age recode with single ages and 85+'] = df['Age recode with single ages and 85+'].replace('85+', '85')

# convert 'Age recode' column to int
df['Age recode with single ages and 85+'] = df['Age recode with single ages and 85+'].astype(int)

In [None]:
print(df['Age recode with single ages and 85+'].unique())

In [None]:
income_encode = 'Median household income inflation adj to 2019'

income_encoded = pd.get_dummies(df[income_encode], prefix = income_encode)

df = pd.concat([df, income_encoded], axis=1)

df.drop('Median household income inflation adj to 2019', axis=1, inplace=True)

In [None]:
print(income_encoded.columns.unique())

In [None]:
print(race_encoded.columns.unique())

In [None]:
print(df['Survival months'].unique())

In [None]:
df['Survival months'] = df['Survival months'].astype(int)

In [None]:
# convert survival months to bool for kfold validation

In [None]:
print(df.dtypes)

Rerun original code

In [None]:
folds = 5

kf = KFold(n_splits=folds, shuffle=True, random_state=42)

# Define the machine learning model
model = LinearRegression()

Put just age to see if it's a bad score to see if model is accurate.

In [None]:
#create pipeline
pipeline = make_pipeline(scaler, model)

In [None]:
i = 0
# Iterate over the K folds
for train_index, val_index in kf.split(df):

    # Split the data into training and validation sets
    train_data = df.iloc[train_index]
    val_data = df.iloc[val_index]


    # Define the one hot encoded columns
    cols = df.columns
    cols_list = cols.tolist()
    cols_three = pd.concat([race_encoded, pd.DataFrame(df['Age recode with single ages and 85+']), income_encoded], axis=1)
    # cols_three = pd.DataFrame(df['Age recode with single ages and 85+'])
    # cols_three = pd.DataFrame(df.drop(columns=['Survival months'])

    # Define training vars
    X_train = train_data[cols_three.columns]
    X_val = val_data[cols_three.columns]
    y_train = train_data['Survival months']
    y_val = val_data['Survival months']

    # Fit the model on the training data
    # model.fit(X_train, y_train)

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)

    # Calculate mean squared error on validation
    # y_pred = model.predict(X_val)
    y_pred = pipeline.predict(X_val)
    mse = mean_squared_error(y_val, y_pred)
    r2 = r2_score(y_val, y_pred)
    
    i = i+1
    print(f"Fold {i} MSE: {mse:.2f} R2: {r2:.2f}")


In [None]:
i = 0
# Iterate over the K folds
for train_index, val_index in kf.split(df):

    # Split the data into training and validation sets
    train_data = df.iloc[train_index]
    val_data = df.iloc[val_index]


    # Define the one hot encoded columns
    cols = df.columns
    cols_list = cols.tolist()
    cols_three = pd.concat([race_encoded, pd.DataFrame(df['Age recode with single ages and 85+']), income_encoded], axis=1)
    # cols_three = pd.DataFrame(df['Age recode with single ages and 85+'])
    # cols_three = pd.DataFrame(df.drop(columns=['Survival months'])

    # Define training vars
    X_train = train_data[cols_three.columns]
    X_val = val_data[cols_three.columns]
    y_train = train_data['Survival months']
    y_val = val_data['Survival months']

    # Fit the model on the training data
    # model.fit(X_train, y_train)

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)

    # Calculate mean squared error on validation
    # y_pred = model.predict(X_val)
    y_pred = pipeline.predict(X_val)
    mse = mean_squared_error(y_val, y_pred)
    r2 = r2_score(y_val, y_pred)
    
    i = i+1
    print(f"Fold {i} MSE: {mse:.2f} R2: {r2:.2f}")


Linear Regression with PCA and onehot encoding separate

In [46]:
X = pd.concat([pd.DataFrame(df['Race recode (W, B, AI, API)']),pd.DataFrame(df['Age recode with single ages and 85+']), pd.DataFrame(df['Median household income inflation adj to 2019'])], axis=1)
y = df['Survival months']

In [27]:
X = df.drop(columns=['Survival months'])
y = df['Survival months']

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

In [48]:
#instantiate column selectors instead of onehotencode
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [49]:
#intantiate imputers for missing values
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

In [51]:
#instantiate the encoder and scalers
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output= False)

In [52]:
#instantiate the numerical pipeline
num_pipe = make_pipeline(mean_imputer, scaler)
num_pipe

In [53]:
#instantiate the categorical pipeline
cat_pipe = make_pipeline(freq_imputer, ohe)
cat_pipe

In [54]:
#create the tuple for column transformer
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

#create the preprocessor column transformer
preprocessor = make_column_transformer(num_tuple, cat_tuple, remainder = 'passthrough')

In [55]:
#transform the data
#fit only the train data
preprocessor.fit(X_train)

#trainsform train and test data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [36]:
#see the column transformer steps
preprocessor.named_transformers_

{'pipeline-1': Pipeline(steps=[('simpleimputer', SimpleImputer()),
                 ('standardscaler', StandardScaler())]),
 'pipeline-2': Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))])}

In [56]:
#extract feature names from ohe
cat_feature_names = preprocessor.named_transformers_['pipeline-2']\
                    .named_steps['onehotencoder']\
                    .get_feature_names_out(cat_selector(X_train))
cat_feature_names

array(['Race recode (W, B, AI, API)_American Indian/Alaska Native',
       'Race recode (W, B, AI, API)_Asian or Pacific Islander',
       'Race recode (W, B, AI, API)_Black',
       'Race recode (W, B, AI, API)_Unknown',
       'Race recode (W, B, AI, API)_White',
       'Age recode with single ages and 85+_00 years',
       'Age recode with single ages and 85+_01 years',
       'Age recode with single ages and 85+_02 years',
       'Age recode with single ages and 85+_03 years',
       'Age recode with single ages and 85+_04 years',
       'Age recode with single ages and 85+_05 years',
       'Age recode with single ages and 85+_06 years',
       'Age recode with single ages and 85+_07 years',
       'Age recode with single ages and 85+_08 years',
       'Age recode with single ages and 85+_09 years',
       'Age recode with single ages and 85+_10 years',
       'Age recode with single ages and 85+_11 years',
       'Age recode with single ages and 85+_12 years',
       'Age recode 

In [57]:
final_cols = num_selector(X_train) + list(cat_feature_names)

In [58]:
#view transformed data as a dataframe
X_train_df = pd.DataFrame(X_train_processed, columns = final_cols)
X_test_df = pd.DataFrame(X_test_processed, columns = final_cols)

display(X_train_df.head())
X_test_df.head()

Unnamed: 0,"Race recode (W, B, AI, API)_American Indian/Alaska Native","Race recode (W, B, AI, API)_Asian or Pacific Islander","Race recode (W, B, AI, API)_Black","Race recode (W, B, AI, API)_Unknown","Race recode (W, B, AI, API)_White",Age recode with single ages and 85+_00 years,Age recode with single ages and 85+_01 years,Age recode with single ages and 85+_02 years,Age recode with single ages and 85+_03 years,Age recode with single ages and 85+_04 years,...,"Median household income inflation adj to 2019_$40,000 - $44,999","Median household income inflation adj to 2019_$45,000 - $49,999","Median household income inflation adj to 2019_$50,000 - $54,999","Median household income inflation adj to 2019_$55,000 - $59,999","Median household income inflation adj to 2019_$60,000 - $64,999","Median household income inflation adj to 2019_$65,000 - $69,999","Median household income inflation adj to 2019_$70,000 - $74,999","Median household income inflation adj to 2019_$75,000+","Median household income inflation adj to 2019_< $35,000",Median household income inflation adj to 2019_Unknown/missing/no match/Not 1990-2018
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,"Race recode (W, B, AI, API)_American Indian/Alaska Native","Race recode (W, B, AI, API)_Asian or Pacific Islander","Race recode (W, B, AI, API)_Black","Race recode (W, B, AI, API)_Unknown","Race recode (W, B, AI, API)_White",Age recode with single ages and 85+_00 years,Age recode with single ages and 85+_01 years,Age recode with single ages and 85+_02 years,Age recode with single ages and 85+_03 years,Age recode with single ages and 85+_04 years,...,"Median household income inflation adj to 2019_$40,000 - $44,999","Median household income inflation adj to 2019_$45,000 - $49,999","Median household income inflation adj to 2019_$50,000 - $54,999","Median household income inflation adj to 2019_$55,000 - $59,999","Median household income inflation adj to 2019_$60,000 - $64,999","Median household income inflation adj to 2019_$65,000 - $69,999","Median household income inflation adj to 2019_$70,000 - $74,999","Median household income inflation adj to 2019_$75,000+","Median household income inflation adj to 2019_< $35,000",Median household income inflation adj to 2019_Unknown/missing/no match/Not 1990-2018
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [59]:
#inspect the result of scaled data and ohe
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6039893, 102)




array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.]])

In [60]:
#instantiate the model
linreg = LinearRegression()

#create pipeline
linreg_pipe = make_pipeline(preprocessor, linreg)

#fit the pipeline
linreg_pipe.fit(X_train, y_train)

In [61]:
#function that shows all needed values
def eval_model(true, pred):
  mse = mean_squared_error(true, pred)
  r2 = r2_score(true, pred)
  print(f' MSE: {mse}, \n R2: {r2}')

#results
print('Train Evaluation')
eval_model(y_train, linreg_pipe.predict(X_train))
print('\nTest Evaluation')
eval_model(y_test, linreg_pipe.predict(X_test))

Train Evaluation
 MSE: 3572.840155440204, 
 R2: 0.09483364648555004

Test Evaluation
 MSE: 3565.366375477891, 
 R2: 0.09468032504917367


MEasure variance

In [63]:
# standardize the data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

ValueError: could not convert string to float: 'White'

In [None]:
#instantiate the model
model = LinearRegression()

#fit 
model.fit(X_train, y_train)

In [None]:
# perform PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_std)

In [None]:
# print the explained variance ratio
print(pca.explained_variance_ratio_)

This means that the total variance for PCA is low (around 27.16% in total). Since I applied PCA, low total variance means that the total amount of variability in the data captured by the principal components is relatively small. I assume this is happening because there is a predominant amount of one value (in race_encoded) that makes the data more homogenous.

Experiment with Spearmanr Coefficient (rec by Sheena) to see if there is a relationship between variables, this is displayed as a matrix, doesn't tell me much.

In [62]:
corr, pval = spearmanr(X, y)

print("Spearmanr Coefficient: %0.3f", corr)

Spearmanr Coefficient: %0.3f [[ 1.          0.08280474 -0.02216576  0.03121668]
 [ 0.08280474  1.         -0.00771402 -0.28686586]
 [-0.02216576 -0.00771402  1.          0.03880116]
 [ 0.03121668 -0.28686586  0.03880116  1.        ]]


What is the relationship between income & breast cancer survival months? Is there one?

Linear regression without and with PCA have the same results with high MSE & low R2. May need to choose a different model.