# Cervical Cancer Risk

## Purpose

Cervical cancer is a preventable type of cancer that affects about 11,000 women each year in the U.S. The number of new cases has been declining due to increased screening and early detection with the Pap test. However, it still kills about 4,000 women in the U.S. and 300,000 worldwide annually. Risk factors include human papillomavirus (HPV), sexual activity with infected partners, having multiple sexual partners, a history of chlamydia, long-term use of oral contraception, having many children, smoking, weak immune systems, and being exposed to diethylstilbestrol (DES). While the median age of diagnosis is 48 years, 15% of women develop cervical cancer between ages 20-30. African American and Hispanic women have a higher risk of cervical cancer due to social and economic disparities that limit access to screening services. This project aims to determine the most important features in a positive cervical cancer biopsy. 

## Introduction

In [1]:
# import libraries
import pandas as pd 
import plotly_express as px 
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score, roc_curve, auc, f1_score

# show graphs in html
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"

In [149]:
# read dataframe
df = pd.read_csv('datasets/kag_risk_factors_cervical_cancer.csv')

In [150]:
# look at dataframe
df.head()

Unnamed: 0,age,number_of_sexual_partners,first_sexual_intercourse,num_of_pregnancies,smokes,smokes_years,smokes_packs/year,hormonal_contraceptives,hormonal_contraceptives_years,iud,...,std_ time_since_first_diagnosis,std_time_since_last_diagnosis,dx_cancer,dx_cin,dx_hpv,dx,hinselmann,schiller,cytology,biopsy
0,18,4,15,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
1,15,1,14,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
2,34,1,?,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
3,52,5,16,4,1,37,37,1,3,0,...,?,?,1,0,1,0,0,0,0,0
4,46,3,21,4,0,0,0,1,15,0,...,?,?,0,0,0,0,0,0,0,0


In [151]:
# look at columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 858 entries, 0 to 857
Data columns (total 36 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   age                                858 non-null    int64 
 1   number_of_sexual_partners          858 non-null    object
 2   first_sexual_intercourse           858 non-null    object
 3   num_of_pregnancies                 858 non-null    object
 4   smokes                             858 non-null    object
 5   smokes_years                       858 non-null    object
 6   smokes_packs/year                  858 non-null    object
 7   hormonal_contraceptives            858 non-null    object
 8   hormonal_contraceptives_years      858 non-null    object
 9   iud                                858 non-null    object
 10  iud_years                          858 non-null    object
 11  stds                               858 non-null    object
 12  stds_num

In [152]:
# lookong for missing values
df.isna().sum().sum()

0

In [153]:
# looking for duplicates
df.duplicated().sum()

23

In [154]:
# looking for duplicates
df[df.duplicated()]

Unnamed: 0,age,number_of_sexual_partners,first_sexual_intercourse,num_of_pregnancies,smokes,smokes_years,smokes_packs/year,hormonal_contraceptives,hormonal_contraceptives_years,iud,...,std_ time_since_first_diagnosis,std_time_since_last_diagnosis,dx_cancer,dx_cin,dx_hpv,dx,hinselmann,schiller,cytology,biopsy
66,34,3,19,3,0,0,0,1,5,0,...,?,?,0,0,0,0,0,0,0,0
234,25,?,18,2,0,0,0,?,?,?,...,?,?,0,0,0,0,0,0,0,0
255,25,2,18,2,0,0,0,1,0.25,0,...,?,?,0,0,0,0,0,0,0,0
356,18,1,17,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
395,18,1,18,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
406,17,1,17,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
419,19,4,14,1,0,0,0,?,?,?,...,?,?,0,0,0,0,0,0,0,0
431,18,1,14,2,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
435,17,2,15,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
440,15,1,14,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0


Does not appear to be true duplicates, so we will keep them

In [155]:
# replace question marks with NaN
df.replace('?', np.NaN, inplace=True)

In [156]:
# recheck for missing values
df.isna().sum()

age                                    0
number_of_sexual_partners             26
first_sexual_intercourse               7
num_of_pregnancies                    56
smokes                                13
smokes_years                          13
smokes_packs/year                     13
hormonal_contraceptives              108
hormonal_contraceptives_years        108
iud                                  117
iud_years                            117
stds                                 105
stds_number                          105
std_condylomatosis                   105
std_cervical condylomatosis          105
std_vaginal_condylomatosis           105
std_vulvo_perineal_condylomatosis    105
std_syphilis                         105
std_pelvic_inflammatory_disease      105
std_genital_herpes                   105
std_molluscum_contagiosum            105
std_aids                             105
std_hiv                              105
std_hep_B                            105
std_hpv         

In [157]:
# looking at column names
df.columns

Index(['age', 'number_of_sexual_partners', 'first_sexual_intercourse',
       'num_of_pregnancies', 'smokes', 'smokes_years', 'smokes_packs/year',
       'hormonal_contraceptives', 'hormonal_contraceptives_years', 'iud',
       'iud_years', 'stds', 'stds_number', 'std_condylomatosis',
       'std_cervical condylomatosis', 'std_vaginal_condylomatosis',
       'std_vulvo_perineal_condylomatosis', 'std_syphilis',
       'std_pelvic_inflammatory_disease', 'std_genital_herpes',
       'std_molluscum_contagiosum', 'std_aids', 'std_hiv', 'std_hep_B',
       'std_hpv', 'std_number_of_diagnosis', 'std_ time_since_first_diagnosis',
       'std_time_since_last_diagnosis', 'dx_cancer', 'dx_cin', 'dx_hpv', 'dx',
       'hinselmann', 'schiller', 'cytology', 'biopsy'],
      dtype='object')

We read the dataset and noticed many columns haf '?' values. We replaced them with NaN, s that we could impute them later. 

## EDA

In [158]:
# summary statistics
df.describe()

Unnamed: 0,age,std_number_of_diagnosis,dx_cancer,dx_cin,dx_hpv,dx,hinselmann,schiller,cytology,biopsy
count,858.0,858.0,858.0,858.0,858.0,858.0,858.0,858.0,858.0,858.0
mean,26.820513,0.087413,0.020979,0.01049,0.020979,0.027972,0.040793,0.086247,0.051282,0.064103
std,8.497948,0.302545,0.143398,0.101939,0.143398,0.164989,0.197925,0.280892,0.220701,0.245078
min,13.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,84.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [159]:
# correlations
px.imshow(df.corr(), text_auto=True, aspect='auto')

Features with the highest correlation to biopsy are schiller, hinselmann, and cytology. 

In [160]:
# data skew
df.skew()

age                                   1.394279
number_of_sexual_partners             5.454649
first_sexual_intercourse              1.564375
num_of_pregnancies                    1.423514
smokes                                2.013621
smokes_years                          4.465484
smokes_packs/year                     9.308806
hormonal_contraceptives              -0.590551
hormonal_contraceptives_years         2.626438
iud                                   2.465451
iud_years                             5.001759
stds                                  2.583687
stds_number                           3.402849
std_condylomatosis                    3.772582
std_cervical condylomatosis           0.000000
std_vaginal_condylomatosis           13.638036
std_vulvo_perineal_condylomatosis     3.824978
std_syphilis                          6.246054
std_pelvic_inflammatory_disease      27.440845
std_genital_herpes                   27.440845
std_molluscum_contagiosum            27.440845
std_aids     

We can see many of these categorical features are heavily skewed to the right. 

In [161]:
# distributions
columns = ['age', 'number_of_sexual_partners', 'first_sexual_intercourse',
       'num_of_pregnancies', 'smokes', 'smokes_years', 'smokes_packs/year', 'hormonal_contraceptives', 'hormonal_contraceptives_years', 'iud',
       'iud_years', 'stds', 'stds_number', 'std_condylomatosis',
       'std_cervical condylomatosis', 'std_vaginal_condylomatosis',
       'std_vulvo_perineal_condylomatosis', 'std_syphilis',
       'std_pelvic_inflammatory_disease', 'std_genital_herpes',
       'std_molluscum_contagiosum', 'std_aids', 'std_hiv', 'std_hep_B',
       'std_hpv', 'std_number_of_diagnosis', 'std_ time_since_first_diagnosis',
       'std_time_since_last_diagnosis', 'dx_cancer', 'dx_cin', 'dx_hpv', 'dx',
       'hinselmann', 'schiller', 'cytology', 'biopsy']
for column in columns:
    px.histogram(df[column].sort_values(), title='Distribution of ' + str.upper(column).replace('_', ' '), template='ggplot2', labels={'value':str.upper(column).replace('_', ' ')}).show()

Age is distributed around a mean of 26, and most of the data is made up of individuals under 30. the most common number of sexual partners is 2, then 1, and then 3. The number of partners tails off after 5 partners. The first sexual intercourse appears normally distributed around 16 to 17 years old. The most frequent number of pregnancies is 1, then we see a steady decrease to 2, 3, 4, 5, and 6. A large majority of the dataset does not smoke. Those that do smoke, tend to smoke less than a year, and less than one pack a year. 

More individuals use oral contraceptives, but the time frame is concentrated to less than 1.5 years. Most individuals do not use an IUD, and the timeframe is also usually less than 1 year. Sexually transmitted diseases are not prevalent, and those that do have stds usually have one or two. The other std features are not prevalent in the data. The diagnoses are also distributed towards the negative as well. 

In [35]:
columns = ['age', 'number_of_sexual_partners', 'first_sexual_intercourse',
       'num_of_pregnancies', 'smokes_years', 'smokes_packs/year',
       'hormonal_contraceptives_years', 'iud_years', 'stds_number', 'std_number_of_diagnosis', 'std_ time_since_first_diagnosis',
       'std_time_since_last_diagnosis']
for column in columns:
    px.box(df[column], title='Distribution of ' + str.upper(column).replace('_', ' '), template='ggplot2', labels={'value':str.upper(column).replace('_', ' ')}).show()

In [180]:
px.scatter(df, x='schiller', y='biopsy', color='biopsy')

## Modeling

In [36]:
# split data into features and target
X = df.drop(columns='biopsy')
y = df.biopsy 


In [37]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=19)

In [58]:
# Classifier pipeline for imbalance 

# Classifier pipeline for Non Tinder users
pipe_lr = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()), ('oversampler1', SMOTE(random_state=19)), ('lr_classifier', LogisticRegression(random_state=19, max_iter=1000))])
pipe_dt = Pipeline([('scalar1', StandardScaler()),  ("imputation", IterativeImputer()), ('oversampler1', SMOTE(random_state=19)), ('dt_classifier', AdaBoostClassifier(random_state=19))])
pipe_rf = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('rf_classifier', RandomForestClassifier(random_state=19))])
pipe_sv = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('svm_classifier', svm.SVC(random_state=19))])
pipe_cb = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('cb_classifier', CatBoostClassifier(iterations=10, early_stopping_rounds=5))])
pipe_kn = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('knn_classifier', KNeighborsClassifier(n_neighbors=3))])
pipe_xg = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('knn_classifier', XGBClassifier(random_state=19))])
pipe_lg = Pipeline([('scalar1', StandardScaler()), ("imputation", IterativeImputer()),  ('oversampler1', SMOTE(random_state=19)), ('knn_classifier', LGBMClassifier(random_state=19))])


pipelines = [pipe_lr, pipe_dt, pipe_rf, pipe_sv, pipe_cb, pipe_kn, pipe_xg, pipe_lg]

best_accuracy = 0
best_classifier = 0
best_pipeline = ""

pipe_dict = {0: 'Logistic Regression', 1: 'ADA Boost', 2: 'Random Forest', 3: 'SVM', 4: 'Cat Boost', 5: 'KNN', 6: 'XG Boost', 7:'LGBM'}

# Use cross-validation to evaluate the models
for i, model in enumerate(pipelines):
    scores = cross_val_score(model, X, y, cv=5, scoring='f1')
    print('{} Cross-Validation Accuracy: {:.2f}'.format(pipe_dict[i], scores.mean()))
    if scores.mean() > best_accuracy:
        best_accuracy = scores.mean()
        best_pipeline = model
        best_classifier = i

# Print the best classifier
print('\nClassifier with the best F1 Score: {}'.format(pipe_dict[best_classifier]))


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



Logistic Regression Cross-Validation Accuracy: 0.66



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



ADA Boost Cross-Validation Accuracy: 0.66



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



Random Forest Cross-Validation Accuracy: 0.63



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



SVM Cross-Validation Accuracy: 0.60



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.5
0:	learn: 0.2782645	total: 6.06ms	remaining: 54.5ms
1:	learn: 0.1486698	total: 11ms	remaining: 44.1ms
2:	learn: 0.1060843	total: 16.2ms	remaining: 37.9ms
3:	learn: 0.0812010	total: 22.3ms	remaining: 33.4ms
4:	learn: 0.0703448	total: 27ms	remaining: 27ms
5:	learn: 0.0636565	total: 31.4ms	remaining: 20.9ms
6:	learn: 0.0589563	total: 37ms	remaining: 15.9ms
7:	learn: 0.0515106	total: 41.7ms	remaining: 10.4ms
8:	learn: 0.0453969	total: 47.4ms	remaining: 5.26ms
9:	learn: 0.0415884	total: 53.2ms	remaining: 0us



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.5
0:	learn: 0.2590671	total: 10.4ms	remaining: 94ms
1:	learn: 0.1585923	total: 17ms	remaining: 68ms
2:	learn: 0.1200170	total: 23.8ms	remaining: 55.5ms
3:	learn: 0.1012396	total: 29.8ms	remaining: 44.6ms
4:	learn: 0.0983804	total: 38.3ms	remaining: 38.3ms
5:	learn: 0.0772535	total: 44.3ms	remaining: 29.5ms
6:	learn: 0.0699251	total: 52.9ms	remaining: 22.7ms
7:	learn: 0.0638414	total: 59.8ms	remaining: 14.9ms
8:	learn: 0.0599677	total: 67.8ms	remaining: 7.53ms
9:	learn: 0.0595701	total: 73.5ms	remaining: 0us



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.5
0:	learn: 0.2169383	total: 11ms	remaining: 98.7ms
1:	learn: 0.1153698	total: 24.6ms	remaining: 98.4ms
2:	learn: 0.0890425	total: 39.9ms	remaining: 93.1ms
3:	learn: 0.0776150	total: 50.6ms	remaining: 75.8ms
4:	learn: 0.0675302	total: 59.7ms	remaining: 59.7ms
5:	learn: 0.0624361	total: 69.4ms	remaining: 46.3ms
6:	learn: 0.0549160	total: 79.5ms	remaining: 34.1ms
7:	learn: 0.0517873	total: 89.3ms	remaining: 22.3ms
8:	learn: 0.0462054	total: 103ms	remaining: 11.5ms
9:	learn: 0.0448777	total: 115ms	remaining: 0us



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.5
0:	learn: 0.2029183	total: 7.31ms	remaining: 65.8ms
1:	learn: 0.1283416	total: 15.3ms	remaining: 61.1ms
2:	learn: 0.1039650	total: 21.5ms	remaining: 50.1ms
3:	learn: 0.0868861	total: 29.7ms	remaining: 44.5ms
4:	learn: 0.0795214	total: 35.8ms	remaining: 35.8ms
5:	learn: 0.0723886	total: 43.7ms	remaining: 29.1ms
6:	learn: 0.0613723	total: 50.1ms	remaining: 21.5ms
7:	learn: 0.0507648	total: 58.9ms	remaining: 14.7ms
8:	learn: 0.0455654	total: 65ms	remaining: 7.22ms
9:	learn: 0.0421869	total: 73.1ms	remaining: 0us



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.5
0:	learn: 0.2540362	total: 12.7ms	remaining: 114ms
1:	learn: 0.1580696	total: 19.9ms	remaining: 79.6ms
2:	learn: 0.1165058	total: 25.9ms	remaining: 60.3ms
3:	learn: 0.0941989	total: 31.5ms	remaining: 47.3ms
4:	learn: 0.0921410	total: 38.1ms	remaining: 38.1ms
5:	learn: 0.0742546	total: 43.9ms	remaining: 29.2ms
6:	learn: 0.0664886	total: 54.1ms	remaining: 23.2ms
7:	learn: 0.0618824	total: 59.6ms	remaining: 14.9ms
8:	learn: 0.0599476	total: 67.1ms	remaining: 7.46ms
9:	learn: 0.0484866	total: 73ms	remaining: 0us
Cat Boost Cross-Validation Accuracy: 0.74



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



KNN Cross-Validation Accuracy: 0.50



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



XG Boost Cross-Validation Accuracy: 0.68



[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.


[IterativeImputer] Early stopping criterion not reached.



LGBM Cross-Validation Accuracy: 0.62

Classifier with the best F1 Score: Cat Boost


In [59]:
# series of model scores
data = {'Logistic Regression': 0.66, 'ADA Boost': 0.66 , 'Random Forest': 0.63, 'SVM': 0.60, 'Cat Boost': 0.74, 'KNN': 0.50, 'XG Boost': 0.68, 'LGBM': 0.62}
comp = pd.Series(data, name='F1 Score')

In [60]:
# model scores
px.scatter(comp, color=comp.index, size=comp, title='Model Comparison', symbol=comp, labels={'index': 'Model', 'value': 'F1 Score'})

We see from the comparisons that the best performing model is catboost, and the worst is KNN. Consequently, we will pick catboost as our final model. 

## Final Model

In [81]:
# Create the pipeline
pipeline = Pipeline([
    ('imputer', IterativeImputer(random_state=0)),
    ('scaler', StandardScaler()),
    ('classifier', CatBoostClassifier(task_type='GPU', loss_function='Logloss', eval_metric='F1', 
                                      iterations=100, random_seed=19, use_best_model=True))
])

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train, classifier__eval_set=(X_test, y_test))
pipeline_pred = pipeline.predict(X_test)
# Evaluate the pipeline on the test data
score = f1_score(y_test, pipeline_pred)
print(f'Test score: {score:.4f}')



[IterativeImputer] Early stopping criterion not reached.



Learning rate set to 0.190738
0:	learn: 0.7766990	test: 0.0000000	best: 0.0000000 (0)	total: 12.9ms	remaining: 1.27s
1:	learn: 0.8631579	test: 0.0000000	best: 0.0000000 (0)	total: 27.1ms	remaining: 1.33s
2:	learn: 0.8541667	test: 0.0000000	best: 0.0000000 (0)	total: 46.2ms	remaining: 1.49s
3:	learn: 0.8510638	test: 0.0000000	best: 0.0000000 (0)	total: 67.3ms	remaining: 1.61s
4:	learn: 0.8936170	test: 0.0000000	best: 0.0000000 (0)	total: 87.3ms	remaining: 1.66s
5:	learn: 0.9032258	test: 0.0000000	best: 0.0000000 (0)	total: 105ms	remaining: 1.65s
6:	learn: 0.8791209	test: 0.0000000	best: 0.0000000 (0)	total: 125ms	remaining: 1.67s
7:	learn: 0.9010989	test: 0.0000000	best: 0.0000000 (0)	total: 142ms	remaining: 1.63s
8:	learn: 0.9010989	test: 0.0000000	best: 0.0000000 (0)	total: 159ms	remaining: 1.6s
9:	learn: 0.8913043	test: 0.0000000	best: 0.0000000 (0)	total: 173ms	remaining: 1.55s
10:	learn: 0.9130435	test: 0.0000000	best: 0.0000000 (0)	total: 188ms	remaining: 1.52s
11:	learn: 0.923076

The best F1 score we could achieve with our test set was 0.625. This is a mediocre score, likely due to the imbalance of numerous features and the target. Smote does not balance features, and from our EDA, we see most of the features favored the negative classes. We would see improvements if we had a wider set of data, with balanced feature categories.

In [98]:
# Ada pipeline feature importance


cat_classifier = pipeline.named_steps['classifier']
cat_importances = cat_classifier.feature_importances_
cat_indices = np.argsort(cat_importances)[::-1]

top_10_features = []
for f in range(10):
    feature_index = cat_indices[f]
    feature_name = X_train.columns[feature_index]
    top_10_features.append((feature_name, cat_importances[feature_index]))

cat_top_10_df = pd.DataFrame(top_10_features, columns=['Feature', 'Importance'])
print("CatBoost Top 10 Feature Importance:")
print(cat_top_10_df)

CatBoost Top 10 Feature Importance:
                         Feature  Importance
0                       schiller   85.724047
1                       citology    7.742365
2  std_time_since_last_diagnosis    6.533588
3      number_of_sexual_partners    0.000000
4       first_sexual_intercourse    0.000000
5             num_of_pregnancies    0.000000
6                         smokes    0.000000
7                   smokes_years    0.000000
8              smokes_packs/year    0.000000
9     std_vaginal_condylomatosis    0.000000


In [103]:
px.bar(cat_top_10_df.head(3), y='Importance', x='Feature', title='Most Important Features in Predicting Cervical Cancer', template='ggplot2')

During the Schiller's test, iodine solution is applied to the cervix under direct vision. Normal cervical mucosa contains glycogen, and stains brown. Abnormal areas, such as early cervical cancer, do not take up the stain. Cervical cytology, known as a pap smear, is a test method of cervical screening used to detect potentially precancerous and cancerous processes in the cervix. Abnormal results are followed up by more sensitive diagnostic procedure, and sometimes interventions that aim ro prevent progression to cervical cancer. 

Medical research suggests smoking and HPV increase the likelihood of cervical cancer, yet we did no see that in this dataset. This was likely due to the imbalance of those, as well as other important features. 

In [84]:
# Display AUC ROC
probabilities_valid = pipeline.predict_proba(X_test)
probabilities_one_valid = probabilities_valid[:, 1]

auc_roc = roc_auc_score(y_test, probabilities_one_valid)

print(auc_roc)

# ROC AUC curve of results
fpr, tpr, thresholds = roc_curve(y_test, probabilities_one_valid)

fig = px.area(
    x=fpr, y=tpr,
    title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=800, height=600
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

0.7791411042944786


The AUC ROC score shows the model is better than random predictions, which would have a score of 0.5. However, this dataset had bias towards the negative cases for many features, which severely reduces the validity of the results when used to make general conclusions. Overall, imbalanced features lead to a biased model that performs poorly on new data, especially considering features of the minority class. 

## Conclusions

Overall, we do not like data we used fr this project. The features were severely imbalanced, as well as the target. Consequently, the model was biased, and we were unable to find reliable feature importance comparisons. We would need to find a larger dataset with more balanced features, and a balanced target.  