The goal of this project is to build a predictive model for predicting biopsy results for patients at risk of developing cervical cancer. The dataset used for this project was accessed on Kaggle and contains 36 columns: 35 predictor variables consisting primarily on patient history information, and a 'Biopsy' column containing a binary biopsy result from each patient. The goal of this project will be to use this dataset to predict biopsy result for each patient. Because a biopsy is an invasive medical procedure, the ability to predict biopsy results based on patient history is a valuable tool that can prevent unnecessary biopsies. The model used for this project will be a suppor vector machine (SVM) model 

A primary challenge faced in the development of this model is dealing with missing data. The full dataset contains data from 868 patients. However, only 59 of those 868 patients contain full rows of data without any missing values. Building a predictive model based only on these 59 patients is not nearly as significant as a model based on the full dataset would be. In order to do this, missing values must be intelligently replaced in the dataset with values that will allow those patients to be included in the model without heavily influencing the prediction of the model. This is done by performing a combination of linear/logistic regression using the patient's age values to replace missing values with values consistent with the rest of the dataset based on age.

The first step in this project is to import the necessary Python libraries.

In [1]:
import pandas
import sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LinearRegression
import math
import random

Next, the dataset is imported from the csv file downloaded from Kaggle. All rows containing a '?', or missing data point, are then dropped to perform a preliminary analysis on all full rows of this dataset

In [2]:
#Import dataset as Pandas df
dataset = pandas.read_csv('/Users/milesmarkey/Downloads/kag_risk_factors_cervical_cancer.csv')
dataset.head()

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
2,34,1.0,?,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,?,?,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,?,?,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,?,?,0,0,0,0,0,0,0,0


In [3]:
#Data preprocessing - replace all '?' with NaN, then remove all rows containing NaN

dataset.replace('?',np.nan,inplace=True)
droppedDs = dataset.dropna(axis=0,how='any',inplace=False)
print(droppedDs.shape)
#Note the resulting dataset is very small. 

(59, 36)


The SVM model is built by separating the 'Biopsy' column from the other 35 columns. The Biopsy column is used as the dependent variable while the other columns are used as the independent variables. These variables are then split 80/20 for training/testing data, then fit to a linear SVM model. 

In [4]:
#Separate independent variable from dependent variables (Here the variable to be predicted is 'Biopsy')

X = droppedDs.drop('Biopsy',axis=1)
y = droppedDs.Biopsy

In [5]:
#Use scikitlearn's train/test split to randomly split data for training/testing (80/20 split)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [6]:
#Initialize SVM classifier with Linear Kernel function:
Linear_classifier = svm.SVC(kernel='linear')

In [7]:
#Train the classifiers

Linear_classifier.fit(X_train,y_train)

SVC(kernel='linear')

In [8]:
y_pred = Linear_classifier.predict(X_test)
accuracy = sklearn.metrics.accuracy_score(y_test,y_pred)
print(accuracy)
print(confusion_matrix(y_test,y_pred))

0.8333333333333334
[[8 0]
 [2 2]]


We can see here that the model has a reasonably high accuracy, but the sample size is very low because all rows that contained an NaN were dropped from the dataset. Additionally, the fact that so much data has been dropped has actually introduced a bias to the dataset (because it only includes patients who have an STD, since all others have an 'NaN' value in their 27th and 28th columns). To address this problem, columns 27 and 28 (containing data related to STD diagnosis, which a large majority of the patients do not have) will be dropped first, then all rows containing NaNs will be dropped. This will leave us with a much larger dataset than we previoulsy had since we are temporarily ignoring two of the columns where most of the missing data is contained

After doing this, the data divided into X/y variables, split for training/testing, and used for fitting an SVM model once again, as was done previously

In [9]:
std_first = dataset['STDs: Time since first diagnosis']
stf_last = dataset['STDs: Time since last diagnosis']
newDs1 = dataset.drop('STDs: Time since first diagnosis',axis=1)
newDs2 = newDs1.drop('STDs: Time since last diagnosis',axis=1)
newDs3 = newDs2.dropna(axis=0,how='any',inplace=False)
print(newDs3.shape)

(668, 34)


In [10]:
#Separate independent/dependent variable of new dataset
X2 = newDs3.drop('Biopsy',axis=1)
y2 = newDs3.Biopsy

In [11]:
#Train/test split
X_train,X_test,y_train,y_test = train_test_split(X2,y2,test_size=0.2)

In [12]:
Linear_classifier = svm.SVC(kernel='linear')
Linear_classifier.fit(X_train,y_train)

SVC(kernel='linear')

In [13]:
y_pred = Linear_classifier.predict(X_test)
accuracy = sklearn.metrics.accuracy_score(y_test,y_pred)
print(accuracy)
print(confusion_matrix(y_test,y_pred))

0.917910447761194
[[110   9]
 [  2  13]]


This model is significantly more powerful than the original model. The accuracy value is similar, but more significant is the fact that  the model is now trained/tested on a much larger dataset, increasing the significance of the results. 

The final model will be built using the full dataset (ie. all 858 rows, rather than the 668 used previously). This will be done by replacing all remaining NaN values with sensible values. While these values will not represent actual data points, they will allow the patients to be included in the model training. This will be done column by column

In [14]:
age = dataset.Age
test = age.dropna(axis=0,how='any',inplace=False)
print(age.shape)
print(test.shape)
#All patients have an age. This will be used to approximate other values for these patients

(858,)
(858,)


In [15]:
columnsMissingData = []
for (columnName,columnData) in newDs2.iteritems():
    varNonNanData = columnData.dropna(axis=0,how='any',inplace=False)
    if not (len(varNonNanData) == len(newDs2)):
        columnsMissingData.append(columnName)
print(columnsMissingData)

['Number of sexual partners', 'First sexual intercourse', 'Num of pregnancies', 'Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'Hormonal Contraceptives (years)', 'IUD', 'IUD (years)', 'STDs', 'STDs (number)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease', 'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 'STDs:HPV']


This shows a list of the rows that contain at least 1 NaN value. For some of these values (Number of sexual partners, first sexual intercourse, etc.) it may make sense to replace the NaN values with the average value of the column. Others, such as 'Smokes', or 'IUD' are binary, and therefore should be replaced with a '1' or a '0', depending on the frequency of each occurance.
One potential improvement to this basic method of replacement using either average values or average frequency is to account for a patient's age, since many of these variable will logically depend on age. This can be done by creating a simple regression model with age as the independent variable and the variable of interest as the dependent variable. This model can then be given the age of the patient with the missing data point, and will return a more accurate prediction for the missing variable. This is done using the function(s) below:

In [16]:
def truncate(n, decimals=0):
    multiplier = 10 ** decimals
    return int(n * multiplier) / multiplier

def createRegressionModel(X,y):
    X = X.to_numpy().reshape(-1,1)
    y = y.to_numpy()
    model = LinearRegression()
    model.fit(X,y)
    return model

def fillInMissingValues(inputDataset,dependentVariable):
    subDataset = dataset[['Age',dependentVariable]]
    nonMissingDataset = dataset.dropna(axis=0,how='any',inplace=False)
    X = nonMissingDataset['Age']
    y = nonMissingDataset[dependentVariable]
    trainedModel = createRegressionModel(X,y)
    count = 0
    for idx in inputDataset.index:
        #print(type(float(inputDataset[dependentVariable][idx])))
        if np.isnan(float(inputDataset[dependentVariable][idx])):
            #print(type(float(inputDataset['Age'][idx])))
            replaceVal = trainedModel.predict(inputDataset['Age'][idx].reshape(-1,1))
            replaceVal = truncate(replaceVal,1)
            inputDataset.loc[idx, dependentVariable] = replaceVal
    


fillInMissingValues(newDs2,'Number of sexual partners')
fillInMissingValues(newDs2,'First sexual intercourse')
fillInMissingValues(newDs2,'Num of pregnancies')

In [17]:
columnsMissingData = []
for (columnName,columnData) in newDs2.iteritems():
    varNonNanData = columnData.dropna(axis=0,how='any',inplace=False)
    if not (len(varNonNanData) == len(newDs2)):
        columnsMissingData.append(columnName)
print(columnsMissingData)

['Smokes', 'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives', 'Hormonal Contraceptives (years)', 'IUD', 'IUD (years)', 'STDs', 'STDs (number)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease', 'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 'STDs:HPV']


In [18]:
print(newDs2.shape)
newDs2.head()

(858, 34)


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs:HPV,STDs: Number of diagnosis,Dx:Cancer,Dx:CIN,Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4.0,15.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
1,15,1.0,14.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
2,34,1.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0
3,52,5.0,16.0,4.0,1.0,37.0,37.0,1.0,3.0,0.0,...,0.0,0,1,0,1,0,0,0,0,0
4,46,3.0,21.0,4.0,0.0,0.0,0.0,1.0,15.0,0.0,...,0.0,0,0,0,0,0,0,0,0,0


In [19]:
age = newDs2.Age
test = age.dropna(axis=0,how='any',inplace=False)
print(age.shape)
print(test.shape)

(858,)
(858,)


It is clear that this function was able to replace continuous values with approximations based on the patient's age. For the non-continuous variables, however, a different function must be used. The goal of this function will be to approximate an individual's liklihood of falling into a particular category based on their age. Logisitic Regression will be used to go this:

In [20]:
def truncate(n, decimals=0):
    multiplier = 10 ** decimals
    return int(n * multiplier) / multiplier

def createLogRegressionModel(X,y):
    X = X.to_numpy().reshape(-1,1)
    y = y.to_numpy()
    model = LogisticRegression()
    model.fit(X,y)
    return model

def fillInMissingCategoricalValues(inputDataset,dependentVariable):
    subDataset = dataset[['Age',dependentVariable]]
    nonMissingDataset = dataset.dropna(axis=0,how='any',inplace=False)
    X = nonMissingDataset['Age']
    y = nonMissingDataset[dependentVariable]
    trainedModel = createRegressionModel(X,y)
    count = 0
    for idx in inputDataset.index:
        #print(type(float(inputDataset[dependentVariable][idx])))
        if np.isnan(float(inputDataset[dependentVariable][idx])):
            #print(type(float(inputDataset['Age'][idx])))
            replaceValProbability = trainedModel.predict(inputDataset['Age'][idx].reshape(-1,1))
            if (random.random()<replaceValProbability):
                replaceVal = 1
            else:
                replaceVal = 0
            inputDataset.loc[idx, dependentVariable] = replaceVal

fillInMissingCategoricalValues(newDs2,'Smokes')
fillInMissingCategoricalValues(newDs2,'Hormonal Contraceptives')
fillInMissingCategoricalValues(newDs2,'IUD')

In [21]:
columnsMissingData = []
for (columnName,columnData) in newDs2.iteritems():
    varNonNanData = columnData.dropna(axis=0,how='any',inplace=False)
    if not (len(varNonNanData) == len(newDs2)):
        columnsMissingData.append(columnName)
print(columnsMissingData)

['Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives (years)', 'IUD (years)', 'STDs', 'STDs (number)', 'STDs:condylomatosis', 'STDs:cervical condylomatosis', 'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 'STDs:syphilis', 'STDs:pelvic inflammatory disease', 'STDs:genital herpes', 'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 'STDs:HPV']


Now all the catergorical variables have been filled in. The next step is to fill in the missing values in the related columns:
'Smokes (years)', 'Smokes (packs/year)', 'Hormonal Contraceptives (years)', 'IUD (years)'
These variables are unique because they will only be non-zero for patients where 'Smokes','Hormonal COntraceptives' or 'IUD' == 1. This means that they can not be filled in based off of simple linear regression based on age for all patients in the dataframe, as the other continuous variables were. Instead, we will first fill in '0' for all variables where the parent variable ('Smokes','Hormonal COntraceptives' or 'IUD') == 0 (including those which were randomly assigned by our logisitic regression. For the remaining missing values, linear regression will be performed based only on the patients who's parent variable == 1. This is done below.

The final set of variables that need to be replaced are the STD variables. The primary condition that must be upheld for thse variables is that the sum of all child variables must equal the parent value, 'STDs'. This will be done by first assigning each categorical child variable based on logistic regression, with age as the independent variable, then assigning the parent value for each row by finding the sum of the child values in that row. THis is done below: