# Intro

In this notebook, we will run some models that help us predict if a patient has Cervical cancer or not. Then, we will answer some questions about the model itself and the predictions from them. The point of this exercise is for you to look at the results from the models and not to practice your coding, so I am providing all the necessary code for you. This also means that you are expected to discuss and type the answers to the questions. 

The data is from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors). This is the description from the source website: 

    This dataset focuses on the prediction of indicators/diagnosis  of cervical cancer. The features cover demographic information, habits, and historic medical records. The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela.   The dataset comprises demographic information, habits, and      historic medical records of 858 patients. Several patients      decided not to answer some of the questions because of privacy  concerns (missing values).
    
|Variable Name | Type| Meaning |
| --- | --- | --- |
|Age| int | Age of subject |
|n_pregnancy| int | Num of pregnancies|
|Smokes_years| bool | Number of years Smoking|
|IUD_years| int | Number of years with Hormonal Contraceptives|
|dx_cancer|bool|Previously diagnosed with cancer|
|dx_hpv|bool|Previously diagnosed with HPV|
|Biopsy|bool|Results from biopsy|


# Necessary Code

Update the path for the data file in the second cell. Then, run each of the following cells. The easiest way to do this is by going to "Cell" and then selecting "Run All."


In [166]:
# Loading necessary packages
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [177]:
## Loading data
clean_cervical_cancer = pd.read_csv("/Users/marilyn.vazquez/Documents/GitHub/BryanProgram2024/WorkshopActivities/updated_risk_factors_cervical_cancer.csv")
print('Original shape: ' + str(clean_cervical_cancer.shape))

## Cleaning the data

# Getting rid of rows with missing values
clean_cervical_cancer = clean_cervical_cancer.replace("?",np.nan)
print('Number of missing values = ' + str(clean_cervical_cancer.isnull().sum().sum()))
rows_not_nans = ~clean_cervical_cancer.isnull().any(axis=1)
clean_cervical_cancer = clean_cervical_cancer.loc[rows_nans]
print('New shape after removing observations with missing data: ' + str(clean_cervical_cancer.shape))
print('Number of missing values = ' + str(clean_cervical_cancer.isnull().sum().sum()))

clean_cervical_cancer.head()

Original shape: (858, 8)
Number of missing values = 294
New shape after removing observations with missing data: (689, 8)
Number of missing values = 0


Unnamed: 0,Age,n_pregnancy,Smokes_years,horm_cont_yrs,IUD_yrs,dx_cancer,dx_hpv,Biopsy
0,18,1,0,0,0,0,0,0
1,15,1,0,0,0,0,0,0
2,34,1,0,0,0,0,0,0
3,52,4,37,3,0,1,1,0
4,46,4,0,15,0,0,0,0


In [117]:
## Splitting the data for training and testing

# Dividing features from labels
cols = clean_cervical_cancer.columns[0:7]
risk_factors = np.array(clean_cervical_cancer.iloc[:,0:7]).astype('float')
biopsy = np.array(clean_cervical_cancer['Biopsy']).astype('float')

# Transforming the data for training and testing
risk_factors = (risk_factors - risk_factors.mean())/risk_factors.std()

# Putting it back to pandas
risk_factors = pd.DataFrame(risk_factors,columns=cols)

# Splitting
X_train, X_test, y_train, y_test = train_test_split(
    risk_factors, biopsy, test_size=0.25, random_state=0)

print('Number of risk factors = ' + str(np.shape(X_train)[1]))
print('Number of training observations = '+ str(np.shape(X_train)[0]))
print('Number of testing observations = ' + str(np.shape(X_test)[0]))

Number of risk factors = 7
Number of training observations = 516
Number of testing observations = 173


### Questions about the data

The previous code went through some cleaning. Before we move on to the actual models, let's make sure we understand the basics of the data and what was done to clean it.

Answer the following questions:

1. What are the risk factors for cervical cancer in this data set?

2. How many patients were in the original data?

3. How many patients had missing information in the original data?

4. How many missing values are there in the clean version of the data?

5. What type of data is contained in the column "Biopsy" and how is it encoded?

6. How many observations ended in the training data? What about the testing data?

7. What types of plots would you do to explore this dataset?

# Models

Now we will move on to training and evaluating two models: logistic regression and K-Nearest Neighbors. 

In [157]:
# Training the log regression
log_reg_model = LogisticRegression(random_state=0).fit(X_train, y_train)
log_reg_model.predict(X_train)

# Pretty table
pd.DataFrame(np.hstack([np.transpose(np.array(log_reg_model.coef_)), np.transpose(np.array(np.exp(log_reg_model.coef_)))]),index=cols, columns = ['Coefficients','Odds Ratio'])


Unnamed: 0,Coefficients,Odds Ratio
Age,0.064651,1.066787
n_pregnancy,0.039389,1.040175
Smokes_years,0.381729,1.464815
horm_cont_yrs,0.748853,2.114573
IUD_yrs,0.65017,1.915866
dx_cancer,0.26947,1.30927
dx_hpv,0.278363,1.320966


In [159]:
# Testing the log regression
print('Average accuracy of Log regression on training set:')
log_reg_model.score(X_train,y_train)

Average accuracy of Log regression on training set:


0.9282945736434108

8. After running the previous code, interpret the results.


In [163]:
# Training the KNN
knn_model = KNeighborsClassifier(n_neighbors=16).fit(X_train, y_train)

In [164]:
# Testing the KNN
print('Average accuracy of KNN on training set:')
knn_model.score(X_test,y_test)

Average accuracy of KNN on training set:


0.9421965317919075

9. After running the previous code, interpret the results.

10. Next, compare and contrast what you learned from both models.

11. Which method was more accurate and by how much?

12. When would you use the logistic model? Why?

13. When would you use KNN? Why?

14. Suppose that the same values in the data are used to train and test both models but instead of the columns representing risk factors for cervical cancer, they represent characteristics of applicants for a job. Also, instead of the labels being biopsy results, they are indicators of whether the applicant got the job or not. Since the numbers are the same, when you feed them to the models, they resulting values will be the same. How the the interpretations change? 