# Mammography Classification Model
### By Entai Lin

Throughout this project, you will see me clean a mammography dataset (cited below) and use it to try construct a model that can predict a patient's diagnosis of cancer (cancer_c) a year after the mammography screening.

### Key skills
- Data cleaning
- Exploratory data analysis
- Model building
- Model Optimisation
- Utilizing Sklearn library
- Machine Learning


#### Citations

Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer Surveillance Consortium (HHSN261201100031C). You can learn more about the BCSC at: http://www.bcsc-research.org/.

Dataset Documentation : https://www.bcsc-research.org/index.php/datasets/mammography_dataset/dataset_documentation

In [2]:
import numpy as np
import pandas as pd
mammography = pd.read_csv('mammography.csv')
mammography

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713


# Data cleaning

## Process Outline:

- Create NA values for values coded as missing
- Remove obsolete variables 
- Fill missing values using KNNImputer
- Move prediction variable to the last column in the dataframe

In [4]:
from sklearn.impute import KNNImputer
fill_imputer = KNNImputer(n_neighbors=1)
# I chose 1 nearest neighber in the imputer because most of our variables are categorical variables coded as numeric

In [5]:
#compfilm gets dropped because we do not have prior examination data
mammography = mammography.drop("compfilm_c", axis = 1)

In [6]:
#famhx, hrt, prvmam, biophx all have missing values entered as 9
#these variables have index 4,5,6,7 respectively after dropping compfilm
#fill these missing values using KNN imputation
#you have to run this cell twice

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

for i in range(4,8):
    mammography.iloc[:,i] = mammography.iloc[:,i].replace(9,np.nan)

mammography['hrt_c'].unique()

array([ 0., nan,  1.])

In [7]:
#drop CaTypeO since it is not information obtainable prior to cancer diagnosis and thus shouldn't be used to make predicions
mammography = mammography.drop("CaTypeO", axis = 1)

In [8]:
#replace -99 bmi values with NA
mammography["bmi_c"] = mammography["bmi_c"].replace(-99,np.nan)
mammography["bmi_c"].unique()

array([24.0235443,        nan, 29.0524292, ..., 29.8625793, 19.6293335,
       35.680542 ])

In [9]:
#drop patient ID variable because we don't want this to be influencing our model
mammography = mammography.drop("ptid", axis = 1)

In [10]:
#move predictor variable to the end 
move = mammography.pop("cancer_c")
mammography.insert(9, "cancer_c", move)

In [11]:
#final dataset after imputation
mammography_imputed = pd.DataFrame(fill_imputer.fit_transform(mammography), columns=mammography.columns)
mammography_imputed

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0
...,...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048,0.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704,0.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449,0.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745,0.0


## Iteration 1

- Defining independant and dependant variables
- Splitting test and train set
- Scaling variables between 0 and 1
- Building and training model
- An analysis of the confusion matrix

In [13]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [14]:
X = mammography_imputed.iloc[:,0:9]
X

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523
...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745


In [15]:
y = mammography_imputed.iloc[:,9]
y

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
39995    0.0
39996    0.0
39997    0.0
39998    0.0
39999    0.0
Name: cancer_c, Length: 40000, dtype: float64

In [16]:
#train test set splitting
X_train, X_test, y_train, y_test = train_test_split(X,y, shuffle = True, test_size = 0.2)

In [17]:
#scaling variables for equal weighting, some variables
scaler = MinMaxScaler()

In [18]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X_train

array([[0.72413793, 0.2       , 0.33333333, ..., 0.        , 0.        ,
        0.07214903],
       [0.17241379, 0.        , 0.66666667, ..., 0.        , 0.        ,
        0.10330553],
       [0.17241379, 0.2       , 0.        , ..., 0.        , 0.        ,
        0.59938855],
       ...,
       [0.51724138, 0.2       , 0.33333333, ..., 1.        , 0.        ,
        0.29236697],
       [0.55172414, 0.4       , 0.33333333, ..., 0.        , 0.        ,
        0.25084417],
       [0.65517241, 0.4       , 0.33333333, ..., 0.        , 1.        ,
        0.13143523]])

### Model Training

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [21]:
knn_model = KNeighborsClassifier(n_neighbors=5)

In [22]:
knn_model.fit(X_train,y_train)

In [23]:
y_pred = knn_model.predict(X_test)
y_pred

array([0., 0., 0., ..., 0., 0., 0.])

In [24]:
#accuracy (keep in mind that the dataset is heavily unbalanced)

knn_model.score(X_test,y_test)

0.991375

In [25]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[7928,   15],
       [  54,    3]])

In [26]:
cr = classification_report(y_test, y_pred)
cr

'              precision    recall  f1-score   support\n\n         0.0       0.99      1.00      1.00      7943\n         1.0       0.17      0.05      0.08        57\n\n    accuracy                           0.99      8000\n   macro avg       0.58      0.53      0.54      8000\nweighted avg       0.99      0.99      0.99      8000\n'

### Feedback

In this first preliminary model, using all variables and equal weighting. The results are quite terrible. We will now investigate how to improve the model, and iterating to optimise the model

# Exploratory Data Analysis

- Assess correlation matrix
- Observe anomaly in assess_c variable
- Analyse the anomaly
- Propose how to recode variables

In [29]:
#start by assessing the correlation between each variable and cancer.
corr_m = round(mammography_imputed.corr(),3).iloc[:,9]
corr_m

age_c        0.010
assess_c    -0.080
density_c    0.015
famhx_c      0.005
hrt_c        0.002
prvmam_c    -0.007
biophx_c     0.022
mammtype    -0.005
bmi_c        0.012
cancer_c     1.000
Name: cancer_c, dtype: float64

Something interesting about this matrix is that assess_c is inversely correlated to cancer_c. This implies that higher numerical coding indicates a lower risk of cancer. This is strange, given a high assess_c values is coded to be associated with a radiologist thinking that there is a high risk of malignancy.

Things that are positively correlated with breast cancer are: age, breast density, as well as family history, breast biopsy history, and BMI.

Things that are negatively correlated with breast cancer are: prior mammograms and mammogram types, although just by intuition I don't think this is significant.

Lets start by investigating why the radiologist assessment variable is negatively correlated with cancer diagnosis.

In [32]:
mamm = mammography_imputed.copy()
mamm.head()

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0


In [33]:
assess_canc = mamm.iloc[:,[1,9]].astype(int)
assess_canc["assess_c"].value_counts()

assess_c
1    26032
2    10718
0     3049
3      139
4       57
5        5
Name: count, dtype: int64

### Calculating the cancer proportion for unique values in selected variables, as well as bins for BMI and Age

In [35]:
mamm.groupby("assess_c")["cancer_c"].mean()

assess_c
0.0    0.064939
1.0    0.000768
2.0    0.001120
3.0    0.007194
4.0    0.421053
5.0    0.800000
Name: cancer_c, dtype: float64

In [36]:
mamm.groupby("density_c")["cancer_c"].mean()

density_c
1.0    0.004973
2.0    0.005506
3.0    0.008979
4.0    0.005405
Name: cancer_c, dtype: float64

In [37]:
mamm.groupby("famhx_c")["cancer_c"].mean()

famhx_c
0.0    0.006291
1.0    0.007375
Name: cancer_c, dtype: float64

In [38]:
mamm.groupby("hrt_c")["cancer_c"].mean()

hrt_c
0.0    0.006415
1.0    0.006957
Name: cancer_c, dtype: float64

In [39]:
mamm.groupby("prvmam_c")["cancer_c"].mean()

prvmam_c
0.0    0.012821
1.0    0.006425
Name: cancer_c, dtype: float64

In [40]:
mamm.groupby("biophx_c")["cancer_c"].mean()

biophx_c
0.0    0.005388
1.0    0.009460
Name: cancer_c, dtype: float64

In [41]:
binned_bmi = pd.cut(mamm["bmi_c"], bins=[0,20,25,30,35,40,45,50])
mamm.groupby(binned_bmi)["cancer_c"].mean()

bmi_c
(0, 20]     0.006093
(20, 25]    0.006215
(25, 30]    0.004986
(30, 35]    0.009690
(35, 40]    0.007058
(40, 45]    0.010274
(45, 50]    0.012232
Name: cancer_c, dtype: float64

In [42]:
split_bmi = pd.cut(mamm["bmi_c"], bins=[0,30,50])
mamm.groupby(split_bmi)["cancer_c"].mean()

bmi_c
(0, 30]     0.005657
(30, 50]    0.009178
Name: cancer_c, dtype: float64

In [43]:
mamm["age_c"].sort_values()

7434     60.0
29122    60.0
29114    60.0
2475     60.0
16933    60.0
         ... 
28998    89.0
17436    89.0
32359    89.0
30868    89.0
934      89.0
Name: age_c, Length: 40000, dtype: float64

In [44]:
binned_age = pd.cut(mamm["age_c"], bins=[60,65,70,75,80,85,90])
mamm.groupby(binned_age)["cancer_c"].mean()

age_c
(60, 65]    0.006748
(65, 70]    0.006353
(70, 75]    0.006002
(75, 80]    0.007614
(80, 85]    0.007551
(85, 90]    0.010204
Name: cancer_c, dtype: float64

In [45]:
split_age = pd.cut(mamm["age_c"], bins=[60,75,90])
mamm.groupby(split_age)["cancer_c"].mean()

age_c
(60, 75]    0.006430
(75, 90]    0.007847
Name: cancer_c, dtype: float64

## Analysis conclusions 


### Assess_c
For assess_c, those who need additional testing, 0, are almost 100 times more likely to have cancer than those that recieved a 1, a 
negative result. This is why the correlation coefficient is zero. The 0 value skews the ordinal nature of the variable.

### BMI
For BMI, there is a clear cut off between those who have a bmi of below 30 and above 30. To the left of the critical value of 30, the proportion of cancer diagnosis is 0.005657, to the right it is 0.009178, almost a doubling of proportion.

### Age
For age, we expect to see less of a correlation because the only people in our dataset are those who are aged 60 to 90 years old. Knowing that typical research proves that BC is more likely in those aged 50+, age may play less in a role in our model. However, there seems to be a critical value at the 75 year point, where there is a difference of 0.001417 in cancer proportion.

## Iteration 3
- Define a function that generalises the process of model building for potential further iterations
- Recode numeric variables into a metric suitable for hamming distance
- Utilise hamming distance metric
- Use distance based weigthing

In [48]:
# Function with parameters being the X variable table, y variable table, number of neighbors, and distance type.
# it will return the confusion matrix of the model, from which you could calculate metrics measuring the effectiveness of the model
def knn_func(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n)
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [49]:
#original encoding for this dataset is pretty close to being suitable for hqamming distance

mamm3 = pd.read_csv('mammography.csv')
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713


In [50]:
mamm3["above75"] = mamm3["age_c"] >= 75
mamm3["above75"] = mamm3["above75"].astype(int)
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0


In [51]:
mamm3["overweight"] = np.nan
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75,overweight
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0,
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0,
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0,
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0,
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1,
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1,
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1,
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0,


In [52]:
for i in range(40000):
    if mamm3.iloc[i,11] == -99:
        mamm3.iloc[i,14] = 2
    elif mamm3.iloc[i,11] >= 30:
        mamm3.iloc[i,14] = 1
    else:
        mamm3.iloc[i,14] = 0

mamm3["overweight"] = mamm3["overweight"].astype(int)
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75,overweight
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0,0
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0,0
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0,2
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1,2
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1,2
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1,2
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0,2


### Selecting variables for X3
Based on the exploratory data analysis, for this iteration I will use:
- above 75
- overweight
- biophx_c
- famhx_c
- density_c
- assess_c


In [54]:
X3 = mamm3.iloc[:,[1,4,5,8,13,14]]
X3

Unnamed: 0,assess_c,density_c,famhx_c,biophx_c,above75,overweight
0,1,2,0,0,0,0
1,1,4,0,0,0,2
2,0,2,0,0,0,0
3,2,2,0,0,0,2
4,3,2,0,1,0,1
...,...,...,...,...,...,...
39995,1,1,0,0,1,2
39996,1,3,0,0,1,2
39997,1,2,0,0,1,2
39998,1,2,0,1,0,2


In [55]:
y3 = mamm3.iloc[:,2]
y3

0        0
1        0
2        0
3        0
4        0
        ..
39995    0
39996    0
39997    0
39998    0
39999    0
Name: cancer_c, Length: 40000, dtype: int64

### Defining a hamming distance function:
For two arrays, if a variable is not the same, add one to distance

In [57]:
def ham_dist(x,y):
    dist = 0
    for i in range(len(x)):
        if x[i] != y[i]:
            dist += 1
    return dist

In [58]:
knn_func(X3,y3,3)

array([[7936,    9],
       [  53,    2]])

## Assessing iterations

- Define a function that takes a sample of n confusion matrices
- Use the sample of matrices to calculate average model performance and 95% confidence interval

In [61]:
from scipy.spatial import distance

In [62]:
def matrices_sample(X,y,n,s,function):
    matrices = []
    for i in range(s):
        matrices.append(function(X,y,n))
    return matrices

In [63]:
def accuracy_value(list_of_matrices):
    acc = []
    for i in range(50):
        total_correct = list_of_matrices[i][0,0]+list_of_matrices[i][1,1]
        total = 8000
        accuracy = total_correct/total
        acc.append(accuracy)
    return acc

def precision(list_of_matrices):
    prec = []
    for i in range(50):
        total_positive_classifications = list_of_matrices[i][0,1]+list_of_matrices[i][1,1]
        total_true_positive = list_of_matrices[i][1,1]
        if total_positive_classifications==0:
            precision = 0 
        else:
            precision = total_true_positive/total_positive_classifications
        
        prec.append(precision)
    return prec

def recall(list_of_matrices):
    rec = []
    for i in range(50):
        total_positive_incidences = list_of_matrices[i][1,0]+list_of_matrices[i][1,1]
        total_true_positive = list_of_matrices[i][1,1]
        recall = total_true_positive/total_positive_incidences
        rec.append(recall)
    return rec

In [64]:
def confidence95(mean,sd):
    """ Returns a list of two numbers, the first being the lower confidence interval, the second being the higher at a 95% confidence,
    and a sample size of 50"""
    standard_error = sd/np.sqrt(50)
    return([mean-standard_error*1.96,mean+standard_error*1.96])

Our three functions are listed below

In [66]:
def knn_func(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n)
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [67]:
def knn_func_weighted(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n,weights = "distance")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [68]:
def knn_hamm_weighted(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n, metric = distance.hamming, weights = "distance")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [69]:
def knn_hamm(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n, metric = distance.hamming, weights = "uniform")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

## Iteration1 uses the following independant variables

In [71]:
X = mammography_imputed.iloc[:,0:8]
y = mammography_imputed.iloc[:,9]
X

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0


In [72]:
iteration1 = matrices_sample(X,y,5,50,knn_func)

In [73]:
it1acc, it1prec, it1rec = np.mean(accuracy_value(iteration1)), np.mean(precision(iteration1)), np.mean(recall(iteration1))
it1acc, it1prec, it1rec

(0.9933000000000001, 0.3588571428571429, 0.023842985528143807)

In [74]:
it1accsd, it1precsd, it1recsd = np.std(accuracy_value(iteration1)), np.std(precision(iteration1)), np.std(recall(iteration1))
it1accsd, it1precsd, it1recsd

(0.0007466592261534132, 0.2987387449637298, 0.01724402995857133)

In [159]:
it1accCI, it1precCI, it1recCI = confidence95(it1acc,it1accsd), confidence95(it1prec,it1precsd), confidence95(it1rec,it1recsd)
np.round(it1accCI,4), np.round(it1precCI,4), np.round(it1recCI,4)

(array([0.9931, 0.9935]), array([0.2761, 0.4417]), array([0.0191, 0.0286]))

## Iteration2
Is the same as iteration1 except the values of assess_c have been ordered in an ordinal manner, scaled by proportion of cancer diagnosis

In [77]:
mamm5 = mammography_imputed.copy()
mamm5

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0
...,...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048,0.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704,0.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449,0.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745,0.0


In [78]:
for i in range(40000):
    temp = mamm5["assess_c"]
    if temp[i] == 0:
        temp[i] = 4
    elif temp[i] == 4 or temp[i] == 5:
        temp[i] += 1

In [79]:
mamm5

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,4.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0
...,...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048,0.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704,0.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449,0.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745,0.0


### I chose to use age, radiologist assessment, density, family history, hormone therapy and BMI in this model's independant variables

In [81]:
X2 = mamm5.iloc[:,0:9]
y2 = mamm5.iloc[:,9]
X2

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502
2,69.0,4.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523
...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745


In [82]:
iteration2 = matrices_sample(X2,y2,5,50,knn_func_weighted)

In [83]:
it2acc, it2prec, it2rec = np.mean(accuracy_value(iteration2)), np.mean(precision(iteration2)), np.mean(recall(iteration2))
it2acc, it2prec, it2rec

(0.9908925, 0.1411440761214806, 0.06616897101700736)

In [84]:
it2accsd, it2precsd, it2recsd = np.std(accuracy_value(iteration2)), np.std(precision(iteration2)), np.std(recall(iteration2))
it2accsd, it2precsd, it2recsd

(0.0013181450034044086, 0.08242082174868125, 0.03549599082812246)

In [161]:
it2accCI, it2precCI, it2recCI = confidence95(it2acc,it2accsd), confidence95(it2prec,it2precsd), confidence95(it2rec,it2recsd)
np.round(it2accCI,4), np.round(it2precCI,4), np.round(it2recCI,4)

(array([0.9905, 0.9913]), array([0.1183, 0.164 ]), array([0.0563, 0.076 ]))

In [86]:
# iteration3 = matrices_sample(X3,y3,3,3,knn_hamm)
# iteration3

In [87]:
# iteration4 = matrices_sample(X3,y3,3,knn_hamm_weighted)
# iteration4

## Analysis notes
For iteration 3 and 4, using producing a large sample of confusion matrices calculated by hamming distance is much more computationally expensive than using euclidian distance. This is because each distance comparison needs a large number of comparisons. Therefore, I think it'll be better to use this distance measurement for smaller datasets and smaller vector sizes. I was not able to assess the effectiveness of the hamming distanced models so instead we will compare model 1 and model 2

# Final Conclusions

## Iteration 1 Summary:
Variables used: 
- Age
- BMI
- Radiologist assessment
- Breast tissue density
- Family history
- Prior mammogram
- Biopsy history

Methods used
- Uniform weighting
- 5 nearest neighbors

## Iteration 1 Results:
#### Accuracy confidence interval: [0.9931, 0.9935]
#### Precision confidence interval: [0.2761, 0.4417]
#### Recall confidence interval: [0.0191, 0.0286]

## Iteration 2 Summary:
Variables used: 
- Age
- BMI
- Radiologist assessment (reordered into ordinal)
- Breast tissue density
- Family history
- Prior mammogram
- Biopsy history

Methods used 
- Distance weighted
- 5 Nearest neighbors

## Iteration 2 Results:
#### Accuracy confidence interval: [0.9905, 0.9913]
#### Precision confidence interval: [0.1183, 0.164 ]
#### Recall confidence interval: [0.0563, 0.076 ]

# Which Model is better?
Both models have an classification accuracy of 99%+, although iteration 1 is slightly highier. However, this accuracy rating is mostly misleading since the dataset is largely populated by those who have a negative cancer diagnosis.

Precision in this context is the proportion of people who do have cancer, given we classified them to have cancer.
Iteration 1's model had a precision of around 36±9%.
Iteration 2's model had a precision of around 14±2%.
The first model has a better precision score.

Recall in this context is the ratio of people we predicted to have cancer, and people who actually have cancer.
Iteration 1's model had a recall of around 2.4±0.5%.
Iteration 2's model had a recall of around 6.6±1.0%.
The second model has a better recall rate. 

## Final Words: What does this all mean?
So what does this all mean? If you want a model that is better at not giving false alarms, choose model 1, as 36% of the people diagnosed to have cancer with this model actually do end up having cancer. On the other hand, if you would like a model that can more sensitively pick up a cancer diagnosis from this dataset's variables, choose model 2, as model 2 has more than a double recall rate.

In this business of disease diagnosis, doctor decisions are most definitely the most important factor. Decisions shouldn't be made on a model, no matter how high the precision and recall. That being saod, model 1 can still potentially prove to be useful in the medical context as a preliminary screening. After all, around a 1/3 of the individuals it identifies to have cancer do end up getting a diagnosis a year in the future!


## Future improvements:
- A thorough optimization on the number of nearest neighbors can prove to be useful.
- Creating a computationally cheap method of calculating hamming distance can allow us to change the distance metric and analyse that model.
- A more thorough optimization of the variables used, for example through regression, could help us choose more relevant variables.
- Next projects could also use: Logistic regression, Random Forest Classification