# Mammography Classification Model
### Entai Lin

Throughout this project, you will see me clean a mammography dataset (cited below) and use it to try construct a model that can predict a patient's diagnosis of cancer (cancer_c) a year after the mammography screening.

I then iterate the model to try and optimise it, with a summary of each iteration as follows:

### Iteration 1
- Euclidian distance
- All relevant variables used
- Variables scaled between 0 and 1

### Iteration 2
- A more specific set of variables was chosen through examining a correlation matrix
- A variable had a strange correlation coefficient, so I did an exploratory data analysis on why this was the case
- Proposed idea to recode said variable

### Iteration 3
- After a bit of research, I decided to change the model's distance metric to Hamming instead of Euclidian
- Recoded some numeric variables to better suit this new metric

### Iteration 4
- Added distance based weighting to model

### Key skills
- Data cleaning
- Exploratory data analysis
- Model building
- Model Optimisation
- Utilizing Sklearn library
- Machine Learning


#### Citations

Data collection and sharing was supported by the National Cancer Institute-funded Breast Cancer Surveillance Consortium (HHSN261201100031C). You can learn more about the BCSC at: http://www.bcsc-research.org/.

Dataset Documentation : https://www.bcsc-research.org/index.php/datasets/mammography_dataset/dataset_documentation

In [2]:
import numpy as np
import pandas as pd
mammography = pd.read_csv('mammography.csv')
mammography

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713


# Data cleaning

## Process Outline:

- Create NA values for values coded as missing
- Remove obsolete variables 
- Fill missing values using KNNImputer
- Move prediction variable to the last column in the dataframe

In [4]:
from sklearn.impute import KNNImputer
fill_imputer = KNNImputer(n_neighbors=1)
# I chose 1 nearest neighber in the imputer because most of our variables are categorical variables coded as numeric

In [5]:
#compfilm gets dropped because we do not have prior examination data
mammography = mammography.drop("compfilm_c", axis = 1)

In [6]:
#famhx, hrt, prvmam, biophx all have missing values entered as 9
#these variables have index 4,5,6,7 respectively after dropping compfilm
#fill these missing values using KNN imputation
#you have to run this cell twice

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

for i in range(4,8):
    mammography.iloc[:,i] = mammography.iloc[:,i].replace(9,np.nan)

mammography['hrt_c'].unique()

array([ 0., nan,  1.])

In [7]:
#drop CaTypeO since it is not information obtainable prior to cancer diagnosis and thus shouldn't be used to make predicions
mammography = mammography.drop("CaTypeO", axis = 1)

In [8]:
#replace -99 bmi values with NA
mammography["bmi_c"] = mammography["bmi_c"].replace(-99,np.nan)
mammography["bmi_c"].unique()

array([24.0235443,        nan, 29.0524292, ..., 29.8625793, 19.6293335,
       35.680542 ])

In [9]:
#drop patient ID variable because we don't want this to be influencing our model
mammography = mammography.drop("ptid", axis = 1)

In [10]:
#move predictor variable to the end 
move = mammography.pop("cancer_c")
mammography.insert(9, "cancer_c", move)

In [11]:
#final dataset after imputation
mammography_imputed = pd.DataFrame(fill_imputer.fit_transform(mammography), columns=mammography.columns)
mammography_imputed

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0
...,...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048,0.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704,0.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449,0.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745,0.0


## Iteration 1

- Defining independant and dependant variables
- Splitting test and train set
- Scaling variables between 0 and 1
- Building and training model
- An analysis of the confusion matrix

In [13]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [14]:
X = mammography_imputed.iloc[:,0:8]
X

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0


In [15]:
y = mammography_imputed.iloc[:,9]
y

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
39995    0.0
39996    0.0
39997    0.0
39998    0.0
39999    0.0
Name: cancer_c, Length: 40000, dtype: float64

In [16]:
#train test set splitting
X_train, X_test, y_train, y_test = train_test_split(X,y, shuffle = True, test_size = 0.2)

In [17]:
#scaling variables for equal weighting, some variables
scaler = MinMaxScaler()

In [18]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X_train

array([[0.17241379, 0.4       , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.10344828, 0.2       , 0.        , ..., 1.        , 0.        ,
        1.        ],
       [0.06896552, 0.        , 0.66666667, ..., 1.        , 1.        ,
        0.        ],
       ...,
       [0.51724138, 0.2       , 0.66666667, ..., 1.        , 0.        ,
        1.        ],
       [0.        , 0.        , 0.33333333, ..., 1.        , 0.        ,
        1.        ],
       [0.27586207, 0.2       , 0.33333333, ..., 1.        , 0.        ,
        1.        ]])

### Model Training

In [20]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [21]:
knn_model = KNeighborsClassifier(n_neighbors=5)

In [22]:
knn_model.fit(X_train,y_train)

In [23]:
y_pred = knn_model.predict(X_test)
y_pred

array([0., 0., 0., ..., 0., 0., 0.])

In [24]:
#accuracy (keep in mind that the dataset is heavily unbalanced)

knn_model.score(X_test,y_test)

0.993125

In [25]:
cm = confusion_matrix(y_test, y_pred)
cm

array([[7944,    4],
       [  51,    1]])

In [26]:
cr = classification_report(y_test, y_pred)
cr

'              precision    recall  f1-score   support\n\n         0.0       0.99      1.00      1.00      7948\n         1.0       0.20      0.02      0.04        52\n\n    accuracy                           0.99      8000\n   macro avg       0.60      0.51      0.52      8000\nweighted avg       0.99      0.99      0.99      8000\n'

### Feedback

In this first preliminary model, using all variables and equal weighting. The results are quite terrible. We will now investigate how to improve the model, and iterating to optimise the model

## Iteration 2

- Assess correlation matrix
- Observe anomaly in assess_c variable
- Analyse the anomaly
- Propose how to recode variables

In [29]:
#start by assessing the correlation between each variable and cancer.
corr_m = round(mammography_imputed.corr(),3).iloc[:,9]
corr_m

age_c        0.010
assess_c    -0.080
density_c    0.015
famhx_c      0.005
hrt_c        0.002
prvmam_c    -0.007
biophx_c     0.022
mammtype    -0.005
bmi_c        0.012
cancer_c     1.000
Name: cancer_c, dtype: float64

Something interesting about this matrix is that assess_c is inversely correlated to cancer_c. This implies that higher numerical coding indicates a lower risk of cancer. This is strange, given a high assess_c values is coded to be associated with a radiologist thinking that there is a high risk of malignancy.

Things that are positively correlated with breast cancer are: age, breast density, as well as family history, breast biopsy history, and BMI.

Things that are negatively correlated with breast cancer are: prior mammograms and mammogram types, although just by intuition I don't think this is significant.

Lets start by investigating why the radiologist assessment variable is negatively correlated with cancer diagnosis.

In [32]:
mamm = mammography_imputed.copy()
mamm.head()

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0


In [33]:
assess_canc = mamm.iloc[:,[1,9]].astype(int)
assess_canc["assess_c"].value_counts()

assess_c
1    26032
2    10718
0     3049
3      139
4       57
5        5
Name: count, dtype: int64

### Calculating the cancer proportion for unique values in selected variables, as well as bins for BMI and Age

In [35]:
mamm.groupby("assess_c")["cancer_c"].mean()

assess_c
0.0    0.064939
1.0    0.000768
2.0    0.001120
3.0    0.007194
4.0    0.421053
5.0    0.800000
Name: cancer_c, dtype: float64

In [36]:
mamm.groupby("density_c")["cancer_c"].mean()

density_c
1.0    0.004973
2.0    0.005506
3.0    0.008979
4.0    0.005405
Name: cancer_c, dtype: float64

In [37]:
mamm.groupby("famhx_c")["cancer_c"].mean()

famhx_c
0.0    0.006291
1.0    0.007375
Name: cancer_c, dtype: float64

In [38]:
mamm.groupby("hrt_c")["cancer_c"].mean()

hrt_c
0.0    0.006415
1.0    0.006957
Name: cancer_c, dtype: float64

In [39]:
mamm.groupby("prvmam_c")["cancer_c"].mean()

prvmam_c
0.0    0.012821
1.0    0.006425
Name: cancer_c, dtype: float64

In [40]:
mamm.groupby("biophx_c")["cancer_c"].mean()

biophx_c
0.0    0.005388
1.0    0.009460
Name: cancer_c, dtype: float64

In [41]:
binned_bmi = pd.cut(mamm["bmi_c"], bins=[0,20,25,30,35,40,45,50])
mamm.groupby(binned_bmi)["cancer_c"].mean()

bmi_c
(0, 20]     0.006093
(20, 25]    0.006215
(25, 30]    0.004986
(30, 35]    0.009690
(35, 40]    0.007058
(40, 45]    0.010274
(45, 50]    0.012232
Name: cancer_c, dtype: float64

In [42]:
split_bmi = pd.cut(mamm["bmi_c"], bins=[0,30,50])
mamm.groupby(split_bmi)["cancer_c"].mean()

bmi_c
(0, 30]     0.005657
(30, 50]    0.009178
Name: cancer_c, dtype: float64

In [43]:
mamm["age_c"].sort_values()

7434     60.0
29122    60.0
29114    60.0
2475     60.0
16933    60.0
         ... 
28998    89.0
17436    89.0
32359    89.0
30868    89.0
934      89.0
Name: age_c, Length: 40000, dtype: float64

In [44]:
binned_age = pd.cut(mamm["age_c"], bins=[60,65,70,75,80,85,90])
mamm.groupby(binned_age)["cancer_c"].mean()

age_c
(60, 65]    0.006748
(65, 70]    0.006353
(70, 75]    0.006002
(75, 80]    0.007614
(80, 85]    0.007551
(85, 90]    0.010204
Name: cancer_c, dtype: float64

In [45]:
split_age = pd.cut(mamm["age_c"], bins=[60,75,90])
mamm.groupby(split_age)["cancer_c"].mean()

age_c
(60, 75]    0.006430
(75, 90]    0.007847
Name: cancer_c, dtype: float64

## Analysis conclusions 


### Assess_c
For assess_c, those who need additional testing, 0, are almost 100 times more likely to have cancer than those that recieved a 1, a 
negative result. This is why the correlation coefficient is zero. The 0 value skews the ordinal nature of the variable.

### BMI
For BMI, there is a clear cut off between those who have a bmi of below 30 and above 30. To the left of the critical value of 30, the proportion of cancer diagnosis is 0.005657, to the right it is 0.009178, almost a doubling of proportion.

### Age
For age, we expect to see less of a correlation because the only people in our dataset are those who are aged 60 to 90 years old. Knowing that typical research proves that BC is more likely in those aged 50+, age may play less in a role in our model. However, there seems to be a critical value at the 75 year point, where there is a difference of 0.001417 in cancer proportion.

## Iteration 3
- Define a function that generalises the process of model building for potential further iterations
- Recode numeric variables into a metric suitable for hamming distance
- Utilise hamming distance metric
- Use distance based weigthing

In [252]:
# Function with parameters being the X variable table, y variable table, number of neighbors, and distance type.
# it will return the confusion matrix of the model, from which you could calculate metrics measuring the effectiveness of the model
def knn_func(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n)
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [49]:
#original encoding for this dataset is pretty close to being suitable for hqamming distance

mamm3 = pd.read_csv('mammography.csv')
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713


In [50]:
mamm3["above75"] = mamm3["age_c"] >= 75
mamm3["above75"] = mamm3["above75"].astype(int)
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0


In [51]:
mamm3["overweight"] = np.nan
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75,overweight
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0,
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0,
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0,
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0,
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1,
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1,
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1,
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0,


In [52]:
for i in range(40000):
    if mamm3.iloc[i,11] == -99:
        mamm3.iloc[i,14] = 2
    elif mamm3.iloc[i,11] >= 30:
        mamm3.iloc[i,14] = 1
    else:
        mamm3.iloc[i,14] = 0

mamm3["overweight"] = mamm3["overweight"].astype(int)
mamm3

Unnamed: 0,age_c,assess_c,cancer_c,compfilm_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,CaTypeO,bmi_c,ptid,above75,overweight
0,62,1,0,1,2,0,0,1,0,1,8,24.023544,1,0,0
1,65,1,0,1,4,0,0,1,0,1,8,-99.000000,2,0,2
2,69,0,0,1,2,0,0,1,0,1,8,29.052429,3,0,0
3,64,2,0,1,2,0,0,1,0,1,8,-99.000000,4,0,2
4,63,3,0,1,2,0,0,1,1,1,8,33.729523,5,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,80,1,0,1,1,0,0,1,0,1,8,-99.000000,36711,1,2
39996,78,1,0,1,3,0,0,1,0,1,8,-99.000000,36712,1,2
39997,77,1,0,1,2,0,0,1,0,1,8,-99.000000,36712,1,2
39998,66,1,0,1,2,0,0,1,1,1,8,-99.000000,36713,0,2


### Selecting variables for X
Based on the exploratory data analysis, for this iteration I will use:
- above 75
- overweight
- biophx_c
- famhx_c
- density_c
- assess_c


In [54]:
X3 = mamm3.iloc[:,[1,4,5,8,13,14]]
X3

Unnamed: 0,assess_c,density_c,famhx_c,biophx_c,above75,overweight
0,1,2,0,0,0,0
1,1,4,0,0,0,2
2,0,2,0,0,0,0
3,2,2,0,0,0,2
4,3,2,0,1,0,1
...,...,...,...,...,...,...
39995,1,1,0,0,1,2
39996,1,3,0,0,1,2
39997,1,2,0,0,1,2
39998,1,2,0,1,0,2


In [99]:
y3 = mamm3.iloc[:,2]
y3

0        0
1        0
2        0
3        0
4        0
        ..
39995    0
39996    0
39997    0
39998    0
39999    0
Name: cancer_c, Length: 40000, dtype: int64

### Defining a hamming distance function:
For two arrays, if a variable is not the same, add one to distance

In [114]:
def ham_dist(x,y):
    dist = 0
    for i in range(len(x)):
        if x[i] != y[i]:
            dist += 1
    return dist

In [156]:
knn_func(X3,y3,3)

array([[7942,    1],
       [  55,    2]])

In [158]:
np.mean(y3)

0.006475

In [162]:
np.sum(y3)

259

In [160]:
55/(7942+55)

0.0068775790921595595

In [178]:
X4 = mamm3.iloc[:,[1,4,14]]
X4

Unnamed: 0,assess_c,density_c,overweight
0,1,2,0
1,1,4,2
2,0,2,0
3,2,2,2
4,3,2,1
...,...,...,...
39995,1,1,2
39996,1,3,2
39997,1,2,2
39998,1,2,2


In [307]:
knn_func(X,y,3)

array([[7913,   24],
       [  60,    3]])

In [280]:
def knn_hamm_weighted(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n, metric = ham_dist, weights = "distance")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [282]:
knn_hamm(X3,y,3)

array([[7920,   31],
       [  44,    5]])

In [381]:
from scipy.spatial import distance

## Assessing iterations

- Define a function that takes a sample of 100 confusion matrices
- Use the sample of matrices to calculate average model performance

In [363]:
def matrices_sample(X,y,n,s,function):
    matrices = []
    for i in range(s):
        matrices.append(function(X,y,n))
    return matrices

Our three functions are listed below

In [337]:
def knn_func(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n)
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [383]:
def knn_hamm_weighted(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n, metric = distance.hamming, weights = "distance")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [385]:
def knn_hamm(X,y,n):
    scaler = MinMaxScaler()
    
    X_tr, X_te, y_tr, y_te = train_test_split(X,y, shuffle = True, test_size = 0.2)
    X_tr = scaler.fit_transform(X_tr)
    X_te = scaler.fit_transform(X_te)
    knn_model = KNeighborsClassifier(n_neighbors=n, metric = distance.hamming, weights = "uniform")
    knn_model.fit(X_tr,y_tr)
    y_pr = knn_model.predict(X_te)
    cm = confusion_matrix(y_te, y_pr)
    return(cm)

In [379]:
iteration1 = matrices_sample(X,y,3,20,knn_func)
iteration1

[array([[7930,   15],
        [  51,    4]]),
 array([[7937,   14],
        [  48,    1]]),
 array([[7939,    5],
        [  56,    0]]),
 array([[7951,   11],
        [  35,    3]]),
 array([[7941,    7],
        [  50,    2]]),
 array([[7945,    7],
        [  47,    1]]),
 array([[7934,    6],
        [  56,    4]]),
 array([[7956,    3],
        [  38,    3]]),
 array([[7943,   10],
        [  45,    2]]),
 array([[7926,   10],
        [  64,    0]]),
 array([[7926,   14],
        [  59,    1]]),
 array([[7939,   13],
        [  47,    1]]),
 array([[7938,    3],
        [  58,    1]]),
 array([[7946,    7],
        [  45,    2]]),
 array([[7923,   17],
        [  58,    2]]),
 array([[7931,   10],
        [  57,    2]]),
 array([[7914,   28],
        [  54,    4]]),
 array([[7940,   11],
        [  46,    3]]),
 array([[7941,    7],
        [  50,    2]]),
 array([[7924,   15],
        [  58,    3]])]

In [392]:
# iteration3 = matrices_sample(X3,y3,3,3,knn_hamm)
# iteration3

In [394]:
# iteration4 = matrices_sample(X3,y3,3,knn_hamm_weighted)
# iteration4

## Conclusions
for iteration 3 and 4, using producing a large sample of confusion matrices calculated by hamming distance is much more computationally expensive than using euclidian distance. This is because each distance comparison needs a large number of comparisons. Therefore, I think it'll be better to use this distance measurement for smaller datasets and smaller vector sizes.

Moving forward, we can recode the assess_c variable to be more suitable for euclidian distance and take samples of correlation matrices to compare the models.

In [428]:
mamm5 = mammography_imputed.copy()
mamm5

Unnamed: 0,age_c,assess_c,density_c,famhx_c,hrt_c,prvmam_c,biophx_c,mammtype,bmi_c,cancer_c
0,62.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,24.023544,0.0
1,65.0,1.0,4.0,0.0,0.0,1.0,0.0,1.0,22.142502,0.0
2,69.0,0.0,2.0,0.0,0.0,1.0,0.0,1.0,29.052429,0.0
3,64.0,2.0,2.0,0.0,0.0,1.0,0.0,1.0,23.649673,0.0
4,63.0,3.0,2.0,0.0,0.0,1.0,1.0,1.0,33.729523,0.0
...,...,...,...,...,...,...,...,...,...,...
39995,80.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,31.093048,0.0
39996,78.0,1.0,3.0,0.0,0.0,1.0,0.0,1.0,24.802704,0.0
39997,77.0,1.0,2.0,0.0,0.0,1.0,0.0,1.0,34.335449,0.0
39998,66.0,1.0,2.0,0.0,0.0,1.0,1.0,1.0,25.685745,0.0


We will move the 0 variable coding to 4, and variables 4 and 5 to 5 and 6 respectively.

In [430]:
for i in range(40000):
    temp = mamm5["assess_c"]
    if temp[i] == 0:
        temp[i] = 4
    elif temp[i] == 4 or temp[i] == 5:
        temp[i] += 1

In [444]:
corr_m = round(mamm5.corr(),3).iloc[:,9]
corr_m

age_c        0.010
assess_c     0.215
density_c    0.015
famhx_c      0.005
hrt_c        0.002
prvmam_c    -0.007
biophx_c     0.022
mammtype    -0.005
bmi_c        0.012
cancer_c     1.000
Name: cancer_c, dtype: float64

As you can see, by reordering assess_c, the correlation coefficient went from a small negative to a strong positive. This will make it so that those who needed further screening are "closer" to those who got a 3 or a 4 on assess_c.