## Modelling Notebook

This notebook is to be used for testing out the various models that you want to use. No preprocessing will be done in this notebook. Steps:

1. Read in `data/final_dataset.csv` that you created in the `Data Cleaning.ipynb`
2. Try various models and print appropriate metrics (accuracy/MSE etc)
3. Pick a final model and save it as `models/model.pkl`

## Importing and Loading

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

import pickle

In [2]:
df = pd.read_csv('../data/final_dataset.csv')
df

Unnamed: 0,diagnosis_result,perimeter,area,compactness,symmetry
0,0,0.825000,0.448687,0.781759,0.633136
1,1,0.675000,0.670644,0.133550,0.272189
2,0,0.650000,0.597255,0.397394,0.426036
3,0,0.216667,0.109785,0.801303,0.739645
4,0,0.691667,0.653341,0.309446,0.272189
...,...,...,...,...,...
95,0,0.666667,0.633652,0.302932,0.443787
96,1,0.216667,0.148568,0.107492,0.325444
97,1,0.083333,0.055489,0.048860,0.000000
98,1,0.183333,0.125895,0.120521,0.159763


## Splitting

### Input and Output

In [3]:
X = df.drop(columns=['diagnosis_result'])
Y = df['diagnosis_result']

### Training and Testing

In [4]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, stratify=Y)

In [5]:
def train_eval(model, x_train, x_test, y_train, y_test):
    
    model.fit(x_train,y_train)
    prd = model.predict(x_test)
    print(confusion_matrix(y_test, prd))
    print(classification_report(y_test, prd))
    
    return model

## Evaluating

### KNN Classifier

In [6]:
knn = train_eval(KNeighborsClassifier(), x_train, x_test, y_train, y_test)

[[13  2]
 [ 2  8]]
              precision    recall  f1-score   support

           0       0.87      0.87      0.87        15
           1       0.80      0.80      0.80        10

    accuracy                           0.84        25
   macro avg       0.83      0.83      0.83        25
weighted avg       0.84      0.84      0.84        25



### Support Vector Machine Classifier

In [7]:
svm = train_eval(SVC(), x_train, x_test, y_train, y_test)

[[14  1]
 [ 2  8]]
              precision    recall  f1-score   support

           0       0.88      0.93      0.90        15
           1       0.89      0.80      0.84        10

    accuracy                           0.88        25
   macro avg       0.88      0.87      0.87        25
weighted avg       0.88      0.88      0.88        25



### Decision Tree Classifier

In [8]:
dt = train_eval(DecisionTreeClassifier(), x_train, x_test, y_train, y_test)

[[13  2]
 [ 2  8]]
              precision    recall  f1-score   support

           0       0.87      0.87      0.87        15
           1       0.80      0.80      0.80        10

    accuracy                           0.84        25
   macro avg       0.83      0.83      0.83        25
weighted avg       0.84      0.84      0.84        25



## Saving best model

We see that SVM has performed best and so we save the SVM model. We can find out how well a model has performed by seeing the final accuracy and the individual f1-scores. If a model has a high value for both, it is considered to be good. Although, one must note that for a multi-class classification it is important to see the f1-scores, sometimes even more than overall accuracy.

In [9]:
filename = '../models/model.pkl'
pickle.dump(svm, open(filename, 'wb'))

We can now load and see if the model is able to generate predicitions

In [10]:
model = pickle.load(open(filename, 'rb'))
model.predict(x_test)

array([1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1], dtype=int64)