# Model Selection

In this notebook we will test different models to find the one that gives us the best results.

In [7]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

def f1_scores(Z, y_test):
    averages = ['macro', 'micro', 'weighted']
    for avg in averages:
        score = f1_score(Z, y_test, average=avg)
        print("f1 score ({}): {}".format(avg, score))

def test_model(X, y, model_name, model):
    print("MODEL: {}".format(model_name))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    model.fit(X_train, y_train)
    Z = model.predict(X_test)
    f1_scores(Z, y_test)

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

models = {
    "Logistic Regression": LogisticRegression(),
    "SVC": SVC(),
    "Random Forest": RandomForestClassifier(n_estimators=1000),
    "K-Nearest Neighbors": KNeighborsClassifier(n_neighbors = 3),
    "GaussianNB": GaussianNB(),
    "Perceptron": Perceptron(),
    "SGDClassifier": SGDClassifier(),
    "Decision Tree": DecisionTreeClassifier()   
}

dataset_path = "data.csv"
data = pd.read_csv(dataset_path, sep=";")
X = data.drop("diagnosis", axis=1).values
y = data["diagnosis"]

for model in models:
    test_model(X, y, model, models[model])
    print()

MODEL: Logistic Regression
f1 score (macro): 0.7877711432321793
f1 score (micro): 0.8052930056710775
f1 score (weighted): 0.8056388319034243

MODEL: SVC
f1 score (macro): 0.8617902270640023
f1 score (micro): 0.8752362948960303
f1 score (weighted): 0.8749103296152537

MODEL: Random Forest
f1 score (macro): 0.881720430107527
f1 score (micro): 0.8960302457466919
f1 score (weighted): 0.8961080164838613

MODEL: K-Nearest Neighbors
f1 score (macro): 0.8192956534980678
f1 score (micro): 0.8431001890359168
f1 score (weighted): 0.8461997379340743

MODEL: GaussianNB
f1 score (macro): 0.7548035538313527
f1 score (micro): 0.775047258979206
f1 score (weighted): 0.7706522440457904

MODEL: Perceptron
f1 score (macro): 0.8084197141385527
f1 score (micro): 0.8298676748582231
f1 score (weighted): 0.8330182227605476

MODEL: SGDClassifier
f1 score (macro): 0.7320162107396151
f1 score (micro): 0.7353497164461249
f1 score (weighted): 0.7281177040659003

MODEL: Decision Tree
f1 score (macro): 0.8541459379757

We can check the confusion matrix for the Random Forest model.

In [26]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

model = RandomForestClassifier(n_estimators=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model.fit(X_train, y_train)
Z = model.predict(X_test)

print("Confusion Matrix")
print(confusion_matrix(y_test, Z))

print("\nClassfication Report")
print(classification_report(y_test, Z))

Confusion Matrix
[[319  32]
 [ 25 153]]

Classfication Report
             precision    recall  f1-score   support

          0       0.93      0.91      0.92       351
          1       0.83      0.86      0.84       178

avg / total       0.89      0.89      0.89       529

