<a href="https://colab.research.google.com/github/paulowe/ml-lambda/blob/main/colab-train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import packages 


In [2]:
import sklearn
import pandas as pd
import numpy as np
import csv as csv
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.externals import joblib
from sklearn.preprocessing import label_binarize
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score



- Verify you are running Version 0.23.1 of sklearn. Some of the packages used for model evaluation only work with this version or higher.

- Run <> to upgrade sklearn

In [3]:
sklearn.__version__

'0.22.2.post1'

## Import Data

X - all training examples
y - all true labels

In [4]:
data = pd.read_csv('./syntheticData.csv')
X, y = data.iloc[:, 1:], data.iloc[:,0]

## Visualize Data 

(80100 * 377) training matrix

(801 * 1) label vector


In [5]:
print(X.head())
print(X.shape)
print(y.head())
print(y.shape)

   Abdominal distention  ...  Wrist weakness
0                     0  ...               0
1                     0  ...               0
2                     0  ...               0
3                     0  ...               0
4                     0  ...               0

[5 rows x 377 columns]
(80100, 377)
0    Abdominal aortic aneurysm
1    Abdominal aortic aneurysm
2    Abdominal aortic aneurysm
3    Abdominal aortic aneurysm
4    Abdominal aortic aneurysm
Name: Conditions_name, dtype: object
(80100,)


## Split into training, cross validation and test sets

- Shuffle dataset

- Perform Split (60-20-20)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, stratify=y)

X_cv, X_test, y_cv, y_test = train_test_split(X_test, y_test, test_size=0.5, stratify=y_test)

print("Training data dimensions")
print(X_train.shape)
print(y_train.shape)

print("Cross validation data dimensions")
print(X_cv.shape)
print(y_cv.shape)

print("Test data dimensions")
print(X_test.shape)
print(y_test.shape)


Training data dimensions
(48060, 377)
(48060,)
Cross validation data dimensions
(16020, 377)
(16020,)
Test data dimensions
(16020, 377)
(16020,)


## Train default MLP Classifier

In [7]:
clf = MLPClassifier()
clf = clf.fit(X_train, y_train)



## Training Variant: Bottom Up implementation

In this variant I will implement an identical classifier to the one we trained above. The objective here is to expose underlying components of the training process and perform direct optimization and monitoring techniques.

- Prediction function
- Sigmoid Activation
- Cost Function


In [None]:
def predict(Theta1, Theta2, X):
    """
    Predict the label of an input given a trained neural network
    """
    m= X.shape[0]
    X = np.hstack((np.ones((m,1)),X))
    
    a1 = sigmoid(X @ Theta1.T)
    a1 = np.hstack((np.ones((m,1)), a1)) # hidden layer
    a2 = sigmoid(a1 @ Theta2.T) # output layer
    
    return np.argmax(a2,axis=1)+1

## Model Evaluation

Model Evaluation is an important part of understanding your model performance. 

For that matter it is crucial to choose a good evaluation metric you can monitor. In our case Accuracy makes the most sense.

We will monitor

- Accuracy on Test (clf)
- AUC (implementation requires sklearn v0.23.1 +) 

- Accuracy on Test (eng)
- AUC

- Accuracy other vairants (vnt)
- AUC


In [1]:
# Accuracy
testsetPred = clf.predict(X_test)
accuracy_score(y_test, testsetPred)

#AUC
#roc_auc_score(y_test, testsetPred, multi_class='ovr')

NameError: ignored

## Serialize Model Variant

Serialize the classifier you like 

(1) Default Sklearn Model (clf)

(2) Variant 1 (eng)

(3) Variant 2

(4) Variant 3

In [None]:
"""
Serialize Model
"""
joblib.dump(clf, 'mlp.pkl')