<p align="center"><h1 align="center">GR5074 </br> Projects in Advanced Machine Learning </br>Spring 2022 </h1>

---
</br>

This notebook contains starter code for our class workshop where we'll

1. learn how to create a data preprocessing module suitable for a ML pipeline (in our case, leveraging the `scikit-learn` library) 
2. get you familiar with the **AI Model Share Initiative** API to  
  * build all necessary elements to submit a model
  * submit a model
  * retrieve leaderboard information



## **(1) Preprocessor Function & Setup**

> ### A more advanced example demonstrating the flexibility of a new *Column Transformer* approach.

In [1]:
# note that tabular preprocessors require scikit-learn>=0.24.0
!pip install scikit-learn --upgrade 

% tensorflow_version 1.x

TensorFlow 1.x selected.


In [2]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

print(data.shape)

data.head()

(1309, 14)


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
# Preprocess data using sklearn's Column Transformer approach

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']

# Replacing missing values with Modal value and then one-hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Final preprocessor object set up with ColumnTransformer...
preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Target = survived
y = data['survived']
y = y.map({0: 'died', 1: 'survived'})

# keep only pclass, sex, age, fare, embarked as features 
X = data.drop(['survived','sibsp','parch','ticket','name','cabin','boat','body','home.dest'], axis=1)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# fit preprocessor to your data
preprocess = preprocess.fit(X_train)

In [4]:
# Write function to transform data with preprocessor

def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [5]:
X_train.shape

(1047, 5)

In [29]:
# Notice categorical feature columns have been one-hot encoded
preprocessor(X_train)

array([[-0.37016209, -0.50478215,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.90402864,  1.97155505,  1.        , ...,  1.        ,
         0.        ,  0.        ],
       [-0.13125133, -0.5085326 ,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.13125133, -0.5085326 ,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [-0.7683467 ,  0.05915559,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [ 0.18729636, -0.35658342,  0.        , ...,  0.        ,
         0.        ,  1.        ]])

## **(2) Build Your Model Using `sklearn`**

In [7]:
print(X_train.shape, X_test.shape, 
      y_train.shape, y_test.shape)

(1047, 5) (262, 5) (1047,) (262,)


In [8]:
# Penalized Logit...

hyperparameters = {'C':np.logspace(1, 10, 100), 'penalty':['l2']}

logit = LogisticRegression()
logit_cv = GridSearchCV(logit, hyperparameters, cv = 10)
logit_cv.fit(preprocessor(X_train), y_train)

print("Best Parameters {:.3f}:", logit_cv.best_params_)

Best Parameters {:.3f}: {'C': 10.0, 'penalty': 'l2'}


In [9]:
logit_cv.best_estimator_

LogisticRegression(C=10.0)

In [35]:
model = LogisticRegression(C=10, penalty='l2')

model.fit(preprocessor(X_train), y_train) # Fitting to the training set.

model.score(preprocessor(X_train), y_train) # Fit score, 0-1 scale.

0.7793696275071633

In [36]:
y_pred = model.predict(preprocessor(X_test))

y_pred

array(['died', 'survived', 'died', 'died', 'died', 'survived', 'died',
       'died', 'died', 'died', 'died', 'died', 'died', 'survived',
       'survived', 'died', 'survived', 'died', 'survived', 'died', 'died',
       'died', 'died', 'survived', 'died', 'survived', 'died', 'died',
       'died', 'survived', 'survived', 'survived', 'survived', 'died',
       'survived', 'died', 'died', 'died', 'died', 'died', 'died', 'died',
       'died', 'died', 'survived', 'died', 'died', 'survived', 'died',
       'died', 'survived', 'died', 'died', 'survived', 'died', 'died',
       'survived', 'died', 'survived', 'survived', 'died', 'died',
       'survived', 'died', 'survived', 'survived', 'died', 'died', 'died',
       'survived', 'survived', 'died', 'died', 'died', 'survived',
       'survived', 'died', 'survived', 'survived', 'died', 'died',
       'survived', 'died', 'died', 'survived', 'survived', 'died', 'died',
       'died', 'died', 'died', 'died', 'died', 'died', 'died', 'died',
      

In [37]:
# Evaluate held out test data
from sklearn.metrics import accuracy_score

print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))

Accuracy: 79.01%


# Code to be able to submit to AI Model Share (through `aimodelshare` API)

#### Step (1) install `aimodelshare` library

In [13]:
! pip install aimodelshare --upgrade

Collecting aimodelshare
  Downloading aimodelshare-0.0.86-py3-none-any.whl (131 kB)
[K     |████████████████████████████████| 131 kB 13.7 MB/s 
[?25hCollecting onnx>=1.9.0
  Downloading onnx-1.10.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.7 MB)
[K     |████████████████████████████████| 12.7 MB 10.3 MB/s 
[?25hCollecting skl2onnx>=1.8.0
  Downloading skl2onnx-1.10.4-py2.py3-none-any.whl (273 kB)
[K     |████████████████████████████████| 273 kB 40.3 MB/s 
[?25hCollecting Pympler==0.9
  Downloading Pympler-0.9.tar.gz (178 kB)
[K     |████████████████████████████████| 178 kB 31.6 MB/s 
[?25hCollecting PyJWT==2.2.0
  Downloading PyJWT-2.2.0-py3-none-any.whl (16 kB)
Collecting scikit-learn==0.24.2
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 1.5 MB/s 
[?25hCollecting keras2onnx>=1.7.0
  Downloading keras2onnx-1.7.0-py3-none-any.whl (96 kB)
[K     |█████████████████████████████

#### Step (2) import the `aimodelshare` library and (locally) create a preprocessor object which contains all aimodelshare needs for deployment

In [14]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"")

Your preprocessor is now saved to 'preprocessor.zip'


#### Step (3) import the preprocessor that was just created 

In [15]:
prep = ai.import_preprocessor("preprocessor.zip")
prep(X_test)

array([[ 0.66511788, -0.50535342,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [-0.68870978, -0.24898038,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [ 0.98366557, -0.13159525,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [ 0.02802251, -0.40549389,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [-0.29052517, -0.40549389,  0.        , ...,  0.        ,
         1.        ,  0.        ],
       [-0.13125133, -0.50233662,  0.        , ...,  0.        ,
         0.        ,  1.        ]])

#### Step (4) convert the model output object to an ONNX file

In [38]:
from aimodelshare.aimsonnx import model_to_onnx
from skl2onnx.common.data_types import FloatTensorType

# Get count of preprocessed features
feature_count = preprocessor(X_test).shape[1] 

# Insert correct number of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]

# transform sklearn model to ONNX
onnx_model_sklearn = model_to_onnx(model, framework='sklearn', 
                                   initial_types=initial_type,
                                   transfer_learning=False,
                                   deep_learning=False, 
                                   task_type = 'classification')

# Save model to local .onnx file
with open("onnx_model_sklearn.onnx", "wb") as f:
    f.write(onnx_model_sklearn.SerializeToString())

#### Step (5) create model predictions for submission

In [39]:
predictions_sklearn = model.predict(preprocessor(X_test))

#### Step (6) add AI Model Share credentials for the Playground

In [41]:
# Set credentials for model submissions to this competition by running below  
# function, then entering aimodelshare username and password.  Public 
# competitions allow any aimodelshare user to submit new models.

from aimodelshare.aws import set_credentials

apiurl = "https://wgwd00tice.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


#### Step (7) Instantiate competition and submit model + predictions

In [42]:
# Instantiate Competition
mycompetition = ai.Competition(apiurl)

In [43]:
#-- Generate predicted values (a list of predicted labels "survived" or "died") (Model 1)
prediction_labels = model.predict(preprocessor(X_test))

In [44]:
# Submit model and predictions to competition leaderboard
mycompetition.submit_model(model_filepath = "onnx_model_sklearn.onnx",
                preprocessor_filepath="preprocessor.zip",
                prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional):  
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 13

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1187


#### Step (8) get leaderbord and learn from submissions

In [45]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,deep_learning,model_type,depth,num_params,dense_layers,dropout_layers,softmax_act,relu_act,loss,optimizer,model_config,username,version
0,82.82%,80.35%,83.47%,78.98%,sklearn,False,RandomForestClassifier,,,,,,,,,"{'bootstrap': True, 'ccp_alpha...",AdvProjectsinML,5
1,80.53%,78.48%,79.48%,77.83%,keras,True,Sequential,4.0,24162.0,4.0,,1.0,3.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",AdvProjectsinML,9
2,80.15%,78.22%,78.88%,77.75%,sklearn,False,GradientBoostingClassifier,,,,,,,,,"{'ccp_alpha': 0.0, 'criterion'...",AdvProjectsinML,6
3,79.01%,77.44%,77.39%,77.50%,sklearn,False,LogisticRegression,,10.0,,,,,,liblinear,"{'C': 10, 'class_weight': None...",AdvProjectsinML,1
4,79.01%,77.44%,77.39%,77.50%,sklearn,False,LogisticRegression,,10.0,,,,,,liblinear,"{'C': 10, 'class_weight': None...",AdvProjectsinML,3
5,76.34%,72.49%,75.90%,71.44%,sklearn,False,LogisticRegression,,10.0,,,,,,lbfgs,"{'C': 0.01, 'class_weight': No...",AdvProjectsinML,2
6,76.34%,72.49%,75.90%,71.44%,sklearn,False,LogisticRegression,,10.0,,,,,,lbfgs,"{'C': 0.01, 'class_weight': No...",AdvProjectsinML,4
7,66.79%,48.19%,82.81%,54.69%,keras,True,Sequential,4.0,9154.0,4.0,,1.0,3.0,str,SGD,"{'name': 'sequential', 'layers...",AdvProjectsinML,7
8,65.27%,45.81%,71.30%,53.04%,keras,True,Sequential,7.0,18114.0,5.0,2.0,1.0,4.0,str,SGD,"{'name': 'sequential_1', 'laye...",AdvProjectsinML,8
9,55.34%,51.58%,51.60%,51.58%,sklearn,False,LogisticRegression,,10.0,,,,,,liblinear,"{'C': 10, 'class_weight': None...",mikedparrott,10


#### Step (9) compare models for learning


In [46]:
# Compare two or more models
data = mycompetition.compare_models([1,2], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,param_name,default_value,model_version_1,model_version_2
0,C,1.000000,10,0.010000
1,class_weight,,,
2,dual,False,False,False
3,fit_intercept,True,True,True
4,intercept_scaling,1,1,1
5,l1_ratio,,,
6,max_iter,100,100,100
7,multi_class,auto,auto,auto
8,n_jobs,,,
9,penalty,l2,l1,l2





