## Titanic Survival Prediction Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using Sklearn Column Transformer / Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/x8e4b0t0/titanic_competition_data-repository:latest') 

Downloading [=>                                               ]

Data downloaded successfully.


In [4]:
# Separate data into X_train, y_train, and X_test
import pandas as pd
full_training_data=pd.read_csv("titanic_competition_data/training_data.csv")

X_train=full_training_data.iloc[:,full_training_data.columns!='survived']
X_test=pd.read_csv("titanic_competition_data/test_data.csv")
y_train=full_training_data['survived']

X_train.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,pclass,sex,age,fare,embarked
0,1107,1107,3,male,21.0,8.6625,S
1,928,928,3,female,,14.4542,C
2,347,347,2,male,42.0,13.0,S
3,819,819,3,female,,7.75,Q
4,71,71,1,male,27.0,136.7792,C


##2.   Preprocess data using Sklearn Column Transformer / Write and Save Preprocessor function


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

#Preprocess data using sklearn's Column Transformer approach

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), #'imputer' names the step
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']

# Replacing missing values with Modal value and then one-hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Final preprocessor object set up with ColumnTransformer...

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# fit preprocessor to your data
preprocess = preprocess.fit(X_train)

In [6]:
# Write function to transform data with preprocessor 
# In this case we use sklearn's Column transformer in our preprocessor function

def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

##3. Fit model on preprocessed data and save preprocessor function and model 


In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=10, penalty='l1', solver = 'liblinear')
model.fit(preprocessor(X_train), y_train) # Fitting to the training set.
model.score(preprocessor(X_train), y_train) # Fit score, 0-1 scale.

0.7879656160458453

#### Save preprocessor function to local "preprocessor.zip" file

In [8]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [9]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

# Check how many preprocessed input features are there?
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 10]))]  # You need to insert correct number of features in preprocessed data

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [17]:
#Set credentials using modelshare.org username/password

apiurl='https://r2okzbjyhh.execute-api.us-east-1.amazonaws.com/prod/m'

ai.set_credentials(apiurl=apiurl)


AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [18]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [19]:
#Submit Model 1: 

#-- Generate predicted values (a list of predicted labels "survived" or "died") (Model 1)
prediction_labels = model.predict(preprocessor(X_test))

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): test
Provide any useful notes about your model (optional): test

Your model has been submitted as model version 1

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:603


In [20]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,num_params,optimizer,model_config,username,version
0,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10,liblinear,"{'C': 10, 'class_weight': None...",mikedparrott,1


## 5. Repeat submission process to improve place on leaderboard


In [38]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=.01, penalty='l1', solver = 'liblinear')
model.fit(preprocessor(X_train), y_train) # Fitting to the training set.
model.score(preprocessor(X_train), y_train) # Fit score, 0-1 scale.

0.7058261700095511

In [39]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

# Check how many preprocessed input features are there?
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 10]))]  # You need to insert correct number of features in preprocesed data

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [40]:
#Submit Model 2: 

#-- Generate predicted values (a list of predicted labels "survived" or "died") (Model 1)
prediction_labels = model.predict(preprocessor(X_test))

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): test
Provide any useful notes about your model (optional): test

Your model has been submitted as model version 7

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:603


In [24]:
# Compare differences between models
# (Experimental, Git-like Diffs for Model Architectures)
mycompetition.compare_models([1,2])

Unnamed: 0,param_name,model_default,Model_1,Model_2
0,C,1,10,0.01
1,class_weight,,,
2,dual,False,False,False
3,fit_intercept,True,True,True
4,intercept_scaling,1,1,1
5,l1_ratio,,,
6,max_iter,100,100,100
7,multi_class,auto,auto,auto
8,n_jobs,,,
9,penalty,l2,l1,l1


In [48]:
# Submit a third model using GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {'C': np.arange(.1, 10, .1),'penalty':['l2']} #np.arange creates sequence of numbers for each k value

gridmodel = GridSearchCV(LogisticRegression(solver ='newton-cg'), param_grid=param_grid, cv=10)

#use meta model methods to fit score and predict model:
gridmodel.fit(preprocessor(X_train), y_train)

#extract best score and parameter by calling objects "best_score_" and "best_params_"
print("best mean cross-validation score: {:.3f}".format(gridmodel.best_score_))
print("best parameters: {}".format(gridmodel.best_params_))


best mean cross-validation score: 0.786
best parameters: {'C': 0.30000000000000004, 'penalty': 'l2'}


In [58]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

# Check how many preprocessed input features are there?
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 10]))]  # You need to insert correct number of features in preprocesed data

onnx_model = model_to_onnx(gridmodel, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("gridmodel.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [61]:
#Submit Model 3: 

#-- Generate predicted values (a list of predicted labels "survived" or "died")
prediction_labels = gridmodel.predict(preprocessor(X_test))

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "gridmodel.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): test
Provide any useful notes about your model (optional): test

Your model has been submitted as model version 11

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:603


In [62]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,num_params,optimizer,model_config,username,version
0,82.44%,80.54%,80.88%,80.24%,sklearn,False,False,LogisticRegression,10.0,newton-cg,"{'C': 0.30000000000000004, 'cl...",mikedparrott,10
1,82.44%,80.54%,80.88%,80.24%,sklearn,False,False,RandomForestClassifier,,,"{'bootstrap': True, 'ccp_alpha...",mikedparrott,11
2,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 10, 'class_weight': None...",mikedparrott,1
3,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 4, 'class_weight': None,...",mikedparrott,3
4,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 4, 'class_weight': None,...",mikedparrott,4
5,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 4, 'class_weight': None,...",mikedparrott,5
6,77.86%,75.94%,75.72%,76.21%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 100, 'class_weight': Non...",mikedparrott,6
7,77.86%,75.83%,75.71%,75.96%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 1, 'class_weight': None,...",mikedparrott,8
8,77.86%,75.83%,75.71%,75.96%,sklearn,False,False,LogisticRegression,10.0,newton-cg,"{'C': 0.30000000000000004, 'cl...",mikedparrott,9
9,73.28%,62.17%,80.33%,62.70%,sklearn,False,False,LogisticRegression,10.0,liblinear,"{'C': 0.01, 'class_weight': No...",mikedparrott,2
