# Model Training
The purpose of this script is simply to create a functional machine learning model that can be serialized for reuse. I'm incorporating sklearn's pipeline and gridsearch utilities.

This model uses sklearn's [wine](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) dataset.

## Helpful references
* [sklearn: Selecting dimensionality reduction with Pipeline and GridSearchCV](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html)
* [sklearn: Putting it all together](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html)
* [Importance of feature scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py)

## Documentation
* [sklearn: SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
* [sklearn: RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [sklearn: Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [sklearn: GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* [sklearn: Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer)

In [1]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import Normalizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score

features, target = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(features, target,
                                                    test_size=0.30)

In [2]:
scaler = Normalizer()
clf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5)

In [3]:
pipeline = Pipeline(steps=[('scaler', scaler),
                           ('reduce_dim', SelectKBest(chi2)),
                           ('clf', clf)])

In [4]:
n_feature_options = [2, 4, 8, 10, 12]
n_estimators_options = [10, 50, 100, 200]
min_samples_leaf_options = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
param_grid = [
    {
        'reduce_dim__k': n_feature_options,
        'clf__criterion': ['gini', 'entropy'],
        'clf__n_estimators': n_estimators_options,
        'clf__min_samples_leaf': min_samples_leaf_options
    }
]

grid = GridSearchCV(pipeline, cv=5, param_grid=param_grid, iid=False)

In [5]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('scaler', Normalizer(copy=True, norm='l2')), ('reduce_dim', SelectKBest(k=10, score_func=<function chi2 at 0x00000206F9AFB400>)), ('clf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
        ...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=False, n_jobs=None,
       param_grid=[{'reduce_dim__k': [2, 4, 8, 10, 12], 'clf__criterion': ['gini', 'entropy'], 'clf__n_estimators': [10, 50, 100, 200], 'clf__min_samples_leaf': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [6]:
grid.best_params_

{'clf__criterion': 'entropy',
 'clf__min_samples_leaf': 1,
 'clf__n_estimators': 200,
 'reduce_dim__k': 12}

In [7]:
y_pred = grid.predict(X_test)
print('accuracy_score:', accuracy_score(y_test, y_pred))
print('f1_score:', f1_score(y_test, y_pred, average='weighted'))
print(classification_report(y_test, y_pred))

accuracy_score: 0.9259259259259259
f1_score: 0.9258020562368388
              precision    recall  f1-score   support

           0       0.92      1.00      0.96        11
           1       0.88      0.95      0.91        22
           2       1.00      0.86      0.92        21

   micro avg       0.93      0.93      0.93        54
   macro avg       0.93      0.94      0.93        54
weighted avg       0.93      0.93      0.93        54



## Upload serialized model to S3

In [8]:
import os
import boto3
import pickle

session = boto3.Session(aws_access_key_id=os.getenv('AWS_ADMIN_ACCESS'),
                        aws_secret_access_key=os.getenv('AWS_ADMIN_SECRET'))

s3 = session.resource('s3')

bytes_obj = pickle.dumps('foo')

bucket = 'gwilson253awsprojects'
key = 'neptune/wine_model.pkl'

s3.Object(bucket,key).put(Body=bytes_obj)

{'ResponseMetadata': {'RequestId': '35F79263250A271E',
  'HostId': 'E6muT6bP9awIwuq4FXrgIvx7bzU1RY+HguBEhGyodu8JlOhKSUwPlt/Xm0Ta8DakvMTw/J9enoM=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'E6muT6bP9awIwuq4FXrgIvx7bzU1RY+HguBEhGyodu8JlOhKSUwPlt/Xm0Ta8DakvMTw/J9enoM=',
   'x-amz-request-id': '35F79263250A271E',
   'date': 'Sun, 30 Jun 2019 21:37:24 GMT',
   'etag': '"ed9440c442632621b608521b3f2650b8"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 1},
 'ETag': '"ed9440c442632621b608521b3f2650b8"'}