# Model Training
The purpose of this script is simply to create a functional machine learning model that can be serialized for reuse. I'm incorporating sklearn's pipeline and gridsearch utilities.

This model uses sklearn's [wine](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) dataset.

## Helpful references
* [sklearn: Selecting dimensionality reduction with Pipeline and GridSearchCV](https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html)
* [sklearn: Putting it all together](https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html)
* [Importance of feature scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py)

## Documentation
* [sklearn: SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
* [sklearn: RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [sklearn: Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* [sklearn: GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* [sklearn: Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer)

In [4]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score

features, target = load_wine(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(features, target,
                                                    test_size=0.30)

In [5]:
scaler = Normalizer()
clf = RandomForestClassifier(n_estimators=100, min_samples_leaf=5)

In [6]:
pipeline = Pipeline(steps=[('reduce_dim', SelectKBest(chi2)),
                           ('scaler', scaler),
                           ('clf', clf)])

In [7]:
n_feature_options = [2, 4, 8, 10, 12]
n_estimators_options = [10, 50, 100, 200]
min_samples_leaf_options = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
param_grid = [
    {
        'reduce_dim__k': n_feature_options,
        'clf__criterion': ['gini', 'entropy'],
        'clf__n_estimators': n_estimators_options,
        'clf__min_samples_leaf': min_samples_leaf_options
    }
]

grid = GridSearchCV(pipeline, cv=5, param_grid=param_grid, iid=False)

In [8]:
grid.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('reduce_dim',
                                        SelectKBest(k=10,
                                                    score_func=<function chi2 at 0x000002527F5A01E0>)),
                                       ('scaler',
                                        Normalizer(copy=True, norm='l2')),
                                       ('clf',
                                        RandomForestClassifier(bootstrap=True,
                                                               class_weight=None,
                                                               criterion='gini',
                                                               max_depth=None,
                                                               max_features='auto',
                                                               max_leaf_nodes=None,
                          

In [9]:
grid.best_params_

{'clf__criterion': 'gini',
 'clf__min_samples_leaf': 1,
 'clf__n_estimators': 50,
 'reduce_dim__k': 10}

In [10]:
y_pred = grid.predict(X_test)
print('accuracy_score:', accuracy_score(y_test, y_pred))
print('f1_score:', f1_score(y_test, y_pred, average='weighted'))
print(classification_report(y_test, y_pred))

accuracy_score: 0.9259259259259259
f1_score: 0.9253747795414462
              precision    recall  f1-score   support

           0       0.93      0.82      0.87        17
           1       0.86      0.95      0.90        20
           2       1.00      1.00      1.00        17

    accuracy                           0.93        54
   macro avg       0.93      0.92      0.93        54
weighted avg       0.93      0.93      0.93        54



# GridSearchCV Introspection
We'll need to see which features were selected by the model in order to use it for future prediction.

In [11]:
features = load_wine().feature_names
selected = grid.best_estimator_.named_steps['reduce_dim'].get_support()

print('< included >')
for f, s in zip(features, selected):
    if s:
        print(f)
        
print('\n< excluded >')
for f, s in zip(features, selected):
    if not s:
        print(f)

< included >
alcohol
malic_acid
alcalinity_of_ash
magnesium
total_phenols
flavanoids
proanthocyanins
color_intensity
od280/od315_of_diluted_wines
proline

< excluded >
ash
nonflavanoid_phenols
hue


## Upload serialized model to S3

In [12]:
import os
import boto3
import pickle

session = boto3.Session(aws_access_key_id=os.getenv('AWS_ADMIN_ACCESS'),
                        aws_secret_access_key=os.getenv('AWS_ADMIN_SECRET'))

s3 = session.resource('s3')

bytes_obj = pickle.dumps(grid.best_estimator_, protocol=pickle.HIGHEST_PROTOCOL)

bucket = 'gwilson253awsprojects'
key = 'neptune/wine_model.pkl'

s3.Object(bucket,key).put(Body=bytes_obj)

{'ResponseMetadata': {'RequestId': 'B480EF1CE15D1B83',
  'HostId': 'VtqThZIGC78yWIjjanADjv81bygeKfjSkWns5PxnOYv+8efSubhIbKl6KvtMqJgBha52GhqLff0=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'VtqThZIGC78yWIjjanADjv81bygeKfjSkWns5PxnOYv+8efSubhIbKl6KvtMqJgBha52GhqLff0=',
   'x-amz-request-id': 'B480EF1CE15D1B83',
   'date': 'Wed, 17 Jul 2019 21:01:55 GMT',
   'etag': '"f772bc40fc2f01699373bc5131e7fc94"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 1},
 'ETag': '"f772bc40fc2f01699373bc5131e7fc94"'}