## Assignment

Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

### Results
I managed to achieve near parity between a half-optimized MLP (65%) and a completely unoptimized RFC model (64%).  Previously, I had optimized a GBC algorithm to the point of ~71% performance, but that involved running multiple GridSearchCV optimizations and significant tinkering, neither of which I had time for on this project, given the run-time for an MLP.

Other caveats for this exercise are:
- the previous 71% was achieved with a sample size five times as large; I cut the sample size down to make the run-times for the MLP algorithm more manageable, but this understandably harms the algorithms' ability to learn accurately.
- The MLP has not been optimized to the extent that its performance is as high as it can go, only to the point where I am no longer able to make consistent improvement in its performance via tinkering with parameters and layer settings.
- This is an incredibly complex dataset, with multiple classes sharing many properties and broad swathes of outliers in every class that seem to defy classification; testing any algorithm on it is going to be challenging.


### Analysis
While the MLP scores slightly higher than the RFC, it's also notably more over-fitted, so there's a definite trade-off there.  Complexity-wise, MLP is _significantly_ slower to run, and requires more work to optimize than RFC, or even GBC, for (to my eyes) less resulting performance.  Perhaps if I'd had a week to tweak the parameters, run GridSearchCV, and/or use the full dataset, MLP might be able to do as well as the decision-tree models, but for this particular case it's distinctly sub-par from a practical standpoint.



## RFC

RFC results:
 
Model score:
0.6442185514612452
 
Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.55      0.53       529
           1       0.31      0.22      0.26        63
           2       0.68      0.72      0.70        29
           3       0.46      0.33      0.38       122
           4       0.76      0.55      0.64        40
           5       0.74      0.79      0.76       855
           6       0.55      0.36      0.44        44
           7       0.73      0.76      0.75       207
           8       0.71      0.45      0.56        11
           9       0.49      0.49      0.49       432
          10       0.53      0.49      0.51        63
          11       0.60      0.47      0.53       131
          12       0.83      0.86      0.85       424
          13       0.62      0.60      0.61       198

   micro avg       0.64      0.64      0.64      3148
   macro avg       0.61      0.55      0.57      3148
weighted

# MLP

In [143]:
mlp = MLPClassifier(hidden_layer_sizes=(200,150,100,50,25),
                    activation= 'logistic', 
                    solver='adam', 
                    alpha=0.0001, 
                    batch_size='auto', 
                    learning_rate='constant', 
                    learning_rate_init=0.001, 
                    power_t=0.5, 
                    max_iter=500, 
                    shuffle=True, 
                    random_state=105, 
                    tol=0.0001, 
                    verbose=False, 
                    warm_start=False, 
                    momentum=0.9, 
                    nesterovs_momentum=True, 
                    early_stopping=False, 
                    validation_fraction=0.1, 
                    beta_1=0.9, 
                    beta_2=0.999, 
                    epsilon=1e-08, 
                    n_iter_no_change=10)

Model score:
0.6550190597204575
 
Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.52      0.53       529
           1       0.28      0.24      0.26        63
           2       0.70      0.55      0.62        29
           3       0.44      0.25      0.32       122
           4       0.70      0.57      0.63        40
           5       0.76      0.81      0.78       855
           6       0.38      0.52      0.44        44
           7       0.74      0.71      0.73       207
           8       0.70      0.64      0.67        11
           9       0.52      0.55      0.53       432
          10       0.59      0.57      0.58        63
          11       0.57      0.50      0.53       131
          12       0.84      0.87      0.86       424
          13       0.57      0.64      0.60       198

   micro avg       0.66      0.66      0.66      3148
   macro avg       0.60      0.57      0.58      3148
weighted avg       0.65



[0.61417323 0.60663507 0.59206349 0.63476874 0.63402889]


## Importing code

In [25]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
import math
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier

from IPython.display import display

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.decomposition import PCA
from sklearn import neighbors
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest, chi2, f_classif
from sklearn.preprocessing import LabelEncoder, Imputer
from sklearn.model_selection import train_test_split

from timeit import default_timer as timer

import os

import pydotplus
from sklearn import tree
from sklearn import preprocessing
%matplotlib inline
sns.set_style('white')

## Uploading data

In [3]:
# Upload dataset
basedata_beer_recipes = pd.read_csv('recipeData.csv', index_col='BeerID', encoding='latin1')


## Defining Processing Functions

I am creating functions to allow for repeated processing of different data sets (or different samples of the same data set) without using up a massive number of lines or running the risk of the data cleaning sections of the notebook being different due to a forgetful programmer going in and changing things in one cell but not another.  With a standardized set of functions, they can all be put through the *same* processing, without worrying as much about human error.

In [4]:
# Drop excess columns
def drop_unnecessary_columns(beer_recipes):
    # Drop columns that are useless for our purposes
    beer_recipes.drop(['Name', 'URL', 'StyleID', 'UserId'], axis=1, inplace=True)

    # Drop columns that have a high % of NaN
    beer_recipes.drop(['PrimingMethod', 'PrimingAmount'], axis=1, inplace=True)

    return(beer_recipes)

In [5]:
# Add a column that takes the style of beer and converts it into one of a shorter list of broad styles
# Note that the order of these has been changed since v1 of the notebook
def add_broad_styles(beer_recipes):
    broad_styles = ['English Brown Ale', 'IPA', 'Pale Ale', 'Lager', 'Stout', 
                    'Bitter', 'Cider', 'Porter', 'Mead', 'Pilsner', 
                    'Weizen', 'Saison', 'Kölsch', 'American Wheat', 
                    'Fruit Beer','Barleywine', 'Ale']
    beer_recipes['BroadStyle'] = 'Other'
    beer_recipes['Style'].fillna('Unknown', inplace=True)

    for broad_style in broad_styles:
        beer_recipes.loc[beer_recipes['Style'].str.contains(broad_style), 'BroadStyle'] = broad_style
        beer_recipes.loc[beer_recipes['Style'].str.contains('Pils'), 'BroadStyle'] = 'Pilsner'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Melomel'), 'BroadStyle'] = 'Mead'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Metheglin'), 'BroadStyle'] = 'Mead'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Witbier'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Weiss'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('weiz'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Gose'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Schwarzbier'), 'BroadStyle'] = 'Lager'
        beer_recipes.loc[beer_recipes['Style'].str.contains('bock'), 'BroadStyle'] = 'Bock'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Winter'), 'BroadStyle'] = 'Winter Beer'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Tripel'), 'BroadStyle'] = 'Pale Ale'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Bock'), 'BroadStyle'] = 'Bock'
    
    drop_Other_style = beer_recipes.loc[beer_recipes['BroadStyle'] == 'Other']
    beer_recipes.drop(drop_Other_style.index, inplace=True)
    
    # drop "Style" so that the models don't get a chance to cheat
    beer_recipes.drop(['Style'], axis=1, inplace=True)

    return(beer_recipes)


In [6]:
def get_sg_from_plato(plato):
    sg = 1 + (plato / (258.6 - ( (plato/258.2) *227.1) ) )
    return sg

def apply_get_sg(beer_recipes):
    beer_recipes['OG_sg'] = beer_recipes.apply(lambda row: 
                                             get_sg_from_plato(row['OG']) 
                                             if row['SugarScale'] == 'Plato' 
                                             else row['OG'], axis=1)
    
    beer_recipes['FG_sg'] = beer_recipes.apply(lambda row: 
                                             get_sg_from_plato(row['FG']) 
                                             if row['SugarScale'] == 'Plato' 
                                             else row['FG'], axis=1)
    
    beer_recipes['BoilGravity_sg'] = beer_recipes.apply(lambda row: 
                                                      get_sg_from_plato(row['BoilGravity']) 
                                                      if row['SugarScale'] == 'Plato' 
                                                      else row['BoilGravity'], axis=1)

    # Drop the original rows, now that the issue's been dealt with
    beer_recipes.drop(['SugarScale','OG','FG','BoilGravity'], axis=1, inplace=True)
    
    return(beer_recipes)


In [7]:
# Using a function similar to the one below, outliers were identified and functions designed to remove them.
#beer_recipe.loc[beer_recipe['Size(L)'] > 2300].count()

def outlier_cleaner(beer_recipes, feature, outlier_limit):
    drop_feature_outliers = beer_recipes.loc[beer_recipes[feature] > outlier_limit]
    beer_recipes.drop(drop_feature_outliers.index, inplace=True)
    return(beer_recipes)

def clean_outliers(beer_recipes):
    outlier_cleaner(beer_recipes, 'Size(L)', 2300)
    outlier_cleaner(beer_recipes, 'ABV', 20)
    outlier_cleaner(beer_recipes, 'IBU', 350)
    outlier_cleaner(beer_recipes, 'Color', 50)
    outlier_cleaner(beer_recipes, 'BoilSize', 2000)
    outlier_cleaner(beer_recipes, 'BoilTime', 150)
    outlier_cleaner(beer_recipes, 'OG_sg', 1.2)
    outlier_cleaner(beer_recipes, 'FG_sg', 1.06)
    outlier_cleaner(beer_recipes, 'BoilGravity_sg', 1.3)
    drop_feature_outliers = beer_recipes.loc[beer_recipes['Efficiency'] < 20]
    beer_recipes.drop(drop_feature_outliers.index, inplace=True)

    return(beer_recipes)

In [8]:
# Because some columns have a reasonable amount of useful data but too many null values to consider discarding
# every row afflicted by them, we need to insert the median values for each type of beer into those null values

def fillin_values(beer_recipes):
    nonnumeric_feats = list(beer_recipes.select_dtypes(include=object).columns)

    # separate out the nonnumeric, as they'll be removed by this process and we need to add them back in afterwards
    beer_recipes_nonnumeric = beer_recipes[nonnumeric_feats].copy()
    
    # fill in NaN based on the median of the respective broad style
    grouped = beer_recipes.groupby('BroadStyle')
    transformed = grouped.transform(lambda x: x.fillna(x.median()))
    beer_recipes = pd.concat([beer_recipes_nonnumeric, transformed], axis=1)
    
    return(beer_recipes)

In [9]:
# A function to encode any non-numeric features, so that models can use them more easily

def encoding_function(beer_recipes, features_list):
    selected_data = beer_recipes.loc[:, features_list]
    
    categorical_features = list(selected_data.select_dtypes(include=object).columns)
    for feature in categorical_features:
        encoder = LabelEncoder()
        selected_data[feature] = encoder.fit_transform(selected_data[feature])
    return(selected_data)


In [10]:
# create a detailed report function that can be used for any model

def accuracy_report(testing_X, testing_Y, model, cv):
    predictions = model.predict(testing_X)
    print('Model score:')
    print(model.score(testing_X, testing_Y))
    print(" ")
    print("Classification Report:")
    y_prediction = model.predict(testing_X)
    print(classification_report(testing_Y, y_prediction))
    
# Sometimes we don't want to spend the processor time calculating the cross-valuation, so we need a way to toggle it.
    if cv == 1:
        print(" ")
        print('Model cross-valuation:')
        print(sklearn.model_selection.cross_val_score(model, testing_X, testing_Y, cv = 5))
    return

In [90]:
# create a report function that can give a general score for a model tested on multiple different datasets

def mulifeature_accuracy_report(X_train, X_test, y_train, y_test, list_of_feature_lists, model):
    counter = 0
    for feature_list in list_of_feature_lists:
        counter = counter + 1
        selected_X_train = X_train.loc[:, feature_list]
        selected_X_train.drop(['BroadStyle'], axis=1, inplace=True)
        selected_X_test = X_test.loc[:, feature_list]
        selected_X_test.drop(['BroadStyle'], axis=1, inplace=True)
        model.fit(selected_X_train, y_train)
        predictions = model.predict(selected_X_test)
        print('Model score for feature list #{}: {}'.format(counter, model.score(selected_X_test, y_test)))
        os.system('say "all done."')  # this is going to take a while, let me know when it's done
    return

## Cleaning & Processing Data
The dataframe object 'beer_recipes' will be the dataframe we'll be using all those functions on, copied from the 'basedata_beer_recipes' dataframe so that we can come back and reset beer_recipes later in the notebook without having to re-upload the dataset.

In [12]:
beer_recipes = basedata_beer_recipes.copy()

beer_recipes = drop_unnecessary_columns(beer_recipes)
beer_recipes = add_broad_styles(beer_recipes)
beer_recipes = apply_get_sg(beer_recipes)
beer_recipes = clean_outliers(beer_recipes)
beer_recipes = fillin_values(beer_recipes)
beer_recipes = beer_recipes.dropna()


In [212]:
#beer_recipes.isnull().sum()
#beer_recipes.describe().T
#beer_recipes.head(5)

## Set up lists of features for experimentation
This is necessary to make many of the functions above work properly, and especially if we want to be able to test models on multiple different sets of features within a single cell without taking up too many lines.

In [13]:
all_features = ['BroadStyle', #target
                'OG_sg','FG_sg','ABV','IBU','Color', #standardized fields
                'BrewMethod', #categorical feature
                'Size(L)', 'BoilSize', 'BoilTime', 'BoilGravity_sg', 'Efficiency', 
                'MashThickness', 'PitchRate', 'PrimaryTemp' # other numerical features
                ]

# top X features taken from SelectKBest, immediately below
top_5_features = ['BroadStyle', #target
                  'OG_sg','IBU','Color','PitchRate','PrimaryTemp' # top 5 features according to SKB
                 ]

top_7_features = ['BroadStyle', #target
                  'OG_sg','FG_sg','IBU','ABV','Color','PitchRate','PrimaryTemp' # top 7 features according to SKB
                 ]

top_10_features = ['BroadStyle', #target
                  'OG_sg','FG_sg','BoilGravity_sg','IBU','ABV','Color','PitchRate',
                   'PrimaryTemp','Efficiency','MashThickness' # top 10 features according to SKB
                 ]

list_of_feature_lists = [all_features, top_10_features, top_7_features, top_5_features]


## Performing SelectKBest to identify the best features
This process was performed twice, once to identify the top 10 (listed in order, based on incrementing 'k' and noting which feature was added for each k += 1) out of the original features and once to identify the top 10 including the features added by the two PCA functions.

In [458]:
# Picking 5/10 SelectKBest features

skb_data = encoding_function(beer_recipes, alldata_allpca)


X_best = SelectKBest(f_classif, k=10).fit_transform(
    skb_data.drop(['BroadStyle'], axis=1), 
    skb_data['BroadStyle'])
X_best_df = pd.DataFrame({
    'best_1':X_best[:,0],
    'best_2':X_best[:,1],
    'best_3':X_best[:,2],
    'best_4':X_best[:,3],
    'best_5':X_best[:,4],
    'best_6':X_best[:,5],
    'best_7':X_best[:,6],
    'best_8':X_best[:,7],
    'best_9':X_best[:,8],
    'best_10':X_best[:,9]
})

X_best_df.head(5)

Unnamed: 0,best_1,best_2,best_3,best_4,best_5,best_6,best_7,best_8,best_9,best_10
0,1.055,17.65,4.83,0.75,17.78,-28.441,-7.441,-1.138,-28.368,-7.648
1,1.083,60.65,15.64,0.5,20.0,14.909,2.197,0.734,14.975,1.937
2,1.063,59.25,8.98,0.75,20.0,13.272,-4.482,0.764,13.338,-4.754
3,1.061,54.48,8.5,0.75,20.0,8.489,-4.825,0.799,8.555,-5.092
4,1.06,17.84,4.57,0.75,19.0,-28.272,-7.522,0.127,-28.169,-7.878


In [447]:
beer_recipes.head(5)

Unnamed: 0_level_0,BrewMethod,BroadStyle,Size(L),ABV,IBU,Color,BoilSize,BoilTime,Efficiency,MashThickness,PitchRate,PrimaryTemp,OG_sg,FG_sg,BoilGravity_sg,top_10_pca_1,top_10_pca_2,top_10_pca_3,top_10_pca_4,top_10_pca_5,properties_pca_1,properties_pca_2,properties_pca_3
BeerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,All Grain,Ale,21.77,5.48,17.65,4.83,28.39,75,70.0,1.5,0.75,17.78,1.055,1.013,1.038,-28.441,-3.503,-7.441,-1.138,-0.253,-28.368,-7.648,0.114
2,All Grain,Winter Beer,20.82,8.16,60.65,15.64,24.61,60,70.0,1.5,0.5,20.0,1.083,1.021,1.07,14.909,-3.704,2.197,0.734,-1.67,14.975,1.937,1.688
3,extract,IPA,18.93,5.91,59.25,8.98,22.71,60,70.0,1.5,0.75,20.0,1.063,1.018,1.052,13.272,-4.097,-4.482,0.764,0.212,13.338,-4.754,-0.295
4,All Grain,IPA,22.71,5.8,54.48,8.5,26.5,60,70.0,1.5,0.75,20.0,1.061,1.017,1.052,8.489,-4.026,-4.825,0.799,0.222,8.555,-5.092,-0.309
5,All Grain,Ale,50.0,6.48,17.84,4.57,60.0,90,72.0,1.5,0.75,19.0,1.06,1.01,1.05,-28.272,-5.513,-7.522,0.127,-1.17,-28.169,-7.878,1.119


Results for top features out of the original set were:
1. Color
2. PitchRate
3. IBU
4. PrimaryTemp
5. OG_sg
6. ABV
7. FG_sg
8. BoilGravity_sg
9. Efficiency
10. MashThickness

Results for top features out of everything including the two PCAs were:
1. properties_pca_2
2. Color
3. top_10_pca_3
4. PitchRate
5. properties_pca_1
6. IBU
7. top_10_pca_1
8. PrimaryTemp
9. top_10_pca_4
10. OG_sg

## Creating PCA features
My approach here is to make a PCA that takes the top 10 features and squishes them down into 5 PCA features, and then to go more focused and create 3 PCA features based on six of the more-highly-correlated properties like color, IBU, ABV, and the specific gravity trio.  Neither on its own turns out to improve the models hugely, but the properties-based PCAs do seem to increase performance by a couple of percentages.

In [106]:
def PCA_engineering_1(beer_recipes):

    # calculate PCA based on top 10 features
    features_to_pca = beer_recipes[top_10_features]
    features_to_pca.drop(['BroadStyle'], axis=1, inplace=True)
    pca = PCA(n_components=5)
    top_10_pca = pca.fit_transform(features_to_pca)

    # join the PCAs up with the input dataframe
    beer_recipes['top_10_pca_1'] = top_10_pca[:,0]
    beer_recipes['top_10_pca_2'] = top_10_pca[:,1]
    beer_recipes['top_10_pca_3'] = top_10_pca[:,2]
    beer_recipes['top_10_pca_4'] = top_10_pca[:,3]
    beer_recipes['top_10_pca_5'] = top_10_pca[:,4] 
    
    return(beer_recipes)

In [107]:
def PCA_engineering_2(beer_recipes):
    features = ['OG_sg','FG_sg','BoilGravity_sg','IBU','ABV','Color']
    # calculate PCA based on the more-correlated variables out of the top 10
    features_to_pca = beer_recipes[features]
    pca = PCA(n_components=3)
    properties_pca = pca.fit_transform(features_to_pca)

    # join the PCAs up with the input dataframe
    beer_recipes['properties_pca_1'] = properties_pca[:,0]
    beer_recipes['properties_pca_2'] = properties_pca[:,1]
    beer_recipes['properties_pca_3'] = properties_pca[:,2]
    
    return(beer_recipes)

In [108]:
# create feature lists for the PCA features

top_10_pca = ['BroadStyle', #target
              'top_10_pca_1','top_10_pca_2','top_10_pca_3','top_10_pca_4','top_10_pca_5'
             ]

alldata_top10pca = ['BroadStyle', #target
                    'OG_sg','FG_sg','ABV','IBU','Color', #standardized fields
                    'BrewMethod', #categorical feature
                    'Size(L)', 'BoilSize', 'BoilTime', 'BoilGravity_sg', 'Efficiency', 
                    'MashThickness', 'PitchRate', 'PrimaryTemp', # other numerical features
                    'top_10_pca_1','top_10_pca_2','top_10_pca_3','top_10_pca_4','top_10_pca_5',
                   ] 

alldata_properties_pca = ['BroadStyle', #target
                    'OG_sg','FG_sg','ABV','IBU','Color', #standardized fields
                    'BrewMethod', #categorical feature
                    'Size(L)', 'BoilSize', 'BoilTime', 'BoilGravity_sg', 'Efficiency', 
                    'MashThickness', 'PitchRate', 'PrimaryTemp', # other numerical features
                    'properties_pca_1','properties_pca_2','properties_pca_3'
                   ] 

skb10_with_pca = ['BroadStyle', #target
                  'properties_pca_2', 'Color', 'top_10_pca_3', 'PitchRate', 'properties_pca_1',
                  'IBU', 'top_10_pca_1', 'PrimaryTemp', 'top_10_pca_4', 'OG_sg'
                 ]

alldata_allpca = ['BroadStyle', #target
                    'OG_sg','FG_sg','ABV','IBU','Color', #standardized fields
                    'BrewMethod', #categorical feature
                    'Size(L)', 'BoilSize', 'BoilTime', 'BoilGravity_sg', 'Efficiency', 
                    'MashThickness', 'PitchRate', 'PrimaryTemp', # other numerical features
                    'top_10_pca_1','top_10_pca_2','top_10_pca_3','top_10_pca_4','top_10_pca_5',
                    'properties_pca_1','properties_pca_2','properties_pca_3'
                   ] 

# update list_of_feature_lists  alldata_top10pca, 
list_of_feature_lists = [all_features, top_10_features, top_7_features, top_5_features, 
                         alldata_allpca, top_10_pca, alldata_top10pca, alldata_properties_pca, skb10_with_pca]
feature_list_str = ['all_features', 'top_10_features', 'top_7_features', 'top_5_features', 
                    'alldata_allpca', 'top_10_pca', 'alldata_top10pca', 'alldata_properties_pca', 'skb10_with_pca']


In [122]:
# Establish the new version of beer_recipes
beer_recipes = basedata_beer_recipes.copy()

# do all the same preprocessing to it (using the new version of add_broad_styles_v2)
beer_recipes = drop_unnecessary_columns(beer_recipes)
beer_recipes = add_broad_styles_v2(beer_recipes) # <- this is the new version
beer_recipes = apply_get_sg(beer_recipes)
beer_recipes = clean_outliers(beer_recipes)
beer_recipes = fillin_values(beer_recipes)
beer_recipes = beer_recipes.dropna()

# run PCA function
beer_recipes = PCA_engineering_1(beer_recipes)
beer_recipes = PCA_engineering_2(beer_recipes)

# Select the feature list to be used and encode categorical data with the encoding function
data_to_model = encoding_function(beer_recipes, alldata_allpca)

# reset the encoder so that it can be used to reverse-engineer the broad_styles afterwards
encoder = LabelEncoder()
encoded_broadtypes = encoder.fit_transform(beer_recipes['BroadStyle'])

os.system('say "all done."'); print('\a')  # this is going to take a while, let me know when it's done

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)





### Sampling data for faster run-times

In [123]:
data_to_model.head()

Unnamed: 0_level_0,BroadStyle,OG_sg,FG_sg,ABV,IBU,Color,BrewMethod,Size(L),BoilSize,BoilTime,BoilGravity_sg,Efficiency,MashThickness,PitchRate,PrimaryTemp,top_10_pca_1,top_10_pca_2,top_10_pca_3,top_10_pca_4,top_10_pca_5,properties_pca_1,properties_pca_2,properties_pca_3
BeerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,0,1.055,1.013,5.48,17.65,4.83,0,21.77,28.39,75,1.038,70.0,1.5,0.75,17.78,-28.783,-3.511,-7.437,-1.151,0.233,-28.71,-7.673,0.126
3,5,1.063,1.018,5.91,59.25,8.98,3,18.93,22.71,60,1.052,70.0,1.5,0.75,20.0,12.925,-4.102,-4.429,0.757,-0.238,12.994,-4.735,-0.29
4,5,1.061,1.017,5.8,54.48,8.5,0,22.71,26.5,60,1.052,70.0,1.5,0.75,20.0,8.143,-4.031,-4.778,0.791,-0.248,8.211,-5.079,-0.303
5,0,1.06,1.01,6.48,17.84,4.57,0,50.0,60.0,90,1.05,72.0,1.5,0.75,19.0,-28.616,-5.522,-7.501,0.113,1.167,-28.509,-7.903,1.131
6,9,1.055,1.013,5.58,40.12,8.0,0,24.61,29.34,70,1.047,79.0,1.5,1.0,20.0,-6.406,-12.718,-4.199,0.981,-0.282,-6.159,-5.164,-0.265


In [127]:
sample_to_model = data_to_model.sample(frac=.2, random_state=2, axis=0)

X = sample_to_model.iloc[:, 1:]
y = sample_to_model.iloc[:, 0] #the target was the first column included in features_list

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, stratify=y, random_state=35)

os.system('say "all done."'); print('\a')  # this is going to take a while, let me know when it's done




### Running multiple different feature-sets to determine which is optimal

In [128]:
# Random Forest model
rfc = RandomForestClassifier()
print('Random Forest Classifier, multiple feature-list test series:')
mulifeature_accuracy_report(X_train, X_test, y_train, y_test, list_of_feature_lists, rfc)

Random Forest Classifier, multiple feature-list test series:
Model score for feature list #1: 0.6378077839555203
Model score for feature list #2: 0.6409849086576648
Model score for feature list #3: 0.6505162827640985
Model score for feature list #4: 0.625099285146942
Model score for feature list #5: 0.6378077839555203
Model score for feature list #6: 0.6020651310563939
Model score for feature list #7: 0.6282764098490866
Model score for feature list #8: 0.6481334392374901
Model score for feature list #9: 0.6195393169181891


In [130]:
# MLP model
mlp = MLPClassifier(hidden_layer_sizes=(100,200,100,100),
                    activation= 'logistic', 
                    solver='adam', 
                    alpha=0.0001, 
                    batch_size='auto', 
                    learning_rate='constant', 
                    learning_rate_init=0.001, 
                    power_t=0.5, 
                    max_iter=400, 
                    shuffle=True, 
                    random_state=105, 
                    tol=0.0001, 
                    verbose=False, 
                    warm_start=False, 
                    momentum=0.9, 
                    nesterovs_momentum=True, 
                    early_stopping=False, 
                    validation_fraction=0.1, 
                    beta_1=0.9, 
                    beta_2=0.999, 
                    epsilon=1e-08, 
                    n_iter_no_change=10)
print('MLP Classifier, multiple feature-list test series:')
mulifeature_accuracy_report(X_train, X_test, y_train, y_test, list_of_feature_lists, mlp)

MLP Classifier, multiple feature-list test series:




Model score for feature list #1: 0.6211278792692613




Model score for feature list #2: 0.6354249404289118




Model score for feature list #3: 0.6401906274821286
Model score for feature list #4: 0.6195393169181891




Model score for feature list #5: 0.5988880063542494




Model score for feature list #6: 0.6187450357426529




Model score for feature list #7: 0.6187450357426529




Model score for feature list #8: 0.6179507545671168




Model score for feature list #9: 0.613979348689436


### Update styles

In [17]:
# add a version 2 of the add_broad_styles function, one which removes the offending broad styles
def add_broad_styles_v2(beer_recipes):
    broad_styles = ['Ale', 'IPA', 'Pale Ale', 'Lager', 'Stout', 
                    'Bitter', 'Porter', 'Mead', 'Pilsner', 
                    'Weizen', 'Saison', 'Kölsch', 'American Wheat', 
                    'Barleywine', 'English Brown Ale']
    beer_recipes['BroadStyle'] = 'Other'
    beer_recipes['Style'].fillna('Unknown', inplace=True)

    for broad_style in broad_styles:
        beer_recipes.loc[beer_recipes['Style'].str.contains(broad_style), 'BroadStyle'] = broad_style
        beer_recipes.loc[beer_recipes['Style'].str.contains('Pils'), 'BroadStyle'] = 'Pilsner'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Melomel'), 'BroadStyle'] = 'Mead'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Metheglin'), 'BroadStyle'] = 'Mead'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Witbier'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Weiss'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('weiz'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Gose'), 'BroadStyle'] = 'Weizen'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Schwarzbier'), 'BroadStyle'] = 'Lager'
        beer_recipes.loc[beer_recipes['Style'].str.contains('bock'), 'BroadStyle'] = 'Bock'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Tripel'), 'BroadStyle'] = 'Pale Ale'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Bock'), 'BroadStyle'] = 'Bock'
        beer_recipes.loc[beer_recipes['Style'].str.contains('Porter'), 'BroadStyle'] = 'Stout'
    
    drop_Other_style = beer_recipes.loc[beer_recipes['BroadStyle'] == 'Other']
    beer_recipes.drop(drop_Other_style.index, inplace=True)

    beer_recipes.drop(['Style'], axis=1, inplace=True)

    return(beer_recipes)


### Setup comparison runs between RFC and MLP

In [131]:
# Establish the new version of beer_recipes
beer_recipes = basedata_beer_recipes.copy()

# do all the same preprocessing to it (using the new version of add_broad_styles_v2)
beer_recipes = drop_unnecessary_columns(beer_recipes)
beer_recipes = add_broad_styles_v2(beer_recipes) # <- this is the new version
beer_recipes = apply_get_sg(beer_recipes)
beer_recipes = clean_outliers(beer_recipes)
beer_recipes = fillin_values(beer_recipes)
beer_recipes = beer_recipes.dropna()

# run PCA function
beer_recipes = PCA_engineering_1(beer_recipes)
beer_recipes = PCA_engineering_2(beer_recipes)

# Select the feature list to be used and encode categorical data with the encoding function
data_to_model = encoding_function(beer_recipes, top_7_features)

# reset the encoder so that it can be used to reverse-engineer the broad_styles afterwards
encoder = LabelEncoder()
encoded_broadtypes = encoder.fit_transform(beer_recipes['BroadStyle'])

X = data_to_model.iloc[:, 1:]
y = data_to_model.iloc[:, 0] #the target was the first column included in features_list

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, random_state=35)

os.system('say "all done."'); print('\a')  # this is going to take a while, let me know when it's done

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)





### Sampling data for faster run-times

In [132]:
sample_to_model = data_to_model.sample(frac=.25, random_state=2, axis=0)

X = sample_to_model.iloc[:, 1:]
y = sample_to_model.iloc[:, 0] #the target was the first column included in features_list

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y, random_state=35)

os.system('say "all done."'); print('\a')  # this is going to take a while, let me know when it's done




## RFC

In [133]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
print('RFC results:')
print(' ')
accuracy_report(X_test, y_test, rfc, 1)
os.system('say "all done."'); print('\a')  # this is going to take a while, let me know when it's done

RFC results:
 
Model score:
0.6442185514612452
 
Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.55      0.53       529
           1       0.31      0.22      0.26        63
           2       0.68      0.72      0.70        29
           3       0.46      0.33      0.38       122
           4       0.76      0.55      0.64        40
           5       0.74      0.79      0.76       855
           6       0.55      0.36      0.44        44
           7       0.73      0.76      0.75       207
           8       0.71      0.45      0.56        11
           9       0.49      0.49      0.49       432
          10       0.53      0.49      0.51        63
          11       0.60      0.47      0.53       131
          12       0.83      0.86      0.85       424
          13       0.62      0.60      0.61       198

   micro avg       0.64      0.64      0.64      3148
   macro avg       0.61      0.55      0.57      3148
weighted

# MLP
Create a multi-layer perceptron neural network model to predict on a labeled dataset of your choosing. Compare this model to either a boosted tree or a random forest model and describe the relative tradeoffs between complexity and accuracy. Be sure to vary the hyperparameters of your MLP!

In [142]:
# Establish and fit the model
mlp = MLPClassifier(hidden_layer_sizes=(200,150,100,50,25),
                    activation= 'logistic', 
                    solver='adam', 
                    alpha=0.0001, 
                    batch_size='auto', 
                    learning_rate='constant', 
                    learning_rate_init=0.001, 
                    power_t=0.5, 
                    max_iter=500, 
                    shuffle=True, 
                    random_state=105, 
                    tol=0.0001, 
                    verbose=False, 
                    warm_start=False, 
                    momentum=0.9, 
                    nesterovs_momentum=True, 
                    early_stopping=False, 
                    validation_fraction=0.1, 
                    beta_1=0.9, 
                    beta_2=0.999, 
                    epsilon=1e-08, 
                    n_iter_no_change=10)
mlp.fit(X_train, y_train)
print(' ')

 




In [143]:
accuracy_report(X_test, y_test, mlp, 1)


Model score:
0.6550190597204575
 
Classification Report:
              precision    recall  f1-score   support

           0       0.55      0.52      0.53       529
           1       0.28      0.24      0.26        63
           2       0.70      0.55      0.62        29
           3       0.44      0.25      0.32       122
           4       0.70      0.57      0.63        40
           5       0.76      0.81      0.78       855
           6       0.38      0.52      0.44        44
           7       0.74      0.71      0.73       207
           8       0.70      0.64      0.67        11
           9       0.52      0.55      0.53       432
          10       0.59      0.57      0.58        63
          11       0.57      0.50      0.53       131
          12       0.84      0.87      0.86       424
          13       0.57      0.64      0.60       198

   micro avg       0.66      0.66      0.66      3148
   macro avg       0.60      0.57      0.58      3148
weighted avg       0.65



[0.61417323 0.60663507 0.59206349 0.63476874 0.63402889]


In [144]:
os.system('say "all done."'); print('\a')  # this could take a while, let me know when it's done


