### MACHINE LEARNING -- REBALANCE THE DATA 
#### The baseline model is set, I'll try to improve the model.  I'll eliminate imbalance, using oversampling or undersampling.  <br>  Here's the baseline for three classes of beer styles:
              precision    recall  f1-score   support

         IPA       0.84      0.61      0.71      2664
       Other       0.87      0.94      0.90     10883
       Stout       0.68      0.55      0.61      1196
   micro avg       0.85      0.85      0.85     14743
   macro avg       0.80      0.70      0.74     14743
weighted avg       0.85      0.85      0.84     14743

In [34]:
# IMPORT MODULES AND THE DATA SET
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

df = pd.read_csv('beer.csv', header=0)
df_copy = df  #save a copy of dataframe for reference. 
print('length',len(df))
pd.set_option('max_colwidth', 220)
df.head(3)

length 80818


Unnamed: 0,name,brewery,style,rating,review
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m..."
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no..."


In [35]:
# DATA PREP
# drop all reviews with < 20 characters
df = df[df['review'].map(len) > 20]
# reset index for the shortened dataframe
df['index'] = np.arange(len(df))
df = df.set_index('index')

# Change review to a string of words.  remove non-letters, make lower case, split into words.  
# Remove stopwords (common words.)  Join back together into a long string of words. 
def review_to_words(review):
    letters_only = re.sub('[^a-zA-Z]',' ', review)
    words = letters_only.lower().split()
    stops = set(stopwords.words('english'))  
    good_words = [w for w in words if not w in stops]
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in good_words]
    return(' '.join(stemmed))

# clean the reviews
df['clean_review'] = df['review'].apply(review_to_words)

df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0_level_0,name,brewery,style,rating,review,clean_review
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m...",smell soft hop aroma signific malt scent one smell creami tast creami tradit irish flavor come tongu creami like cream ale close malt big butteri smooth hop uniqu sharp hop flavor easi satur well mix blend play compl...
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...,tap dfh rehoboth collab eatili cardamom red wine must golden orang head typic dfh yeast aroma spice mayb belgian influenc sweet spici somewhat fruiti much old ale characterist light still tasti cardamom add nice flav...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no...",nice auburn impress ton clariti solid inch white head aroma littl bit sweet nutti tast gave littl sweet stay away hop bitter rel light bodi noth almond came obviou kind fanci oktoberfest good realli chang anyth use a...


In [36]:
# Create 3 classes: Stout, IPA, and other

three_styles = df 
ipa_list = ['American IPA','English India Pale Ale (IPA)','American Double / Imperial IPA',
           'Belgian IPA',]
three_styles['style'].replace(ipa_list, 'IPA', inplace=True)
stout_list = ['American Stout','English Stout','Milk / Sweet Stout','Oatmeal Stout',
             'Imperial Stout','American Double / Imperial Stout', ]
three_styles['style'].replace(stout_list, 'Stout', inplace=True)
other_list = ['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine', 'American Black Ale', 
       'American Blonde Ale', 'American Brown Ale', 'American Double / Imperial Pilsner', 
        'American Pale Ale (APA)', 'American Pale Lager', 'American Pale Wheat Ale', 
       'American Porter', 'American Stout', 'American Strong Ale', 'American Wild Ale',
       'Baltic Porter', 'Belgian Dark Ale', 'Belgian Pale Ale', 'Belgian Strong Dark Ale', 
       'Belgian Strong Pale Ale', 'Berliner Weissbier', 'Bière de Garde', 'Bock', 
       'California Common / Steam Beer', 'Chile Beer', 'Cream Ale', 'Czech Pilsener', 
       'Doppelbock', 'Dortmunder / Export Lager', 'Dubbel', 'Dunkelweizen', 
       'English Barleywine', 'English Bitter', 'English Brown Ale', 'English Pale Ale', 
       'English Dark Mild Ale',  'English Porter', 'English Strong Ale', 'Euro Dark Lager', 
       'Euro Pale Lager', 'Extra Special / Strong Bitter (ESB)', 'Flanders Oud Bruin', 
       'Flanders Red Ale', 'Foreign / Export Stout', 'Fruit / Vegetable Beer', 
       'German Pilsener', 'Gose', 'Hefeweizen', 'Herbed / Spiced Beer', 'Irish Dry Stout', 
       'Irish Red Ale', 'Kellerbier / Zwickelbier', 'Kölsch', 'Lambic - Fruit', 
       'Light Lager', 'Maibock / Helles Bock', 'Milk / Sweet Stout', 'Munich Dunkel Lager',  
       'Munich Helles Lager', 'Märzen / Oktoberfest', 'Old Ale', 'Pumpkin Ale', 
       'Quadrupel (Quad)', 'Rauchbier', 'Russian Imperial Stout', 'Rye Beer',
        'Saison / Farmhouse Ale', 'Schwarzbier', 'Scotch Ale / Wee Heavy', 'Scottish Ale', 
       'Smoked Beer', 'Tripel', 'Vienna Lager', 'Weizenbock', 'Wheatwine', 
       'Winter Warmer', 'Witbier','American Dark Wheat Ale', 'American Malt Liquor',       
       'Bière de Champagne / Bière Brut', 'Black & Tan', 'Braggot', 'Eisbock',
       'English Pale Mild Ale', 'Euro Strong Lager', 'Faro', 'Gueuze', 'Happoshu', 
       'Japanese Rice Lager', 'Kristalweizen', 'Kvass', 'Lambic - Unblended', 
       'Low Alcohol Beer', 'Roggenbier', 'Sahti', 'Scottish Gruit / Ancient Herbed Ale',
       'American Lager','Barleywine','Bitter', 'Brown Ale', 'Farm Ale', 'Lager',
       'Pale Ale', 'Porter','Wheat']
three_styles['style'].replace(other_list, 'Other', inplace=True)
styles = three_styles.groupby(['style']).size() 

In [37]:
labels = three_styles.groupby(['style']).size() 
print(labels)

style
IPA       8945
Other    36300
Stout     3896
dtype: int64


In [38]:
#pickle the clean data:
import pickle
filename = '3styles'
outfile = open(filename,'wb')
pickle.dump(three_styles,outfile)
outfile.close()

In [39]:
# retrieve the pickled data:
import pickle
filename = '3styles'
infile = open(filename,'rb')
three_styles = pickle.load(infile)
infile.close()

In [40]:
# VECTORIZE THE REVIEWS  
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split 

X = three_styles['clean_review'].values
y = three_styles['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word')
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

#### NAIVE BAYES to predict style
#### Use classification report to see all scoring.  <br>  Compare train data and test data to detect overfitting.

In [41]:
# NAIVE BAYES PREDICTOR  BASELINE
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted_train = clf.predict(X_train)

print(classification_report(y_train, predicted_train))
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

         IPA       0.91      0.75      0.82      6281
       Other       0.92      0.96      0.94     25417
       Stout       0.81      0.77      0.79      2700

   micro avg       0.91      0.91      0.91     34398
   macro avg       0.88      0.83      0.85     34398
weighted avg       0.91      0.91      0.91     34398

              precision    recall  f1-score   support

         IPA       0.84      0.60      0.70      2664
       Other       0.86      0.94      0.90     10883
       Stout       0.68      0.55      0.61      1196

   micro avg       0.85      0.85      0.85     14743
   macro avg       0.80      0.70      0.74     14743
weighted avg       0.85      0.85      0.84     14743



#### Stout performs much better in the training set than the test set.  Thus, Stout is overfitted, perhaps because of the low support number.  I'll address the imbalance with sampling methods.  

### Parameter tuning for multnomial Naive Bayes

In [6]:
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': [1,0.1,0.01,0.001,0.0001]}
clf = MultinomialNB()
clf_cv = GridSearchCV(clf, param_grid, cv = 5)
clf_cv.fit(X_train, y_train)
print(clf_cv.best_params_,clf_cv.best_score_)

{'alpha': 0.01} 0.847781847782


#### According to grid search, alpha=0.01 is the best setting.  

### RE-SAMPLING DATA.  
#### Use SMOTE to oversample the small classes.  This creates more data to bolster the small classes, making all classes balanced.

In [44]:
# oversample using SMOTE to balance the classes
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X_scaled, y)
# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=22)

In [45]:
# NAIVE BAYES PREDICTOR 
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve
clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted_train = clf.predict(X_train)

print("classification report on the TRAIN data")
print(classification_report(y_train, predicted_train))
print("classification report on the TEST data")
print(classification_report(y_test, predicted))

classification report on the TRAIN data
              precision    recall  f1-score   support

         IPA       0.92      0.93      0.92     25285
       Other       0.91      0.82      0.87     25476
       Stout       0.90      0.97      0.94     25469

   micro avg       0.91      0.91      0.91     76230
   macro avg       0.91      0.91      0.91     76230
weighted avg       0.91      0.91      0.91     76230

classification report on the TEST data
              precision    recall  f1-score   support

         IPA       0.90      0.92      0.91     11015
       Other       0.89      0.78      0.83     10824
       Stout       0.89      0.97      0.93     10831

   micro avg       0.89      0.89      0.89     32670
   macro avg       0.89      0.89      0.89     32670
weighted avg       0.89      0.89      0.89     32670



#### this boosted the scores a lot, and removed the overfitting problem.  

#### UNDERSAMPLING  <br>  try with random undersampling, then try NearMiss undersampling

In [30]:
# undersample using random to balance the classes
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X_scaled, y)
# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=22)

In [31]:
# NAIVE BAYES PREDICTOR 
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve
clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted_train = clf.predict(X_train)

print("classification report on the TRAIN data")
print(classification_report(y_train, predicted_train))
print("classification report on the TEST data")
print(classification_report(y_test, predicted))

classification report on the TRAIN data
              precision    recall  f1-score   support

         IPA       0.95      0.96      0.95      2703
       Other       0.97      0.87      0.92      2731
       Stout       0.91      0.99      0.95      2747

   micro avg       0.94      0.94      0.94      8181
   macro avg       0.94      0.94      0.94      8181
weighted avg       0.94      0.94      0.94      8181

classification report on the TEST data
              precision    recall  f1-score   support

         IPA       0.86      0.85      0.85      1193
       Other       0.79      0.71      0.74      1165
       Stout       0.84      0.93      0.88      1149

   micro avg       0.83      0.83      0.83      3507
   macro avg       0.83      0.83      0.83      3507
weighted avg       0.83      0.83      0.83      3507



In [32]:
# undersample using NearMiss to balance the classes
from imblearn.under_sampling import NearMiss
nm1 = NearMiss(version=1)
X_resampled_nm1, y_resampled = nm1.fit_resample(X_scaled, y)
# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=22)

In [33]:
# NAIVE BAYES PREDICTOR 
%%time
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve
clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
predicted_train = clf.predict(X_train)

print("classification report on the TRAIN data")
print(classification_report(y_train, predicted_train))
print("classification report on the TEST data")
print(classification_report(y_test, predicted))

classification report on the TRAIN data
              precision    recall  f1-score   support

         IPA       0.95      0.96      0.95      2703
       Other       0.97      0.87      0.92      2731
       Stout       0.91      0.99      0.95      2747

   micro avg       0.94      0.94      0.94      8181
   macro avg       0.94      0.94      0.94      8181
weighted avg       0.94      0.94      0.94      8181

classification report on the TEST data
              precision    recall  f1-score   support

         IPA       0.86      0.85      0.85      1193
       Other       0.79      0.71      0.74      1165
       Stout       0.84      0.93      0.88      1149

   micro avg       0.83      0.83      0.83      3507
   macro avg       0.83      0.83      0.83      3507
weighted avg       0.83      0.83      0.83      3507



#### using SMOTE to oversample the small classes worked the best of the sampling methods I tried.  (There are other possible sampling methods, but I'm satisfied with this.)  
#### However, I see that I sampled before train test split.  This altered the test data.  I'll do it again, sampling after the split.

In [50]:
%%time
# oversample using SMOTE to balance the classes
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split

# VECTORIZE THE REVIEWS  
X = three_styles['clean_review'].values
y = three_styles['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word')
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)
# split into train and test data
X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

CPU times: user 1min 12s, sys: 4.65 s, total: 1min 17s
Wall time: 1min 19s


In [53]:
%%time
# NAIVE BAYES PREDICTOR 
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve
clf = MultinomialNB(alpha = 0.01)
clf.fit(X_resampled, y_resampled)
predicted = clf.predict(X_test)
predicted_train = clf.predict(X_resampled)

print("classification report on the TRAIN data")
print(classification_report(y_resampled, predicted_train))
print("classification report on the TEST data")
print(classification_report(y_test, predicted))

classification report on the TRAIN data
              precision    recall  f1-score   support

         IPA       0.91      0.95      0.93     25417
       Other       0.94      0.83      0.88     25417
       Stout       0.91      0.97      0.94     25417

   micro avg       0.92      0.92      0.92     76251
   macro avg       0.92      0.92      0.92     76251
weighted avg       0.92      0.92      0.92     76251

classification report on the TEST data
              precision    recall  f1-score   support

         IPA       0.65      0.85      0.74      2664
       Other       0.94      0.79      0.86     10883
       Stout       0.47      0.85      0.60      1196

   micro avg       0.80      0.80      0.80     14743
   macro avg       0.69      0.83      0.73     14743
weighted avg       0.85      0.80      0.81     14743

CPU times: user 1.92 s, sys: 27.8 ms, total: 1.95 s
Wall time: 2.03 s


In [None]:
 BASELINE: 
    IPA       0.84      0.61      0.71      2664
   Other       0.87      0.94      0.90     10883
   Stout       0.68      0.55      0.61      1196
    
    weighted avg 0.85 0.85 0.84 14743