### ANALYSIS BASELINE
#### use text from reviews, try to predict the beer style.  
#### Vectorize the data in beer.review
#### Diminish the importance of common words
#### use Naive Bayes to measure correlation.  
#### this will be my baseline.  I plan to improve it several ways: eliminate imbalance; feature engineering; different algorithms.

Compare ML algorithms to use the review data to predict beer.style
Compare ML algorithms to predict beer.rating

In [2]:
# IMPORT MODULES AND THE DATA SET
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split 
from sklearn.naive_bayes import MultinomialNB

df = pd.read_csv('beer.csv', header=0)
df_copy = df  #save a copy of dataframe for reference. 
print('length',len(df))
pd.set_option('max_colwidth', 220)
df.head(3)

length 80818


Unnamed: 0,name,brewery,style,rating,review
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m..."
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no..."


In [3]:
df.shape

(80818, 5)

In [4]:
# DATA PREP
print('df original length',len(df))
# drop all reviews with < 20 characters
df = df[df['review'].map(len) > 20]
print('length without short reviews',len(df))

# reset dataframe index for the shortened dataframe
df['index'] = np.arange(len(df))
df = df.set_index('index')

# Change review to a string of words.  remove non-letters, make lower case, split into words.  
# Remove stopwords (common words.)  Join back together into a long string of words. 
def review_to_words(review):
    letters_only = re.sub('[^a-zA-Z]',' ', review)
    words = letters_only.lower().split()
    stops = set(stopwords.words('english'))  
    good_words = [w for w in words if not w in stops]
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in good_words]
    return(' '.join(good_words))

# clean the reviews
df['clean_review'] = df['review'].apply(review_to_words)

df.head(3)

df original length 80818
length without short reviews 49141


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,name,brewery,style,rating,review,clean_review
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m...",smell soft hop aroma significant malt scents one smells creamy taste creamy traditional irish flavors come tongue creamy like cream ale close malt big buttery smooth hops unique sharp hop flavor easy saturated well m...
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...,tap dfh rehoboth collab eatily cardamom red wine must golden orange head typical dfh yeast aroma spice maybe belgian influence sweet spicy somewhat fruity much old ale characteristic light still tasty cardamom add ni...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no...",nice auburn impressions tons clarity solid inch white head aroma little bit sweet nutty taste gave little sweetness stayed away hops bitterness relatively light bodied nothing almond came obvious kind fancied oktober...


In [5]:
#pickle the clean data:
import pickle
filename = 'BeerReviews'
outfile = open(filename,'wb')
pickle.dump(df,outfile)
outfile.close()

In [6]:
# retrieve the pickled data:
filename = 'BeerReviews'
infile = open(filename,'rb')
df = pickle.load(infile)
infile.close()

In [7]:
df.shape

(49141, 6)

## MACHINE LEARNING 
### PREDICT STYLE FROM REVIEW
The most naive model would predict the most reviewed style: IPA.  It would be correct 13% of the time.  I'll use Naive Bayes algorithm to improve on that.  This will be a baseline.  Then I'll perform various changes to improve my model.

## 1 Naive Bayes on all reviews
#### I'll vectorize the words, then use Naive Bayes to predict style based on the text in the reviews.  

In [8]:
# VECTORIZE THE REVIEWS  1.4 minutes
from sklearn.preprocessing import Normalizer

X = df['clean_review'].values
y = df['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word')
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

In [9]:
X_scaled.shape

(49141, 73699)

In [10]:
pd.set_option('display.max_rows', 110)
labels = df.groupby(['style']).size() 
print(labels)

style
Altbier                                 229
American Adjunct Lager                  339
American Amber / Red Ale               1424
American Amber / Red Lager              182
American Barleywine                     517
American Black Ale                      556
American Blonde Ale                     770
American Brown Ale                      877
American Dark Wheat Ale                  53
American Double / Imperial IPA         2677
American Double / Imperial Pilsner      127
American Double / Imperial Stout       1591
American IPA                           5552
American Malt Liquor                     87
American Pale Ale (APA)                2779
American Pale Lager                     483
American Pale Wheat Ale                 607
American Porter                        1578
American Stout                         1121
American Strong Ale                     438
American Wild Ale                      1769
Baltic Porter                           234
Belgian Dark Ale          

#### IMBALANCE <br>  This data set poses a problem.  The classes are imbalanced.  Of course, American beer drinkers prefer 'IPA' to 'American Dark Wheat Ale' or 'Sahti', thus there are more ratings for IPA.  The smallest class, 'Faro' holds 6 entries, the largest holds 5552.  

In [11]:
#pickle the vectorized data:
import pickle
filename = 'AllBeerVectors'
outfile = open(filename,'wb')
pickle.dump(df,outfile)
outfile.close()

In [12]:
# retrieve the pickled data:
filename = 'AllBeerVectors'
infile = open(filename,'rb')
df = pickle.load(infile)
infile.close()

In [13]:
# NAIVE BAYES PREDICTOR
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha=0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print("accuracy score: ", np.mean(predicted == y_test))

print(classification_report(y_test, predicted))

accuracy score:  0.423930000678


  'precision', 'predicted', average, warn_for)


                                     precision    recall  f1-score   support

                            Altbier       0.83      0.13      0.23        75
             American Adjunct Lager       0.45      0.47      0.46       107
           American Amber / Red Ale       0.25      0.36      0.30       438
         American Amber / Red Lager       1.00      0.16      0.27        63
                American Barleywine       0.53      0.39      0.45       160
                 American Black Ale       0.83      0.26      0.40       172
                American Blonde Ale       0.37      0.23      0.28       216
                 American Brown Ale       0.44      0.27      0.34       267
            American Dark Wheat Ale       1.00      0.12      0.22        16
     American Double / Imperial IPA       0.60      0.44      0.51       795
 American Double / Imperial Pilsner       1.00      0.05      0.09        42
   American Double / Imperial Stout       0.37      0.79      0.50       48

#### NAIVE BAYES PREDICTOR:  <br> Overall precision 0.57, recall 0.42, and F1 0.40.  That's not terrible; it's much better than random guessing, and better than guessing IPA (which would yield precision 1.00, recall 0.13.)  <br> Some classes have very little data, according to the support number.  These often have lower scores, such as American Dark Wheat Ale:     precision 1.00    recall 0.12     F1 0.22     support 16.   Black & Tan: precision   1.00  recall  0.12    F1  0.22    support  8.  Each B&T prediction was correct, but we missed most of them.   Biere Brut got a zero score.  <br>  Classes with more support generally scored better:  American IPA F1 0.49, American Double IPA F1 0.51, Saison F1 0.51.  <br>  Pumpkin ale wins with F1 0.71!  It didn't have a lot of support.  This style is distinct, so perhaps the reviews were consistent and unique from other styles.    
#### This chart is hard to look at.  Before I improve the algorithm, I want to reduce classes, so it's easier to measure.   I'll try reducing the number of classes, and I'll try just a few styles.

## 2 reduce the number of classes
#### The number of beer styles is large.  Let's simplify the list:  <br> I'll combine similar styles.  Then I'll eliminate styles with very few reviews.

In [14]:
print('length',len(df))
styles = df.groupby(['style']).size() 
print('Number of styles used:', len(styles))
print('')
print(styles.index)

length 49141
Number of styles used: 104

Index(['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine',
       'American Black Ale', 'American Blonde Ale', 'American Brown Ale',
       'American Dark Wheat Ale', 'American Double / Imperial IPA',
       ...
       'Scotch Ale / Wee Heavy', 'Scottish Ale',
       'Scottish Gruit / Ancient Herbed Ale', 'Smoked Beer', 'Tripel',
       'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer', 'Witbier'],
      dtype='object', name='style', length=104)


In [15]:
# COMBINE SIMILAR STYLES OF BEER, and eliminate the least common styles

slim_df = df 
slim_df['style'].replace(['Saison / Farmhouse Ale','Bière de Garde'], 'Farm Ale', inplace=True)
ipa_list = ['American IPA','English India Pale Ale (IPA)','Belgian IPA']
slim_df['style'].replace(ipa_list, 'IPA', inplace=True)
slim_df['style'].replace('Scotch Ale / Wee Heavy', 'Scottish Ale', inplace=True)
pale_list = ['American Pale Ale (APA)','English Pale Ale','Belgian Pale Ale']
slim_df['style'].replace(pale_list, 'Pale Ale', inplace=True)
brown_list = ['American Brown Ale','English Brown Ale','English Dark Mild Ale']
slim_df['style'].replace(brown_list, 'Brown Ale', inplace=True)
stout_list = ['American Stout','English Stout','Milk / Sweet Stout','Oatmeal Stout',]
slim_df['style'].replace(stout_list, 'Stout', inplace=True)
slim_df['style'].replace('American Double / Imperial Stout', 'Imperial Stout', inplace=True)
slim_df['style'].replace('Russian Imperial', 'Imperial Stout', inplace=True)
porter_list = ['American Porter','Baltic Porter','English Porter']
slim_df['style'].replace(porter_list, 'Porter', inplace=True)
lager_list = ['American Amber / Red Lager','Vienna Lager','German Pilsener','Munich Helles Lager']
slim_df['style'].replace(lager_list, 'Lager', inplace=True)
american_lager_list = ['American Adjunct Lager','American Pale Lager']
slim_df['style'].replace(american_lager_list, 'American Lager', inplace=True)
slim_df['style'].replace('American Barleywine', 'Barleywine', inplace=True)
slim_df['style'].replace('English Barleywine', 'Barleywine', inplace=True)
slim_df['style'].replace('English Bitter', 'Bitter', inplace=True)
slim_df['style'].replace('Extra Special / Strong Bitter (ESB)', 'Bitter', inplace=True)
slim_df['style'].replace(['American Pale Wheat Ale','Witbier'], 'Wheat', inplace=True)

styles = slim_df.groupby(['style']).size() 
print('Number of styles after combining:', len(styles))

Number of styles after combining: 84


In [16]:
# remove uncommon styles (in EDA, I found some uncommon styles with fewer than 200 reviews, 
# such as 'Eisbock', 'Faro', 'Gueuze', 'Happoshu'.  

labels = slim_df.groupby(['style']).size() 
uncommon = labels[labels<200]
slim_df = slim_df.loc[~df['style'].isin(uncommon.index)]
styles = slim_df.groupby(['style']).size() 
print('Number of styles after removing uncommon:', len(styles))
print('New length',len(slim_df))

Number of styles after removing uncommon: 51
New length 46321


#### So the number of reviews went down slightly, from 49141 to 46321.  The number of styles dropped by half, from 104 to 51.  Perhaps this will be a manageable amount.  Let's look at the number per class:

In [17]:
pd.set_option('display.max_rows', 110)
labels = slim_df.groupby(['style']).size() 
print(labels)

style
Altbier                            229
American Amber / Red Ale          1424
American Black Ale                 556
American Blonde Ale                770
American Double / Imperial IPA    2677
American Lager                     822
American Strong Ale                438
American Wild Ale                 1769
Barleywine                         799
Belgian Dark Ale                   205
Belgian Strong Dark Ale            407
Belgian Strong Pale Ale            474
Berliner Weissbier                 548
Bitter                            1171
Bock                               233
Brown Ale                         1494
Cream Ale                          286
Czech Pilsener                     429
Doppelbock                         294
Dubbel                             331
Dunkelweizen                       219
Euro Pale Lager                    566
Farm Ale                          2387
Fruit / Vegetable Beer            1010
Gose                               337
Hefeweizen         

#### IMBALANCE <br>  The imbalance problem still exists, but it's improved.  The smallest class holds 205 entries; the largest 6268.

In [18]:
# VECTORIZE THE REVIEWS  ]
from sklearn.preprocessing import Normalizer

X = slim_df['clean_review'].values
y = slim_df['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word')
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

In [19]:
# NAIVE BAYES PREDICTOR
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print("accuracy score: ", np.mean(predicted == y_test))

print(classification_report(y_test, predicted))

accuracy score:  0.49521479456
                                precision    recall  f1-score   support

                       Altbier       1.00      0.13      0.24        89
      American Amber / Red Ale       0.35      0.23      0.27       400
            American Black Ale       0.95      0.21      0.35       174
           American Blonde Ale       0.67      0.14      0.23       243
American Double / Imperial IPA       0.70      0.40      0.51       817
                American Lager       0.53      0.51      0.52       251
           American Strong Ale       0.89      0.19      0.31       129
             American Wild Ale       0.57      0.70      0.63       538
                    Barleywine       0.62      0.62      0.62       229
              Belgian Dark Ale       1.00      0.12      0.21        68
       Belgian Strong Dark Ale       0.45      0.25      0.32       124
       Belgian Strong Pale Ale       0.74      0.12      0.21       141
            Berliner Weissbier  

#### NAIVE BAYES PREDICTOR with 51 classes:  <br>  The results are still hard to look at and understand.  I need to look at fewer classes.  <br>  I note that the scores improved slightly, both overall and individually.  <br>  Overall F1: was 0.40, now 0.47.  IPAs were 0.49 or worse, now combined IPA F1 is 0.57.  Stout F1 score is 0.61, also much better than before.    

## 3 classes: IPA, Stout, other
#### reduce to 3 styles: IPA, Stout, other.  I'll combine all IPA styles into 1 style.  Same for all stouts.  All other styles will be in the "Other" class.

In [20]:
# 3 STYLES OF BEER: all IPAs: IPA.  all Stouts: Stout.  all others: Other

three_styles = df 
ipa_list = ['American IPA','English India Pale Ale (IPA)','American Double / Imperial IPA',
           'Belgian IPA',]
three_styles['style'].replace(ipa_list, 'IPA', inplace=True)
stout_list = ['American Stout','English Stout','Milk / Sweet Stout','Oatmeal Stout',
             'Imperial Stout','American Double / Imperial Stout', ]
three_styles['style'].replace(stout_list, 'Stout', inplace=True)
other_list = ['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine',
       'American Black Ale', 'American Blonde Ale', 'American Brown Ale',
        'American Double / Imperial Pilsner',
       'American Pale Ale (APA)', 'American Pale Lager',
       'American Pale Wheat Ale', 'American Porter', 'American Stout',
       'American Strong Ale', 'American Wild Ale', 'Baltic Porter',
       'Belgian Dark Ale', 'Belgian Pale Ale',
       'Belgian Strong Dark Ale', 'Belgian Strong Pale Ale',
       'Berliner Weissbier', 'Bière de Garde', 'Bock',
       'California Common / Steam Beer', 'Chile Beer', 'Cream Ale',
       'Czech Pilsener', 'Doppelbock', 'Dortmunder / Export Lager', 'Dubbel',
       'Dunkelweizen', 'English Barleywine', 'English Bitter',
       'English Brown Ale', 'English Dark Mild Ale',
        'English Pale Ale', 'English Porter',
        'English Strong Ale', 'Euro Dark Lager',
       'Euro Pale Lager', 'Extra Special / Strong Bitter (ESB)',
       'Flanders Oud Bruin', 'Flanders Red Ale', 'Foreign / Export Stout',
       'Fruit / Vegetable Beer', 'German Pilsener', 'Gose', 'Hefeweizen',
       'Herbed / Spiced Beer', 'Irish Dry Stout', 'Irish Red Ale',
       'Kellerbier / Zwickelbier', 'Kölsch', 'Lambic - Fruit', 'Light Lager',
       'Maibock / Helles Bock', 'Milk / Sweet Stout', 'Munich Dunkel Lager',
       'Munich Helles Lager', 'Märzen / Oktoberfest',
       'Old Ale', 'Pumpkin Ale', 'Quadrupel (Quad)', 'Rauchbier',
       'Russian Imperial Stout', 'Rye Beer', 'Saison / Farmhouse Ale',
       'Schwarzbier', 'Scotch Ale / Wee Heavy', 'Scottish Ale', 'Smoked Beer',
       'Tripel', 'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer',
       'Witbier','American Dark Wheat Ale', 'American Malt Liquor',
       'Bière de Champagne / Bière Brut', 'Black & Tan', 'Braggot', 'Eisbock',
       'English Pale Mild Ale', 'Euro Strong Lager', 'Faro', 'Gueuze',
       'Happoshu', 'Japanese Rice Lager', 'Kristalweizen', 'Kvass',
       'Lambic - Unblended', 'Low Alcohol Beer', 'Roggenbier', 'Sahti',
       'Scottish Gruit / Ancient Herbed Ale','American Lager','Barleywine','Bitter',
        'Brown Ale', 'Farm Ale','Lager','Pale Ale', 'Porter','Wheat']
three_styles['style'].replace(other_list, 'Other', inplace=True)

styles = three_styles.groupby(['style']).size() 
print('Number of styles after combining:', len(styles))

Number of styles after combining: 3


In [21]:
print(styles.index)

Index(['IPA', 'Other', 'Stout'], dtype='object', name='style')


show number of data points per class (do this for all 3 )

In [22]:
labels = three_styles.groupby(['style']).size() 
print(labels)

style
IPA       8945
Other    36300
Stout     3896
dtype: int64


In [23]:
# VECTORIZE THE REVIEWS  
from sklearn.preprocessing import Normalizer

X = three_styles['clean_review'].values
y = three_styles['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word')
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

In [28]:
# NAIVE BAYES PREDICTOR
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha = 0.01)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

         IPA       0.84      0.61      0.71      2664
       Other       0.87      0.94      0.90     10883
       Stout       0.68      0.55      0.61      1196

   micro avg       0.85      0.85      0.85     14743
   macro avg       0.80      0.70      0.74     14743
weighted avg       0.85      0.85      0.84     14743



#### This chart is easy to evaluate, because I can see all the classes in one window.  <br>  Overall F1 improved again!  So did IPA's F1.  Stout F1 score remained the same.  <br>  I improved the baseline by reducing number of classes.  I hope to improve the scores much more.  I'll eliminate imbalance, use feature engineering, tune the algorithm, and try different algorithms.  In the end, I hope that my algorithm will predict style from text with accuracy.  Perhaps I'll try to use the improved algorithm on the entire set of classes.  