### ANALYSIS BASELINE
#### use text from reviews, try to predict the beer style.  
#### Vectorize the data in beer.review
#### Diminish the importance of common words
#### use Naive Bayes to measure correlation.  
#### this will be my baseline.  I plan to improve it several ways: eliminate imbalance; feature engineering; different algorithms.

Compare ML algorithms to use the review data to predict beer.style
Compare ML algorithms to predict beer.rating

In [23]:
# IMPORT MODULES AND THE DATA SET
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('beer.csv', header=0)
df_copy = df  #save a copy of dataframe for reference. 
print('length',len(df))
pd.set_option('max_colwidth', 220)
df.head(3)

length 80818


Unnamed: 0,name,brewery,style,rating,review
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m..."
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no..."


In [24]:
df.shape

(80818, 5)

In [25]:
# DATA PREP
print('df original length',len(df))
# drop all reviews with < 20 characters
df = df[df['review'].map(len) > 20]
print('length without short reviews',len(df))

# reset dataframe index for the shortened dataframe
df['index'] = np.arange(len(df))
df = df.set_index('index')

# Change review to a string of words.  remove non-letters, make lower case, split into words.  
# Remove stopwords (common words.)  Join back together into a long string of words. 
def review_to_words(review):
    letters_only = re.sub('[^a-zA-Z]',' ', review)
    words = letters_only.lower().split()
    stops = set(stopwords.words('english'))  
    good_words = [w for w in words if not w in stops]
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in good_words]
    return(' '.join(good_words))

# clean the reviews
df['clean_review'] = df['review'].apply(review_to_words)

df.head(3)

df original length 80818
length without short reviews 49141


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,name,brewery,style,rating,review,clean_review
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Big Rock Ale,Big Rock Brewery,Scottish Ale,3.9,"smell  soft hop aroma with significant malt scents. this one smells very creamy. taste  and creamy it is. the traditional irish flavors come out at the tongue. this is creamy, not like a cream ale, but close. the m...",smell soft hop aroma significant malt scents one smells creamy taste creamy traditional irish flavors come tongue creamy like cream ale close malt big buttery smooth hops unique sharp hop flavor easy saturated well m...
1,Flip Ale,Dogfish Head Craft Brewery,Old Ale,4.08,on tap at dfh rehoboth... collab with eatily... cardamom and red wine must. golden orange. .no head. typical dfh yeast aroma. ..some spice and maybe a belgian influence. sweet spicy and somewhat fruity.. not much ol...,tap dfh rehoboth collab eatily cardamom red wine must golden orange head typical dfh yeast aroma spice maybe belgian influence sweet spicy somewhat fruity much old ale characteristic light still tasty cardamom add ni...
2,The Almond Marzen Project - Beer Camp #26,Sierra Nevada Brewing Co.,Märzen / Oktoberfest,3.78,"nice auburn impressions, tons of clarity, solid inch of off white head. aroma was a little bit sweet and nutty. taste gave a little more sweetness, stayed away from hops and bitterness, relatively light bodied. no...",nice auburn impressions tons clarity solid inch white head aroma little bit sweet nutty taste gave little sweetness stayed away hops bitterness relatively light bodied nothing almond came obvious kind fancied oktober...


In [30]:
# ADDITIONAL FEATURE ENGINEERING
# review length
df['review_length'] = df['review'].apply(len)

# average word length
def avg_word_len(words):
    separate_words = words.split()
    count_words = (len(separate_words))    # number of words
    if count_words> 0:
        characters = len(words)  # length of text
        avg = (characters - count_words+1)/count_words
    else:
        avg = 5.65  # this is the mean of 49000 reviews    
    return avg   

df['avg_word_length'] = df['clean_review'].apply(avg_word_len)

In [31]:
#pickle the clean data:
import pickle
filename = 'BeerReviews'
outfile = open(filename,'wb')
pickle.dump(df,outfile)
outfile.close()

In [44]:
# retrieve the pickled data:
filename = 'BeerReviews'
infile = open(filename,'rb')
df = pickle.load(infile)
infile.close()

In [45]:
df.shape

(49141, 8)

## MACHINE LEARNING 
### PREDICT STYLE FROM REVIEW
The most naive model would predict the most reviewed style: IPA.  It would be correct 13% of the time.  I'll use Naive Bayes algorithm to improve on that.  This will be a baseline.  Then I'll perform various changes to improve my model.

## 1 all reviews

In [14]:
# VECTORIZE THE REVIEWS  1.4 minutes
from sklearn.preprocessing import Normalizer

X = df['clean_review'].values
y = df['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word', min_df=3, ngram_range = (1,2))
#vectorizer = TfidfVectorizer(analyzer='word', min_df=2, ngram_range = (1,2))
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

In [15]:
X_scaled.shape

(49141, 482524)

In [21]:
#pickle the vectorized data:
import pickle
filename = 'AllBeerVectors'
outfile = open(filename,'wb')
pickle.dump(df,outfile)
outfile.close()

In [22]:
# retrieve the pickled data:
filename = 'AllBeerVectors'
infile = open(filename,'rb')
df = pickle.load(infile)
infile.close()

In [19]:
# NAIVE BAYES PREDICTOR
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha = 0.001)
# first pass, .1195,  after word cleaning .2076,  after combining styles .2639
# ngrams (1,1): .2639  ngrams (1,2): .2117  ngrams (1,3): .20677
# after adding stemmer with ngrams(1,2): .1617
# changed alpha to 0.001, got 0.53!
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print("accuracy score: ", np.mean(predicted == y_test))

print(classification_report(y_test, predicted))

accuracy score:  0.498541680798


  'precision', 'predicted', average, warn_for)


                                     precision    recall  f1-score   support

                            Altbier       0.63      0.36      0.46        75
             American Adjunct Lager       0.37      0.56      0.45       107
           American Amber / Red Ale       0.39      0.45      0.42       438
         American Amber / Red Lager       0.60      0.24      0.34        63
                American Barleywine       0.45      0.48      0.47       160
                 American Black Ale       0.68      0.59      0.63       172
                American Blonde Ale       0.50      0.40      0.45       216
                 American Brown Ale       0.52      0.35      0.42       267
            American Dark Wheat Ale       0.43      0.19      0.26        16
     American Double / Imperial IPA       0.61      0.54      0.57       795
 American Double / Imperial Pilsner       0.56      0.24      0.33        42
   American Double / Imperial Stout       0.44      0.72      0.54       48

## 2 half the reviews
#### The number of beer styles is large.  Let's simplify the list:

In [47]:
print('length',len(df))
styles = df.groupby(['style']).size() 
print('Number of styles used:', len(styles))
print('')
print(styles.index)

length 49141
Number of styles used: 104

Index(['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine',
       'American Black Ale', 'American Blonde Ale', 'American Brown Ale',
       'American Dark Wheat Ale', 'American Double / Imperial IPA',
       ...
       'Scotch Ale / Wee Heavy', 'Scottish Ale',
       'Scottish Gruit / Ancient Herbed Ale', 'Smoked Beer', 'Tripel',
       'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer', 'Witbier'],
      dtype='object', name='style', length=104)


In [53]:
# COMBINE SIMILAR STYLES OF BEER, and eliminate the least common styles

slim_df = df 
slim_df['style'].replace(['Saison / Farmhouse Ale','Bière de Garde'], 'Farm Ale', inplace=True)
ipa_list = ['American IPA','English India Pale Ale (IPA)','Belgian IPA']
slim_df['style'].replace(ipa_list, 'IPA', inplace=True)
slim_df['style'].replace('Scotch Ale / Wee Heavy', 'Scottish Ale', inplace=True)
pale_list = ['American Pale Ale (APA)','English Pale Ale','Belgian Pale Ale']
slim_df['style'].replace(pale_list, 'Pale Ale', inplace=True)
brown_list = ['American Brown Ale','English Brown Ale','English Dark Mild Ale']
slim_df['style'].replace(brown_list, 'Brown Ale', inplace=True)
stout_list = ['American Stout','English Stout','Milk / Sweet Stout','Oatmeal Stout',]
slim_df['style'].replace(stout_list, 'Stout', inplace=True)
slim_df['style'].replace('American Double / Imperial Stout', 'Imperial Stout', inplace=True)
slim_df['style'].replace('Russian Imperial', 'Imperial Stout', inplace=True)
porter_list = ['American Porter','Baltic Porter','English Porter']
slim_df['style'].replace(porter_list, 'Porter', inplace=True)
lager_list = ['American Amber / Red Lager','Vienna Lager','German Pilsener','Munich Helles Lager']
slim_df['style'].replace(lager_list, 'Lager', inplace=True)
american_lager_list = ['American Adjunct Lager','American Pale Lager']
slim_df['style'].replace(american_lager_list, 'American Lager', inplace=True)
slim_df['style'].replace('American Barleywine', 'Barleywine', inplace=True)
slim_df['style'].replace('English Barleywine', 'Barleywine', inplace=True)
slim_df['style'].replace('English Bitter', 'Bitter', inplace=True)
slim_df['style'].replace('Extra Special / Strong Bitter (ESB)', 'Bitter', inplace=True)
slim_df['style'].replace(['American Pale Wheat Ale','Witbier'], 'Wheat', inplace=True)

styles = slim_df.groupby(['style']).size() 
print('Number of styles after combining:', len(styles))

# remove uncommon styles (in EDA, I found some uncommon styles with fewer than 200 reviews, 
# such as 'Eisbock', 'Faro', 'Gueuze', 'Happoshu'.  

labels = slim_df.groupby(['style']).size() 
uncommon = labels[labels<200]
slim_df = slim_df.loc[~df['style'].isin(uncommon.index)]
styles = slim_df.groupby(['style']).size() 
print('Number of styles after removing uncommon:', len(styles))
print('New length',len(slim_df))

Number of styles after combining: 84
Number of styles after removing uncommon: 51
New length 46321


So the number of reviews went down slightly, from 49141 to 46861.  The number of styles dropped by half, from 104 to 54.

In [54]:
# VECTORIZE THE REVIEWS  1.4 minutes
from sklearn.preprocessing import Normalizer

X = df['clean_review'].values
y = df['style'].values

# vectorize the train data, fit and transform into feature vectors
vectorizer = CountVectorizer(analyzer='word', min_df=3, ngram_range = (1,2))
#vectorizer = TfidfVectorizer(analyzer='word', min_df=2, ngram_range = (1,2))
X_counts = vectorizer.fit_transform(X)
tfidf = TfidfTransformer()
X_train_tfidf = tfidf.fit_transform(X_counts)
scaler = Normalizer()
X_scaled = scaler.fit_transform(X_train_tfidf)

# split into train and test data
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y, test_size=0.3, random_state=22)

In [55]:
# NAIVE BAYES PREDICTOR
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve

clf = MultinomialNB(alpha = 0.001)
# first pass, .1195,  after word cleaning .2076,  after combining styles .2639
# ngrams (1,1): .2639  ngrams (1,2): .2117  ngrams (1,3): .20677
# after adding stemmer with ngrams(1,2): .1617
# changed alpha to 0.001, got 0.53!
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print("accuracy score: ", np.mean(predicted == y_test))

print(classification_report(y_test, predicted))

accuracy score:  0.534558773655


  'precision', 'predicted', average, warn_for)


                                     precision    recall  f1-score   support

                            Altbier       0.71      0.33      0.45        75
           American Amber / Red Ale       0.49      0.36      0.41       438
                 American Black Ale       0.72      0.56      0.63       172
                American Blonde Ale       0.59      0.28      0.38       216
            American Dark Wheat Ale       0.50      0.19      0.27        16
     American Double / Imperial IPA       0.67      0.49      0.57       795
 American Double / Imperial Pilsner       0.62      0.24      0.34        42
                     American Lager       0.44      0.47      0.46       241
               American Malt Liquor       0.67      0.75      0.71        24
                American Strong Ale       0.45      0.27      0.34       139
                  American Wild Ale       0.60      0.69      0.64       539
                         Barleywine       0.51      0.67      0.58       23

## 3 IPA, Stout, other
#### reduce to 3 styles: IPA, Stout, other

In [59]:
# COMBINE SIMILAR STYLES OF BEER, and eliminate the least common styles

three_styles = df 
ipa_list = ['American IPA','English India Pale Ale (IPA)','American Double / Imperial IPA',
           'Belgian IPA',]
three_styles['style'].replace(ipa_list, 'IPA', inplace=True)
stout_list = ['American Stout','English Stout','Milk / Sweet Stout','Oatmeal Stout',
             'Imperial Stout','American Double / Imperial Stout', ]
three_styles['style'].replace(stout_list, 'Stout', inplace=True)
other_list = ['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine',
       'American Black Ale', 'American Blonde Ale', 'American Brown Ale',
        'American Double / Imperial Pilsner',
       'American Pale Ale (APA)', 'American Pale Lager',
       'American Pale Wheat Ale', 'American Porter', 'American Stout',
       'American Strong Ale', 'American Wild Ale', 'Baltic Porter',
       'Belgian Dark Ale', 'Belgian Pale Ale',
       'Belgian Strong Dark Ale', 'Belgian Strong Pale Ale',
       'Berliner Weissbier', 'Bière de Garde', 'Bock',
       'California Common / Steam Beer', 'Chile Beer', 'Cream Ale',
       'Czech Pilsener', 'Doppelbock', 'Dortmunder / Export Lager', 'Dubbel',
       'Dunkelweizen', 'English Barleywine', 'English Bitter',
       'English Brown Ale', 'English Dark Mild Ale',
        'English Pale Ale', 'English Porter',
        'English Strong Ale', 'Euro Dark Lager',
       'Euro Pale Lager', 'Extra Special / Strong Bitter (ESB)',
       'Flanders Oud Bruin', 'Flanders Red Ale', 'Foreign / Export Stout',
       'Fruit / Vegetable Beer', 'German Pilsener', 'Gose', 'Hefeweizen',
       'Herbed / Spiced Beer', 'Irish Dry Stout', 'Irish Red Ale',
       'Kellerbier / Zwickelbier', 'Kölsch', 'Lambic - Fruit', 'Light Lager',
       'Maibock / Helles Bock', 'Milk / Sweet Stout', 'Munich Dunkel Lager',
       'Munich Helles Lager', 'Märzen / Oktoberfest',
       'Old Ale', 'Pumpkin Ale', 'Quadrupel (Quad)', 'Rauchbier',
       'Russian Imperial Stout', 'Rye Beer', 'Saison / Farmhouse Ale',
       'Schwarzbier', 'Scotch Ale / Wee Heavy', 'Scottish Ale', 'Smoked Beer',
       'Tripel', 'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer',
       'Witbier','American Dark Wheat Ale', 'American Malt Liquor',
       'Bière de Champagne / Bière Brut', 'Black & Tan', 'Braggot', 'Eisbock',
       'English Pale Mild Ale', 'Euro Strong Lager', 'Faro', 'Gueuze',
       'Happoshu', 'Japanese Rice Lager', 'Kristalweizen', 'Kvass',
       'Lambic - Unblended', 'Low Alcohol Beer', 'Roggenbier', 'Sahti',
       'Scottish Gruit / Ancient Herbed Ale','American Lager','Barleywine','Bitter',
        'Brown Ale', 'Farm Ale','Lager','Pale Ale', 'Porter','Wheat']
three_styles['style'].replace(other_list, 'Other', inplace=True)

styles = three_styles.groupby(['style']).size() 
print('Number of styles after combining:', len(styles))

Number of styles after combining: 13


In [60]:
print(styles.index)

Index(['American Lager', 'Barleywine', 'Bitter', 'Brown Ale', 'Farm Ale',
       'IPA', 'Imperial Stout', 'Lager', 'Other', 'Pale Ale', 'Porter',
       'Stout', 'Wheat'],
      dtype='object', name='style')
