# Beers Pipeline
* Import and concatenate beer style dfs
* Remove `'Brewery_Num'` and `'Beer_Name`' as they are not necessary for analysis
* Convert `'ABV'` and `'Beer_Score'` to `float` and `'Num_Beer_Ratings'` to `int`
### Add the following columns:
* Total Number of Beer Reviews
* Standard deviation of alcohol content
* Number of Beer_Style produced by a brewery
* Average Beer_Score by Brewery
* Average Alcohol content of beers
* Highest average score for a style
* Lowest average score for a style  
* Max/min/mean beer score by style category


* Imputing missing ABV Values (Not really necessary...)


* Highest/Lowest score for any beer for a brewery
* Number of beers within particular styles
* Number of beers within style groups


#### Later add
* Percentage of beers by beer style
* Average number of beers by style (only counting styles that have at least one beer)
* average number of ratings by beer style (Num ratings doesn't seem to be a good indicator)
* max number of ratings by beer style (Num ratings doesn't seem to be a good indicator)

## Desired Columns:
* Create boolean for each beer whether notes exists.
    * Possibly convert to percentage of beers that have notes
    * Could also have total beers with notes (boolean probably better)

In [1]:
import os

import pandas as pd
import numpy as np
import seaborn as snsm
import matplotlib.pyplot as plt
import seaborn as sns

import warnings; warnings.simplefilter('ignore')

In [2]:
os.chdir('Data2/')

# Beers

### Import and concatenate Beer Style Dataframes

The below was used to open all files that were saved to the Data2/ folder. Columns were renamed to be consistent with the convention used in brews_df.

``` python
data_fols = os.listdir()

style_nums = []

for fol in data_fols:
    try:
        int(fol)
        style_nums.append(str(fol))
    except:
        pass

beers_df = pd.DataFrame(columns = ['beer_name','brewery_name','abv',
                             'ratings','score','brewery_nums'])

for num in style_nums:
    pickled = pd.read_pickle(num)
    beers_df = pd.concat([beers_df,pickled],sort=True)
    
beers_df.columns = ['ABV','Beer_Name','Brewery_Name',
                    'Brewery_Num','Num_Beer_Ratings',
                    'Beer_Score','Beer_Style']
    
pd.to_pickle(beers_df,'Beer_Data')

```

In [3]:
beers_df = pd.read_pickle('Beer_Data')

Drop the Brewery_Num column as those values would have only been necessary to populate the Brewery Scraper function and will not be used in our model.

### Drop unnecessary columns

In [4]:
beers_df.drop(['Brewery_Num','Beer_Name'],inplace=True,axis=1)

Remove extraneous characters from `'ABV'` and `'Num_Beer_Ratings'` columns. Convert `'ABV'` and `'Beer_Score'` to `float` and `'Num_Beer_Ratings'` to `int`.

### Convert numeric columns to float and int

In [5]:
beers_df['ABV'].replace(' ? ',np.nan,inplace=True)
beers_df.Num_Beer_Ratings = beers_df.Num_Beer_Ratings.str.replace(',','')

beers_df.ABV = beers_df.ABV.apply(float)
beers_df.Beer_Score = beers_df.Beer_Score.apply(float)
beers_df.Num_Beer_Ratings = beers_df.Num_Beer_Ratings.apply(int)

Convert/impute missing values within ABV to the mean of ABV for that Beer_Style.  
I decided to use the mean instead of the median due to the fact that the distribution of alcohol percentages by beer style are roughly normal but values peak around around certain values, typically integers. As it was possible for the median to fall within some of these peaks around integers, I decided that the mean is likely a more accurate representation of central tendency.

### Add Total Number of Beer Reviews

In [6]:
total_beer_ratings = beers_df.groupby('Brewery_Name').sum().Num_Beer_Ratings
total_beer_ratings = pd.DataFrame(total_beer_ratings)
total_beer_ratings.columns = ['Total_Beer_Ratings']

### Determine Standard Deviation of `'ABV'`

It is possible that having a beers of a wider range of ABV could be something that customers seek in a brewery and thus lead to a higher score.

In [7]:
ABV_std = beers_df.groupby('Brewery_Name').std().ABV
ABV_std = pd.DataFrame(ABV_std)
ABV_std.columns = ['ABV_std']

### Number / Counts of Beer_Style produced by Brewery

In [8]:
num_styles_df = pd.DataFrame(beers_df.groupby(['Brewery_Name','Beer_Style']).size())
num_styles_df.reset_index(inplace=True)
num_styles = num_styles_df.groupby('Brewery_Name').size()
num_styles = pd.DataFrame(num_styles,columns=['Num_Styles'])

### Average `'Beer_Score'` by Brewery

In [9]:
mean_beer_score = beers_df.groupby('Brewery_Name').mean().Beer_Score
mean_beer_score = pd.DataFrame(mean_beer_score)
mean_beer_score.columns = ['Mean_Beer_Score']

### Average ABV of Beers

In [10]:
mean_abv = beers_df.groupby('Brewery_Name').mean().ABV
mean_abv = pd.DataFrame(mean_abv)
mean_abv.columns = ['Mean_ABV']

### Max Number of Beer Ratings by Brewery

In [11]:
max_beer_ratings = beers_df.groupby('Brewery_Name').max().Num_Beer_Ratings
max_beer_ratings = pd.DataFrame(max_beer_ratings)
max_beer_ratings.columns = ['Max_Num_Beer_Ratings']

### Highest/Lowest Average Score for any Style for a Brewery

In [12]:
mean_beer_score_by_style_df = pd.DataFrame(beers_df.groupby(['Brewery_Name','Beer_Style']).mean())
mean_beer_score_by_style_df.reset_index(inplace=True)

max_of_mean_beer_score = mean_beer_score_by_style_df.groupby('Brewery_Name').max().Beer_Score
max_of_mean_beer_score = pd.DataFrame(max_of_mean_beer_score)
max_of_mean_beer_score.columns = ['Max_Mean_Beer_Score']

min_of_mean_beer_score = mean_beer_score_by_style_df.groupby('Brewery_Name').min().Beer_Score
min_of_mean_beer_score = pd.DataFrame(min_of_mean_beer_score)
min_of_mean_beer_score.columns = ['Min_Mean_Beer_Score']

### Highest/Lowest Score for any Beer for a Brewery

In [13]:
max_beer_score = beers_df.groupby(['Brewery_Name']).max().Beer_Score
max_beer_score = pd.DataFrame(max_beer_score)
max_beer_score.columns = ['Max_Beer_Score']

min_beer_score = beers_df.groupby(['Brewery_Name']).min().Beer_Score
min_beer_score = pd.DataFrame(min_beer_score)
min_beer_score.columns = ['Min_Beer_Score']

## Creating Beer_Style Categories

### Max/min/mean by Beer_Style Category

In [15]:
bocks_df = beers_df[beers_df.Beer_Style.str.contains('Bock')]

max_bock = bocks_df.groupby('Brewery_Name').max().Beer_Score
max_bock = pd.DataFrame(max_bock)
max_bock.columns = ['Max_Bock']

min_bock = bocks_df.groupby('Brewery_Name').min().Beer_Score
min_bock = pd.DataFrame(min_bock)
min_bock.columns = ['Min_Bock']

mean_bock = bocks_df.groupby('Brewery_Name').mean().Beer_Score
mean_bock = pd.DataFrame(mean_bock)
min_bock.columns = ['Mean_Bock']

count_bock = bocks_df.groupby('Brewery_Name').size()
count_bock = pd.DataFrame(count_bock)
count_bock.columns = ['Count_Bock']

In [17]:
brown_ales_pattern = 'Brown|Dark Mild|Altbier'
brown_ales_df = beers_df[beers_df.Beer_Style.str.contains(brown_ales_pattern)]

max_brown_ales = brown_ales_df.groupby('Brewery_Name').max().Beer_Score
max_brown_ales = pd.DataFrame(max_brown_ales)
max_brown_ales.columns = ['Max_Brown_Ales']

min_brown_ales = brown_ales_df.groupby('Brewery_Name').min().Beer_Score
min_brown_ales = pd.DataFrame(min_brown_ales)
min_brown_ales.columns = ['Min_Brown_Ales']

mean_brown_ales = brown_ales_df.groupby('Brewery_Name').mean().Beer_Score
mean_brown_ales = pd.DataFrame(mean_brown_ales)
mean_brown_ales.columns = ['Mean_Brown_Ales']

count_brown_ales = brown_ales_df.groupby('Brewery_Name').size()
count_brown_ales = pd.DataFrame(count_brown_ales)
count_brown_ales.columns = ['Count_Brown_Ales']

In [23]:
dark_ales_pattern = 'Black|Dark Ale|Dubbel|Roggenbier|Scottish Ale|Winter'
dark_ales_df = beers_df[beers_df.Beer_Style.str.contains(dark_ales_pattern)]

max_dark_ales = dark_ales_df.groupby('Brewery_Name').max().Beer_Score
max_dark_ales = pd.DataFrame(max_dark_ales)
max_dark_ales.columns = ['Max_Dark_Ales']

min_dark_ales = dark_ales_df.groupby('Brewery_Name').min().Beer_Score
min_dark_ales = pd.DataFrame(min_dark_ales)
min_dark_ales.columns = ['Min_Dark_Ales']

mean_dark_ales = dark_ales_df.groupby('Brewery_Name').mean().Beer_Score
mean_dark_ales = pd.DataFrame(mean_dark_ales)
mean_dark_ales.columns = ['Mean_Dark_Ales']

count_brown_ales = bocks_df.groupby('Brewery_Name').size()
count_brown_ales = pd.DataFrame(count_brown_ales)
count_brown_ales.columns = ['Count_Brown_Ales']

In [25]:
dark_lagers_pattern = 'Red Lager|Dark Lager|Märzen|Rauch|Schwarz|Dunkel Lager|Vien'
dark_lagers_df = beers_df[beers_df.Beer_Style.str.contains(dark_lagers_pattern)]

max_dark_lagers = dark_lagers_df.groupby('Brewery_Name').max().Beer_Score
max_dark_lagers = pd.DataFrame(max_dark_lagers)
max_dark_lagers.columns = ['Max_Dark_Lagers']

min_dark_lagers = dark_lagers_df.groupby('Brewery_Name').min().Beer_Score
min_dark_lagers = pd.DataFrame(max_dark_lagers)
min_dark_lagers.columns = ['Min_Dark_Lagers']

mean_dark_lagers = dark_lagers_df.groupby('Brewery_Name').mean().Beer_Score
mean_dark_lagers = pd.DataFrame(mean_dark_lagers)
mean_dark_lagers.columns = ['Mean_Dark_Lagers']

count_dark_lagers = dark_lagers_df.groupby('Brewery_Name').size()
count_dark_lagers = pd.DataFrame(count_dark_lagers)
count_dark_lagers.columns = ['Count_Dark_Lagers']

In [26]:
hybrid_beers_pattern = 'Cream|Champ|Cali'
hybrid_df = beers_df[beers_df.Beer_Style.str.contains(hybrid_beers_pattern)]

max_hybrid = hybrid_df.groupby('Brewery_Name').max().Beer_Score
max_hybrid = pd.DataFrame(max_hybrid)
max_hybrid.columns = ['Max_Hybrid']

min_hybrid = hybrid_df.groupby('Brewery_Name').min().Beer_Score
min_hybrid = pd.DataFrame(min_hybrid)
min_hybrid.columns = ['Min_Hybrid']

mean_hybrid = hybrid_df.groupby('Brewery_Name').mean().Beer_Score
mean_hybrid = pd.DataFrame(mean_hybrid)
mean_hybrid.columns = ['Mean_Hybrid']

count_hybrid = dark_lagers_df.groupby('Brewery_Name').size()
count_hybrid = pd.DataFrame(count_hybrid)
count_hybrid.columns = ['Count_Hybrid']

In [27]:
ipa_df = beers_df[beers_df.Beer_Style.str.contains('IPA')]

max_ipa = ipa_df.groupby('Brewery_Name').max().Beer_Score
max_ipa = pd.DataFrame(max_ipa)
max_ipa.columns = ['Max_IPA']

min_ipa = ipa_df.groupby('Brewery_Name').min().Beer_Score
min_ipa = pd.DataFrame(min_ipa)
min_ipa.columns = ['Min_IPA']

mean_ipa = ipa_df.groupby('Brewery_Name').mean().Beer_Score
mean_ipa = pd.DataFrame(mean_ipa)
mean_ipa.columns = ['Mean_IPA']

count_ipa = ipa_df.groupby('Brewery_Name').size()
count_ipa = pd.DataFrame(count_ipa)
count_ipa.columns = ['Count_IPA']

Combined IPA and Pale Ale

In [29]:
ipa_pale_ale_pattern = 'Red Ale|Blonde Ale|American Pale Ale|Belgian Pale|\
                    Belgian Saison|Bitter|English Pale|French|Kölsch|IPA'
ipa_pale_ale_df = beers_df[beers_df.Beer_Style.str.contains(ipa_pale_ale_pattern)]

max_ipa_pale_ale = ipa_pale_ale_df.groupby('Brewery_Name').max().Beer_Score
max_ipa_pale_ale = pd.DataFrame(max_ipa_pale_ale)
max_ipa_pale_ale.columns = ['Max_IPA_Pale_Ale']

min_ipa_pale_ale = ipa_pale_ale_df.groupby('Brewery_Name').min().Beer_Score
min_ipa_pale_ale = pd.DataFrame(min_ipa_pale_ale)
min_ipa_pale_ale.columns = ['Min_IPA_Pale_Ale']

mean_ipa_pale_ale = ipa_pale_ale_df.groupby('Brewery_Name').mean().Beer_Score
mean_ipa_pale_ale = pd.DataFrame(mean_ipa_pale_ale)
mean_ipa_pale_ale.columns = ['Mean_IPA_Pale_Ale']

count_ipa_pale_ale = ipa_pale_ale_df.groupby('Brewery_Name').size()
count_ipa_pale_ale = pd.DataFrame(count_ipa_pale_ale)
count_ipa_pale_ale.columns = ['Count_IPA_Pale_Ale']

In [30]:
pale_ale_pattern = 'Red Ale|Blonde Ale|American Pale Ale|Belgian Pale|\
                    Belgian Saison|Bitter|English Pale|French|Kölsch|I'
pale_ale_df = beers_df[beers_df.Beer_Style.str.contains(pale_ale_pattern)]

max_pale_ale = pale_ale_df.groupby('Brewery_Name').max().Beer_Score
max_pale_ale = pd.DataFrame(max_pale_ale)
max_pale_ale.columns = ['Max_Pale_Ale']

min_pale_ale = pale_ale_df.groupby('Brewery_Name').min().Beer_Score
min_pale_ale = pd.DataFrame(min_pale_ale)
min_pale_ale.columns = ['Min_Pale_Ale']

mean_pale_ale = pale_ale_df.groupby('Brewery_Name').mean().Beer_Score
mean_pale_ale = pd.DataFrame(mean_pale_ale)
mean_pale_ale.columns = ['Mean_Pale_Ale']

count_pale_ale = ipa_df.groupby('Brewery_Name').size()
count_pale_ale = pd.DataFrame(count_pale_ale)
count_pale_ale.columns = ['Count_Pale_Ale']

In [31]:
pilsner_pattern = 'Adj|Pils|American L|Malt|Dort|\
                    Pale L|Strong L|Helles|Keller'
pilsner_df = beers_df[beers_df.Beer_Style.str.contains(pilsner_pattern)]

max_pilsner = pilsner_df.groupby('Brewery_Name').max().Beer_Score
max_pilsner = pd.DataFrame(max_pilsner)
max_pilsner.columns = ['Max_Pilsner']

min_pilsner = pilsner_df.groupby('Brewery_Name').min().Beer_Score
min_pilsner = pd.DataFrame(min_pilsner)
min_pilsner.columns = ['Min_Pilsner']

mean_pilsner = pilsner_df.groupby('Brewery_Name').mean().Beer_Score
mean_pilsner = pd.DataFrame(mean_pilsner)
mean_pilsner.columns = ['Mean_Pilsner']

count_pilsner = ipa_df.groupby('Brewery_Name').size()
count_pilsner = pd.DataFrame(count_pilsner)
count_pilsner.columns = ['Count_Pilsner']

In [32]:
porter_df = beers_df[beers_df.Beer_Style.str.contains('Porter')]

max_porter = porter_df.groupby('Brewery_Name').max().Beer_Score
max_porter = pd.DataFrame(max_porter)
max_porter.columns = ['Max_Porter']

min_porter = porter_df.groupby('Brewery_Name').min().Beer_Score
min_porter = pd.DataFrame(min_porter)
min_porter.columns = ['Min_Porter']

mean_porter = porter_df.groupby('Brewery_Name').mean().Beer_Score
mean_porter = pd.DataFrame(mean_porter)
mean_porter.columns = ['Mean_Porter']

count_porter = ipa_df.groupby('Brewery_Name').size()
count_porter = pd.DataFrame(count_porter)
count_porter.columns = ['Count_Porter']

In [34]:
specialty_pattern = 'Chile|Sahti|Field|Spice|Japan|Low|\
                    Pumpkin|Kvass|Rye|Gruit|Smoke Beer'
specialty_df = beers_df[beers_df.Beer_Style.str.contains(specialty_pattern)]

max_specialty = specialty_df.groupby('Brewery_Name').max().Beer_Score
max_specialty = pd.DataFrame(max_specialty)
max_specialty.columns = ['Max_Specialty']

min_specialty = specialty_df.groupby('Brewery_Name').min().Beer_Score
min_specialty = pd.DataFrame(min_specialty)
min_specialty.columns = ['Min_Specialty']

mean_specialty = specialty_df.groupby('Brewery_Name').mean().Beer_Score
mean_specialty = pd.DataFrame(mean_specialty)
mean_specialty.columns = ['Mean_Specialty']

count_specialty = ipa_df.groupby('Brewery_Name').size()
count_specialty = pd.DataFrame(count_specialty)
count_specialty.columns = ['Count_Specialty']

In [40]:
stout_df = beers_df[beers_df.Beer_Style.str.contains('Stout')]

max_stout = stout_df.groupby('Brewery_Name').max().Beer_Score
max_stout = pd.DataFrame(max_stout)
max_stout.columns = ['Max_Stout']

min_stout = stout_df.groupby('Brewery_Name').min().Beer_Score
min_stout = pd.DataFrame(min_stout)
min_stout.columns = ['Min_Stout']

mean_stout = stout_df.groupby('Brewery_Name').mean().Beer_Score
mean_stout = pd.DataFrame(mean_stout)
mean_stout.columns = ['Mean_Stout']

count_stout = ipa_df.groupby('Brewery_Name').size()
count_stout = pd.DataFrame(count_stout)
count_stout.columns = ['Count_Stout']

In [35]:
strong_pattern = 'Barley|Imperial Red|Strong Ale|Wheatwine|\
                Quad|Strong Dark|Strong Pale|Tripel|Old|Wee'
strong_df = beers_df[beers_df.Beer_Style.str.contains(strong_pattern)]

max_strong = strong_df.groupby('Brewery_Name').max().Beer_Score
max_strong = pd.DataFrame(max_strong)
max_strong.columns = ['Max_Strong']

min_strong = strong_df.groupby('Brewery_Name').min().Beer_Score
min_strong = pd.DataFrame(min_strong)
min_strong.columns = ['Min_Strong']

mean_strong = strong_df.groupby('Brewery_Name').mean().Beer_Score
mean_strong = pd.DataFrame(mean_strong)
mean_strong.columns = ['Mean_Strong']

count_strong = ipa_df.groupby('Brewery_Name').size()
count_strong = pd.DataFrame(count_strong)
count_strong.columns = ['Count_Strong']

In [36]:
wheat_pattern = 'Wheat Ale|Witbier|Weisse|Dunkelweizen|Hefe|Kristal'
wheat_df = beers_df[beers_df.Beer_Style.str.contains(wheat_pattern)]

max_wheat = wheat_df.groupby('Brewery_Name').max().Beer_Score
max_wheat = pd.DataFrame(max_wheat)
max_wheat.columns = ['Max_Wheat']

min_wheat = wheat_df.groupby('Brewery_Name').min().Beer_Score
min_wheat = pd.DataFrame(min_wheat)
min_wheat.columns = ['Min_Wheat']

mean_wheat = wheat_df.groupby('Brewery_Name').mean().Beer_Score
mean_wheat = pd.DataFrame(mean_wheat)
mean_wheat.columns = ['Mean_Wheat']

count_wheat = ipa_df.groupby('Brewery_Name').size()
count_wheat = pd.DataFrame(count_wheat)
count_wheat.columns = ['Count_Wheat']

In [37]:
wild_sour_pattern = 'Brett|Wild|Faro|Lambic|Gue|Flanders|Leip'
wild_sour_df = beers_df[beers_df.Beer_Style.str.contains(wild_sour_pattern)]

max_wild_sour = wild_sour_df.groupby('Brewery_Name').max().Beer_Score
max_wild_sour = pd.DataFrame(max_wild_sour)
max_wild_sour.columns = ['Max_Wild_Sour']

min_wild_sour = wild_sour_df.groupby('Brewery_Name').min().Beer_Score
min_wild_sour = pd.DataFrame(min_wild_sour)
min_wild_sour.columns = ['Min_Wild_Sour']

mean_wild_sour = wild_sour_df.groupby('Brewery_Name').mean().Beer_Score
mean_wild_sour = pd.DataFrame(mean_wild_sour)
mean_wild_sour.columns = ['Mean_Wild_Sour']

count_wild_sour = ipa_df.groupby('Brewery_Name').size()
count_wild_sour = pd.DataFrame(count_wild_sour)
count_wild_sour.columns = ['Count_Wild_Sour']

# Compiling Beer Data

In [41]:
derived_from_beers = [total_beer_ratings, ABV_std, num_styles, 
                      mean_beer_score, mean_abv, max_beer_ratings, 
                      max_of_mean_beer_score, min_of_mean_beer_score, 
                      max_beer_score, min_beer_score, count_ipa_pale_ale,
                      max_bock, min_bock, mean_bock, max_brown_ales, 
                      min_brown_ales, mean_brown_ales, max_dark_ales, 
                      min_dark_ales, mean_dark_ales, max_dark_lagers, 
                      min_dark_lagers, mean_dark_lagers, max_hybrid, 
                      min_hybrid, mean_hybrid, max_ipa, min_ipa, 
                      mean_ipa, max_pale_ale, min_pale_ale, 
                      mean_pale_ale, max_pilsner, min_pilsner, 
                      mean_pilsner, max_porter, min_porter, 
                      mean_porter, max_specialty, min_specialty, 
                      mean_specialty, max_stout, min_stout, 
                      mean_stout, max_strong, min_strong, mean_strong, 
                      max_wheat, min_wheat, mean_wheat, max_wild_sour, 
                      min_wild_sour, mean_wild_sour]





In [42]:
df = pd.concat(derived_from_beers,axis=1,join='outer')

In [45]:
df.fillna(0,inplace=True)

# Saving Clean Beer Data

In [47]:
pd.to_pickle(df,'Clean_Beer_Data')