# Great American Beer Awards Analysis

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib as mpl
import plotly.express as px
import re
import warnings

%matplotlib inline

In [2]:
# load data
df = pd.read_csv('beer_awards.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,medal,beer_name,brewery,city,state,category,year
0,1,Gold,Volksbier Vienna,Wibby Brewing,Longmont,CO,American Amber Lager,2020
1,2,Silver,Oktoberfest,Founders Brewing Co.,Grand Rapids,MI,American Amber Lager,2020
2,3,Bronze,Amber Lager,Skipping Rock Beer Co.,Staunton,VA,American Amber Lager,2020
3,4,Gold,Lager at World's End,Epidemic Ales,Concord,CA,American Lager,2020
4,5,Silver,Seismic Tremor,Seismic Brewing Co.,Santa Rosa,CA,American Lager,2020


Potentially find other dataframes from the internet which may inform how what qualities lead to a gold medal beer.

## Describe

Provide brief descriptive analysis of the df. What's the shape? How many categories are there? How many years are represented? How many companies are represented?

In [4]:
# Shape
df.shape

(4970, 8)

In [5]:
df.brewery.unique()

array(['Wibby Brewing', 'Founders Brewing Co.', 'Skipping Rock Beer Co.',
       ..., 'Yakima Brewing', 'Val Blatz Brewery', 'Hibernia Brewing Co.'],
      dtype=object)

In [18]:
# Most Popular Categories
df['category'] = df['category'].str.lower()
df['state'] = df['state'].str.upper()
df.groupby('category').size().reset_index(name = 'count').sort_values(by = 'count', ascending=False)[:60]

Unnamed: 0,category,count
176,classic irish-style dry stout,62
96,american-style pale ale,61
449,robust porter,61
158,bock,61
353,imperial stout,60
329,german-style pilsener,59
333,german-style wheat ale,56
66,american-style amber lager,55
167,brown porter,55
365,irish-style red ale,53


__Category Tidying Tasks__
- str to lower
- create a column for [x]-style
- create columns for each of the most popular categories of beer (porter, stout, ipa, lager, helles, etc)

## Tidy Dataset

### Split up Category


In [19]:
styles = ['india pale ale', 'pale ale', 'stout', 'sour', 'wheat beer', 'hefeweizen',
         'witbier','dunkelweizen', 'gose', 'lager', 'pilsner', 'pilsener', 'helles',
          'kolsch', 'porter', 'bock', 'ale', 'fruit beer', 'brett beer', 'saison']
pattern = r'|'.join(styles)

def pattern_searcher(search_str:str, search_list:str):

    search_obj = re.search(search_list, search_str)
    if search_obj :
        return_str = search_str[search_obj.start(): search_obj.end()]
    else:
        return_str = 'NA'
    return return_str

df['style'] = df['category'].apply(lambda x: pattern_searcher(search_str = x, search_list = pattern))

df.groupby('style').size().reset_index(name = 'count').sort_values(by = 'count', ascending = False)

df

Unnamed: 0.1,Unnamed: 0,medal,beer_name,brewery,city,state,category,year,style,ipa
0,1,Gold,Volksbier Vienna,Wibby Brewing,Longmont,CO,american amber lager,2020,lager,False
1,2,Silver,Oktoberfest,Founders Brewing Co.,Grand Rapids,MI,american amber lager,2020,lager,False
2,3,Bronze,Amber Lager,Skipping Rock Beer Co.,Staunton,VA,american amber lager,2020,lager,False
3,4,Gold,Lager at World's End,Epidemic Ales,Concord,CA,american lager,2020,lager,False
4,5,Silver,Seismic Tremor,Seismic Brewing Co.,Santa Rosa,CA,american lager,2020,lager,False
...,...,...,...,...,...,...,...,...,...,...
4965,4966,Gold,Boulder Stout,Rockies Brewing Co.,Boulder,CO,stouts,1987,stout,False
4966,4967,Silver,Grant's Imperial Stout,Yakima Brewing,Sunnyside,WA,stouts,1987,stout,False
4967,4968,Silver,Schild Brau,Millstream Brewing Co.,Amana,IA,vienna style lagers,1987,lager,False
4968,4969,Gold,Edelweiss,Val Blatz Brewery,Milwaukee,WI,wheat beers,1987,wheat beer,False


## Heatmaps

Create a heatmap of America, showing which states have the greatest number of gold medals. Perhaps create a different heatmap for each category.  

which cities and/or states have the most breweries?

In [20]:
df['year'].unique()

array([2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010,
       2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999,
       1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988,
       1987], dtype=int64)

In [28]:
# heatmap of breweries by state
## reshape df with cols brewery, city, state. each brewery shows up once
brew = df.groupby(['brewery', 'city', 'state']).size().reset_index(name = 'total_submissions')
brew_count = brew.groupby(['state']).size().reset_index(name = 'total_breweries')

fig = px.choropleth(brew_count,
                    locations='state',
                    color = 'total_breweries',
                    color_continuous_scale=px.colors.sequential.Viridis,
                    locationmode='USA-states',
                    scope = 'usa',
                    title = 'Total breweries in attendance at the Great American Beer Awards since 1987')
fig.show()

In [None]:
# heatmap of breweries per capita / gold medals per brewery
## assign gold = 3, silver = 2, bronze = 1. points = gold + silver + bronze
## calculate a quality score for each state: points / breweries
df.groupby(['brewery', 'city', 'state']).size().reset_index(name = 'total_submissions')

## Timeline

Over the years, which categories have risen the most in popularity? This is partially dependent on how medals are assigned.