# Beer Score Data Exploration and Analysis
This notebook transforms data sourced from [https://data.world/socialmediadata/beeradvocate](https://data.world/socialmediadata/beeradvocate) as part of creating a beer prediction machine learning model.  The dataset is ~1.5 million reviews, spanning more than 10 years **(insert date range)**.  Each review includes ratings from [ratebeer](https://www.ratebeer.com/Story.asp?StoryID=103) across five "features": appearance, aroma, palate, taste, and overall impression.  A quick explanation of each rating follows however for a full explanation refer to the website.


### Index
* [Data Extraction](#dataextract)

* [Data Transformation](#datatransform)
    * [Exploratory Analysis](#dataexplore)
    * [Visualising the Data](#dataviz)
        
* [Adding Features](#dataadd)









### Beer Ratings
**Appearance:** (rating out of 5)
  After pouring rating includes observations on the visual appeal, including the colour, clarity, carbonation, and head       size and longevity, as well as the extent and pattern of lacing on the glass

**Aroma:** (rating out of 10)
  Any attractive, unusual or bad aromas. Hop character, malts, sweetness, fruitiness and other aromas including more subtle   aromas released after swirling the glass

**Palate:** (rating out of 5)
  The “feel” of the beer inside the mouth, at the front, the back and as you swallow, concentrating on the body or fullness   of the beer and any other special feature of how it feels in the mouth

**Flavour:** (rating out of 10) 
  How the beer tastes, the number of different tastes and flavours that can be identified.  Consideration also given to       variation in flavour from the start, to the middle, the finish and then the aftertaste of the beer. Assessment includes     intensity of the bitterness, sweetness and sourness of the beer

**Overall:** (rating out of 20) 
  A way of balancing up other features about the beer or anything else liked or disliked about it eg include price,           likelihood of buying again, etc





## Data Extraction <a id="dataextract"></a>


In [1]:
# import dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline


In [None]:
# create dataframe
beer_df = pd.read_csv("https://query.data.world/s/nuub3qupegsd33g3nimifjpajqeq2o")

# for local drive use
#beer_df = pd.read_csv("data/beer.gz")


In [None]:
# drop all rows with any NaN values
beer_df = beer_df.dropna()


In [None]:
# remove duplicate rows in place
beer_df.drop_duplicates(keep="first")
beer_df = beer_df.reset_index(drop = True)

# show first 10 rows
beer_df.head(10)


In [None]:
# save to csv file for local use
beer_data = beer_df.to_csv("data/beer.gz", header = True, compression="gzip", index = False)


## Data Transformation <a id="datatransform"></a>

### Exploratory analysis  <a id="dataexplore"></a>

An initial inspection of dataset is done for information on number of rows and columns, the data types and for confirmation that no null values remain.  A basic statistical analysis is then be completed using the **dataframe.describe()** to show for each feature, the:
- mean or average value
- standard deviation, which shows the spread of the data
- range of the data (min and max)
- 25%, 50% and 75% quartiles, showing skewedness in the data and the existence of outliers

Note as the default **DataFrame.describe()** method does not include categorical values in the summary, an additional check is done using the parameter _(include=[object])_ to examine the string columns.  This was chosen over the _(include="all")_ parameter to increase the readability of both summary outputs.

In [None]:
# inspect number of rows and columns 
beer_df.shape


In [None]:
# check additional information about dataframe ie spread of data types, null values, total number of records
beer_df.info()


In [None]:
# convert epoch unix time as integer to timestamp for plotting
beer_df["review_time"] = pd.to_datetime(beer_df["review_time"], unit = "s")


In [None]:
# check data type
beer_df["review_time"].dtypes


In [None]:
# check endianess
np.dtype("datetime64[ns]") == np.dtype("<M8[ns]")


In [None]:
# inspect basic statistic summary details (all numberic fields)
beer_df.describe().T


Preliminary observations made, based on the above include :
- **review_time** conversion to timestamp excludes the column from **DataFrame.describe()** method calculations.  This feature will need to be investigated independently
- **brewery_id** and **beer_beerid** are ordinal in nature as such no statistical inference can be made from the related statisics 
- the min values of zero for **review_overall** and **review_appearance** indicate beers in the dataset with no ratings.  These will need to be removed from the dataset
- with the exception of **review_overall** and **review_appearance** the remaining review ratings (review_aroma, review_appearance and review_taste) have ratings that fall within the range of 1-5 as expected and will not require additional cleaning
- the 25%, 50%, and 75% percentiles for all review ratings are largely consistent.  Worthnoting however is the complete absence of a 3rd quartile (75%) for **review_aroma**, **review_appearance** and **review_palate**.  In addition, with each feature having a mean that is lower than the median, there is a distinct skewness of these features to the right.  Whilst not as extreme, this is also the case for the remaining review features, **review_overall** and **review_taste**.  This will be confirmed later with a number of visual checks
- the max value for **beer_abv** is 57.7 which seems extremely high for an alcohol content and requires further investigation
- conversion of the **review_time** from an integer to a datetime format 


In [None]:
# filter for review_overall greater than zero and confirm results
beer_df = beer_df[beer_df["review_overall"] > 0]


Filtering to exclude rows where the **review_overall** value is less than zero has also dealt with zero values for **review_appearance**.

In [None]:
beer_df["review_time"].dtypes


In [None]:
 np.dtype('datetime64[ns]') == np.dtype('<M8[ns]')
    

### Visualising the data  <a id="dataviz"></a>

In [None]:
# inspect distribution of dataframe numeric columns
beer_df.hist(figsize = (20,20));


In [None]:
# inspect abv value skewness using a scatterplot
abv_check = beer_df.groupby(["beer_abv", "beer_name"]).size().reset_index(name="counts")

plt.scatter(abv_check["beer_abv"], abv_check["counts"])
plt.annotate("long tail of high abv beers",(25,1000));
plt.show()


In [None]:
# slice dataframe to list beers with values >=20% abv
new_abv_check = abv_check.loc[abv_check["beer_abv"] >= 20].sort_values(
    "beer_abv", ascending = False)
new_abv_check


A validation of the abv for each individual beer (20 in total) confirmed that **beer_abv** values are correct and should remain in the dataset.

In [None]:
# visualise check of review scores to identify outliers
beer_df.boxplot(column=["review_overall","review_aroma", "review_appearance","review_palate", "review_taste"],
                figsize = (15,6), return_type="axes", notch = True, 
                flierprops=dict(marker='s', markersize = 7, markerfacecolor="b"));

# add title and ticks
plt.title("Boxplot of Ratings", fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12);


Interestingly, there are 2 distinct groups within the 5 ratings features, which was not immediately evident from the initial histograms.  Also of note is that the proximity of the mean to the median in the group of **review_aroma**, **review_appearance** and **review_palate** is so small that there is no _'notching'_ at the median points in the plot.  This shows how heavily skewed the data is to the upper limit of the ratings, with the absence of the 75% quartile.

In [None]:
# resize dataframe to analyse ratings
columns = ["beer_style","review_overall","review_aroma","review_appearance","review_palate","review_taste"]
reviews_df = beer_df.loc[:,columns]
#reviews_df = reviews_df.set_index("beer_style")

# plot heatmap to show correlations between the ratings features
plt.figure(figsize=(6,5))
sns.heatmap(reviews_df.corr(), xticklabels=reviews_df.corr().columns, 
            yticklabels=reviews_df.corr().columns, cmap="GnBu", center=0, annot=True)

# add title and ticks
plt.title("Heatmap of Ratings", fontsize=20)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()


In [None]:
reviews_df

In [None]:
# Plot
plt.figure(figsize=(10,8))
sns.pairplot(reviews_df, kind="scatter")
plt.show()


In [None]:
# check for large number of same reviewers ie reviewer bias
total_reviews = beer_df["review_profilename"].count()
reviewers = beer_df.groupby("review_profilename").size().reset_index(name="counts")
reviewers = reviewers.sort_values(by = "counts", ascending = False)

weighting = round(reviewers["counts"]/total_reviews * 100,2)
weighted_reviewers = pd.concat([reviewers, weighting], axis = 1)
weighted_reviewers.columns = ["review_profilename","ratings","% total"]
weighted_reviewers.head(15)


In [None]:
beer_df.describe(include=[object])

In [None]:
# save to csv file for import
beer_data = beer_df.to_csv("data/beer.gz", header = True, compression="gzip", index = False)
