<a href="https://colab.research.google.com/github/lytvyniuk/iowa_liquor_sales-Exploratory-Data-Analysis-/blob/master/Test_task_iowa_liquor_sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# please run this cell and follow the link to authenticate
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

The dataset analysis can be useful to find out the best time to sell items, the most profitable vendors, the most popular drinks etc. This information eventually can help to understand businesses how to act to gain more revenue.

I start with general analysis of the data and go in details if I find an interesting observation.

**Top 10 stores with the most bottles sold**

Let's find out what vendors are the biggest on the market in terms of sold bottles. There are a lot of them, so I showed top 10 on the plot for better visualisation.

In [0]:
# select aggregated data on bottles_sold and vendors
%%bigquery --project protean-genius-271221 df
SELECT vendor_number, sum(bottles_sold)  as bottles_sold FROM `bigquery-public-data.iowa_liquor_sales.sales`  group by vendor_number order by bottles_sold desc limit 10

In [0]:
# select unique names of the stores , because some stores have same name code but slight difference in names (uppercase and lowercase etc), so it is needed to group correctly
%%bigquery --project protean-genius-271221 df_names
SELECT distinct vendor_number, vendor_name FROM `bigquery-public-data.iowa_liquor_sales.sales` 

In [0]:
# merging with names
df = pd.merge(df, df_names, on = "vendor_number", how='left').drop_duplicates(subset='vendor_number', keep="first")
df


In [0]:
sns.barplot(x = 'vendor_name', y = 'bottles_sold',  data = df, color="green")
plt.xticks(rotation=50, horizontalalignment='right')
plt.xlabel("Vendor name")
plt.ylabel("Bottles sold")
plt.title("Top 10 biggest vendors")
plt.show()




 **Let's see how many drinks are sold in different days, monthes, years for all vendors.**

In [0]:
%%bigquery --project protean-genius-271221 df_dates
SELECT sum(bottles_sold) as bottles_sold, FORMAT_DATE('%a',date) AS weekday FROM `bigquery-public-data.iowa_liquor_sales.sales` group by weekday

In [0]:
df_dates.dtypes

In [0]:
order = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", 'Sun']
sns.barplot(x = 'weekday', y = 'bottles_sold',  data = df_dates, color="blue", order = order)
plt.xlabel('Day of the week')
plt.ylabel('Bottles sold')
plt.title("Number of bottles sold by day of the week")
plt.show()



Interesting observation that liquors almost were not sold on weekends, which is most likely caused by law restrictions. However, there is small amount of sales on Saturday. Amount of bottles on Mon - Thu is almost on the same level but it is suprisingly higher comparing to sales on Friday.

In [0]:
%%bigquery --project protean-genius-271221 df_dates
SELECT bottles_sold as bottles_sold, FORMAT_DATE('%a',date) AS weekday FROM `bigquery-public-data.iowa_liquor_sales.sales` where weekday = 'Sat'

In [0]:
# group by month
%%bigquery --project protean-genius-271221 df_dates
SELECT sum(bottles_sold) as bottles_sold, FORMAT_DATE('%m',date) AS month FROM `bigquery-public-data.iowa_liquor_sales.sales` group by month order by month

In [0]:
df_dates

In [0]:
sns.barplot(x = 'month', y = 'bottles_sold',  data = df_dates, color="blue")
plt.xlabel('Month')
plt.ylabel('Bottles sold')
plt.title("Number of bottles sold by month")
plt.show()

The data shows that the busiest sales happen on October and December, which is surely related to upcoming holidays in these months (at least in December). Other months level of sales is lower without any significant differences.

In [0]:
# group by year
%%bigquery --project protean-genius-271221 df_dates
SELECT sum(bottles_sold) as bottles_sold, FORMAT_DATE('%Y',date) AS year FROM `bigquery-public-data.iowa_liquor_sales.sales` group by year order by year

In [0]:
df_dates

In [0]:
sns.barplot(x = 'year', y = 'bottles_sold',  data = df_dates, color="blue")
plt.xlabel('Year')
plt.ylabel('Bottles sold')
plt.title("Number of bottles sold by year")
plt.show()

Sales are increasing every year ( for 2020 there is no full data yet).