# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [209]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model, preprocessing, metrics
from sklearn.cross_validation import train_test_split, cross_val_score, cross_val_predict
from collections import defaultdict
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8, 8
%matplotlib inline

In [210]:
df = pd.read_csv('../../DSI-BOS-students/apasciuto/datasets/w3_liquorsale.csv')

In [211]:
df = df.drop(['County Number', 'Vendor Number','Item Number','Item Description', 'Volume Sold (Gallons)'], axis = 1)

In [212]:
cols = ["State Bottle Cost", "State Bottle Retail", "Sale (Dollars)"]
for col in cols:
    df[col] = df[col].apply(lambda x: float(x[1:]))

In [213]:
df = df.dropna()

In [214]:
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")

In [215]:
# Calculate margins, unit prices
df["Margin"] = (df["State Bottle Retail"] - df["State Bottle Cost"]) * df["Bottles Sold"]
df["Price per Liter"] = df["Sale (Dollars)"] / df["Volume Sold (Liters)"]
df["Price per Bottle"] = df["Sale (Dollars)"]/ df["Bottles Sold"]

In [216]:
# Sales per store, 2015
# Filter by our start and end dates
df.sort_values(by=["Store Number", "Date"], inplace=True)
start_date = pd.Timestamp("20150101")
end_date = pd.Timestamp("20160331")
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
sales = df[mask]

In [217]:
sales = sales.groupby(by="Store Number", as_index=False)

In [218]:
# Compute sums, means
sales = sales.agg({"County": lambda x: x.iloc[0],
                   "Sale (Dollars)": [np.sum, np.mean],
                   "Volume Sold (Liters)": [np.sum, np.mean],
                   "Margin": np.sum,
                   "Price per Liter": np.mean,
                   "Zip Code": lambda x: x.iloc[0], # just extract once, should be the same
                   "City": lambda x: x.iloc[0],
                   "Bottles Sold": [np.sum, np.mean], 
                   "Price per Bottle": np.mean})

In [219]:
# Collapse the column indices
sales.columns = [' '.join(col).strip() for col in sales.columns.values]

In [220]:
sales.columns = ['Store Number', 'County', 'City', 'Zip Code', '2015 Sales', '2015 Sales mean', '2015 Margin', 'Total Bottles Sold', 
        'Bottles Sold mean', 'Price per Bottle mean', '2015 Volume Sold (Liters)', '2015 Volume Sold (Liters) mean',
        'Price per Liter mean'] 

In [221]:
# Determine which stores were open all of 2015
# Find the first and last sales date.
dates = df.groupby(by=["Store Number"], as_index=False)
dates = dates.agg({"Date": [np.min, np.max]})
dates.columns = [' '.join(col).strip() for col in dates.columns.values]

# Sales 2015  Q1
# Filter out stores that opened or closed throughout the year
# You may want to save this step until you start modelling
lower_cutoff = pd.Timestamp("20150105")
upper_cutoff = pd.Timestamp("20150331")
mask = (dates['Date amin'] < lower_cutoff) & (dates['Date amax'] > upper_cutoff)
good_stores = dates[mask]["Store Number"]
data_for_open_stores = df[df["Store Number"].isin(good_stores)]

In [222]:
sales.head()

Unnamed: 0,Store Number,County,City,Zip Code,2015 Sales,2015 Sales mean,2015 Margin,Total Bottles Sold,Bottles Sold mean,Price per Bottle mean,2015 Volume Sold (Liters),2015 Volume Sold (Liters) mean,Price per Liter mean
0,2106,15.479095,11836.1,18.153528,Black Hawk,CEDAR FALLS,12573,19.283742,176517.45,270.732285,58916.88,50613,17.86911
1,2113,16.267717,836.85,4.548098,Webster,GOWRIE,830,4.51087,11376.12,61.826739,3802.53,50543,18.301651
2,2130,15.015197,8436.27,16.606831,Black Hawk,WATERLOO,9144,18.0,139440.02,274.488228,46517.61,50703,16.963739
3,2152,12.829193,720.87,4.477453,Cerro Gordo,ROCKWELL,670,4.161491,8625.74,53.576025,2891.61,50469,12.954562
4,2178,14.432203,2437.92,8.264136,Allamakee,WAUKON,2408,8.162712,29912.68,101.398915,10034.46,52172,15.866688


In [223]:
# Determine which stores were open all of 2016
# Find the first and last sales date.
dates = df.groupby(by=["Store Number"], as_index=False)
dates = dates.agg({"Date": [np.min, np.max]})
dates.columns = [' '.join(col).strip() for col in dates.columns.values]

# Sales 2016  Q1
# Filter out stores that opened or closed throughout the year
# You may want to save this step until you start modelling
lower_cutoff = pd.Timestamp("20160105")
upper_cutoff = pd.Timestamp("20160331")
mask = (dates['Date amin'] < lower_cutoff) & (dates['Date amax'] > upper_cutoff)
good_stores = dates[mask]["Store Number"]
data_for_open_stores = df[df["Store Number"].isin(good_stores)]

In [224]:
sales.head()

Unnamed: 0,Store Number,County,City,Zip Code,2015 Sales,2015 Sales mean,2015 Margin,Total Bottles Sold,Bottles Sold mean,Price per Bottle mean,2015 Volume Sold (Liters),2015 Volume Sold (Liters) mean,Price per Liter mean
0,2106,15.479095,11836.1,18.153528,Black Hawk,CEDAR FALLS,12573,19.283742,176517.45,270.732285,58916.88,50613,17.86911
1,2113,16.267717,836.85,4.548098,Webster,GOWRIE,830,4.51087,11376.12,61.826739,3802.53,50543,18.301651
2,2130,15.015197,8436.27,16.606831,Black Hawk,WATERLOO,9144,18.0,139440.02,274.488228,46517.61,50703,16.963739
3,2152,12.829193,720.87,4.477453,Cerro Gordo,ROCKWELL,670,4.161491,8625.74,53.576025,2891.61,50469,12.954562
4,2178,14.432203,2437.92,8.264136,Allamakee,WAUKON,2408,8.162712,29912.68,101.398915,10034.46,52172,15.866688


In [335]:
# data.info()

In [336]:
data["Total Sales"] = data["State Bottle Retail"].mul(data["Bottles Sold"])

In [337]:
data["Sales Margin"] = (data["State Bottle Retail"] - data["State Bottle Cost"]) * data["Bottles Sold"]

In [340]:
data["Total Volume Sold (Liters)"] = data["Volume Sold (Liters)"].mul(data["Bottles Sold"])

In [341]:
market_data = data[["Store Number", "City", "Zip Code", "Bottles Sold", "Sales Margin", "Total Sales", "Item Description", "Category Name", "Total Volume Sold (Liters)"]]

In [342]:
market_data.head()

Unnamed: 0_level_0,Store Number,City,Zip Code,Bottles Sold,Sales Margin,Total Sales,Item Description,Category Name,Total Volume Sold (Liters)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-05,4303,MEDIAPOLIS,52637,4,5.52,16.52,Smirnoff Vodka 80 Prf,VODKA 80 PROOF,3.2
2015-01-05,2650,HARLAN,51537,3,30.96,92.85,Belvedere Vodka,IMPORTED VODKA,9.0
2015-01-05,2613,COUNCIL BLUFFS,51501,2,7.86,23.58,Smirnoff Silver Vodka 90 Prf,OTHER PROOF VODKA,3.0
2015-01-05,4806,WEST LIBERTY,52776,12,114.0,342.0,Ciroc Coconut,IMPORTED VODKA - MISC,108.0
2015-01-05,2650,HARLAN,51537,4,19.32,57.92,Titos Vodka,VODKA 80 PROOF,12.0


In [295]:
twentyfifteen = market_data.loc["2015-01-05":"2016-01-04"]

In [298]:
twentyfifteen_volume = twentyfifteen["Total Volume Sold (Liters)"].sum()
twentyfifteen_volume

158410014.54000542

In [297]:
twentyfifteen_sales = twentyfifteen["Total Sales"].sum()
twentyfifteen_sales

28644382.880000506

In [154]:
twenty_fifteen_one = data.loc["2015-01-05":"2015-04-04"]

In [155]:
twentyfifteen_one = twenty_fifteen_one[["Store Number", "City", "Zip Code", "Total Sales", "Total Volume Sold (Liters)", "Item Description", "Category Name"]]

In [156]:
twentyfifteen_one = market_data.loc["2015-01-05":"2015-04-04"]

Unnamed: 0_level_0,Store Number,City,Zip Code,Total Sales,Total Volume Sold (Liters),Item Description,Category Name
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-05,4303,MEDIAPOLIS,52637,16.52,3.2,Smirnoff Vodka 80 Prf,VODKA 80 PROOF
2015-01-05,2650,HARLAN,51537,92.85,9.0,Belvedere Vodka,IMPORTED VODKA
2015-01-05,2613,COUNCIL BLUFFS,51501,23.58,3.0,Smirnoff Silver Vodka 90 Prf,OTHER PROOF VODKA
2015-01-05,4806,WEST LIBERTY,52776,342.0,108.0,Ciroc Coconut,IMPORTED VODKA - MISC
2015-01-05,2650,HARLAN,51537,57.92,12.0,Titos Vodka,VODKA 80 PROOF


In [136]:
twenty_fifteen_two = data.loc["2015-04-05":"2015-07-04"]

In [137]:
twenty_fifteen_three = data.loc["2015-07-05":"2015-10-04"]

In [138]:
twenty_fifteen_four = data.loc["2015-10-05":"2016-01-04"]

In [139]:
twenty_sixteen_one = data["2015-01-05":"2016-03-31"]

In [45]:
list(data.columns)

['Store Number',
 'City',
 'Zip Code',
 'County Number',
 'County',
 'Category',
 'Category Name',
 'Vendor Number',
 'Item Number',
 'Item Description',
 'Bottle Volume (ml)',
 'State Bottle Cost',
 'State Bottle Retail',
 'Bottles Sold',
 'Sale (Dollars)',
 'Volume Sold (Liters)',
 'Volume Sold (Gallons)']

In [72]:
data["Date"] = pd.to_datetime(data["Date"], infer_datetime_format=True)

In [76]:
cols = ["State Bottle Cost", "State Bottle Retail", "Sale (Dollars)"]
for col in cols:
    data[col] = data[col].apply(lambda x: float(x.replace("$", "")))

In [77]:
data = data.dropna()

In [78]:
data.head()

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,2015-11-04,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,4.5,6.75,12,81.0,9.0,2.38
1,2016-03-02,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,13.75,20.63,2,41.26,1.5,0.4
2,2016-02-11,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,12.59,18.89,24,453.36,24.0,6.34
3,2016-02-03,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,9.5,14.25,6,85.5,10.5,2.77
4,2015-08-18,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,7.2,10.8,12,129.6,21.0,5.55


In [79]:
data['State Bottle Cost'].describe()

count    269258.000000
mean          9.763293
std           7.039787
min           0.890000
25%           5.500000
50%           8.000000
75%          11.920000
max         425.000000
Name: State Bottle Cost, dtype: float64

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [4]:
# import seaborn as sns
# import matplotlib.pyplot as plt

## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [6]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.