# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [242]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
import matplotlib.pyplot as plt
%matplotlib inline

In [243]:
source = pd.read_csv("../../DSI-BOS-students/apasciuto/datasets/w3_liquorsale.csv", low_memory=False)

In [244]:
data = source

In [245]:
data["Date"] = pd.to_datetime(data["Date"], infer_datetime_format=True)

In [246]:
cols = ["State Bottle Cost", "State Bottle Retail", "Sale (Dollars)"]
for col in cols:
    data[col] = data[col].apply(lambda x: float(x.replace("$", "")))

In [247]:
data = source.sort_values(by='Date', ascending=True)

In [248]:
# print(data.Date.min())
# print(data.Date.max())

In [249]:
# data.head()

In [250]:
data = data.set_index("Date")

In [251]:
data.dropna(inplace=True)

In [252]:
# data.info()

In [253]:
data["Total Sales"] = data["State Bottle Retail"].mul(data["Bottles Sold"])

In [254]:
data["Total Volume Sold (Liters)"] = data["Volume Sold (Liters)"].mul(data["Bottles Sold"])

In [255]:
market_data = data[["Store Number", "City", "Zip Code", "Total Sales", "Item Description", "Category Name", "Total Volume Sold (Liters)"]]

In [256]:
market_data.head()

Unnamed: 0_level_0,Store Number,City,Zip Code,Total Sales,Item Description,Category Name,Total Volume Sold (Liters)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-05,4303,MEDIAPOLIS,52637,16.52,Smirnoff Vodka 80 Prf,VODKA 80 PROOF,3.2
2015-01-05,2650,HARLAN,51537,92.85,Belvedere Vodka,IMPORTED VODKA,9.0
2015-01-05,2613,COUNCIL BLUFFS,51501,23.58,Smirnoff Silver Vodka 90 Prf,OTHER PROOF VODKA,3.0
2015-01-05,4806,WEST LIBERTY,52776,342.0,Ciroc Coconut,IMPORTED VODKA - MISC,108.0
2015-01-05,2650,HARLAN,51537,57.92,Titos Vodka,VODKA 80 PROOF,12.0


In [257]:
twentyfifteen_sales = market_data.loc["2015-01-05":"2016-01-04", "Total Sales"].sum()
twentyfifteen_sales

28644382.880000506

In [201]:
twentyfifteen_volume = market_data.loc["2015-01-05":"2016-01-04", "Total Volume Sold (Liters)"].sum()
twentyfifteen_volume

158410014.54000542

In [154]:
twenty_fifteen_one = data.loc["2015-01-05":"2015-04-04"]

In [155]:
twentyfifteen_one = twenty_fifteen_one[["Store Number", "City", "Zip Code", "Total Sales", "Total Volume Sold (Liters)", "Item Description", "Category Name"]]

In [156]:
twentyfifteen_one.head()

Unnamed: 0_level_0,Store Number,City,Zip Code,Total Sales,Total Volume Sold (Liters),Item Description,Category Name
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-01-05,4303,MEDIAPOLIS,52637,16.52,3.2,Smirnoff Vodka 80 Prf,VODKA 80 PROOF
2015-01-05,2650,HARLAN,51537,92.85,9.0,Belvedere Vodka,IMPORTED VODKA
2015-01-05,2613,COUNCIL BLUFFS,51501,23.58,3.0,Smirnoff Silver Vodka 90 Prf,OTHER PROOF VODKA
2015-01-05,4806,WEST LIBERTY,52776,342.0,108.0,Ciroc Coconut,IMPORTED VODKA - MISC
2015-01-05,2650,HARLAN,51537,57.92,12.0,Titos Vodka,VODKA 80 PROOF


In [136]:
twenty_fifteen_two = data.loc["2015-04-05":"2015-07-04"]

In [137]:
twenty_fifteen_three = data.loc["2015-07-05":"2015-10-04"]

In [138]:
twenty_fifteen_four = data.loc["2015-10-05":"2016-01-04"]

In [139]:
twenty_sixteen_one = data["2015-01-05":"2016-03-31"]

In [None]:
twenty_fifteen_one = data[""]
twenty_fifteen_one = data[(data.Date >= '2015-01-05') & (data.Date <= '2015-04-04')]

In [60]:
data["City"].value_counts()

DES MOINES         23724
CEDAR RAPIDS       18888
DAVENPORT          11580
WATERLOO            8425
COUNCIL BLUFFS      8060
SIOUX CITY          7992
IOWA CITY           7951
AMES                7548
WEST DES MOINES     7162
DUBUQUE             6915
CEDAR FALLS         5735
ANKENY              4836
MASON CITY          4191
BETTENDORF          3709
CORALVILLE          3490
MUSCATINE           3397
BURLINGTON          3144
CLINTON             3111
FORT DODGE          2989
WINDSOR HEIGHTS     2811
MARSHALLTOWN        2694
NEWTON              2544
STORM LAKE          2533
MARION              2489
URBANDALE           2438
OTTUMWA             2295
JOHNSTON            2141
ALTOONA             2113
CLEAR LAKE          2083
SPENCER             1910
                   ...  
GILMORE CITY          16
SCHALLER              15
VAN METER             15
WASHBURN              15
DANVILLE              15
GOLDFIELD             15
WALL LAKE             14
ALTA                  14
DOWS                  13


In [59]:
data["City"].unique()

array(['MEDIAPOLIS', 'HARLAN', 'COUNCIL BLUFFS', 'WEST LIBERTY',
       'CEDAR RAPIDS', 'ANKENY', 'MAXWELL', 'BURLINGTON', 'DAVENPORT',
       'BETTENDORF', 'INDIANOLA', 'WEST DES MOINES', 'DES MOINES',
       'WILTON', 'MAQUOKETA', 'MUSCATINE', 'CORALVILLE', 'IOWA FALLS',
       'JESUP', 'MOUNT PLEASANT', 'WATERLOO', 'CLIVE', 'FORT MADISON',
       'BONDURANT', 'MONONA', 'STUART', 'KEOKUK', 'EVANSDALE', 'FAIRFIELD',
       'SOLON', 'INDEPENDENCE', 'HAMPTON', 'PARKERSBURG', 'HIAWATHA',
       'GRUNDY CENTER', 'MOUNT VERNON', 'ANITA', 'ATLANTIC',
       'WEST BURLINGTON', 'WOODBINE', 'CRESCENT', 'DURANT', 'WELLMAN',
       'WASHINGTON', 'GUTHRIE CENTER', 'SIGOURNEY', 'ZWINGLE', 'MARION',
       'ALTOONA', 'COLUMBUS JUNCTION', 'MANCHESTER', 'NEWTON', 'IOWA CITY',
       'BELLEVUE', 'WAUKEE', 'JOHNSTON', 'AUDUBON', 'ANAMOSA',
       'GUTTENBURG', 'ELDON', 'CLEAR LAKE', 'BLUE GRASS', 'KEOSAUQUA',
       'LOGAN', 'WEST POINT', 'MISSOURI VALLEY', 'NORTH ENGLISH',
       'EDGEWOOD', 'DUNLAP',

In [48]:
data.City.value_counts()

DES MOINES         23724
CEDAR RAPIDS       18888
DAVENPORT          11580
WATERLOO            8425
COUNCIL BLUFFS      8060
SIOUX CITY          7992
IOWA CITY           7951
AMES                7548
WEST DES MOINES     7162
DUBUQUE             6915
CEDAR FALLS         5735
ANKENY              4836
MASON CITY          4191
BETTENDORF          3709
CORALVILLE          3490
MUSCATINE           3397
BURLINGTON          3144
CLINTON             3111
FORT DODGE          2989
WINDSOR HEIGHTS     2811
MARSHALLTOWN        2694
NEWTON              2544
STORM LAKE          2533
MARION              2489
URBANDALE           2438
OTTUMWA             2295
JOHNSTON            2141
ALTOONA             2113
CLEAR LAKE          2083
SPENCER             1910
                   ...  
GILMORE CITY          16
SCHALLER              15
VAN METER             15
WASHBURN              15
DANVILLE              15
GOLDFIELD             15
WALL LAKE             14
ALTA                  14
DOWS                  13


In [45]:
list(data.columns)

['Store Number',
 'City',
 'Zip Code',
 'County Number',
 'County',
 'Category',
 'Category Name',
 'Vendor Number',
 'Item Number',
 'Item Description',
 'Bottle Volume (ml)',
 'State Bottle Cost',
 'State Bottle Retail',
 'Bottles Sold',
 'Sale (Dollars)',
 'Volume Sold (Liters)',
 'Volume Sold (Gallons)']

In [72]:
data["Date"] = pd.to_datetime(data["Date"], infer_datetime_format=True)

In [76]:
cols = ["State Bottle Cost", "State Bottle Retail", "Sale (Dollars)"]
for col in cols:
    data[col] = data[col].apply(lambda x: float(x.replace("$", "")))

In [77]:
data = data.dropna()

In [78]:
data.head()

Unnamed: 0,Date,Store Number,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Volume Sold (Gallons)
0,2015-11-04,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,55,54436,Mr. Boston Apricot Brandy,750,4.5,6.75,12,81.0,9.0,2.38
1,2016-03-02,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,395,27605,Tin Cup,750,13.75,20.63,2,41.26,1.5,0.4
2,2016-02-11,2106,CEDAR FALLS,50613,7.0,Black Hawk,1011200.0,STRAIGHT BOURBON WHISKIES,65,19067,Jim Beam,1000,12.59,18.89,24,453.36,24.0,6.34
3,2016-02-03,2501,AMES,50010,85.0,Story,1071100.0,AMERICAN COCKTAILS,395,59154,1800 Ultimate Margarita,1750,9.5,14.25,6,85.5,10.5,2.77
4,2015-08-18,3654,BELMOND,50421,99.0,Wright,1031080.0,VODKA 80 PROOF,297,35918,Five O'clock Vodka,1750,7.2,10.8,12,129.6,21.0,5.55


In [79]:
data['State Bottle Cost'].describe()

count    269258.000000
mean          9.763293
std           7.039787
min           0.890000
25%           5.500000
50%           8.000000
75%          11.920000
max         425.000000
Name: State Bottle Cost, dtype: float64

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [4]:
# import seaborn as sns
# import matplotlib.pyplot as plt

## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [6]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.