# Data Wrangling for Iowa Liquor Sales Database

Objective is to prepare this dataframe for the next stages of data analysis by having correct data types. Matching duplicate pairs of data such as: store number/ store name deal with missing values. Removing unneeded columns. 

In [25]:
# load packages

import pandas as pd
import numpy as np
import re

In [26]:
# CSV file is from "https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy"

# create variable to store dataset
iowaLiquorSales = pd.read_csv('/Users/joe/Desktop/IOWA LIQUOR/Iowa_Liquor_Sales.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [27]:
iowaLiquorSales = iowaLiquorSales.sample(10000)

In [28]:
# take a look at the data

iowaLiquorSales.head().T

Unnamed: 0,3129755,6851465,2086055,15991241,617545
Invoice/Item Number,INV-06063800007,S16412300003,INV-02780200026,INV-22880200023,INV-16322600066
Date,07/13/2017,12/20/2013,01/18/2017,10/29/2019,12/14/2018
Store Number,3054,4159,2600,5219,4947
Store Name,Mcnally's Super Valu,Fareway Stores #073 / Council Bluffs,Hy-Vee Food Store / Oskaloosa,Kirkwood Liquor & Tobacco,The Music Station
Address,1026 Main,310 MCKENZIE AVE,110 S D St,"300, Kirkwood Ave",1420 W First St
City,Grinnell,COUNCIL BLUFFS,Oskaloosa,Iowa City,Cedar Falls
Zip Code,50112.0,51503,52577.0,52240,50647.0
Store Location,,POINT (-95.81799100000002 41.280084),POINT (-92.649764 41.295218),POINT (-91.531628 41.649432),POINT (-92.462826 42.537839)
County Number,79.0,78.0,62.0,52.0,7.0
County,POWESHIEK,Pottawattamie,MAHASKA,JOHNSON,Black Hawk


In [29]:
# Right away, there seems to be some spelling differences to standardize
# some are in all caps, some are capitalized
# there are also some columns I have no interest in: invoice, volume of sales in gallons, address

In [30]:
# and it's size

iowaLiquorSales.shape

(10000, 24)

In [31]:
# 22M entries with 24 features

In [32]:
# checking for missing
iowaLiquorSales.isna().sum()

Invoice/Item Number        0
Date                       0
Store Number               0
Store Name                 0
Address                   37
City                      37
Zip Code                  37
Store Location           955
County Number             70
County                    70
Category                   6
Category Name              8
Vendor Number              0
Vendor Name                0
Item Number                0
Item Description           0
Pack                       0
Bottle Volume (ml)         0
State Bottle Cost          0
State Bottle Retail        0
Bottles Sold               0
Sale (Dollars)             0
Volume Sold (Liters)       0
Volume Sold (Gallons)      0
dtype: int64

### Missing values thoughts

Store location, GPS information is missing around 11%, 
at the moment I am uncertain if I will use this information.

A potential use for this information would be to determine a good location for a new store.
GPS data is coordinates on a plane. Basic statistics could be used to determine distance between stores

Do stores groupped together see greater sales or does isolated stores see greater sales. We also have zip codes for each store which are missing far fewer entries. Store density per zip code could also give an idea of this information, especially if acres/ sq. miles per zip is available

Before we simply drop nulls we need to know:
Are the NaN values specific to a certain location? Would dropping them under represent an area

To do this let's explore the location information we do have, zip code. Once zip code has been cleaned we will be able to tell if the missing data is concentrated to a certain area. If the missing data is concentrated we will need to decide how to proceed. Which perhaps means ignoring store location entirely

## Let the Wrangling Begin!

### [ ' Zip Code ' ]

In [33]:
# ZIP CODE
iowaLiquorSales['Zip Code'].nunique()

681

In [34]:
# a quick internet search states iowa has 1055 zip codes. at least the number isn't larger. 

In [35]:
# zip code data type
iowaLiquorSales['Zip Code'].dtypes

dtype('O')

In [36]:
#### Objective: 

# Change zipcode from object type to a numeric dtype

In [37]:
####### this code provides an error, but it let me know I had a non-numerical entry  ########

# iowaLiquorSales['Zip Code'] = iowaLiquorSales['Zip Code'].astype('float64')

# this let me know I need to look out for '712-2'

In [38]:
# how many have that entries have this faulty zip code?

iowaLiquorSales[iowaLiquorSales['Zip Code'] == '712-2'].groupby(['City']).describe()

Unnamed: 0_level_0,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,County Number,County Number,...,Volume Sold (Liters),Volume Sold (Liters),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
City,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
DUNLAP,4.0,4307.0,0.0,4307.0,4307.0,4307.0,4307.0,4307.0,4.0,43.0,...,4.5,12.0,4.0,1.03,1.4352,0.16,0.235,0.395,1.19,3.17


In [39]:
# all of the entries are from the same city Dunlap. We also now know the 'city' feature has spelling issues

# Dunlap which has a actual zip code of 51529 and turns out their area code is 712
# so that is most likely the origin of the error

# objective: change all 712-2 zip codes to 51529

In [40]:
# replace all of those entries to the correct zip code

iowaLiquorSales['Zip Code'] = iowaLiquorSales['Zip Code'].replace({'712-2': 51529})

In [41]:
# examine different zip codes

#iowaLiquorSales['Zip Code'].unique()

In [42]:
# some float values, some are integers, some are strings and one nan.

# there is only 1 nan in 22M. we could easily drop it. with:
# iowaLiquorSales = iowaLiquorSales[iowaLiquorSales['Zip Code'].notna()]

In [43]:
# turn the nan into a zero
iowaLiquorSales['Zip Code'] = iowaLiquorSales['Zip Code'].replace(np.nan, 0)

# check out where that store location is
iowaLiquorSales[iowaLiquorSales['Zip Code'] == 0].groupby(['City']).describe()

Series([], dtype: float64)

In [44]:
# what is the zip code of stanton
iowaLiquorSales[iowaLiquorSales['City'] == 'Stanton'].groupby(['Zip Code']).describe()

Series([], dtype: float64)

In [45]:
# we already knew there were some data type duplicates str, int, float in the zip code column.
# but we did learn that Stanton should have a zip code of 51573

# Objective: change zipcode '0' to the correct value of 51573

# earlier we used the replace method because the zipcode had a '-' in it. if we use it again:
# iowaLiquorSales['Zip Code'] = iowaLiquorSales['Zip Code'].replace({0: 51573})
# all zipcodes containing a '0' would have their '0' replaced with five extra digits

# instead we will write over those specific entries
iowaLiquorSales.loc[iowaLiquorSales['Zip Code'] == 0, 'Zip Code'] = 51573

In [46]:
# now the zip codes should be ready for conversion
# converting all zip codes to numeric
iowaLiquorSales['Zip Code'] = iowaLiquorSales['Zip Code'].astype('int64')

In [47]:
# I know it was a while ago, but all that zip code cleaning was to see if the missing store location data
# is concentrated.

# we can groupby zipcode to see if the missing gps locations are in the same place

df = iowaLiquorSales[['Store Location', 'Zip Code']]

In [48]:
# get rid of the null values in dummy dataframe
df['Store Location'] = df['Store Location'].fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Store Location'] = df['Store Location'].fillna(0)


In [49]:
# focus on just the missing locations
dff = df[df['Store Location'] == 0]

In [50]:
# how many missing values are there per zip code
dff.value_counts()


Store Location  Zip Code
0               50701       64
                50010       62
                52804       56
                52302       48
                50315       47
                            ..
                51241        1
                51247        1
                51301        1
                51579        1
                51201        1
Length: 104, dtype: int64

In [51]:
# missing store locations might be grouped. There are 100,000 missing in multiple zip codes
# remember there are about 2M missing store locations

iowaLiquorSales[iowaLiquorSales['Zip Code'] == 52302].groupby(['City']).describe()

Unnamed: 0_level_0,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,Store Number,Zip Code,Zip Code,...,Volume Sold (Liters),Volume Sold (Liters),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons),Volume Sold (Gallons)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
City,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
MARION,35.0,3330.428571,856.752291,2560.0,2560.0,2560.0,4180.0,4736.0,35.0,52302.0,...,9.0,105.0,35.0,2.240857,4.605667,0.05,0.45,1.19,2.38,27.74
Marion,59.0,3702.440678,1006.507024,2514.0,2560.0,4180.0,4180.0,6158.0,59.0,52302.0,...,9.0,18.0,59.0,1.509153,1.221543,0.02,0.53,1.18,2.37,4.75


In [52]:
# Just from the most common missing gps data. Ames Iowa. 131355 out of the 559568
# that is 20% which is over a years worth. I do not want to drop these missing values.

# from the 5th most common missing, ankeny 104475 / ~250000 that is ~40%

# 52302 (city of marion) has 121012 missing values  and about 200000 total which is 60%

# the missing gps data appears to be correlated. I will be dropping store location.

In [53]:
# drop unneeded columns

# invoice and pack columns are important for the distributers with warehouse inventory, not for store habits.
# store sell the individual items not cases most often.

iowa = iowaLiquorSales.drop(columns=['Invoice/Item Number','Address','Store Location','Pack','Volume Sold (Gallons)'])

In [54]:
# current state
iowa.head().T

Unnamed: 0,3129755,6851465,2086055,15991241,617545
Date,07/13/2017,12/20/2013,01/18/2017,10/29/2019,12/14/2018
Store Number,3054,4159,2600,5219,4947
Store Name,Mcnally's Super Valu,Fareway Stores #073 / Council Bluffs,Hy-Vee Food Store / Oskaloosa,Kirkwood Liquor & Tobacco,The Music Station
City,Grinnell,COUNCIL BLUFFS,Oskaloosa,Iowa City,Cedar Falls
Zip Code,50112,51503,52577,52240,50647
County Number,79.0,78.0,62.0,52.0,7.0
County,POWESHIEK,Pottawattamie,MAHASKA,JOHNSON,Black Hawk
Category,1012100.0,1062200.0,1031100.0,1031100.0,1062100.0
Category Name,Canadian Whiskies,PUERTO RICO & VIRGIN ISLANDS RUM,American Vodkas,American Vodkas,Gold Rum
Vendor Number,260.0,434.0,434.0,300.0,434.0


In [55]:
# check column data types
iowa.dtypes

Date                     object
Store Number              int64
Store Name               object
City                     object
Zip Code                  int64
County Number           float64
County                   object
Category                float64
Category Name            object
Vendor Number           float64
Vendor Name              object
Item Number              object
Item Description         object
Bottle Volume (ml)        int64
State Bottle Cost       float64
State Bottle Retail     float64
Bottles Sold              int64
Sale (Dollars)          float64
Volume Sold (Liters)    float64
dtype: object

In [56]:
# missing values
iowa.isna().sum()

Date                     0
Store Number             0
Store Name               0
City                    37
Zip Code                 0
County Number           70
County                  70
Category                 6
Category Name            8
Vendor Number            0
Vendor Name              0
Item Number              0
Item Description         0
Bottle Volume (ml)       0
State Bottle Cost        0
State Bottle Retail      0
Bottles Sold             0
Sale (Dollars)           0
Volume Sold (Liters)     0
dtype: int64

### [ ' Date ' ]

In [57]:
#DATE COLUMN
# we know that date is currently an object, we would like to have this be in datetime
iowa['Date'].describe(datetime_is_numeric=True)

count          10000
unique          2312
top       05/12/2014
freq              15
Name: Date, dtype: object

In [58]:
# there were 18000 deliveries on 12/22/2020.  There are only ~2600 different stores
# that is a busy day at the warehouse

In [59]:
# 'Date' column is also an object type. 
# let's change that to date time with pandas.to_datetime()

iowa['Date'] = pd.to_datetime(iowa['Date'])

### [ ' Store Number ' ] and [ ' Store Name ' ]

In [60]:
# STORE NUMBER COLUMN

# how many different store numbers are there?
iowa['Store Number'].nunique()

1667

In [61]:
# STORE NAME COLUMN

iowa['Store Name'].describe()

count                            10000
unique                            1719
top       Hy-Vee #3 / BDI / Des Moines
freq                                84
Name: Store Name, dtype: object

In [62]:
# there's 2839 unique store names and 2687 numbers
# so there are at least some spelling issues

In [63]:
# clean up some

iowa['Store Name'] = iowa['Store Name'].str.capitalize()

In [64]:
#iowa['Store Name'].nunique()

In [65]:
# we are going to locate the duplicate pairs

# create a dummy df with just the store number and name that has all the combinations
dup_stores = iowa.drop_duplicates(subset = ['Store Number', 'Store Name'])[['Store Number', 'Store Name']]

# sort these number/ name combinations by occurances of store number 
# (we have more names than numbers, so numbers will have certainly have duplicates)
dup_stores = dup_stores.groupby('Store Number').count()

# ascending false because we want the larger numbers aka duplicates
dup_stores = dup_stores.sort_values('Store Name', ascending = False)

# remove singles, or non-duplicates
dup_stores = dup_stores[dup_stores['Store Name'] > 1 ].reset_index()

# we now have a list/ df of store numbers who have duplicate entries
dup_stores

Unnamed: 0,Store Number,Store Name
0,4743,3
1,4152,3
2,2663,3
3,2625,2
4,5092,2
...,...,...
77,4540,2
78,3917,2
79,3920,2
80,4546,2


In [66]:
# list of stores
stores = iowa.groupby(['Store Number', 'Store Name']).count().reset_index()

# list of stores from the duplicate stores
wrong_names = stores[stores['Store Number'].isin(dup_stores['Store Number'])].sort_values('Store Number')[['Store Number','Store Name']]

# list of 'correct' names, most common name entry for each store
stores_map = wrong_names.groupby('Store Number').max()[['Store Name']].reset_index()

In [67]:
# merge 'correct' names with the df, merged on store number
iowa = pd.merge(left = iowa, right = stores_map, left_on = 'Store Number', right_on = 'Store Number', how = 'left')
iowa.head()

Unnamed: 0,Date,Store Number,Store Name_x,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Vendor Name,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Store Name_y
0,2017-07-13,3054,Mcnally's super valu,Grinnell,50112,79.0,POWESHIEK,1012100.0,Canadian Whiskies,260.0,DIAGEO AMERICAS,10790,Crown Royal Vanilla,375,8.0,12.0,3,36.0,1.12,
1,2013-12-20,4159,Fareway stores #073 / council bluffs,COUNCIL BLUFFS,51503,78.0,Pottawattamie,1062200.0,PUERTO RICO & VIRGIN ISLANDS RUM,434.0,Luxco-St Louis,45278,Paramount White Rum,1750,7.58,11.37,6,68.22,10.5,
2,2017-01-18,2600,Hy-vee food store / oskaloosa,Oskaloosa,52577,62.0,MAHASKA,1031100.0,American Vodkas,434.0,LUXCO INC,36305,Hawkeye Vodka,750,3.34,5.01,12,60.12,9.0,
3,2019-10-29,5219,Kirkwood liquor & tobacco,Iowa City,52240,52.0,JOHNSON,1031100.0,American Vodkas,300.0,McCormick Distilling Co.,36908,McCormick 80prf Vodka PET,1750,7.47,11.21,6,67.26,10.5,
4,2018-12-14,4947,The music station,Cedar Falls,50647,7.0,Black Hawk,1062100.0,Gold Rum,434.0,LUXCO INC,45245,Paramount Gold Rum PET,750,4.0,6.0,3,18.0,2.25,


In [68]:
# now our dataframe has multiple store name columns
# store_y is the cleaned names for duplicates
# fill in the missing values (the values where there was no duplication from the original name column)
iowa['Store Name'] = iowa['Store Name_y'].fillna(iowa['Store Name_x'])

In [69]:
# drop the extra store name column
iowa = iowa.drop(columns=['Store Name_x', 'Store Name_y'])

In [70]:
# where are we at now
iowa[['Store Number', "Store Name"]].nunique()

Store Number    1667
Store Name      1616
dtype: int64

In [71]:
# We had an assumption there store number was clean.  Something I have learned is do not assume this df is clean
# repeat the previous process but prioritizing name instead of number

# we are going to locate the duplicate pairs

# create a dummy df with just the store number and name that has all the combinations
dup_stores = iowa.drop_duplicates(subset = ['Store Number', 'Store Name'])[['Store Number', 'Store Name']]

# sort these number/ name combinations by occurances of store number 
# (we have more numbers than names this time, so names will have certainly have duplicates)
dup_stores = dup_stores.groupby('Store Name').count()

# ascending false because we want the larger numbers aka duplicates
dup_stores = dup_stores.sort_values('Store Number', ascending = False)

# remove singles, or non-duplicates
dup_stores = dup_stores[dup_stores['Store Number'] > 1 ].reset_index()

# we now have a list/ df of store numbers who have duplicate entries
dup_stores

Unnamed: 0,Store Name,Store Number
0,Liquor and tobacco outlet /,3
1,Jeff's market / wilton,3
2,Jeff's market / durant,3
3,Sauce,3
4,New star / fort dodge,3
5,Super mart / oelwein,2
6,Kum & go #4098 / windsor heights,2
7,Select mart / sioux city,2
8,Lil' chubs corner stop,2
9,Hometown foods / traer,2


In [72]:
# list of stores
stores = iowa.groupby(['Store Number', 'Store Name']).count().reset_index()

# list of stores from the duplicate stores
wrong_numbers = stores[stores['Store Name'].isin(dup_stores['Store Name'])].sort_values('Store Name')[['Store Number','Store Name']]

# list of 'correct' numbers, most common number entry for each store
stores_map = wrong_numbers.groupby('Store Name').max()[['Store Number']].reset_index()

In [73]:
# merge 'correct' numbers with the df, merged on store name
iowa = pd.merge(left = iowa, right = stores_map, left_on = 'Store Name', right_on = 'Store Name', how = 'left')
iowa.head()

Unnamed: 0,Date,Store Number_x,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Vendor Name,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Store Name,Store Number_y
0,2017-07-13,3054,Grinnell,50112,79.0,POWESHIEK,1012100.0,Canadian Whiskies,260.0,DIAGEO AMERICAS,10790,Crown Royal Vanilla,375,8.0,12.0,3,36.0,1.12,Mcnally's super valu,
1,2013-12-20,4159,COUNCIL BLUFFS,51503,78.0,Pottawattamie,1062200.0,PUERTO RICO & VIRGIN ISLANDS RUM,434.0,Luxco-St Louis,45278,Paramount White Rum,1750,7.58,11.37,6,68.22,10.5,Fareway stores #073 / council bluffs,
2,2017-01-18,2600,Oskaloosa,52577,62.0,MAHASKA,1031100.0,American Vodkas,434.0,LUXCO INC,36305,Hawkeye Vodka,750,3.34,5.01,12,60.12,9.0,Hy-vee food store / oskaloosa,
3,2019-10-29,5219,Iowa City,52240,52.0,JOHNSON,1031100.0,American Vodkas,300.0,McCormick Distilling Co.,36908,McCormick 80prf Vodka PET,1750,7.47,11.21,6,67.26,10.5,Kirkwood liquor & tobacco,
4,2018-12-14,4947,Cedar Falls,50647,7.0,Black Hawk,1062100.0,Gold Rum,434.0,LUXCO INC,45245,Paramount Gold Rum PET,750,4.0,6.0,3,18.0,2.25,The music station,


In [74]:
# now our dataframe has multiple store number columns
# store_y is the cleaned names for duplicates
# fill in the missing values (the values where there was no duplication from the original name column)
iowa['Store Number'] = iowa['Store Number_y'].fillna(iowa['Store Number_x'])

In [75]:
# drop the extra store name column
iowa = iowa.drop(columns=['Store Number_x','Store Number_y'])

# where are we at now
iowa[['Store Number', "Store Name"]].nunique()

Store Number    1616
Store Name      1616
dtype: int64

In [76]:
# Do all pairs match?
# if we groupby store number and name is the total unique values also 2555?
iowa.groupby(['Store Number','Store Name'])['Store Number'].nunique().sum()

1616

### [ ' City ' ]

In [77]:
# CITY
iowa["City"].describe()

count           9963
unique           609
top       Des Moines
freq             525
Name: City, dtype: object

In [78]:
# most common city is the largest city

In [79]:
# check for any incorrect entries
# first let's standardize capitalization

# convert all cities to upper case
iowa['City'] = iowa['City'].str.upper()

In [80]:
# look for duplicates
dummy = iowa['City'].drop_duplicates()
sorted(dummy.astype('str'))

['ACKLEY',
 'ADAIR',
 'ADEL',
 'AKRON',
 'ALBIA',
 'ALBION',
 'ALDEN',
 'ALGONA',
 'ALLISON',
 'ALTON',
 'ALTOONA',
 'AMES',
 'ANAMOSA',
 'ANITA',
 'ANKENY',
 'ANTHON',
 'ARMSTRONG',
 "ARNOLD'S PARK",
 'ARNOLDS PARK',
 'ATKINS',
 'ATLANTIC',
 'AUDUBON',
 'AVOCA',
 'BALDWIN',
 'BANCROFT',
 'BAXTER',
 'BEDFORD',
 'BELLE PLAINE',
 'BELLEVUE',
 'BELMOND',
 'BETTENDORF',
 'BEVINGTON',
 'BLOOMFIELD',
 'BLUE GRASS',
 'BONDURANT',
 'BOONE',
 'BRITT',
 'BROOKLYN',
 'BUFFALO',
 'BUFFALO CENTER',
 'BURLINGTON',
 'CAMANCHE',
 'CAMBRIDGE',
 'CARLISLE',
 'CARROLL',
 'CARTER LAKE',
 'CASCADE',
 'CASEY',
 'CEDAR FALLS',
 'CEDAR RAPIDS',
 'CENTER POINT',
 'CENTERVILLE',
 'CENTRAL CITY',
 'CHARITON',
 'CHARLES CITY',
 'CHEROKEE',
 'CLARINDA',
 'CLARION',
 'CLARKSVILLE',
 'CLEAR LAKE',
 'CLEARLAKE',
 'CLINTON',
 'CLIVE',
 'COLESBURG',
 'COLFAX',
 'COLO',
 'COLUMBUS JUNCTION',
 'CONRAD',
 'COON RAPIDS',
 'CORALVILLE',
 'CORNING',
 'CORWITH',
 'CORYDON',
 'COUNCIL BLUFFS',
 'CRESCO',
 'CRESTON',
 'CUMMING'

In [81]:
### first look throughthere are a bunch of duplicates
# and nan
# create a map to swap-out / correct the spelling
mapping = {"ARNOLD'S PARK" : 'ARNOLDS PARK',
           'CLEAR LAKE' : 'CLEARLAKE',
           'COLORADO SPRINGS' : np.nan,
           'FT. ATKINSON' : 'FORT ATKINSON',
           'GRAND MOUND' : 'GRAND MOUNDS',
           'GUTTENBERG' : 'GUTTENBURG',
           'KELLOG' : 'KELLOGG',
           'LECLAIRE' : 'LE CLAIRE',
           'LEMARS' : 'LE MARS',
           'MT PLEASANT' : 'MOUNT PLEASANT',
           'MT VERNON' : 'MOUNT VERNON',
           'OTTUWMA' : 'OTTUMWA',
           'OTUMWA' : 'OTTUMWA'}

In [82]:
#### map the correct spelllings onto their replacements
iowa['City'] = iowa['City'].replace(mapping.keys(), list(map(str, mapping.values())), regex=True)

In [83]:
## city has a 80000 nan values

# we are going to fill missing values with the preceeding value when sorted by that column (alphabetical)
# if we sort by zip then city all the na cities should be at the bottom and ffill will fill them mostly correctly

iowa = iowa.sort_values(by = ['Zip Code','City'])
iowa['City'] = iowa['City'].fillna(method='ffill')


### [ ' County Number ' ]

In [84]:
# COUNTY NUMBER

# this is an error
#iowa['County Number'] = iowa['County Number'].astype('int64')

# IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

iowa['County Number'] = iowa['County Number'].fillna(0)

In [85]:
# try this again
iowa['County Number'] = iowa['County Number'].astype('int64')

In [86]:
# check out some zip code combinations for the county 0
iowa.groupby('County Number').agg({'Zip Code': ['count','max','min',pd.Series.mode,pd.Series.nunique]})

Unnamed: 0_level_0,Zip Code,Zip Code,Zip Code,Zip Code,Zip Code
Unnamed: 0_level_1,count,max,min,mode,nunique
County Number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
0,70,52732,50266,51573,22
1,23,50849,50002,50849,4
2,10,50841,50841,50841,1
3,43,52172,52146,52172,4
4,44,52571,52544,52544,2
...,...,...,...,...,...
95,46,50450,50424,50436,3
96,47,52144,52101,52101,2
97,315,51109,51004,51106,11
98,16,50459,50459,50459,1


In [87]:
# 98 different zip codes are represented in the missing county values
# one, 80904, appears to not be in iowa
iowa = iowa[iowa['County Number'] != 0]

In [88]:
iowa.isna().sum()

Date                    0
City                    0
Zip Code                0
County Number           0
County                  0
Category                6
Category Name           8
Vendor Number           0
Vendor Name             0
Item Number             0
Item Description        0
Bottle Volume (ml)      0
State Bottle Cost       0
State Bottle Retail     0
Bottles Sold            0
Sale (Dollars)          0
Volume Sold (Liters)    0
Store Name              0
Store Number            0
dtype: int64

In [89]:
# a quick internet search of: number of counties in iowa, 99 is correct

iowa['County Number'].nunique()

99

### [ ' County ' ]

In [90]:
# COUNTY

# we know it is supposed to be 99
iowa['County'].nunique()

199

In [91]:
# 200 > 99. Examine some values

# make all county names capitalized
iowa['County'] = iowa['County'].str.upper()

# did we catch 'em all?
iowa['County'].nunique()

103

In [92]:
# look for duplicates
dummy = iowa['County'].drop_duplicates()
sorted(dummy.astype('str'))

['ADAIR',
 'ADAMS',
 'ALLAMAKEE',
 'APPANOOSE',
 'AUDUBON',
 'BENTON',
 'BLACK HAWK',
 'BOONE',
 'BREMER',
 'BUCHANAN',
 'BUENA VIST',
 'BUENA VISTA',
 'BUTLER',
 'CALHOUN',
 'CARROLL',
 'CASS',
 'CEDAR',
 'CERRO GORD',
 'CERRO GORDO',
 'CHEROKEE',
 'CHICKASAW',
 'CLARKE',
 'CLAY',
 'CLAYTON',
 'CLINTON',
 'CRAWFORD',
 'DALLAS',
 'DAVIS',
 'DECATUR',
 'DELAWARE',
 'DES MOINES',
 'DICKINSON',
 'DUBUQUE',
 'EMMET',
 'FAYETTE',
 'FLOYD',
 'FRANKLIN',
 'FREMONT',
 'GREENE',
 'GRUNDY',
 'GUTHRIE',
 'HAMILTON',
 'HANCOCK',
 'HARDIN',
 'HARRISON',
 'HENRY',
 'HOWARD',
 'HUMBOLDT',
 'IDA',
 'IOWA',
 'JACKSON',
 'JASPER',
 'JEFFERSON',
 'JOHNSON',
 'JONES',
 'KEOKUK',
 'KOSSUTH',
 'LEE',
 'LINN',
 'LOUISA',
 'LUCAS',
 'LYON',
 'MADISON',
 'MAHASKA',
 'MARION',
 'MARSHALL',
 'MILLS',
 'MITCHELL',
 'MONONA',
 'MONROE',
 'MONTGOMERY',
 'MUSCATINE',
 "O'BRIEN",
 'OBRIEN',
 'OSCEOLA',
 'PAGE',
 'PALO ALTO',
 'PLYMOUTH',
 'POCAHONTAS',
 'POLK',
 'POTTAWATTA',
 'POTTAWATTAMIE',
 'POWESHIEK',
 'RINGGOL

In [93]:
# So we have 4 extra.

# only four so let's just manually fix them
iowa.loc[iowa['County'] == 'BUENA VIST', 'County'] = 'BUENA VISTA'
iowa.loc[iowa['County'] == 'CERRO GORD', 'County'] = 'CERRO GORDO'
iowa.loc[iowa['County'] == "OBRIEN", 'County'] = "O'BRIEN"
iowa.loc[iowa['County'] == 'POTTAWATTA', 'County'] = 'POTTAWATTAMIE'

In [94]:
# is it 99?
iowa['County'].nunique()

99

### [ ' Category ' ]  &  [ ' Category Name ' ]

In [95]:
# CATEGORY

# try standardizing letter case
iowa['Category Name'] = iowa['Category Name'].str.upper()
iowa['Category Name'].nunique()

93

In [96]:
iowa.groupby(['Category', 'Category Name'])['Category'].count()

Category   Category Name                  
1011100.0  BLENDED WHISKIES                   415
1011200.0  STRAIGHT BOURBON WHISKIES          589
1011250.0  SINGLE BARREL BOURBON WHISKIES       2
1011300.0  SINGLE BARREL BOURBON WHISKIES       9
           TENNESSEE WHISKIES                 107
                                             ... 
1101100.0  AMERICAN ALCOHOL                    15
1700000.0  TEMPORARY &  SPECIALTY PACKAGES      8
1701100.0  DECANTERS & SPECIALTY PACKAGES      15
           TEMPORARY & SPECIALTY PACKAGES      72
1901200.0  SPECIAL ORDER ITEMS                 10
Name: Category, Length: 105, dtype: int64

In [97]:
# I investigated this. They appear to have bought a barrel, so only one entry makes sense

iowa[iowa['Category Name'] =='AMERICAN WHISKIES']

Unnamed: 0,Date,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Vendor Name,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Store Name,Store Number


In [98]:
# look for duplicates
dummy = iowa['Category'].drop_duplicates()
sorted(dummy.astype('str'))

['1011100.0',
 '1011200.0',
 '1011250.0',
 '1011300.0',
 '1011400.0',
 '1011500.0',
 '1011600.0',
 '1011700.0',
 '1012100.0',
 '1012200.0',
 '1012210.0',
 '1012300.0',
 '1012400.0',
 '1022100.0',
 '1022200.0',
 '1022300.0',
 '1031080.0',
 '1031090.0',
 '1031100.0',
 '1031110.0',
 '1031200.0',
 '1032080.0',
 '1032100.0',
 '1032200.0',
 '1041100.0',
 '1041150.0',
 '1041200.0',
 '1041300.0',
 '1042100.0',
 '1051010.0',
 '1051100.0',
 '1051110.0',
 '1051120.0',
 '1051140.0',
 '1052010.0',
 '1052100.0',
 '1062050.0',
 '1062100.0',
 '1062200.0',
 '1062300.0',
 '1062310.0',
 '1062400.0',
 '1062500.0',
 '1071100.0',
 '1081000.0',
 '1081010.0',
 '1081015.0',
 '1081030.0',
 '1081100.0',
 '1081200.0',
 '1081210.0',
 '1081220.0',
 '1081230.0',
 '1081240.0',
 '1081250.0',
 '1081300.0',
 '1081305.0',
 '1081312.0',
 '1081315.0',
 '1081317.0',
 '1081330.0',
 '1081335.0',
 '1081340.0',
 '1081350.0',
 '1081355.0',
 '1081365.0',
 '1081370.0',
 '1081380.0',
 '1081390.0',
 '1081400.0',
 '1081500.0',
 '1081

In [99]:
# This category as one fewer character
iowa[iowa['Category'] == 101220.0]

Unnamed: 0,Date,City,Zip Code,County Number,County,Category,Category Name,Vendor Number,Vendor Name,Item Number,Item Description,Bottle Volume (ml),State Bottle Cost,State Bottle Retail,Bottles Sold,Sale (Dollars),Volume Sold (Liters),Store Name,Store Number


In [100]:
## the category name is missing 

In [101]:
# 101220.0 is scotch so let's fix that
# find the category number for scotch
iowa.loc[iowa['Category Name'] == 'SCOTCH WHISKIES', 'Category']


9032    1012200.0
2415    1012200.0
3935    1012200.0
4753    1012200.0
8626    1012200.0
          ...    
9100    1012200.0
8823    1012200.0
4220    1012200.0
6827    1012200.0
8423    1012200.0
Name: Category, Length: 193, dtype: float64

In [102]:
# fix the incorrect scotch label
iowa.loc[iowa['Category'] == 101220.0, 'Category'] = 1012200.0

In [103]:
# clean up missing category name
iowa['Category Name'] = iowa['Category Name'].fillna(0)

In [104]:
# clean up missing category
iowa['Category'] = iowa['Category'].fillna(0)

In [105]:
# condense categories
iowa['Category Name'] = iowa['Category Name'].str.replace('IMPORTED','')

In [106]:
# we filled nan values with 0 so we need to make those 0 --> '0'
iowa['Category Name'] = iowa['Category Name'].astype('str')

In [107]:
# let's clean up some Category names

# apply re to all column entries
def regularexpression(x):
    
    # apply the regular expression
    # removing all non alpha characters or spaces
    return re.sub(r'[^a-zA-Z ]', ' ', x)  

In [108]:
# Let's apply the RE to the category name column
iowa['Category Name'] = iowa['Category Name'].apply(lambda x: regularexpression(x))

In [109]:
# condense categories some more
iowa['Category Name'] = iowa['Category Name'].str.replace('AMERICAN ','')

iowa['Category Name'] = iowa['Category Name'].str.replace('IMPORTED','')

iowa['Category Name'] = iowa['Category Name'].str.replace('STRAIGHT','')

iowa['Category Name'] = iowa['Category Name'].str.replace('DRY','')

iowa['Category Name'] = iowa['Category Name'].str.replace('PROOF','')

iowa['Category Name'] = iowa['Category Name'].str.replace('GRAPE','')

In [110]:
## the nan column is now and empty string

### first look throughthere are a bunch of over laps
# create a map to swap-out / correct the spelling
CATE_MAP = {'' : 'EMPTY',
           'GINS' : 'GIN',
           'VODKAS' : 'VODKA',
           'LIQUEURS' : 'LIQUEUR',
           'SPIRITS' : 'SPIRIT',
           'WHISKIES' : 'WHISKY',
           'WHISKEY' : 'WHISKY',
           'VODKA  MISC' : 'VODKA FLAVORED',
           'TENNESSEE WHISKIES' : 'WHISKY',
           'AGAVE TEQUILA' : 'TEQUILA',
           'VODKA  CHERRY' : 'VODKA FLAVORED',
           'WHITE CREME DE CACAO' : 'LIQUEUR',
           'WHITE CREME DE MENTHE' : 'LIQUEUR'}

## there are some double spaces that will be taken care of last. Because we still might generate some more

In [111]:
#### map the correct spelllings onto their replacements
iowa['Category Name'] = iowa['Category Name'].replace(CATE_MAP.keys(), list(map(str, CATE_MAP.values())), regex=True)

In [112]:
# create a function to replace values

def type(x):
    #search through values to apply a new label
    if "IOWA" in x:
        return "IOWA LOCAL"
    elif "TEMPORARY" in x:
        return "SPECIAL PACKAGING"
    elif "HOLIDAY" in x:
        return "SPECIAL PACKAGING"
    elif 'SPECIALTY' in x:
        return "SPECIAL PACKAGING" 
    elif 'DISTILLED SPIRIT SPECIALTY' in x:
        return 'FLAVORED WHISKY'
    elif 'WHISKY LIQUEUR' in x:
        return 'FLAVORED WHISKY'
    elif "DELISTED" in x:
        return "DELISTED"
    elif "BRANDIES" in x:
        return "BRANDY"
    elif "SCHNAPPS" in x:
        return "SCHNAPPS"
    elif 'VODKA FLAVORED' in x:
        return 'FLAVORED VODKA'
    elif "FLAVORED" in x:
        return x
    elif 'ROCK  RYE' in x:
        return 'COCKTAILS RTD'
    elif "RYE" in x:
        return "RYE WHISKY"
    elif "CANADIAN WHISKY" in x:
        return x
    elif "CREME" in x:
        return "LIQUEUR"
    elif "SCOTCH" in x:
        return "SCOTCH"
    elif "BOURBON" in x:
        return "BOURBON WHISKY"
    elif "WHISK" in x:
        return "WHISKY"
    elif "LIQUEUR" in x:
        return "LIQUEUR"
    elif "AMARETTO" in x:
        return "LIQUEUR"
    elif 'TRIPLE SEC' in x:
        return "LIQUEUR"
    elif 'SLOE' in x:
        return "LIQUEUR"
    elif "RUM" in x:
        return "RUM"
    elif "MEZCAL" in x:
        return "MEZCAL" 
    elif "VODKA" in x:
        return "VODKA"
    elif "GIN" in x:
        return "GIN"
    elif 'NEUTRAL GRAIN SPIRIT FLAVORED' in x:
        return "FLAVORED WHISKY"
    elif "NEUTRAL" in x:
        return "NEUTRAL GRAIN"
    elif "ALCOHOL" in x:
        return "NEUTRAL GRAIN"
    elif "SPECIAL" in x:
        return "SPECIALTY"
    elif "COCKTAIL" in x:
        return 'COCKTAILS RTD'
    elif 'MIXTO TEQUILA' in x:
        return 'TEQUILA'
    elif 'ANISETTE' in x:
        return 'LIQUEUR'
    else:
        return x

In [113]:
# apply that function to create a new column
iowa['Category Name'] = iowa['Category Name'].apply(lambda x: type(x))

In [114]:
# remove the empty spaces
iowa['Category Name'] = iowa['Category Name'].str.strip()
iowa['Category Name'] = iowa['Category Name'].str.replace('  ',' ')

In [115]:
iowa = iowa[iowa['Category Name'] != 'EMPTY']
iowa = iowa[iowa['Category Name'] != 'HIGH BEER AMERICAN']
iowa = iowa[iowa['Category Name'] != 'DELISTED']

In [116]:
# look for easy over laps
dummy = iowa['Category Name'].drop_duplicates()
sorted(dummy.astype('str'))

['BOURBON WHISKY',
 'BRANDY',
 'CANADIAN WHISKY',
 'COCKTAILS RTD',
 'FLAVORED GIN',
 'FLAVORED RUM',
 'FLAVORED VODKA',
 'FLAVORED WHISKY',
 'GIN',
 'IOWA LOCAL',
 'LIQUEUR',
 'MEZCAL',
 'NEUTRAL GRAIN',
 'NEUTRAL GRAIN SPIRIT FLAVORED',
 'RUM',
 'RYE WHISKY',
 'SCHNAPPS',
 'SCOTCH',
 'SPECIAL PACKAGING',
 'SPECIALTY',
 'TEQUILA',
 'VODKA',
 'WHISKY',
 'nan']

In [117]:
iowa = iowa[iowa['Category Name'] != 'nan']

In [118]:
# where are we at now
iowa[['Category', 'Category Name']].nunique()

Category         87
Category Name    23
dtype: int64

In [119]:
# now to standardize the Category number

# duplicate method from earlier

# create a dummy df with just the category number and name that has all the combinations
dup_cate = iowa.drop_duplicates(subset = ['Category', 'Category Name'])[['Category', 'Category Name']]

# sort these number/ name combinations by occurances of category number 
# (we have more names than numbers, so numbers will have certainly have duplicates)
dup_cate = dup_cate.groupby('Category Name').count()

# ascending false because we want the larger numbers aka duplicates
dup_cate = dup_cate.sort_values('Category', ascending = False)

# remove singles, or non-duplicates
dup_cate = dup_cate[dup_cate['Category'] > 1 ].reset_index()

# we now have a list/ df of category numbers who have duplicate entries
dup_cate

Unnamed: 0,Category Name,Category
0,LIQUEUR,20
1,SCHNAPPS,16
2,VODKA,7
3,BRANDY,7
4,WHISKY,6
5,RUM,6
6,SPECIAL PACKAGING,5
7,BOURBON WHISKY,5
8,RYE WHISKY,3
9,SCOTCH,3


In [120]:
# list of category
category = iowa.groupby(['Category', 'Category Name']).count().reset_index()

# list of categories from the duplicates
wrong_names = category[category['Category Name'].isin(dup_cate['Category Name'])].sort_values('Category Name')[['Category','Category Name']]

# list of 'correct' names, most common name entry for each category
category_map = wrong_names.groupby('Category Name').max()[['Category']].reset_index()

In [121]:
# merge 'correct' names with the df, merged on category
iowa = pd.merge(left = iowa, right = category_map, left_on = 'Category Name', right_on = 'Category Name', how = 'left')

In [122]:
# now our dataframe has multiple category columns
# store_y is the cleaned names for duplicates
# fill in the missing values (the values where there was no duplication from the original name column)
iowa['Category'] = iowa['Category_y'].fillna(iowa['Category_x'])

# drop the extra category name column
iowa = iowa.drop(columns=['Category_x', 'Category_y'])

# where are we at now
iowa[['Category', 'Category Name']].nunique()

Category         22
Category Name    23
dtype: int64

### [ ' Vendor Number ' ]  &  [ ' Vendor Name ' ] 

In [123]:
# VENDOR NUMBER & NAME

# I am less interested in vendor name/number. 
iowa[['Vendor Number', 'Vendor Name']].nunique()

Vendor Number     95
Vendor Name      144
dtype: int64

In [124]:
#standardize 
iowa['Vendor Name'] = iowa['Vendor Name'].str.upper()

In [125]:
# now to standardize the Vendors

# duplicate method from earlier

# create a dummy df with just the vendors that has all the combinations
dup_vend = iowa.drop_duplicates(subset = ['Vendor Number', 'Vendor Name'])[['Vendor Number', 'Vendor Name']]

# sort these number/ name combinations by occurances of vendor number 
# (we have more names than numbers, so numbers will have certainly have duplicates)
dup_vend = dup_vend.groupby('Vendor Number').count()

# ascending false because we want the larger numbers aka duplicates
dup_vend = dup_vend.sort_values('Vendor Name', ascending = False)

# remove singles, or non-duplicates
dup_vend = dup_vend[dup_vend['Vendor Name'] > 1 ].reset_index()

# we now have a list/ df of category numbers who have duplicate entries
dup_vend

Unnamed: 0,Vendor Number,Vendor Name
0,469.0,3
1,322.0,3
2,389.0,3
3,154.0,3
4,35.0,2
5,370.0,2
6,285.0,2
7,297.0,2
8,300.0,2
9,301.0,2


In [126]:
# list of vendors
vendor = iowa.groupby(['Category', 'Category Name']).count().reset_index()

# list of vendors from the duplicates
wrong_names = vendor[vendor['Vendor Number'].isin(dup_vend['Vendor Number'])].sort_values('Vendor Number')[['Vendor Number','Vendor Name']]

# list of 'correct' names, most common name entry for each vendor
vendor_map = wrong_names.groupby('Vendor Number').max()[['Vendor Name']].reset_index()

In [127]:
# merge 'correct' names with the df, merged on vendor
iowa = pd.merge(left = iowa, right = vendor_map, left_on = 'Vendor Number', right_on = 'Vendor Number', how = 'left')

In [128]:
# now our dataframe has multiple category columns
# store_y is the cleaned names for duplicates
# fill in the missing values (the values where there was no duplication from the original name column)
iowa['Vendor Name'] = iowa['Vendor Name_y'].fillna(iowa['Vendor Name_x'])

# drop the extra category name column
iowa = iowa.drop(columns=['Vendor Name_x', 'Vendor Name_y'])

# where are we at now
iowa[['Vendor Number', 'Vendor Name']].nunique()

Vendor Number     95
Vendor Name      133
dtype: int64

In [129]:
# Remove the 7 nulls
iowa = iowa[iowa['Vendor Name'] != np.nan]

In [130]:
# manual cleaning
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('&', 'AND')

In [131]:
# there seems to be a lot of companies with '/' in their 
iowa['Vendor Name'] = iowa['Vendor Name'].str.split("/", expand=True)[0]

In [132]:
iowa['Vendor Name'] = iowa['Vendor Name'].astype('str')

In [133]:
# let's give regular expression a try
iowa['Vendor Name'] = iowa['Vendor Name'].apply(lambda x: regularexpression(x))

In [134]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('LLC', '')

In [135]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('INC', '')

In [136]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('CORPORATION', 'CO')

In [137]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('CORP', 'CO')

In [138]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('COMPANY', 'CO')

In [139]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('U S A','USA')

In [140]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('LIMITED L','')

In [141]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('SPIRITS','SPIRIT')

In [142]:
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('LTD','')

In [143]:
# remove the empty spaces
iowa['Vendor Name'] = iowa['Vendor Name'].str.strip()
iowa['Vendor Name'] = iowa['Vendor Name'].str.replace('  ',' ')

In [144]:
iowa = iowa.dropna()

### [ ' Item Number ' ]

In [145]:
# ITEM NUMBER

iowa['Item Number'].nunique()

1900

### [ ' Item Description ' ]    &    [ ' Bottles ' ]

In [146]:
# ITEM DESCRIPTION

iowa['Item Description'].nunique()

1691

In [147]:
# BOTTLE VOLUME (mL)
iowa['Bottle Volume (ml)'].nunique()

22

In [148]:
iowa['Bottle Volume (ml)'].unique()

array([  750,   375,  1750,  1000,   500,   600,   100,    50,  3000,
         200,   300,   800, 31500,   950,  1950,  2400,  1800,   502,
         850,  4800,  1200,   400])

In [149]:
# STATE BOTTLE COST
iowa['State Bottle Cost'].describe()

count    9922.000000
mean       10.143270
std         7.657424
min         0.000000
25%         5.510000
50%         8.250000
75%        12.500000
max       265.470000
Name: State Bottle Cost, dtype: float64

In [150]:
# STATE BOTTLE RETAIL
iowa['State Bottle Retail'].describe()

count    9922.000000
mean       15.225435
std        11.486532
min         0.000000
25%         8.270000
50%        12.380000
75%        18.750000
max       398.210000
Name: State Bottle Retail, dtype: float64

In [151]:
iowa['State Bottle Retail'].nunique()

1153

In [152]:
# BOTTLES SOLD
iowa['Bottles Sold']

0       12
1        6
2       24
3        6
4       12
        ..
9917     6
9918    12
9919     6
9920    12
9921    12
Name: Bottles Sold, Length: 9922, dtype: int64

In [153]:
# SALE (DOLLARS)
iowa['Sale (Dollars)']

0        60.12
1       134.94
2       110.40
3        99.00
4        96.00
         ...  
9917     64.80
9918    188.88
9919     58.50
9920    280.32
9921    117.00
Name: Sale (Dollars), Length: 9922, dtype: float64

In [154]:
# VOLUME SOLD (LITERS)
iowa['Volume Sold (Liters)'].nunique()

125

In [155]:
iowa = iowa.drop(columns=['Vendor Number', 'County Number', 'Store Number'])

In [156]:
# save progress as csv
#iowa.to_csv('iowa_clean.csv', index=False)