# Steps: 
1. Determine which variables to use
2. Cut down size of dataframe (maybe top 10 and do dummies? or an aggregate (# of pitbulls per month)
3. Create multivariate dataframe, including timestamp index
4. Scale, series_to_supervise
5. Fit data, split to train_test sets
6. Use univariate LSTM, tune to lowest MSE
7. Compare univariate to multivariate scores, make analysis
8. Save model and fit into function
9. Write new prediction function? 
10. Conclude, interpret
11. Make more visualizations
12. Blog about it
13. Where do we go from here? 
14. What is the "so what"? 

In [19]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv('data/raw_data.csv')

In [4]:
df.set_index('ValidDate', inplace = True)

In [5]:
df = df.drop(['ExpYear'], axis = 1)

In [6]:
df.head()

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-12-02 09:40:53,Dog Individual Neutered Male,COCKAPOO,BROWN,CHARLEY,15236
2014-12-02 09:45:25,Dog Senior Citizen or Disability Neutered Male,GER SHEPHERD,BLACK/BROWN,TACODA,15238
2014-12-02 09:47:55,Dog Individual Spayed Female,GER SHEPHERD,BLACK,CHARLY,15205
2014-12-02 10:02:33,Dog Individual Spayed Female,LABRADOR RETRIEVER,BLACK,ABBEY,15143
2014-12-02 10:05:50,Dog Individual Female,GER SHORTHAIR POINT,BROWN,CHARLEY,15228


In [7]:
df['Breed'].value_counts().sum()

286724

In [None]:
#I would definitely need to look at cutting that number down- that would be way too noisy.
#My initial thought is to see what the top 10 breeds are
#Maybe from there I can either to dummies or aggregate

In [8]:
top_ten = df['Breed'].value_counts()[:11]

In [9]:
top_ten
#bar chart here

MIXED                  29009
LABRADOR RETRIEVER     19713
LAB MIX                17714
GOLDEN RETRIEVER        9344
GER SHEPHERD            8437
SHIH TZU                7976
BEAGLE                  7960
CHIHUAHUA               7664
TAG                     7475
AM PIT BULL TERRIER     7332
YORKSHIRE TERRIER       6268
Name: Breed, dtype: int64

In [None]:
#TAG is not a type of dog. they are denoting that they are putting tags on an existing dog

In [10]:
tag_df = df.loc[df['Breed'] == 'TAG']
tag_df.loc[(tag_df['DogName'] == 'SHADOW') & (tag_df['OwnerZip'] == 15102)]

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-03-12 11:11:36,Dog Individual Spayed Female,TAG,BLACK,SHADOW,15102
2017-01-10 09:39:46,Dog Individual Spayed Female,TAG,BLACK,SHADOW,15102
2015-12-11 10:35:08,Dog Individual Spayed Female,TAG,BLACK,SHADOW,15102


In [11]:
a = len(tag_df['DogName'])
b = len(tag_df['DogName'].value_counts())
a-b

5708

In [12]:
top_ten_df = df[(df['Breed'] == 'MIXED') |
                (df['Breed'] == 'LABRADOR RETRIEVER') |
                (df['Breed'] == 'LAB MIX') |
                (df['Breed'] == 'GOLDEN RETRIEVER') |
                (df['Breed'] == 'GER SHEPHERD') |
                (df['Breed'] == 'SHIH TZU') |
                (df['Breed'] == 'BEAGLE') |
                (df['Breed'] == 'CHIHUAHUA') |
                (df['Breed'] == 'YORKSHIRE TERRIER') |
                (df['Breed'] == 'AM PIT BULL TERRIER')]

In [None]:
#In the univariate model, we summed the total licenses per day. I think we could sum each breed per day,
#and also include a column that shows the % of dogs per day that were top 10
#Let's start by summing the top 10 per day

In [17]:
mixed = top_ten_df.loc[top_ten_df['Breed'] == 'MIXED']
mixed['Sum'] = 1
mixed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,Sum
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-02 11:25:48,Dog Individual Neutered Male,MIXED,BROWN,SPAULDING,15120,1
2014-12-02 11:29:03,Dog Individual Neutered Male,MIXED,BROWN,ROCKY,15227,1
2014-12-02 11:48:53,Dog Senior Citizen or Disability Neutered Male,MIXED,WHITE,BUDDY,15229,1
2014-12-03 08:48:30,Dog Individual Neutered Male,MIXED,BROWN,DEWEY,15216,1
2014-12-03 08:48:32,Dog Individual Neutered Male,MIXED,BLACK,ROXIE,15145,1


In [18]:
lab = top_ten_df.loc[top_ten_df['Breed'] == 'LABRADOR RETRIEVER']
lab['Sum'] = 2
lab.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,Sum
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-02 10:02:33,Dog Individual Spayed Female,LABRADOR RETRIEVER,BLACK,ABBEY,15143,2
2014-12-02 11:27:08,Dog Individual Male,LABRADOR RETRIEVER,YELLOW,BAILEY,15065,2
2014-12-02 11:29:37,Dog Individual Spayed Female,LABRADOR RETRIEVER,BLACK,GYPSY,15209,2
2014-12-03 08:48:32,Dog Individual Female,LABRADOR RETRIEVER,BROWN,ROXY,15237,2
2014-12-03 08:48:36,Dog Individual Spayed Female,LABRADOR RETRIEVER,YELLOW,DAISY,15143,2


In [21]:
lab_mix = top_ten_df.loc[top_ten_df['Breed'] == 'LAB MIX']
lab_mix['Sum'] = 3
lab_mix.head()

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,Sum
ValidDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-02 10:53:56,Dog Individual Neutered Male,LAB MIX,RUST,RUSTY,15102,3
2014-12-02 13:29:21,Dog Senior Citizen or Disability Spayed Female,LAB MIX,BROWN,NALA,15104,3
2014-12-02 15:24:45,Dog Individual Spayed Female,LAB MIX,BLACK,WILLOW,15101,3
2014-12-03 09:06:37,Dog Individual Neutered Male,LAB MIX,SPOTTED,PATCHES,15108,3
2014-12-03 09:42:40,Dog Individual Neutered Male,LAB MIX,BLACK,JACK,15102,3


In [22]:
golden = top_ten_df.loc[top_ten_df['Breed'] == 'GOLDEN RETRIEVER']
golden['Sum'] = 4

In [23]:
ger_shepherd = top_ten_df.loc[top_ten_df['Breed'] == 'GER SHEPHERD']
ger_shepherd['Sum'] = 5

In [24]:
shitzu = top_ten_df.loc[top_ten_df['Breed'] == 'SHIH TZU']
shitzu['Sum'] = 6

In [25]:
beagle = top_ten_df.loc[top_ten_df['Breed'] == 'BEAGLE']
beagle['Sum'] = 7

In [26]:
chihuahua = top_ten_df.loc[top_ten_df['Breed'] == 'CHIHUAHUA']
chihuahua['Sum'] = 8

In [27]:
yorkie = top_ten_df.loc[top_ten_df['Breed'] == 'YORKSHIRE TERRIER']
yorkie['Sum'] = 9

In [28]:
pittie = top_ten_df.loc[top_ten_df['Breed'] == 'AM PIT BULL TERRIER']
pittie['Sum'] = 10

In [34]:
dogs = (mixed, lab, lab_mix, golden, ger_shepherd, shitzu, beagle, chihuahua, yorkie, pittie)
coded = pd.concat(dogs)
coded['Sum'].value_counts()

1     29009
2     19713
3     17714
4      9344
5      8437
6      7976
7      7960
8      7664
10     7332
9      6268
Name: Sum, dtype: int64

In [None]:
#I now have an encoded dataframe 1 = mixed 2 = lab etc. 
#I still need to figure out how I'm going to use this to create a sum of each dog, each day. 
#I hope I didn't do this for nothing. 