# Data Analysis

Ryry :)

### Data analysis
1. Look at columns, graph out variance of columns (min/max) .describe
    * Document output of .describe on each column
#### -> 2. Find which column is a finite value vs. continuous values — get to know data better
 - encode subcategories of categorical columns
 - barchart of frequency when given hashed value appears
 
2. Scatter plot to find outliers
- hypothesize about what I can feed a model 
3. Find best application of random forest on raw data 
4. Then run a neural net on cleaned data 


### Experiment with different models on cleaned data
1. Start small — try one model with one feature to predict click
    * Run a NN on one feature to predict click
    * Then try another feature, and so on
2. Try 10 different models
    * SGD descent classifier — look up different tensorflow models, etc. 
    * Document the value of each model
        * Might not find perfect model but we will know. Make a graph (the width of the network vs. depth network vs. loss of network) 
        * Describe a line through time that shows how tweaking different hyper parameters effects the model —doing data discovery by letting the computer do data discovery for us, while we tweak the hyper parameter)
        * Document everything — ask questions like:
            * When does the model start overfitting?
            * What happens if my neural net is 100? 10 neurons wide? compare
            * Try five layers deep vs. One
            * In each comparison, was there any value? Is there a point of diminishing return?
            * Try different GPUs, etc.

### Informed feature engineering (when we know something about the ad-space): 
(avoid feature engineering unless we have domain knowledge of the features — let the models do the work, emphasize normalizing/standardizing data and controlling/understanding column variance to get better model performance — before feature engineering)

We can feature engineer things like banners, we see ad banners every day, try this:
1. Use K-Means on ads — find natural divisions in banners 
2. Try to find clusters of information around ads
3. Make classes for clusters of ads (i.e., square ads, long rectangular banners, etc.)
    * Make scatterplots of the heights, widths of ad banners/size
    * Turn those into one column — make ordinal ’types’ of banners

Subdivide by website — see if we can be more predictive based on the destination site_id, etc.
* Encode unique groups of hashed values of site_id column (one hot encoder)
* Then test, we can discover that maybe “site_a” provides a higher accuracy and so on. 

*Don’t need recurrent neural net or a convolutional neural net — Josh recommends a basic neural net

### References
 - CTR: https://towardsdatascience.com/mobile-ads-click-through-rate-ctr-prediction-44fdac40c6ff
 - the sage wisdom of AMLI instructors & TAs

## Import Libraries and Data

In [129]:
import pandas as pd
import numpy as np
import multiprocessing as mp
import psutil
import random
import datetime as datetime
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [67]:
# read in the Avazu - criteo labs - csv file
# rand_sample_csv is a randomized subset (1% the size) of the sample_csv which is ~400k instances 

df = pd.read_csv('rand_sample_eng.csv')


# Data Exploration

In [68]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,...,C15,C16,C17,C18,C19,C20,C21,new_date,new_time,day_of_week
0,0,10004510652136496837,0,14102100,1005,0,543a539e,c7ca3108,3e814130,ecad2386,...,320,50,2333,0,39,-1,157,2014-10-21,00:00:00,1
1,1,10007164336863914220,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,...,320,50,1722,0,35,-1,79,2014-10-21,00:00:00,1
2,2,10076859283156800622,0,14102100,1002,0,f17ebd97,c4e18dd6,50e219e0,ecad2386,...,216,36,2497,3,43,100151,42,2014-10-21,00:00:00,1
3,3,10078825124049580646,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,...,320,50,1722,0,35,-1,79,2014-10-21,00:00:00,1
4,4,10085233430943183912,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,...,320,50,1722,0,35,-1,79,2014-10-21,00:00:00,1


In [69]:
df.describe()

Unnamed: 0.1,Unnamed: 0,id,click,hour,C1,banner_pos,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21,day_of_week
count,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0,404410.0
mean,202204.5,9.213896e+18,0.169991,14102560.0,1004.967372,0.288638,1.014695,0.331688,18844.936193,318.795856,60.076581,2112.733881,1.431426,227.89114,53239.462209,83.304955,2.601259
std,116743.255518,5.319411e+18,0.375625,296.8127,1.090207,0.504033,0.523959,0.855877,4947.526554,20.667022,47.023691,607.929983,1.325359,351.686105,49955.39177,70.251537,1.727362
min,0.0,73068130000000.0,0.0,14102100.0,1001.0,0.0,0.0,0.0,375.0,120.0,20.0,112.0,0.0,33.0,-1.0,1.0,0.0
25%,101102.25,4.60724e+18,0.0,14102300.0,1005.0,0.0,1.0,0.0,16920.0,320.0,50.0,1863.0,0.0,35.0,-1.0,23.0,1.0
50%,202204.5,9.218443e+18,0.0,14102600.0,1005.0,0.0,1.0,0.0,20346.0,320.0,50.0,2323.0,2.0,39.0,100048.0,61.0,2.0
75%,303306.75,1.3822e+19,0.0,14102810.0,1005.0,1.0,1.0,0.0,21893.0,320.0,50.0,2526.0,3.0,171.0,100086.0,101.0,4.0
max,404409.0,1.844673e+19,1.0,14103020.0,1012.0,7.0,5.0,5.0,24043.0,1024.0,1024.0,2757.0,3.0,1839.0,100248.0,255.0,6.0


In [70]:
df.shape # this sample has 404,410 rows of data with 26 columns

(404410, 28)

In [71]:
df.dtypes
# avazu: "all integer features are categorical variables, all IDs, no numerical meaning"

Unnamed: 0           int64
id                  uint64
click                int64
hour                 int64
C1                   int64
banner_pos           int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type          int64
device_conn_type     int64
C14                  int64
C15                  int64
C16                  int64
C17                  int64
C18                  int64
C19                  int64
C20                  int64
C21                  int64
new_date            object
new_time            object
day_of_week          int64
dtype: object

In [72]:
# what do the columns mean?

 - id: ad identifier
 - click: 0/1 for non-click/click
 - hour: format is YYMMDDHH
 - C1 — anonymized categorical variable
 - banner_pos
 - site_id
 - site_domain
 - site_category
 - app_id
 - app_domain
 - app_category
 - device_id
 - device_ip
 - device_model
 - device_type
 - device_conn_type
 - C14-C21 — anonymized categorical variables

# Data Preprocessing

In [73]:
#  unnamed column are columns that are created when a dataframe is converted to a csv. 
# 'Unnamed: 0', 'Unnamed: 0.1' are row indexes which were tansposed into columns.
df_new = df.drop(['Unnamed: 0'], axis=1) 

In [74]:
df_new.shape

(404410, 27)

In [75]:
# check for missing values

df_new.isnull().sum()


id                  0
click               0
hour                0
C1                  0
banner_pos          0
site_id             0
site_domain         0
site_category       0
app_id              0
app_domain          0
app_category        0
device_id           0
device_ip           0
device_model        0
device_type         0
device_conn_type    0
C14                 0
C15                 0
C16                 0
C17                 0
C18                 0
C19                 0
C20                 0
C21                 0
new_date            0
new_time            0
day_of_week         0
dtype: int64

In [76]:
# summed list of each column for df_new, looking for inconsistencies

for col in df_new.columns.values:
    total = len(df_new[col].unique())
    print(str(col) + " " + "total: " + str(total))

id total: 404410
click total: 2
hour total: 240
C1 total: 7
banner_pos total: 7
site_id total: 2195
site_domain total: 2172
site_category total: 21
app_id total: 2305
app_domain total: 153
app_category total: 26
device_id total: 64913
device_ip total: 262453
device_model total: 4369
device_type total: 5
device_conn_type total: 4
C14 total: 2067
C15 total: 8
C16 total: 9
C17 total: 415
C18 total: 4
C19 total: 65
C20 total: 159
C21 total: 60
new_date total: 10
new_time total: 24
day_of_week total: 7


Each instance of a column are hashed values of an original ID.
Hashing was done to anonymize the services contributing ad data to this dataset.
For illustrative/descriptive purposes we will treat each hashed value as names or in other fictional contexts (e.g.,'7801e8d9' = 'www.overstock.com'). (Thank you Naomi!)

# Feature Engineering
## Hour & Date 

In [77]:
# check hour column data type
df_new.hour.dtype

dtype('int64')

In [78]:
# separate the date and time
parse_date = lambda val : pd.datetime.strptime(val, '%y%m%d%H')
df_new['new_hour'] = df_new.hour.astype(str).apply(parse_date)
df_new['new_hour']

0        2014-10-21 00:00:00
1        2014-10-21 00:00:00
2        2014-10-21 00:00:00
3        2014-10-21 00:00:00
4        2014-10-21 00:00:00
5        2014-10-21 00:00:00
6        2014-10-21 00:00:00
7        2014-10-21 00:00:00
8        2014-10-21 00:00:00
9        2014-10-21 00:00:00
10       2014-10-21 00:00:00
11       2014-10-21 00:00:00
12       2014-10-21 00:00:00
13       2014-10-21 00:00:00
14       2014-10-21 00:00:00
15       2014-10-21 00:00:00
16       2014-10-21 00:00:00
17       2014-10-21 00:00:00
18       2014-10-21 00:00:00
19       2014-10-21 00:00:00
20       2014-10-21 00:00:00
21       2014-10-21 00:00:00
22       2014-10-21 00:00:00
23       2014-10-21 00:00:00
24       2014-10-21 00:00:00
25       2014-10-21 00:00:00
26       2014-10-21 00:00:00
27       2014-10-21 00:00:00
28       2014-10-21 00:00:00
29       2014-10-21 00:00:00
                 ...        
404380   2014-10-30 23:00:00
404381   2014-10-30 23:00:00
404382   2014-10-30 23:00:00
404383   2014-

In [79]:
# check if column 'new_hour' was created and parsed to string
df_new.head(3)

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,C16,C17,C18,C19,C20,C21,new_date,new_time,day_of_week,new_hour
0,10004510652136496837,0,14102100,1005,0,543a539e,c7ca3108,3e814130,ecad2386,7801e8d9,...,50,2333,0,39,-1,157,2014-10-21,00:00:00,1,2014-10-21
1,10007164336863914220,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,50,1722,0,35,-1,79,2014-10-21,00:00:00,1,2014-10-21
2,10076859283156800622,0,14102100,1002,0,f17ebd97,c4e18dd6,50e219e0,ecad2386,7801e8d9,...,36,2497,3,43,100151,42,2014-10-21,00:00:00,1,2014-10-21


In [80]:
#confirm dtype of new_hour
df_new.new_hour.dtype

dtype('<M8[ns]')

In [81]:
# create new_date & new_time columns from parsed new_hour column
df_new['date'] = [d.date() for d in df_new['new_hour']]
df_new['time'] = [d.time() for d in df_new['new_hour']]

In [82]:
#check if columns were established properly
df_new.head(3)

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,C18,C19,C20,C21,new_date,new_time,day_of_week,new_hour,date,time
0,10004510652136496837,0,14102100,1005,0,543a539e,c7ca3108,3e814130,ecad2386,7801e8d9,...,0,39,-1,157,2014-10-21,00:00:00,1,2014-10-21,2014-10-21,00:00:00
1,10007164336863914220,1,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,0,35,-1,79,2014-10-21,00:00:00,1,2014-10-21,2014-10-21,00:00:00
2,10076859283156800622,0,14102100,1002,0,f17ebd97,c4e18dd6,50e219e0,ecad2386,7801e8d9,...,3,43,100151,42,2014-10-21,00:00:00,1,2014-10-21,2014-10-21,00:00:00


In [83]:
df_new.dtypes

id                          uint64
click                        int64
hour                         int64
C1                           int64
banner_pos                   int64
site_id                     object
site_domain                 object
site_category               object
app_id                      object
app_domain                  object
app_category                object
device_id                   object
device_ip                   object
device_model                object
device_type                  int64
device_conn_type             int64
C14                          int64
C15                          int64
C16                          int64
C17                          int64
C18                          int64
C19                          int64
C20                          int64
C21                          int64
new_date                    object
new_time                    object
day_of_week                  int64
new_hour            datetime64[ns]
date                

In [84]:
# drop redundant cols
df_tmp = df_new.drop(['new_hour', 'hour'], axis=1)

In [85]:
df_tmp.head(2)

Unnamed: 0,id,click,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,...,C17,C18,C19,C20,C21,new_date,new_time,day_of_week,date,time
0,10004510652136496837,0,1005,0,543a539e,c7ca3108,3e814130,ecad2386,7801e8d9,07d7df22,...,2333,0,39,-1,157,2014-10-21,00:00:00,1,2014-10-21,00:00:00
1,10007164336863914220,1,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,...,1722,0,35,-1,79,2014-10-21,00:00:00,1,2014-10-21,00:00:00


In [86]:
# sannity check of summed columns for unique vals

for col in df_tmp.columns.values:
    total = len(df_tmp[col].unique())
    val = df_tmp[col].unique()
    print(str(col) + " " + "total: " + str(total))

id total: 404410
click total: 2
C1 total: 7
banner_pos total: 7
site_id total: 2195
site_domain total: 2172
site_category total: 21
app_id total: 2305
app_domain total: 153
app_category total: 26
device_id total: 64913
device_ip total: 262453
device_model total: 4369
device_type total: 5
device_conn_type total: 4
C14 total: 2067
C15 total: 8
C16 total: 9
C17 total: 415
C18 total: 4
C19 total: 65
C20 total: 159
C21 total: 60
new_date total: 10
new_time total: 24
day_of_week total: 7
date total: 10
time total: 24


# Feature engineering cont. 
### device_type converted into binary columns by device_type 


In [87]:
# iterate through columns and print the unique values of each column
for col in df_tmp.columns.values:
    val = df_tmp[col].unique()
    print(str(col) + " " + ", val: " + str(val))

id , val: [10004510652136496837 10007164336863914220 10076859283156800622 ...
  9930625418032326788  9953588061726377330  9959058523366506236]
click , val: [0 1]
C1 , val: [1005 1002 1010 1007 1008 1012 1001]
banner_pos , val: [0 1 2 5 7 4 3]
site_id , val: ['543a539e' '1fbe01fe' 'f17ebd97' ... '9fd919ea' '1b72ccd8' '5a51436e']
site_domain , val: ['c7ca3108' 'f3845767' 'c4e18dd6' ... '0da06afc' '3e87e1c9' '645c06d3']
site_category , val: ['3e814130' '28905ebd' '50e219e0' '76b2941d' 'f028772b' 'f66779e6'
 '0569f928' '335d28a8' '72722551' '75fa27f6' 'c0dd3be3' 'a818d37a'
 '8fd0aea4' '70fb0e29' 'dedf689d' 'e787de0e' '5378d028' 'bcf865d9'
 '42a36e14' '9ccfa2ea' 'c706e647']
app_id , val: ['ecad2386' '1779deee' 'febd1138' ... '96f19b66' '5717fe5d' '404b2054']
app_domain , val: ['7801e8d9' '2347f47a' '82e27996' '45a51db4' '5c5a694b' 'afdf1f54'
 'aefc06bd' 'ae637522' 'd9b5648e' '828da833' '5b9c592b' '0654b444'
 '885c7f3f' 'b8d325c3' 'b5f3b24a' 'ad63ec9b' '33da2e74' '43cf4f06'
 '15ec7f39' '18eb

In [88]:
# use device_type as practice. There are 5 unique vals -- smaller number is easier to work with
df_tmp.device_type.nunique()

5

In [89]:
# store df_tmp.device_type as var for ease of re-use
dvc_type = df_tmp.device_type

In [90]:
# check instance of dvc_type
dvc_type[0]

1

In [91]:
# val counts gives me the count of each unique values 
dvc_type.value_counts()

1    373412
0     22074
4      7676
5      1247
2         1
Name: device_type, dtype: int64

In [92]:
# make var to hold col 'names' based off unique values stored as a list
col_names = df_tmp['device_type'].unique().tolist()
col_names

[1, 0, 4, 5, 2]

In [93]:
# check it
df_tmp.head(3)

Unnamed: 0,id,click,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,...,C17,C18,C19,C20,C21,new_date,new_time,day_of_week,date,time
0,10004510652136496837,0,1005,0,543a539e,c7ca3108,3e814130,ecad2386,7801e8d9,07d7df22,...,2333,0,39,-1,157,2014-10-21,00:00:00,1,2014-10-21,00:00:00
1,10007164336863914220,1,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,...,1722,0,35,-1,79,2014-10-21,00:00:00,1,2014-10-21,00:00:00
2,10076859283156800622,0,1002,0,f17ebd97,c4e18dd6,50e219e0,ecad2386,7801e8d9,07d7df22,...,2497,3,43,100151,42,2014-10-21,00:00:00,1,2014-10-21,00:00:00


In [94]:
df_tmp[col_names] = pd.get_dummies(df_tmp['device_type'])
df_tmp[col_names].describe()

Unnamed: 0,1,0,4,5,2
count,404410.0,404410.0,404410.0,404410.0,404410.0
mean,0.054583,0.92335,2e-06,0.018981,0.003084
std,0.227165,0.266036,0.001572,0.136457,0.055444
min,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,0.0,0.0
50%,0.0,1.0,0.0,0.0,0.0
75%,0.0,1.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0


In [95]:
# 2 unique numbers for column 1, this means it is either 1 or not 1?
df_tmp[col_names][1].nunique()

2

In [96]:
# check for missing values -- there are none, good.

df_tmp.isnull().sum()

id                  0
click               0
C1                  0
banner_pos          0
site_id             0
site_domain         0
site_category       0
app_id              0
app_domain          0
app_category        0
device_id           0
device_ip           0
device_model        0
device_type         0
device_conn_type    0
C14                 0
C15                 0
C16                 0
C17                 0
C18                 0
C19                 0
C20                 0
C21                 0
new_date            0
new_time            0
day_of_week         0
date                0
time                0
1                   0
0                   0
4                   0
5                   0
2                   0
dtype: int64

In [97]:
# ya final rows are columns stratified by device type
df_tmp.shape

(404410, 33)

In [98]:
# confirm successful selection of last 5 cols -- device types
df_tmp[df_tmp.columns[-5:]].head(3)

Unnamed: 0,1,0,4,5,2
0,0,1,0,0,0
1,0,1,0,0,0
2,1,0,0,0,0


In [99]:
# rename device_type col name for readability / understanding
df_dvtype = df_tmp
#df_dvtype.columns = ['a', 'b']
df_dvtype.head(3)

Unnamed: 0,id,click,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,...,new_date,new_time,day_of_week,date,time,1,0,4,5,2
0,10004510652136496837,0,1005,0,543a539e,c7ca3108,3e814130,ecad2386,7801e8d9,07d7df22,...,2014-10-21,00:00:00,1,2014-10-21,00:00:00,0,1,0,0,0
1,10007164336863914220,1,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,07d7df22,...,2014-10-21,00:00:00,1,2014-10-21,00:00:00,0,1,0,0,0
2,10076859283156800622,0,1002,0,f17ebd97,c4e18dd6,50e219e0,ecad2386,7801e8d9,07d7df22,...,2014-10-21,00:00:00,1,2014-10-21,00:00:00,1,0,0,0,0


In [101]:
df_dvtype = df_dvtype.rename({1:'device_type_a',0:'device_type_b',4:'device_type_c',5:'device_type_d', 2:'device_type_e'}, axis='columns')
df_dvtype.columns

Index(['id', 'click', 'C1', 'banner_pos', 'site_id', 'site_domain',
       'site_category', 'app_id', 'app_domain', 'app_category', 'device_id',
       'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'new_date', 'new_time',
       'day_of_week', 'date', 'time', 'device_type_a', 'device_type_b',
       'device_type_c', 'device_type_d', 'device_type_e'],
      dtype='object')

In [103]:
df_num = df_dvtype.drop(columns = ['id', 'new_time', 'new_date'])

In [104]:
df_num.columns

Index(['click', 'C1', 'banner_pos', 'site_id', 'site_domain', 'site_category',
       'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip',
       'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16',
       'C17', 'C18', 'C19', 'C20', 'C21', 'day_of_week', 'date', 'time',
       'device_type_a', 'device_type_b', 'device_type_c', 'device_type_d',
       'device_type_e'],
      dtype='object')

In [105]:
df_num.dtypes

click                int64
C1                   int64
banner_pos           int64
site_id             object
site_domain         object
site_category       object
app_id              object
app_domain          object
app_category        object
device_id           object
device_ip           object
device_model        object
device_type          int64
device_conn_type     int64
C14                  int64
C15                  int64
C16                  int64
C17                  int64
C18                  int64
C19                  int64
C20                  int64
C21                  int64
day_of_week          int64
date                object
time                object
device_type_a        uint8
device_type_b        uint8
device_type_c        uint8
device_type_d        uint8
device_type_e        uint8
dtype: object

In [110]:
# converting non-numerical columns into numerical columns
categories = ['site_id', 'site_domain','site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_ip', 'device_model', 'date', 'time']

for category in categories:
  df_num[category] = df_num[category].astype('category').cat.codes

In [112]:
# check dtypes
df_num.dtypes

click               int64
C1                  int64
banner_pos          int64
site_id             int16
site_domain         int16
site_category        int8
app_id              int16
app_domain          int16
app_category         int8
device_id           int32
device_ip           int32
device_model        int16
device_type         int64
device_conn_type    int64
C14                 int64
C15                 int64
C16                 int64
C17                 int64
C18                 int64
C19                 int64
C20                 int64
C21                 int64
day_of_week         int64
date                 int8
time                 int8
device_type_a       uint8
device_type_b       uint8
device_type_c       uint8
device_type_d       uint8
device_type_e       uint8
dtype: object

In [116]:
df_num.head(3)
df_num.tail(3)

Unnamed: 0,click,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,...,C20,C21,day_of_week,date,time,device_type_a,device_type_b,device_type_c,device_type_d,device_type_e
404407,0,1005,0,263,2056,1,2129,65,0,43137,...,-1,51,3,9,23,0,1,0,0,0
404408,0,1005,0,1145,1680,5,1396,17,24,43137,...,-1,221,3,9,23,0,1,0,0,0
404409,0,1005,0,263,2056,1,2129,65,0,43137,...,-1,51,3,9,23,0,1,0,0,0


In [126]:
df_num.device_id.nunique()

64913

In [125]:
df_tmp.device_id.nunique()

64913

In [121]:
# check em all just in case
for col in df_num.columns.values:
    total = len(df_num[col].unique())
    val = df_num[col].unique()
    print(str(col) + " " + "total: " + str(total))

click total: 2
C1 total: 7
banner_pos total: 7
site_id total: 2195
site_domain total: 2172
site_category total: 21
app_id total: 2305
app_domain total: 153
app_category total: 26
device_id total: 64913
device_ip total: 262453
device_model total: 4369
device_type total: 5
device_conn_type total: 4
C14 total: 2067
C15 total: 8
C16 total: 9
C17 total: 415
C18 total: 4
C19 total: 65
C20 total: 159
C21 total: 60
day_of_week total: 7
date total: 10
time total: 24
device_type_a total: 2
device_type_b total: 2
device_type_c total: 2
device_type_d total: 2
device_type_e total: 2


# Original features

 - Target feature : click
 - Site features : site_id, site_domain, site_category
 - App feature: app_id, app_domain, app_category
 - Device feature: device_id, device_ip, device_model, device_type, device_conn_type
 - Anonymized categorical features: C14-C21

# New Features

### All features are numerical (df_num)
- Target feature : click
 - Site features : site_id, site_domain, site_category
 - App feature: app_id, app_domain, app_category
 - Device feature: device_id[a,b,c,d,e], device_ip, device_model, device_type, device_conn_type
 - Anonymized numerical features: C14-C21

# Data Analysis


In [131]:
df_da = df_num

In [140]:
# describe each column
for col in df_num.columns:
    stats = df_num[col].mean()
    print(str(col) + " " + "stats are: " + str(stats))

click stats are: 0.16999085086916743
C1 stats are: 1004.9673722212606
banner_pos stats are: 0.2886377686011721
site_id stats are: 1046.65908854875
site_domain stats are: 1453.6608763383695
site_category stats are: 8.643169061101357
app_id stats are: 1847.5357805197696
app_domain stats are: 62.23985806483519
app_category stats are: 2.9607205558715166
device_id stats are: 41269.23497193442
device_ip stats are: 131067.57986449395
device_model stats are: 2180.5250018545535
device_type stats are: 1.0146954823075591
device_conn_type stats are: 0.33168813827551247
C14 stats are: 18844.936193467027
C15 stats are: 318.79585569100664
C16 stats are: 60.07658069780668
C17 stats are: 2112.733881456937
C18 stats are: 1.4314260280408497
C19 stats are: 227.8911401795208
C20 stats are: 53239.462209144185
C21 stats are: 83.30495536707797
day_of_week stats are: 2.6012586236739943
date stats are: 4.47542840186939
time stats are: 11.290195593580773
device_type_a stats are: 0.05458321999950545
device_type_b

In [141]:
# describe each column
for col in df_num.columns:
    stats = df_num[col].std()
    print(str(col) + " " + "stats are: " + str(stats))

click stats are: 0.3756252259629682
C1 stats are: 1.0902065833754702
banner_pos stats are: 0.5040331994297257
site_id stats are: 538.5678552202846
site_domain stats are: 511.6638024997347
site_category stats are: 7.296516926140653
app_id stats are: 540.8971571035224
app_domain stats are: 24.969705836765403
app_category stats are: 6.26839398358437
device_id stats are: 8829.367337188078
device_ip stats are: 75558.98946310426
device_model stats are: 1236.5193840720847
device_type stats are: 0.5239590828986503
device_conn_type stats are: 0.8558771967421039
C14 stats are: 4947.526554252101
C15 stats are: 20.667022434797758
C16 stats are: 47.02369068653012
C17 stats are: 607.9299831803719
C18 stats are: 1.325358504992502
C19 stats are: 351.6861052006083
C20 stats are: 49955.39177024403
C21 stats are: 70.25153725940373
day_of_week stats are: 1.7273624895929598
date stats are: 2.9645545477538864
time stats are: 5.949772171333355
device_type_a stats are: 0.22716518152484272
device_type_b stats 

In [120]:
# correlations
df_da.corr()

Unnamed: 0,click,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,...,C20,C21,day_of_week,date,time,device_type_a,device_type_b,device_type_c,device_type_d,device_type_e
click,1.0,-0.036423,0.025642,-0.006881,-0.038776,-0.006758,0.067806,-0.009115,-0.047392,0.009507,...,-0.060644,-0.066561,0.016852,-0.007525,-0.001594,0.027055,-0.006894,-0.000712,-0.026526,-0.012464
C1,-0.036423,1.0,0.285905,0.002045,0.118206,0.032548,-0.220376,-0.003129,0.088804,0.048135,...,-0.034842,0.034021,0.009949,-0.004417,0.009499,-0.654006,0.175533,0.010144,0.642102,0.256732
banner_pos,0.025642,0.285905,1.0,0.294825,-0.414307,0.537784,0.141547,0.034286,-0.223802,0.036953,...,0.057116,-0.098572,0.007485,0.030877,0.002007,-0.137598,-0.045195,-0.000901,0.273107,0.108488
site_id,-0.006881,0.002045,0.294825,1.0,-0.248331,0.404284,-0.095017,-0.020184,0.086245,-0.035173,...,0.079073,-0.156752,-0.023902,0.036547,0.026704,0.045376,-0.053862,-0.000548,0.025399,0.010036
site_domain,-0.038776,0.118206,-0.414307,-0.248331,1.0,-0.581931,-0.230189,-0.048898,0.208938,-0.036862,...,-0.075283,0.121845,0.010981,-0.010469,0.016024,-0.104539,0.052591,-0.002433,0.061531,0.024602
site_category,-0.006758,0.032548,0.537784,0.404284,-0.581931,1.0,0.25982,0.055193,-0.235834,0.10554,...,0.107218,-0.114582,-0.01594,0.015969,-0.00805,-0.119973,0.143805,-0.000785,-0.069452,-0.027512
app_id,0.067806,-0.220376,0.141547,-0.095017,-0.230189,0.25982,1.0,0.143223,-0.517759,0.170603,...,-0.014258,-0.045694,0.021722,-0.010448,-0.016546,0.060291,0.090639,0.000818,-0.239244,-0.093142
app_domain,-0.009115,-0.003129,0.034286,-0.020184,-0.048898,0.055193,0.143223,1.0,-0.186938,0.065359,...,0.183078,-0.093566,0.014775,-0.08049,0.043936,0.026561,-0.031849,0.000174,0.015376,0.006148
app_category,-0.047392,0.088804,-0.223802,0.086245,0.208938,-0.235834,-0.517759,-0.186938,1.0,-0.123029,...,-0.05568,0.059922,-0.03377,0.033642,0.029856,-0.113491,0.076977,-0.000743,0.035163,0.009114
device_id,0.009507,0.048135,0.036953,-0.035173,-0.036862,0.10554,0.170603,0.065359,-0.123029,1.0,...,0.0228,0.021403,0.019533,-0.002309,-0.007749,-0.243278,0.291648,0.000333,-0.13775,-0.063632


# Standardize / Normalize Data

In [None]:
# filter out outliers
#df_normal = df_num[np.abs(df_num.Data-df_num.Data.mean()) <= (3*df_num.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

# keep anything outside of +3 or -3 stdvs
#df_outliers = df_num[~(np.abs(df_num.Data-df_num.Data.mean()) > (3*df_num.Data.std()))]


# Random Forest
## Classification using sklearn
### baseline metrics

In [None]:
# baseline random forest
features = ['C1', 'banner_pos', 'site_id', 'site_domain',
       'site_category', 'app_id', 'app_domain', 'app_category', 'device_id',
       'device_ip', 'device_model', 'device_type', 'device_conn_type', 'C14',
       'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'date', 'time', 'device_type_a', 'device_type_b', 'device_type_c', 'device_type_d', 'device_type_e']
target = ['click']

X_train, X_test, y_train, y_test = train_test_split(df_num[features], df_num[target], test_size = 0.2, random_state = 0)

In [136]:
y_train.head(3)

Unnamed: 0,click
77016,0
319841,0
385323,1


In [138]:
X_train.head(3)

Unnamed: 0,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,...,C19,C20,C21,date,time,device_type_a,device_type_b,device_type_c,device_type_d,device_type_e
77016,1005,0,1145,1680,5,2029,51,3,43137,11488,...,39,100148,32,1,13,0,1,0,0,0
319841,1005,0,1269,1029,3,2129,65,0,43137,28681,...,39,100083,33,7,21,0,1,0,0,0
385323,1005,0,1145,1680,5,1396,17,24,43137,36498,...,47,-1,221,9,12,0,1,0,0,0


In [None]:
#running random forest algorithm

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=1000, random_state=0)  
regressor.fit(X_train, y_train)  
y_pred = regressor.predict(X_test)  



In [144]:
Y_pred.shape

NameError: name 'Y_pred' is not defined