# Three approaches
I'm trying to take at least three different approaches for each step of my problems.  So for my data, I suspect that the numerical data set is going to be most predictive based off of initial analysis. The goal of this notebook is to evaluate which of these gives me the best performance.

## Create MEGA SET
This is simply a horizontal concatenation of numerical, categorical, and time stamp data.  It will be insanely wide.  I suspect this will be so wide that it is impractical to use.
### Questions I want to answer:
Will permutation importances help me identify my most performant features?  
Can a model with over 2k columns be trained in any meaningful way?  
Will the gods strike me down for my hubris?  

## Create meta-features
I suspect this might be the most practical use of these data sets.  I get the impression that the importance of these sets is roughly  numeric>timestamp>categorical.  But it would be nice to actually confirm this.  I suspect creating a meta feature of each data set and trying to combine them will give me some idea of which sets will perform better this way.
### Questions:
Can I "black box" my meta features with things like PCA and SVD? If so how much does that help my meta feature? Or is it just a time saver?  
How much perfomance do I lose from each metafeature (in comparison to MEGA SET)?

## Stacking
Bespoke model selection for each data set, stacked together in a pipeline.  Or using predict_proba, have weighted votes (kinda like meta features).
### Questions:
Does this out perform MEGA SET or meta features?

# Evaluation
I will judge each on precision and mathews coefficient.  I chose precision based off of a confusion matrix (see main notebook).  Mathews coefficient was the original competition score, and I'd like to see where I stack up.


## MEGA SET

In [1]:
#lets get started on megaset.  My methodology for this is going to read in a chunk of each
#data
import pandas as pd

def get_iters():

    folder = 'bosch-production-line-performance/'

    #iterables for each data set
    num_iter = pd.read_csv(folder + 'train_numeric.csv', iterator = True, chunksize = 1000)

    date_iter = pd.read_csv(folder + 'train_date.csv', iterator = True, chunksize = 1000)
    
    #during exploration I discovered that the categorical data set is all strings, 
    #so reading it in as such.
    cat_iter = pd.read_csv(folder + 'train_categorical.csv', 
                           dtype = str, iterator = True, chunksize = 1000)
    
    return(num_iter, date_iter, cat_iter)


In [2]:
num_iter, date_iter, cat_iter = get_iters()

In [3]:
cat = cat_iter.get_chunk()

num = num_iter.get_chunk()

date = date_iter.get_chunk()

In [4]:
cat.shape, num.shape, date.shape

((1000, 2141), (1000, 970), (1000, 1157))

In [5]:
cat.head()

Unnamed: 0,Id,L0_S1_F25,L0_S1_F27,L0_S1_F29,L0_S1_F31,L0_S2_F33,L0_S2_F35,L0_S2_F37,L0_S2_F39,L0_S2_F41,...,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,4,,,,,,,,,,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,,,,,,,,,,...,,,,,,,,,,
3,9,,,,,,,,,,...,,,,,,,,,,
4,11,,,,,,,,,,...,,,,,,,,,,


In [6]:
num.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


In [7]:
date.head()

Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,...,,,,,,,,,,
3,9,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,...,,,,,,,,,,
4,11,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,...,,,,,,,,,,


#### Verifying each dataset is in the same order
Okay, I know for sure each set as an ID column that I can index with, so lets go ahead and just verify that they actually line up.

In [8]:
folder = 'bosch-production-line-performance/'

catid = pd.read_csv(folder + 'train_categorical.csv', usecols = ['Id'], squeeze = True)
numid = pd.read_csv(folder + 'train_numeric.csv', usecols = ['Id'], squeeze = True)
dateid = pd.read_csv(folder + 'train_date.csv', usecols = ['Id'], squeeze = True)
catid.shape, numid.shape, dateid.shape

((1183747,), (1183747,), (1183747,))

In [9]:
#looking good so far
if catid.equals(numid) and numid.equals(dateid):
    print('Each data set is in the same order.')

Each data set is in the same order.


In [10]:
#resetting all variables with ID in them.  Need to clear up memory for megaset
%reset_selective -f id

#### Does each set have my target column?
According to my data's [documentation](https://www.kaggle.com/c/bosch-production-line-performance/data) we are trying to predict for "Response" but I don't know if each dataset has that feature so lets find out.

In [11]:
try:
    print(num['Response'])
except:
    print('No response column in the numeric dataset')
try:
    print(cat['Response'])
except:
    print('No reponse column in the categorical dataset')
try:
    print(date['Response'])
except:
    print('No response column in the date dataset')

0      0
1      0
2      0
3      0
4      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: Response, Length: 1000, dtype: int64
No reponse column in the categorical dataset
No response column in the date dataset


### Build the mega set

In [12]:
#easy enough, lets just concat everything
mega = pd.merge(num, date, how = 'outer', on = 'Id')
mega.shape
mega.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,


In [13]:
#shaped as expected
mega['Id'].dtype

dtype('int64')

In [14]:
cat['Id'].dtype

dtype('O')

In [15]:
#okay not going to merge with different data types for identifiers.  Lets go ahead and cast the
#categorical column to a 
cat['Id'] = cat['Id'].apply(int)

In [16]:
cat['Id'].dtype

dtype('int64')

In [17]:
#okay I'm expecting this to be 4268 - 2 'Id' columns removed
mega = pd.merge(mega, cat, how = 'outer', on = 'Id')
mega.shape
mega.head()

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,


In [18]:
#I kinda have to keep tuning this as it kills my kernel, so the final numbers are the max
#that my machine can handle.

mega_chunks = [mega]

for i in range(300):
    num = num_iter.get_chunk()
    date = date_iter.get_chunk()
    cat = cat_iter.get_chunk()
    
    cat['Id'] = cat['Id'].apply(int)
    a = pd.merge(num, date, how = 'outer', on = 'Id')
    mega_chunks.append(pd.merge(a, date, how = 'outer', on = 'Id'))
    
mega = pd.concat(mega_chunks, ignore_index = True)
    


In [19]:
print(mega.shape)
mega.head()

(301000, 6578)


Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_D4246_y,L3_S50_D4248_y,L3_S50_D4250_y,L3_S50_D4252_y,L3_S50_D4254_y,L3_S51_D4255_y,L3_S51_D4257_y,L3_S51_D4259_y,L3_S51_D4261_y,L3_S51_D4263_y
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,


In [20]:
%reset_selective -f "^mega_chunks$"
%reset_selective -f "^num$"
%reset_selective -f "^date$"
%reset_selective -f "^cat$"
%reset_selective -f "^a$"