# KDD CUP 2009: Costumer Relationship Prediction

## Introduction

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

## Objective

Our objective here is develop models to predict churn, appetency, and up-selling.

## Dataset

Como é descrito na pagina do problema, o dataset é composto por 230 variáveis, sendo 190 delas númericas e 40 categóricas. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../orange_small_train.data/orange_small_train.data', delimiter='\t')
test_df = pd.read_csv('../orange_small_test.data/orange_small_test.data', delimiter='\t')
df

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7.0,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0.0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7.0,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0.0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7.0,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
5,,,,,,658.0,7.0,,,,...,zCkv,QqVuch3,LM8l689qOp,,,Qcbd,02N6s8f,Zy3gnGM,am7c,
6,,,,,,1680.0,7.0,,,,...,oslk,XlgxB9z,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,am7c,
7,,,,,,77.0,0.0,,,,...,oslk,R2LdzOv,,,,FSa2,RAYp,F2FyR07IdsN7I,,
8,,,,,,1176.0,7.0,,,,...,zCkv,K2SqEo9,jySVZNlOJy,,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c,
9,,,,,,1141.0,7.0,,,,...,oslk,EPqQcw6,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,,


In [3]:
test_df

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1225.0,7.0,,,,...,Al6ZaUT,P6pu4Vl,LM8l689qOp,,ELof,7P5s,ZI9m,R4y5gQQWY8OodqDV,,
1,,,,,,259.0,0.0,,,,...,oslk,S46Rt72,LM8l689qOp,,,Qu4f,RAYp,F2FyR07IdsN7I,,
2,,,,,,861.0,14.0,,,,...,oslk,CcdTy9x,LM8l689qOp,,,7aLG,RAYp,F2FyR07IdsN7I,,
3,,,,,,1568.0,7.0,,,,...,oslk,Q53Rkup,LM8l689qOp,,kG3k,7P5s,RAYp,TCU50_Yjmm6GIBZ0lL_,am7c,
4,,,,,,1197.0,7.0,,,,...,Al6ZaUT,WfsWw2A,LM8l689qOp,,ELof,5Acm,ZI9m,iyHGyLCEkQ,am7c,
5,,,,,0.0,,,,,0.0,...,d0EEeJi,rV_GPRB,,,,Qu4f,02N6s8f,F2FyR07IdsN7I,,
6,,,,,,476.0,7.0,,,,...,oslk,yVBQv5o,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,am7c,
7,,,,,,812.0,7.0,,,,...,oslk,catzS2D,bCPvVye,,ELof,5Acm,ZI9m,iyHGyLCEkQ,am7c,
8,,,,,,2044.0,7.0,,,,...,d0EEeJi,WfsWw2A,LM8l689qOp,,ELof,FSa2,ZI9m,F2FcTt7IdMT_v,am7c,
9,,,,,,518.0,7.0,,,,...,oslk,MERl9if,LM8l689qOp,,ELof,uWr3,RAYp,F2FyR07IdsN7I,am7c,


Para facilitar a execuçao dos testes, juntaremos o train e test set, realizaremos o preprocessamento, e entao dividiremos os conjuntos novamente para teste e treino, assim nao será necessario passar duas vezes pela etapa de preprocessamento.

In [4]:
train_and_test = [df, test_df]
df = pd.concat(train_and_test)
df

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7.0,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0.0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7.0,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0.0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7.0,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
5,,,,,,658.0,7.0,,,,...,zCkv,QqVuch3,LM8l689qOp,,,Qcbd,02N6s8f,Zy3gnGM,am7c,
6,,,,,,1680.0,7.0,,,,...,oslk,XlgxB9z,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,am7c,
7,,,,,,77.0,0.0,,,,...,oslk,R2LdzOv,,,,FSa2,RAYp,F2FyR07IdsN7I,,
8,,,,,,1176.0,7.0,,,,...,zCkv,K2SqEo9,jySVZNlOJy,,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c,
9,,,,,,1141.0,7.0,,,,...,oslk,EPqQcw6,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,,


Pela amostra exibida, podemos perceber que será necessário tratar os valores NaN do dataframe.

In [5]:
def show_missing(dataframe):
    missing = dataframe.columns[dataframe.isnull().any()].tolist()
    return missing

In [6]:
missing_values_count = df.isnull().sum()

In [7]:
missing_values_count

Var1       98610
Var2       97475
Var3       97478
Var4       96783
Var5       97007
Var6       11138
Var7       11162
Var8      100000
Var9       98610
Var10      97007
Var11      97478
Var12      98877
Var13      11162
Var14      97478
Var15     100000
Var16      97007
Var17      96783
Var18      96783
Var19      96783
Var20     100000
Var21      11138
Var22      10125
Var23      97007
Var24      14469
Var25      10125
Var26      97007
Var27      97007
Var28      10130
Var29      98610
Var30      98610
           ...  
Var201     74283
Var202         1
Var203       288
Var204         0
Var205      3877
Var206     11138
Var207         0
Var208       288
Var209    100000
Var210         0
Var211         0
Var212         0
Var213     97729
Var214     50696
Var215     98624
Var216         0
Var217      1360
Var218      1360
Var219     10474
Var220         0
Var221         0
Var222         0
Var223     10474
Var224     98325
Var225     52163
Var226         0
Var227         0
Var228        

In [8]:
#removendo colunas comm mais de 80% de dados faltando
df = df.loc[:, df.isin([np.nan]).mean() < .80]

In [9]:
df

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var219,Var220,Var221,Var222,Var223,Var225,Var226,Var227,Var228,Var229
0,1526.0,7.0,184.0,464.0,580.0,14.0,128.0,166.56,0.0,3570.0,...,FzaX,1YVfGrO,oslk,fXVEsaq,jySVZNlOJy,,xb3V,RAYp,F2FyR07IdsN7I,
1,525.0,0.0,0.0,168.0,210.0,2.0,24.0,353.52,0.0,4764966.0,...,FzaX,0AJo2f2,oslk,2Kb5FSF,LM8l689qOp,,fKCe,RAYp,F2FyR07IdsN7I,
2,5236.0,7.0,904.0,1212.0,1515.0,26.0,816.0,220.08,0.0,5883894.0,...,FzaX,JFM1BiF,Al6ZaUT,NKv4yOc,jySVZNlOJy,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c
3,,0.0,0.0,,0.0,,0.0,22.08,0.0,0.0,...,FzaX,L91KIiz,oslk,CE7uk3u,LM8l689qOp,,FSa2,RAYp,F2FyR07IdsN7I,
4,1029.0,7.0,3216.0,64.0,80.0,4.0,64.0,200.00,0.0,0.0,...,FzaX,OrnLfvc,oslk,1J2cvxe,LM8l689qOp,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86
5,658.0,7.0,3156.0,224.0,280.0,2.0,72.0,200.00,5.0,0.0,...,FzaX,KbkKEj0,zCkv,QqVuch3,LM8l689qOp,,Qcbd,02N6s8f,Zy3gnGM,am7c
6,1680.0,7.0,2952.0,308.0,385.0,4.0,128.0,176.56,0.0,13158.0,...,FzaX,JO03372,oslk,XlgxB9z,LM8l689qOp,kG3k,FSa2,RAYp,55YFVY9,am7c
7,77.0,0.0,0.0,32.0,40.0,2.0,16.0,230.56,0.0,3776496.0,...,,U8IKsQe,oslk,R2LdzOv,,,FSa2,RAYp,F2FyR07IdsN7I,
8,1176.0,7.0,2912.0,200.0,250.0,2.0,64.0,300.32,0.0,6014460.0,...,FzaX,ROeipLp,zCkv,K2SqEo9,jySVZNlOJy,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c
9,1141.0,7.0,164.0,208.0,260.0,2.0,72.0,166.56,5.0,5317974.0,...,FzaX,fabLnWA,oslk,EPqQcw6,LM8l689qOp,kG3k,FSa2,RAYp,55YFVY9,


In [10]:
#valores nan serao substituidos por -1
df = df.replace([np.nan],-1)
df

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var219,Var220,Var221,Var222,Var223,Var225,Var226,Var227,Var228,Var229
0,1526.0,7.0,184.0,464.0,580.0,14.0,128.0,166.56,0.0,3570.0,...,FzaX,1YVfGrO,oslk,fXVEsaq,jySVZNlOJy,-1,xb3V,RAYp,F2FyR07IdsN7I,-1
1,525.0,0.0,0.0,168.0,210.0,2.0,24.0,353.52,0.0,4764966.0,...,FzaX,0AJo2f2,oslk,2Kb5FSF,LM8l689qOp,-1,fKCe,RAYp,F2FyR07IdsN7I,-1
2,5236.0,7.0,904.0,1212.0,1515.0,26.0,816.0,220.08,0.0,5883894.0,...,FzaX,JFM1BiF,Al6ZaUT,NKv4yOc,jySVZNlOJy,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c
3,-1.0,0.0,0.0,-1.0,0.0,-1.0,0.0,22.08,0.0,0.0,...,FzaX,L91KIiz,oslk,CE7uk3u,LM8l689qOp,-1,FSa2,RAYp,F2FyR07IdsN7I,-1
4,1029.0,7.0,3216.0,64.0,80.0,4.0,64.0,200.00,0.0,0.0,...,FzaX,OrnLfvc,oslk,1J2cvxe,LM8l689qOp,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86
5,658.0,7.0,3156.0,224.0,280.0,2.0,72.0,200.00,5.0,0.0,...,FzaX,KbkKEj0,zCkv,QqVuch3,LM8l689qOp,-1,Qcbd,02N6s8f,Zy3gnGM,am7c
6,1680.0,7.0,2952.0,308.0,385.0,4.0,128.0,176.56,0.0,13158.0,...,FzaX,JO03372,oslk,XlgxB9z,LM8l689qOp,kG3k,FSa2,RAYp,55YFVY9,am7c
7,77.0,0.0,0.0,32.0,40.0,2.0,16.0,230.56,0.0,3776496.0,...,-1,U8IKsQe,oslk,R2LdzOv,-1,-1,FSa2,RAYp,F2FyR07IdsN7I,-1
8,1176.0,7.0,2912.0,200.0,250.0,2.0,64.0,300.32,0.0,6014460.0,...,FzaX,ROeipLp,zCkv,K2SqEo9,jySVZNlOJy,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c
9,1141.0,7.0,164.0,208.0,260.0,2.0,72.0,166.56,5.0,5317974.0,...,FzaX,fabLnWA,oslk,EPqQcw6,LM8l689qOp,kG3k,FSa2,RAYp,55YFVY9,-1


Tratando inputs categoricos

In [11]:
def categorical_column_to_onehot_column(column_name, dataframe):
    a = pd.get_dummies(dataframe[column_name]).values.tolist()
    print(len(a[0]))
    return a

In [12]:
_=df.copy()

In [13]:
for i in range(191, 231):
    if 'Var'+str(i) in df.columns:
        a = len(list(set(_['Var'+str(i)].values)))
        print(i, a)

192 400
193 54
194 4
195 23
196 4
197 250
198 5796
199 8134
200 22629
201 3
202 6110
203 6
204 102
205 4
206 23
207 14
208 3
210 7
211 2
212 85
214 22629
216 2692
217 19231
218 3
219 23
220 5796
221 7
222 5796
223 5
225 4
226 25
227 7
228 30
229 5


In [14]:
print(_['Var200'].value_counts())

-1         50696
yP09M03      153
Uw6SDm8       99
Ipi9M03       88
EvCZGt8       57
5YIkUea       56
MF5S0rA       48
b1M9M03       48
NvI9wLk       45
gWmZGt8       42
Uw6kiXL       40
n6_SDm8       28
DlIkiXL       28
qRz9wLk       27
Uw69wLk       25
NDkZGt8       25
qRzZGt8       24
7aLZlRR       24
b1MS0rA       22
MF59M03       22
G6OkUea       22
MF5ZlRR       21
60P9wLk       21
5Ox9M03       21
jCr9wLk       19
NdsZGt8       19
QK29wLk       19
IpiZlRR       18
qVbZlRR       18
7aL9M03       17
           ...  
oiSeYgL        1
jbqMRi5        1
6UTh5Jf        1
M3O2S0R        1
_GpSLd4        1
abit0qI        1
9vyP7wZ        1
feRmaWF        1
1kDbUax        1
vjrGuNR        1
trdE7Ny        1
CuQS5Gz        1
QK2F8Lx        1
3Vudake        1
0eHEROy        1
XNXHqEk        1
eh2p_w9        1
NvIpZcL        1
FLXk3Sr        1
gWm6Y3Q        1
3mxshuR        1
5HPvD1h        1
Q3a80V7        1
qHmV0Gu        1
3mxs2hv        1
CdD1rmq        1
_VHFg8y        1
n6_yhTE       

na coluna numero 200 podemos ver uma grande quantidade de categorias, isso pode acabar dificultando a conversao para one hot, e deixar muito pesado o dataset. por isso substituiremos as classes que aparecem com pouca frequencia por 'other'.

In [15]:
def replace_low_freq(column, threshold=16, replacement='other'):
    c = column.value_counts()
    m = pd.Series(replacement, c.index[c <= threshold])
    return column.replace(m)

In [16]:
for i in range(191, 231):
    if 'Var'+str(i) in df.columns:
        _['Var'+str(i)] = replace_low_freq(_['Var'+str(i)])
        a = len(list(set(_['Var'+str(i)].values)))
        print(i, a)

192 243
193 36
194 4
195 15
196 4
197 170
198 1087
199 562
200 31
201 3
202 1842
203 5
204 101
205 4
206 23
207 12
208 3
210 7
211 2
212 55
214 31
216 381
217 958
218 3
219 14
220 1087
221 7
222 1087
223 5
225 4
226 24
227 7
228 26
229 5


Assim, aliviamos a carga de dados categoricos e poderemos transformar em onehot com mais tranquilidade. É importante ressaltar que essa 'limpeza' no dataset nao removeu dados desnecessários, mas sim removeu os que considerei menos importantes para tornar possível a execuçao do modelo.

In [17]:
for i in range(191, 231):
    if 'Var'+str(i) in df.columns:
        _['Var'+str(i)] = categorical_column_to_onehot_column('Var'+str(i), _)
        print(i)

243
192
36
193
4
194
15
195
4
196
170
197
1087
198
562
199
31
200
3
201
1842
202
5
203
101
204
4
205
23
206
12
207
3
208
7
210
2
211
55
212
31
214
381
216
958
217
3
218
14
219
1087
220
7
221
1087
222
5
223
4
225
24
226
7
227
26
228
5
229


In [18]:
_.size

7600000

In [19]:
_

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var219,Var220,Var221,Var222,Var223,Var225,Var226,Var227,Var228,Var229
0,1526.0,7.0,184.0,464.0,580.0,14.0,128.0,166.56,0.0,3570.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
1,525.0,0.0,0.0,168.0,210.0,2.0,24.0,353.52,0.0,4764966.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
2,5236.0,7.0,904.0,1212.0,1515.0,26.0,816.0,220.08,0.0,5883894.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
3,-1.0,0.0,0.0,-1.0,0.0,-1.0,0.0,22.08,0.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
4,1029.0,7.0,3216.0,64.0,80.0,4.0,64.0,200.00,0.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0]"
5,658.0,7.0,3156.0,224.0,280.0,2.0,72.0,200.00,5.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
6,1680.0,7.0,2952.0,308.0,385.0,4.0,128.0,176.56,0.0,13158.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
7,77.0,0.0,0.0,32.0,40.0,2.0,16.0,230.56,0.0,3776496.0,...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
8,1176.0,7.0,2912.0,200.0,250.0,2.0,64.0,300.32,0.0,6014460.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
9,1141.0,7.0,164.0,208.0,260.0,2.0,72.0,166.56,5.0,5317974.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"


In [20]:
train_df = _[:50000]
test_df = _[50000:]

In [21]:
train_df

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var219,Var220,Var221,Var222,Var223,Var225,Var226,Var227,Var228,Var229
0,1526.0,7.0,184.0,464.0,580.0,14.0,128.0,166.56,0.0,3570.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
1,525.0,0.0,0.0,168.0,210.0,2.0,24.0,353.52,0.0,4764966.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
2,5236.0,7.0,904.0,1212.0,1515.0,26.0,816.0,220.08,0.0,5883894.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
3,-1.0,0.0,0.0,-1.0,0.0,-1.0,0.0,22.08,0.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
4,1029.0,7.0,3216.0,64.0,80.0,4.0,64.0,200.00,0.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0]"
5,658.0,7.0,3156.0,224.0,280.0,2.0,72.0,200.00,5.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
6,1680.0,7.0,2952.0,308.0,385.0,4.0,128.0,176.56,0.0,13158.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
7,77.0,0.0,0.0,32.0,40.0,2.0,16.0,230.56,0.0,3776496.0,...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
8,1176.0,7.0,2912.0,200.0,250.0,2.0,64.0,300.32,0.0,6014460.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
9,1141.0,7.0,164.0,208.0,260.0,2.0,72.0,166.56,5.0,5317974.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"


In [22]:
test_df

Unnamed: 0,Var6,Var7,Var13,Var21,Var22,Var24,Var25,Var28,Var35,Var38,...,Var219,Var220,Var221,Var222,Var223,Var225,Var226,Var227,Var228,Var229
0,1225.0,7.0,100.0,156.0,195.0,0.0,72.0,166.56,0.0,4259232.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 1, 0, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
1,259.0,0.0,0.0,192.0,240.0,0.0,40.0,300.32,5.0,4859550.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
2,861.0,14.0,236.0,32.0,40.0,0.0,8.0,186.64,0.0,10038840.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
3,1568.0,7.0,1232.0,448.0,560.0,4.0,88.0,166.56,0.0,116760.0,...,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 1, 0, 0, 0]"
4,1197.0,7.0,204.0,100.0,125.0,8.0,40.0,133.12,0.0,257772.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
5,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.00,-1.0,-1.0,...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]","[1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0]"
6,476.0,7.0,28.0,112.0,140.0,0.0,56.0,133.12,0.0,0.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
7,812.0,7.0,2508.0,152.0,190.0,2.0,40.0,253.52,0.0,4952868.0,...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0]","[0, 1, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
8,2044.0,7.0,196.0,184.0,230.0,6.0,80.0,166.56,0.0,3667740.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"
9,518.0,7.0,2248.0,180.0,225.0,10.0,56.0,280.24,0.0,4454778.0,...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]","[0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0]"


In [23]:
import pickle
import sys
def save_as_pickled_object(obj, filepath):
    """
    This is a defensive way to write pickle.write, allowing for very large files on all platforms
    """
    max_bytes = 2**31 - 1
    bytes_out = pickle.dumps(obj)
    n_bytes = sys.getsizeof(bytes_out)
    with open(filepath, 'wb') as f_out:
        for idx in range(0, n_bytes, max_bytes):
            f_out.write(bytes_out[idx:idx+max_bytes])

In [24]:
#salvando dataframe para nao precisar fazer preprocessamento novamente
save_as_pickled_object(train_df, '../saved_train_dataframe_00.pkl')
save_as_pickled_object(test_df, '../saved_test_dataframe_00.pkl')
# _.to_pickle('./saved_dataframe_00.pkl')