# KDD CUP 2009: Costumer Relationship Prediction

## Introduction

Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

## Objective

Our objective here is develop models to predict churn, appetency, and up-selling.

## Dataset

Como é descrito na pagina do problema, o dataset é composto por 230 variáveis, sendo 190 delas númericas e 40 categóricas. 

In [1]:
import seaborn as sns
import pandas as pd

In [2]:
df = pd.read_csv('/home/miguel/Downloads/orange_small_train.data/orange_small_train.data', delimiter='\t')
df

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7.0,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0.0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7.0,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0.0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7.0,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
5,,,,,,658.0,7.0,,,,...,zCkv,QqVuch3,LM8l689qOp,,,Qcbd,02N6s8f,Zy3gnGM,am7c,
6,,,,,,1680.0,7.0,,,,...,oslk,XlgxB9z,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,am7c,
7,,,,,,77.0,0.0,,,,...,oslk,R2LdzOv,,,,FSa2,RAYp,F2FyR07IdsN7I,,
8,,,,,,1176.0,7.0,,,,...,zCkv,K2SqEo9,jySVZNlOJy,,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c,
9,,,,,,1141.0,7.0,,,,...,oslk,EPqQcw6,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,,


Pela amostra exibida, podemos perceber que será necessário tratar os valores NaN do dataframe.

In [3]:
def show_missing(dataframe):
    missing = dataframe.columns[dataframe.isnull().any()].tolist()
    return missing

In [4]:
missing_values_count = df.isnull().sum()

In [5]:
missing_values_count

Var1      49298
Var2      48759
Var3      48760
Var4      48421
Var5      48513
Var6       5529
Var7       5539
Var8      50000
Var9      49298
Var10     48513
Var11     48760
Var12     49442
Var13      5539
Var14     48760
Var15     50000
Var16     48513
Var17     48421
Var18     48421
Var19     48421
Var20     50000
Var21      5529
Var22      5009
Var23     48513
Var24      7230
Var25      5009
Var26     48513
Var27     48513
Var28      5011
Var29     49298
Var30     49298
          ...  
Var201    37217
Var202        1
Var203      143
Var204        0
Var205     1934
Var206     5529
Var207        0
Var208      143
Var209    50000
Var210        0
Var211        0
Var212        0
Var213    48871
Var214    25408
Var215    49306
Var216        0
Var217      703
Var218      703
Var219     5211
Var220        0
Var221        0
Var222        0
Var223     5211
Var224    49180
Var225    26144
Var226        0
Var227        0
Var228        0
Var229    28432
Var230    50000
Length: 230, dtype: int6

Tratando inputs categoricos

In [6]:
import time
def categorical_column_to_onehot_column(column_name, dataframe):
    a = pd.get_dummies(dataframe[column_name]).values.tolist()
    print(len(a[0]))
    return a

_= df.copy()
_['Var201'] = categorical_column_to_onehot_column('Var201', _)
_

2


Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7.0,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0.0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7.0,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0.0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7.0,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
5,,,,,,658.0,7.0,,,,...,zCkv,QqVuch3,LM8l689qOp,,,Qcbd,02N6s8f,Zy3gnGM,am7c,
6,,,,,,1680.0,7.0,,,,...,oslk,XlgxB9z,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,am7c,
7,,,,,,77.0,0.0,,,,...,oslk,R2LdzOv,,,,FSa2,RAYp,F2FyR07IdsN7I,,
8,,,,,,1176.0,7.0,,,,...,zCkv,K2SqEo9,jySVZNlOJy,,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c,
9,,,,,,1141.0,7.0,,,,...,oslk,EPqQcw6,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,,


In [7]:
df

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var221,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230
0,,,,,,1526.0,7.0,,,,...,oslk,fXVEsaq,jySVZNlOJy,,,xb3V,RAYp,F2FyR07IdsN7I,,
1,,,,,,525.0,0.0,,,,...,oslk,2Kb5FSF,LM8l689qOp,,,fKCe,RAYp,F2FyR07IdsN7I,,
2,,,,,,5236.0,7.0,,,,...,Al6ZaUT,NKv4yOc,jySVZNlOJy,,kG3k,Qu4f,02N6s8f,ib5G6X1eUxUn6,am7c,
3,,,,,,,0.0,,,,...,oslk,CE7uk3u,LM8l689qOp,,,FSa2,RAYp,F2FyR07IdsN7I,,
4,,,,,,1029.0,7.0,,,,...,oslk,1J2cvxe,LM8l689qOp,,kG3k,FSa2,RAYp,F2FyR07IdsN7I,mj86,
5,,,,,,658.0,7.0,,,,...,zCkv,QqVuch3,LM8l689qOp,,,Qcbd,02N6s8f,Zy3gnGM,am7c,
6,,,,,,1680.0,7.0,,,,...,oslk,XlgxB9z,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,am7c,
7,,,,,,77.0,0.0,,,,...,oslk,R2LdzOv,,,,FSa2,RAYp,F2FyR07IdsN7I,,
8,,,,,,1176.0,7.0,,,,...,zCkv,K2SqEo9,jySVZNlOJy,,kG3k,PM2D,6fzt,am14IcfM7tWLrUmRT52KtA,am7c,
9,,,,,,1141.0,7.0,,,,...,oslk,EPqQcw6,LM8l689qOp,,kG3k,FSa2,RAYp,55YFVY9,,


In [8]:
_=df.copy()
# for i in range(191, 201):
#     if i != 200:
#         _['Var'+str(i)] = categorical_column_to_onehot_column('Var'+str(i), _)
#         print(i)

In [10]:
for i in range(191, 221):
    b = _['Var'+str(i)].values.tolist()
    a = sum(set())
    print(a)

TypeError: unsupported operand type(s) for +: 'float' and 'str'