## Santander Customer Transaction Prediction Resolution

Description:
<br/>
<br/>
At Santander our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

Submission File

For each Id in the test set, you must make a binary prediction of the target variable. The file should contain a header and have the following format:


Integrantes do Grupo:
- Gabriel
- Ricardo
- Mariana
- Gustavo
- Jérémie

## Importing Libraries and Train DataSet

In [1]:
ls

Desafio 3-GaussianNB.ipynb
Desafio 3-GaussianNB_v2 - CODIGO RICARDO.ipynb
Desafio 3-GaussianNB_v2.ipynb
Desafio 3-GaussianNB_v3 - com 27 colunas de maior correlação.ipynb
Desafio 3-GaussianNB_vSimples.ipynb
Desafio 3.ipynb
[34mdata[m[m/
[34mimg[m[m/
[34msubmission[m[m/


In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv(r'data/train.csv')

In [4]:
df.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


## Exploring and understanding Train DataSet

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB


In [6]:
df.target.value_counts(dropna=False)

0    179902
1     20098
Name: target, dtype: int64

In [7]:
df.corr()

Unnamed: 0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
target,1.000000,0.052390,0.050343,0.055870,0.011055,0.010915,0.030979,6.673085e-02,-0.003025,0.019584,...,0.055973,0.047114,-0.042858,-0.017709,-0.022838,0.028285,0.023608,-0.035303,-0.053000,0.025434
var_0,0.052390,1.000000,-0.000544,0.006573,0.003801,0.001326,0.003046,6.982549e-03,0.002429,0.004962,...,0.002752,0.000206,-0.005373,0.001616,-0.001514,0.002073,0.004386,-0.000753,-0.005776,0.003850
var_1,0.050343,-0.000544,1.000000,0.003980,0.000010,0.000303,-0.000902,3.257729e-03,0.001511,0.004098,...,0.006627,0.003621,-0.002604,0.001153,-0.002557,-0.000785,-0.000377,-0.004157,-0.004861,0.002287
var_2,0.055870,0.006573,0.003980,1.000000,0.001001,0.000723,0.001569,8.825211e-04,-0.000991,0.002648,...,0.000197,0.001285,-0.003400,0.000549,0.002104,-0.001070,0.003952,0.001078,-0.000877,0.003855
var_3,0.011055,0.003801,0.000010,0.001001,1.000000,-0.000322,0.003253,-7.743892e-04,0.002500,0.003553,...,0.000151,0.002445,-0.001530,-0.001699,-0.001054,0.001206,-0.002800,0.001164,-0.001651,0.000506
var_4,0.010915,0.001326,0.000303,0.000723,-0.000322,1.000000,-0.001368,4.882529e-05,0.004549,0.001194,...,0.001514,0.004357,0.003347,0.000813,-0.000068,0.003706,0.000513,-0.000046,-0.001821,-0.000786
var_5,0.030979,0.003046,-0.000902,0.001569,0.003253,-0.001368,1.000000,2.587780e-03,-0.000995,0.000147,...,0.001466,-0.000022,0.001116,-0.002237,-0.002543,-0.001274,0.002880,-0.000535,-0.000953,0.002767
var_6,0.066731,0.006983,0.003258,0.000883,-0.000774,0.000049,0.002588,1.000000e+00,-0.002548,-0.001188,...,0.000721,0.005604,-0.002563,0.002464,-0.001141,0.001244,0.005378,-0.003565,-0.003025,0.006096
var_7,-0.003025,0.002429,0.001511,-0.000991,0.002500,0.004549,-0.000995,-2.547746e-03,1.000000,0.000814,...,-0.000337,-0.003957,0.001733,0.003219,-0.000270,0.001854,0.001045,0.003466,0.000650,-0.001457
var_8,0.019584,0.004962,0.004098,0.002648,0.003553,0.001194,0.000147,-1.187995e-03,0.000814,1.000000,...,0.002923,-0.001151,-0.000429,0.001414,0.001313,0.001396,-0.003242,-0.004583,0.002950,0.000854


## Logistic regression to discard variables that don't influence the target

In [8]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

y_lgst = df['target']

function1 = 'y_lgst ~ var_0+ var_1+var_2+var_3+var_4+var_5+var_6+var_7+var_8+var_9+var_10+var_11+var_12+var_13+var_14+var_15+var_16+var_17+var_18+var_19+var_20+var_21+var_22+var_23+var_24+var_25+var_26+var_27+var_28+var_29+var_30+var_31+var_32+var_33+var_34+var_35+var_36+var_37+var_38+var_39+var_40+var_41+var_42+var_43+var_44+var_45+var_46+var_47+var_48+var_49+var_50+var_51+var_52+var_53+var_54+var_55+var_56+var_57+var_58+var_59+var_60+var_61+var_62+var_63+var_64+var_65+var_66+var_67+var_68+var_69+var_70+var_71+var_72+var_73+var_74+var_75+var_76+var_77+var_78+var_79+var_80+var_81+var_82+var_83+var_84+var_85+var_86+var_87+var_88+var_89+var_90+var_91+var_92+var_93+var_94+var_95+var_96+var_97+var_98+var_99+var_100+var_101+var_102+var_103+var_104+var_105+var_106+var_107+var_108+var_109+var_110+var_111+var_112+var_113+var_114+var_115+var_116+var_117+var_118+var_119+var_120+var_121+var_122+var_123+var_124+var_125+var_126+var_127+var_128+var_129+var_130+var_131+var_132+var_133+var_134+var_135+var_136+var_137+var_138+var_139+var_140+var_141+var_142+var_143+var_144+var_145+var_146+var_147+var_148+var_149+var_150+var_151+var_152+var_153+var_154+var_155+var_156+var_157+var_158+var_159+var_160+var_161+var_162+var_163+var_164+var_165+var_166+var_167+var_168+var_169+var_170+var_171+var_172+var_173+var_174+var_175+var_176+var_177+var_178+var_179+var_180+var_181+var_182+var_183+var_184+var_185+var_186+var_187+var_188+var_189+var_190+var_191+var_192+var_193+var_194+var_195+var_196+var_197+var_198+var_199'

model1 = smf.ols(function1, df).fit()
print(model1.summary2())

                  Results: Ordinary least squares
Model:              OLS              Adj. R-squared:     0.181     
Dependent Variable: y_lgst           AIC:                47142.4925
Date:               2019-05-24 20:10 BIC:                49193.9131
No. Observations:   200000           Log-Likelihood:     -23370.   
Df Model:           200              F-statistic:        221.9     
Df Residuals:       199799           Prob (F-statistic): 0.00      
R-squared:          0.182            Scale:              0.074039  
--------------------------------------------------------------------
                Coef.   Std.Err.     t      P>|t|    [0.025   0.975]
--------------------------------------------------------------------
Intercept       4.9416    0.4367   11.3145  0.0000   4.0856   5.7976
var_0           0.0042    0.0002   21.0476  0.0000   0.0038   0.0046
var_1           0.0031    0.0002   20.3820  0.0000   0.0028   0.0034
var_2           0.0052    0.0002   22.5386  0.0000   0.0047 

We could identify that some variables are insignificant for our model:
var_7, vat_10, var_17, var_27, var_30, var_38, var_39, var_41, var_96, var_98, var_100, var_103, var_117, var_124, var_126, var_136, var_158, var_183, var_185.
Lest's drop those.

## Building the sample for analysis

In [9]:
n = 200000 # Number of lines of the CSV file
s = int(n * 1) # Sample size to be analyzed

In [10]:
df_sample_0 = pd.read_csv(r'data/train.csv', 
                         skiprows=sorted(random.sample(range(1,n),k=n-s)),
                         encoding='latin1', 
                         engine='c',
                         low_memory=True)
df_sample_0.head().head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


In [11]:
df_sample_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 202 entries, ID_code to var_199
dtypes: float64(200), int64(1), object(1)
memory usage: 308.2+ MB


In [12]:
df_sample_0.target.value_counts(dropna=False)

0    179902
1     20098
Name: target, dtype: int64

In [13]:
df_sample = df_sample_0.drop(['var_7', 'var_10', 'var_17', 'var_27', 'var_30', 'var_38', 'var_39', 'var_41', 'var_96', 'var_98', 'var_100', 'var_103', 'var_117', 'var_124', 'var_126', 'var_136', 'var_158', 'var_183', 'var_185'], axis=1)

In [14]:
df_sample.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 183 columns):
ID_code    200000 non-null object
target     200000 non-null int64
var_0      200000 non-null float64
var_1      200000 non-null float64
var_2      200000 non-null float64
var_3      200000 non-null float64
var_4      200000 non-null float64
var_5      200000 non-null float64
var_6      200000 non-null float64
var_8      200000 non-null float64
var_9      200000 non-null float64
var_11     200000 non-null float64
var_12     200000 non-null float64
var_13     200000 non-null float64
var_14     200000 non-null float64
var_15     200000 non-null float64
var_16     200000 non-null float64
var_18     200000 non-null float64
var_19     200000 non-null float64
var_20     200000 non-null float64
var_21     200000 non-null float64
var_22     200000 non-null float64
var_23     200000 non-null float64
var_24     200000 non-null float64
var_25     200000 non-null float64
var_26     20000

In [15]:
# Independent Variables
X = df_sample.drop(['target','ID_code'], axis=1)

# Dependent Variable
y = df_sample['target']

## Variables balance and re-modeling

In [18]:
df_sample_copy = df_sample.copy(deep=True)

In [19]:
df_obj = df_sample[df_sample['target']==1]
df_nao_obj = df_sample[df_sample['target']==0].sample(df_obj.shape[0])

print(df_obj.shape)
print(df_nao_obj.shape)

(20098, 183)
(20098, 183)


In [20]:
newdf = pd.concat([df_obj, df_nao_obj])

## Creating DataFrames of the Dependent and Independent Variables

In [21]:
# Independent Variables
Xnew = newdf.drop(['target','ID_code'], axis=1)

# Dependent Variable
ynew = newdf['target']

## Variables Normalization for treatment

In [22]:
from sklearn.preprocessing import StandardScaler
newscaler = StandardScaler()
newscaler.fit(Xnew)
Xnew_norm = newscaler.transform(Xnew)

In [23]:
from sklearn.model_selection import train_test_split
Xnew_norm_train, Xnew_norm_test, ynew_norm_train, ynew_norm_test = train_test_split(Xnew_norm, ynew, test_size = 0.20, random_state=0)

## Multiple model training (GaussianNB results are shown as the best)

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [25]:
classifiers = [
    KNeighborsClassifier(3),
    GaussianNB()]
    LogisticRegression(),
    SVC(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier()]

for clfNV in classifiers:
    clfNV.fit(Xnew_norm_train, ynew_norm_train)
    name = clfNV.__class__.__name__
    
    print("="*30)
    print(name)
    
    print('****Results****')
    target_NV = clfNV.predict(Xnew_norm_test)
    print("Accuracy:", metrics.accuracy_score(ynew_norm_test, target_NV))
    print("Precision:", metrics.precision_score(ynew_norm_test, target_NV))
    print("Recall:", metrics.recall_score(ynew_norm_test, target_NV))
    print("F1_score:",metrics.f1_score(ynew_norm_test, target_NV))

GaussianNB
****Results****
('Accuracy:', 0.8103233830845771)
('Precision:', 0.8102967283794066)
('Recall:', 0.804380664652568)
('F1_score:', 0.8073278584965257)


In [26]:
# Cross Validation
from sklearn.model_selection import cross_val_score

In [27]:
cross_valNV = cross_val_score(clfNV, Xnew_norm_train, ynew_norm_train, cv=10)
cross_valNV.mean()

0.8082476323669366

## Applying to Test Dataset and submission file creation

In [30]:
df_teste = pd.read_csv(r'data/test.csv')

In [31]:
df_teste.shape

(200000, 201)

In [32]:
X_teste = df_teste.drop(['ID_code','var_7', 'var_10', 'var_17', 'var_27', 'var_30', 'var_38', 'var_39', 'var_41', 'var_96', 'var_98', 'var_100', 'var_103', 'var_117', 'var_124', 'var_126', 'var_136', 'var_158', 'var_183', 'var_185'], axis=1)

In [33]:
X_teste.shape

(200000, 181)

In [34]:
X_teste_norm = pd.DataFrame(newscaler.transform(X_teste))

In [35]:
X_teste_norm.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,171,172,173,174,175,176,177,178,179,180
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,...,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,-0.070417,-0.062439,-0.076433,-0.016633,-0.014229,-0.042256,-0.080068,-0.028327,0.05554,-0.036591,...,-0.077653,-0.054448,0.055749,0.024845,0.028616,-0.02462,-0.032148,0.049876,0.06473,-0.025691
std,0.960142,0.977258,0.964762,1.000825,0.988282,0.984406,0.964969,0.999256,0.983896,0.990232,...,0.969003,0.97175,0.993079,0.991239,0.992887,0.979864,0.986487,0.983179,0.964163,0.997117
min,-3.38081,-3.308065,-3.135764,-3.337212,-3.433111,-2.883961,-3.649813,-3.095946,-2.601036,-3.781163,...,-3.757387,-3.223364,-3.478382,-3.630683,-2.69467,-3.299504,-2.972781,-2.878431,-2.910768,-3.497996
25%,-0.770991,-0.806387,-0.798699,-0.776111,-0.738922,-0.811694,-0.797154,-0.802052,-0.699721,-0.743115,...,-0.777012,-0.790609,-0.644099,-0.653516,-0.718168,-0.728529,-0.80004,-0.646591,-0.583204,-0.781077
50%,-0.116242,-0.054277,-0.130187,1e-05,4.7e-05,-0.015179,-0.106365,0.000111,0.105533,-0.044299,...,-0.083479,-0.07991,0.033375,0.051262,0.022493,-0.044046,-0.011734,0.02892,0.088514,0.024209
75%,0.587506,0.655331,0.578334,0.733909,0.705375,0.707452,0.579068,0.766938,0.866558,0.644622,...,0.592312,0.611386,0.746932,0.732423,0.785696,0.64126,0.733825,0.779566,0.762049,0.758854
max,3.617691,2.600338,2.856307,3.081215,3.018684,2.747886,3.14213,2.786053,2.798443,3.292151,...,3.577845,2.919477,4.134888,3.575055,3.174476,3.182284,2.436593,3.640689,3.483751,2.961631


In [36]:
y_teste_pred = clfNV.predict_proba(X_teste_norm)[:,1]

In [37]:
y_teste_pred.shape

(200000,)

In [38]:
X_teste_norm.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,171,172,173,174,175,176,177,178,179,180
0,0.058224,2.212072,0.746304,1.271004,0.203609,0.291753,0.404418,0.528221,1.046406,-0.217124,...,-1.215734,1.356101,-2.19614,-0.191288,-1.326288,1.757734,0.343603,1.979681,-0.062486,-0.55053
1,-0.74335,0.633781,0.142332,-0.797905,-1.163027,0.087699,0.594468,-1.434334,-1.218798,0.451354,...,1.503572,0.387753,-0.605633,1.713342,-0.736591,0.394114,-0.716105,1.073696,1.109456,-1.725782
2,-1.706965,-2.174854,-0.284028,0.109979,-0.511654,1.816119,-0.660559,0.34524,0.67437,0.001007,...,-0.916127,1.081137,0.092223,-0.258469,-1.557039,1.525529,-1.734662,-1.925897,1.355022,-1.937065
3,-0.741136,0.010615,0.40507,-0.120637,-1.377991,0.986642,-0.610674,0.900479,-0.033602,-0.335142,...,1.280804,0.465392,-0.124045,0.089169,-0.859601,2.236605,0.269576,0.388389,-0.849335,-0.118173
4,0.260642,0.298314,1.177022,0.452587,-1.220436,-0.484371,1.531777,0.784617,-0.284474,-0.038942,...,0.14295,0.496132,-0.37534,0.038685,0.52092,-0.129032,-1.381933,-1.684188,-0.557976,-0.595109


In [39]:
df_final_teste = pd.DataFrame(df_teste['ID_code'])

In [40]:
df_final_teste.shape

(200000, 1)

In [41]:
df_final_teste['target'] = y_teste_pred

In [42]:
df_final_teste.shape

(200000, 2)

In [43]:
df_final_teste.to_csv(r'submission/submission_23.csv', index=False)