# Kaggle : Santander Product Recommendation
- In this competition, you are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase.
- The data starts at 2015-01-28 and has monthly records of products a customer has, such as "credit card", "savings account", etc.
- **you will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 - #48 in the training data.**
- you will predict what a customer will buy in addition to what they already had at 2016-05-28. 

## Data Dictionary
### Variable
 - #### Property of Customer
    - fecha_dato : The table is partitioned for this column
    - ncodpers : Customer code
    - ind_empleado : Employee index(A active, B ex employed, F filial, N not employee, P pasive
    - pais_residencia : Customer's Country residence
    - sexo : Customer's sex
    - age : Age
    - fecha_alta : The data in which the customer became as the first holder of a contract in the bank
    - antiguedad : Customer seniority(in months)
    - indrel : 1(First/Primary), 99(Primary customer during the month but not at the end of the month)
    - ult_fec_cli_1t : Last date as primary customer (if he isn't at the end of the month)
    - indrel_1mes : Customer type at the begining of the month, 1(First/Primary customer), 2(co-owner), P(Potential), 3(former primary), 4(former co-owner)
    - indresi : Residence index( S(Yes) or N(No) if the residence country is the same than the bank country)
    - indext : Foreigner index(S(Yes) or N(No) if the customer's bitrh country is differenct than the bank country)
    - conyuemp : Spouse index. 1 if the customer is spouse of an employee
    - canal_entrada : channel used by the customer to join
    - indfall : Deceased index. N/S
    - tipodom : Addres type. 1, primary address
    - cod_prov : Province code (customer's address)
    - nomprov : Province name
    - ind_actividad_cliente : Activity index (1, active customer; 0, inactive customer)
    - renta : Gross income of the household
    - segmento : Segmentation : 01 - VIP, 02 - Individuals, 03 - college graduated
 - #### Products(ind_aaa_aaa_ult1)
    - ahor_ahor : Saving Account
    - aval_fin : Guarantees
    - cco_fin : Current Accounts
    - cder_fin : Derivada Account
    - cno_fin : Payroll Account
    - ctju_fin : Junior Account
    - ctma_fin : Mas particular Account
    - ctop_fin : particular Account
    - ctpp_fin : particular Plus Account
    - deco_fin : Short-term deposits
    - deme_fin : Medium-term deposits
    - dela_fin : Long-term deposits
    - ecue_fin : e-account
    - fond_inf : Funds
    - hip_fin : Mortgage
    - plan_fin : Pensions
    - pres_fin : Loans
    - reca_fin : Taxes
    - tjcr_fin : Credit Card
    - valo_fin : Securities
    - viv_fin : Home Account
    - nomina : Payroll
    - nom_pens : Pensions
    - recibo : Direct Debit

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## 1. Data Exploratory

In [3]:
train = pd.read_csv('train_ver2.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
train_raw = train.copy()

In [5]:
train.tail()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
13647304,2016-05-28,1166765,N,ES,V,22,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647305,2016-05-28,1166764,N,ES,V,23,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647306,2016-05-28,1166763,N,ES,H,47,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647307,2016-05-28,1166789,N,ES,H,22,2013-08-14,0.0,33,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0
13647308,2016-05-28,1550586,N,ES,H,37,2016-05-13,1.0,0,1.0,...,0,0,0,0,0,0,0,0.0,0.0,0


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13647309 entries, 0 to 13647308
Data columns (total 48 columns):
fecha_dato               object
ncodpers                 int64
ind_empleado             object
pais_residencia          object
sexo                     object
age                      object
fecha_alta               object
ind_nuevo                float64
antiguedad               object
indrel                   float64
ult_fec_cli_1t           object
indrel_1mes              object
tiprel_1mes              object
indresi                  object
indext                   object
conyuemp                 object
canal_entrada            object
indfall                  object
tipodom                  float64
cod_prov                 float64
nomprov                  object
ind_actividad_cliente    float64
renta                    float64
segmento                 object
ind_ahor_fin_ult1        int64
ind_aval_fin_ult1        int64
ind_cco_fin_ult1         int64
ind_cder_fin_ult1  

In [7]:
train.isnull().sum()

fecha_dato                      0
ncodpers                        0
ind_empleado                27734
pais_residencia             27734
sexo                        27804
age                             0
fecha_alta                  27734
ind_nuevo                   27734
antiguedad                      0
indrel                      27734
ult_fec_cli_1t           13622516
indrel_1mes                149781
tiprel_1mes                149781
indresi                     27734
indext                      27734
conyuemp                 13645501
canal_entrada              186126
indfall                     27734
tipodom                     27735
cod_prov                    93591
nomprov                     93591
ind_actividad_cliente       27734
renta                     2794375
segmento                   189368
ind_ahor_fin_ult1               0
ind_aval_fin_ult1               0
ind_cco_fin_ult1                0
ind_cder_fin_ult1               0
ind_cno_fin_ult1                0
ind_ctju_fin_u

In [8]:
# numeric dtype columns
number_cols = [col for col in train.columns[:24] if train[col].dtype in ['int64', 'float64']]
train[number_cols].describe()

Unnamed: 0,ncodpers,ind_nuevo,indrel,tipodom,cod_prov,ind_actividad_cliente,renta
count,13647310.0,13619580.0,13619580.0,13619574.0,13553720.0,13619580.0,10852930.0
mean,834904.2,0.05956184,1.178399,1.0,26.57147,0.4578105,134254.3
std,431565.0,0.2366733,4.177469,0.0,12.78402,0.4982169,230620.2
min,15889.0,0.0,1.0,1.0,1.0,0.0,1202.73
25%,452813.0,0.0,1.0,1.0,15.0,0.0,68710.98
50%,931893.0,0.0,1.0,1.0,28.0,0.0,101850.0
75%,1199286.0,0.0,1.0,1.0,35.0,1.0,155956.0
max,1553689.0,1.0,99.0,1.0,52.0,1.0,28894400.0


In [9]:
# categorical dtype columns
cat_cols = [col for col in train.columns[:24] if train[col].dtype in ['object']]
train[cat_cols].describe()

Unnamed: 0,fecha_dato,ind_empleado,pais_residencia,sexo,age,fecha_alta,antiguedad,ult_fec_cli_1t,indrel_1mes,tiprel_1mes,indresi,indext,conyuemp,canal_entrada,indfall,nomprov,segmento
count,13647309,13619575,13619575,13619505,13647309,13619575,13647309,24793,13497528.0,13497528,13619575,13619575,1808,13461183,13619575,13553718,13457941
unique,17,5,118,2,235,6756,507,223,13.0,5,2,2,2,162,2,52,3
top,2016-05-28,N,ES,V,23,2014-07-28,0,2015-12-24,1.0,I,S,N,N,KHE,N,MADRID,02 - PARTICULARES
freq,931453,13610977,13553710,7424252,542682,57389,134335,763,7277607.0,7304875,13553711,12974839,1791,4055270,13584813,4409600,7960220


In [10]:
train_mixed_columns = [train.columns[i] for i in [5, 8, 11, 15]]
print('mixed types columns : ', train_mixed_columns)

mixed types columns :  ['age', 'antiguedad', 'indrel_1mes', 'conyuemp']


## Handle Mixed types columns
- 5 columns of train data
- 1 column of test data
### 1. train mixed types columns

In [11]:
print(train_mixed_columns)

['age', 'antiguedad', 'indrel_1mes', 'conyuemp']


In [12]:
# 1. age
train['age'].unique()

array([' 35', ' 23', ' 22', ' 24', ' 65', ' 28', ' 25', ' 26', ' 53',
       ' 27', ' 32', ' 37', ' 31', ' 39', ' 63', ' 33', ' 55', ' 42',
       ' 58', ' 38', ' 50', ' 30', ' 45', ' 44', ' 36', ' 29', ' 60',
       ' 57', ' 67', ' 47', ' NA', ' 34', ' 48', ' 46', ' 54', ' 84',
       ' 15', ' 12', '  8', '  6', ' 83', ' 40', ' 77', ' 69', ' 52',
       ' 59', ' 43', ' 10', '  9', ' 49', ' 41', ' 51', ' 78', ' 16',
       ' 11', ' 73', ' 62', ' 66', ' 17', ' 68', ' 82', ' 95', ' 96',
       ' 56', ' 61', ' 79', ' 72', ' 14', ' 19', ' 13', ' 86', ' 64',
       ' 20', ' 89', ' 71', '  7', ' 70', ' 74', ' 21', ' 18', ' 75',
       '  4', ' 80', ' 81', '  5', ' 76', ' 92', ' 93', ' 85', ' 91',
       ' 87', ' 90', ' 94', ' 99', ' 98', ' 88', ' 97', '100', '101',
       '106', '103', '  3', '  2', '102', '104', '111', '107', '109',
       '105', '112', '115', '110', '116', '108', '113', 37, 81, 43, 30,
       45, 41, 67, 59, 46, 36, 47, 69, 39, 44, 40, 38, 34, 42, 31, 35, 48,
       60, 54

In [16]:
train['age'].replace(' NA', 999)
train['age'] = train['age'].astype(np.int8)
train['age'].unique()

ValueError: invalid literal for int() with base 10: ' NA'

In [None]:
# 2. 

# 새롭게 알게 된 것
### 1. Pandas의 메모리 사용 줄이기
- https://www.dataquest.io/blog/pandas-big-data/