<a href="https://colab.research.google.com/github/pepemaluza/DL2021/blob/main/DL2021_Cleaned_Version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DL 2021 - Bank Marketing Campaign

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf

from sklearn.metrics import confusion_matrix, classification_report

In [3]:
bank=pd.read_csv("drive/MyDrive/DL/bank-additional-full.csv",sep=';')
dfBank=bank.copy() #creamos la copia para hacer nuestro df.

print ("Shape: "+ str(dfBank.shape))
print (dfBank.info())

Shape: (41188, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.id

#Dataset:

Tenemos 41188 datos y 21 features. Supuestmanete son non-null values, pero nos fijamos igual.

#Attributes: 
`Info sacada del kaggle`

Bank client data:**

* Age : Age of the lead (numeric)
* Job : type of job (Categorical)
* Marital : Marital status (Categorical)
* Education : Educational Qualification of the lead (Categorical)
* Default: Does the lead has any default(unpaid)credit (Categorical)
* Housing: Does the lead has any housing loan? (Categorical)
* loan: Does the lead has any personal loan? (Categorical)

**Related with the last contact of the current campaign:**

* Contact: Contact communication type (Categorical)
* Month: last contact month of year (Categorical)
* day_of_week: last contact day of the week (categorical)
* duration: last contact duration, in seconds (numeric).

**Important note:** Duration highly affects the output target (e.g., if * duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.*

**Other attributes:**

* campaign: number of contacts performed during this campaign and for this client (numeric)
* pdays: number of days that passed by after the client was last contacted from a previous campaign(numeric; 999 means client was not previously contacted))
* previous: number of contacts performed before this campaign and for this client (numeric)
* poutcome: outcome of the previous marketing campaign (categorical)

**Social and economic context attributes**

* emp.var.rate: employment variation rate - quarterly indicator (numeric)
* cons.price.idx: consumer price index - monthly indicator (numeric)
* cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* euribor3m: euribor 3 month rate - daily indicator (numeric)
* nr.employed: number of employees - quarterly indicator (numeric)

**Output variable (desired target):**

* y - has the client subscribed a term deposit? (binary: 'yes','no')

In [None]:
dfBank.head() #vemos info.

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [None]:
dfBank.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [4]:
dfBank.replace('unknown', np.nan, inplace=True)

In [5]:
# Main Encoding Helper Functions
def encode_onehot(df, columns, prefixes): #categorical
    df = df.copy()
    for column, prefix in zip(columns, prefixes):
        dumdums = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dumdums], axis=1)
        df = df.drop(column, axis=1)        
    return df

def encode_ordinal(df, columns, orderings): #ordinal encode
    df = df.copy()
    for column, ordering in zip(columns, orderings):
        df[column] = df[column].apply(lambda x: ordering.index(x))
    return df

def encode_binary(df, columns, positive_values): #binary encode (0,1)
    df = df.copy()
    for column, positive_value in zip(columns, positive_values):
        df[column] = df[column].apply(lambda x: 1 if x == positive_value else x)
        df[column] = df[column].apply(lambda x: 0 if str(x) != 'nan' else x)
    return df

In [6]:
categorical_features = [
    'job',
    'marital',
    'education',
    'day_of_week',
    'poutcome'
]

ordinal_features = [
    'month'
]

binary_features = [
    'default',
    'housing',
    'loan',
    'contact'
]

In [7]:
prefixes = ['J', 'M', 'E', 'D', 'P'] #Job,Martial,Education,day_of_week,pOutcome (for categorical)

orderings = [
    ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'] #months
]

positive_values = [
    'yes',    #has default
    'yes',    #has housing contract
    'yes',    #has loan
    'cellular'  #cellphone
]


In [8]:
dfBank = encode_onehot(dfBank, categorical_features, prefixes)
dfBank = encode_ordinal(dfBank, ordinal_features, orderings)
dfBank = encode_binary(dfBank, binary_features, positive_values)
dfBank

Unnamed: 0,age,default,housing,loan,contact,month,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,J_admin.,J_blue-collar,J_entrepreneur,J_housemaid,J_management,J_retired,J_self-employed,J_services,J_student,J_technician,J_unemployed,M_divorced,M_married,M_single,E_basic.4y,E_basic.6y,E_basic.9y,E_high.school,E_illiterate,E_professional.course,E_university.degree,D_fri,D_mon,D_thu,D_tue,D_wed,P_failure,P_nonexistent,P_success
0,56,0.0,0.0,0.0,0,4,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,no,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0
1,57,,0.0,0.0,0,4,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,no,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
2,37,0.0,0.0,0.0,0,4,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,no,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
3,40,0.0,0.0,0.0,0,4,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,no,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0
4,56,0.0,0.0,0.0,0,4,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,no,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,0.0,0.0,0.0,0,10,334,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,yes,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0
41184,46,0.0,0.0,0.0,0,10,383,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,no,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0
41185,56,0.0,0.0,0.0,0,10,189,2,999,0,-1.1,94.767,-50.8,1.028,4963.6,no,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0
41186,44,0.0,0.0,0.0,0,10,442,1,999,0,-1.1,94.767,-50.8,1.028,4963.6,yes,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0


In [9]:
dfBank.y = dfBank.y.apply(lambda x: 1 if x == 'yes' else 0)

In [None]:
print('Remaining Missing Values: ', dfBank.isna().sum().sum())

Remaining Missing Values:  10577


In [10]:
for column in binary_features:
    dfBank[column]=dfBank[column].fillna(dfBank[column].mean())

In [None]:
print('Remaining Missing Values: ', dfBank.isna().sum().sum())

Remaining Missing Values:  0


In [11]:
X = dfBank.drop('y', axis=1)
y = dfBank.y

In [12]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8, random_state=42)

In [15]:
inputs = tf.keras.Input(shape=(X.shape[1]))

#2 layers con 64 activaciones cada uno - relu activation
x = tf.keras.layers.Dense(64, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer='Nadam',#usamos adam
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.AUC(name='auc')
    ]
)
#nuestros parametros
batch_size = 64 
epochs = 100 

history = model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    batch_size = batch_size,
    epochs = epochs,
    callbacks = [
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=3,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100


In [16]:
model.evaluate(X_test, y_test) 



[0.19164294004440308, 0.9094440340995789, 0.9351575970649719]

#ADAM

* loss: 0.1911 - accuracy: 0.9110 - auc: 0.9371, bS: 16, epoch 100 2 layers
* loss: 0.1938 - accuracy: 0.9086 - auc: 0.9335, bS: 16, epoch 200 2 layers
* loss: 0.1920 - accuracy: 0.9105 - auc: 0.9346, bS: 16, epoch 100 3 layers
* loss: 0.1968 - accuracy: 0.9063 - auc: 0.9318, bS: 16, epoch 200 3 layers
* loss: 0.1931 - accuracy: 0.9077 - auc: 0.9353, bS: 32, epoch 100 2 layers
* loss: 0.1919 - accuracy: 0.9098 - auc: 0.9349, bS: 32, epoch 200 2 layers
* loss: 0.1905 - accuracy: 0.9096 - auc: 0.9363, bS: 64, epoch 100 2 layers 
* loss: 0.1920 - accuracy: 0.9082 - auc: 0.9348, bS: 64, epoch 200 2 layers

In [None]:
y_true = np.array(y_test)
y_pred = np.squeeze(np.array(model.predict(X_test) >= 0.9, dtype=np.int))

In [None]:
print("Confusion Matrix: \n ", confusion_matrix(y_true, y_pred))

Confusion Matrix: 
  [[7296    7]
 [ 889   46]]


In [None]:
print("Classification Report: \n", classification_report(y_true, y_pred))

Classification Report: 
               precision    recall  f1-score   support

           0       0.89      1.00      0.94      7303
           1       0.87      0.05      0.09       935

    accuracy                           0.89      8238
   macro avg       0.88      0.52      0.52      8238
weighted avg       0.89      0.89      0.85      8238

