# Descripción
   ### Los datos están relacionados con campañas de marketing directo de una institución bancaria. Las campañas de marketing se basaron en llamadas telefónicas. A menudo, se requería más de un contacto con el mismo cliente, para poder acceder si el producto (depósito a plazo bancario) estaría ('sí') o no ('no') suscrito.


    Reglas y método de evaluación
    
    El desafío consistirá en que cada grupo formado deberá entregar un arreglo con los resultados ('yes' o 'no') y se medirá la medida de desempeño F1 score (macro).
    
    EL archivo de entrega debe ser en formato csv sin indice. Debe ser una sola columna con valores ´yes´ o ´no´ en 5210 filas


#### Variables de entrada:

# datos del cliente bancario:
1 - edad (numérico)

2 - trabajo: tipo de trabajo (categórico: 'admin.', 'Obrero', 'emprendedor', 'empleada doméstica', 'gerencia', 'jubilado', 'autónomo', 'servicios', 'estudiante' , 'técnico', 'desempleado', 'desconocido')

3 - marital: estado civil (categórico: 'divorciado', 'casado', 'soltero', 'desconocido'; nota: 'divorciado' significa divorciado o viudo)

4 - educación (categórica: 'básico.4y', 'básico.6y', 'básico.9y', 'escuela secundaria', 'analfabeto', 'curso.profesional', 'título universitario', 'desconocido')

5 - incumplimiento: ¿tiene crédito en incumplimiento? (categórico: 'no', 'sí', 'desconocido')

6 - vivienda: ¿tiene préstamo para vivienda? (categórico: 'no', 'sí', 'desconocido')

7 - préstamo: ¿tiene préstamo personal? (categórico: 'no', 'sí', 'desconocido')

# relacionado con el último contacto de la campaña actual:

8 - contacto: tipo de comunicación de contacto (categórico: 'celular', 'teléfono')

9 - mes: último mes de contacto del año (categórico: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: último día de contacto de la semana (categórico: 'lun', 'tue', 'mié', 'jue', 'vie')

11 - duración: duración del último contacto, en segundos (numérico). Nota importante: este atributo afecta en gran medida el objetivo de salida (por ejemplo, si duración = 0, entonces y = 'no'). Sin embargo, se desconoce la duración antes de realizar una llamada. Además, después de la finalización de la llamada y, obviamente, se conoce. Por lo tanto, esta entrada solo debe incluirse con fines de referencia y debe descartarse si la intención es tener un modelo predictivo realista.

# otros atributos:

12 - campaña: número de contactos realizados durante esta campaña y para este cliente (numérico, incluye último contacto)

13 - pdays: número de días que pasaron después de que el cliente fue contactado por última vez desde una campaña anterior (numérico; 999 significa que el cliente no fue contactado previamente)

14 - anterior: número de contactos realizados antes de esta campaña y para este cliente (numérico)

15 - poutcome: resultado de la campaña de marketing anterior (categórico: 'fracaso', 'inexistente', 'éxito')

Variable de salida (objetivo deseado):

21 - y - ¿el cliente ha suscrito un depósito a plazo? (binario: 'sí', 'no')


# Comenzando

In [28]:
import numpy as np
import pandas as pd
import pylab as plt
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing


from imblearn.over_sampling import RandomOverSampler
import imblearn

In [2]:
df_train = pd.read_csv('Trainset.csv', index_col= 'Unnamed: 0')
df_test = pd.read_csv('TestFeatures.csv')
final = pd.read_csv('submission_example.csv')

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        40000 non-null  int64 
 1   job        40000 non-null  object
 2   marital    40000 non-null  object
 3   education  40000 non-null  object
 4   default    40000 non-null  object
 5   balance    40000 non-null  int64 
 6   housing    40000 non-null  object
 7   loan       40000 non-null  object
 8   contact    40000 non-null  object
 9   day        40000 non-null  int64 
 10  month      40000 non-null  object
 11  duration   40000 non-null  int64 
 12  campaign   40000 non-null  int64 
 13  pdays      40000 non-null  int64 
 14  previous   40000 non-null  int64 
 15  poutcome   40000 non-null  object
 16  y          40000 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.5+ MB


In [4]:
df_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,46,management,single,tertiary,no,593,yes,no,cellular,29,jan,190,3,-1,0,unknown,no
1,42,admin.,married,tertiary,no,1536,no,no,cellular,6,aug,140,1,182,4,failure,no
2,33,blue-collar,married,secondary,no,370,yes,no,cellular,8,apr,249,1,-1,0,unknown,no
3,29,blue-collar,single,secondary,no,1472,no,no,cellular,18,may,246,2,-1,0,unknown,no
4,29,technician,married,secondary,no,767,yes,no,cellular,5,feb,253,1,-1,0,unknown,no


In [5]:
df_train['poutcome'].unique()

array(['unknown', 'failure', 'success', 'other'], dtype=object)

In [6]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5210 entries, 0 to 5209
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        5210 non-null   int64 
 1   job        5210 non-null   object
 2   marital    5210 non-null   object
 3   education  5210 non-null   object
 4   default    5210 non-null   object
 5   balance    5210 non-null   int64 
 6   housing    5210 non-null   object
 7   loan       5210 non-null   object
 8   contact    5210 non-null   object
 9   day        5210 non-null   int64 
 10  month      5210 non-null   object
 11  duration   5210 non-null   int64 
 12  campaign   5210 non-null   int64 
 13  pdays      5210 non-null   int64 
 14  previous   5210 non-null   int64 
 15  poutcome   5210 non-null   object
dtypes: int64(7), object(9)
memory usage: 651.4+ KB


## Eliminar Columnas
duration   



In [7]:
df_train = df_train.drop(['duration'], axis=1)
df_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y
0,46,management,single,tertiary,no,593,yes,no,cellular,29,jan,3,-1,0,unknown,no
1,42,admin.,married,tertiary,no,1536,no,no,cellular,6,aug,1,182,4,failure,no
2,33,blue-collar,married,secondary,no,370,yes,no,cellular,8,apr,1,-1,0,unknown,no
3,29,blue-collar,single,secondary,no,1472,no,no,cellular,18,may,2,-1,0,unknown,no
4,29,technician,married,secondary,no,767,yes,no,cellular,5,feb,1,-1,0,unknown,no


## Distribution

In [8]:
# df_train.hist(figsize=(12,12), color = 'darkblue')
# plt.show()

### Mayormente los usuarios NO tiene contratado el producto

In [9]:
df_train.y.value_counts()

no     35347
yes     4653
Name: y, dtype: int64

In [10]:
df_train.y.value_counts()/df_train.shape[0]

no     0.883675
yes    0.116325
Name: y, dtype: float64

# Preproceso de datos Ecoding Test Train

### Copias de los datos.

In [11]:
# Datos sin balanceo
df_prep = df_train.copy()

# Encoding de datos Train

## encoding datos sin balanceo

In [12]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
df_prep['job']=labelencoder.fit_transform(df_prep['job'].values)
df_prep['marital']=labelencoder.fit_transform(df_prep['marital'].values)
df_prep['education']=labelencoder.fit_transform(df_prep['education'].values)
df_prep['contact']=labelencoder.fit_transform(df_prep['contact'].values)
df_prep['poutcome']=labelencoder.fit_transform(df_prep['poutcome'].values)

df_prep['default'].replace(['yes','no'],[1,0],inplace=True)
df_prep['housing'].replace(['yes','no'],[1,0],inplace=True)
df_prep['loan'].replace(['yes','no'],[1,0],inplace=True)
df_prep['y'].replace(['yes','no'],[1,0],inplace=True)
df_prep['month'].replace(['jan','feb','mar','apr','may','jun','jul','aug' , 'sep','oct','nov','dec'],
                         [1,2,3,4,5,6,7,8,9,10,11,12],inplace=True)



#CATEGORICAS ['job','marital','education','contact','month','poutcome']
#NUMERICAS ['age','balance','day','campaign','pdays','previous']
#BINARIAS  ['default','housing','loan']  

In [13]:
df_prep.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y
0,46,4,2,2,0,593,1,0,0,29,1,3,-1,0,3,0
1,42,0,1,2,0,1536,0,0,0,6,8,1,182,4,0,0
2,33,1,1,1,0,370,1,0,0,8,4,1,-1,0,3,0
3,29,1,2,1,0,1472,0,0,0,18,5,2,-1,0,3,0
4,29,9,1,1,0,767,1,0,0,5,2,1,-1,0,3,0


In [14]:
df_prep['y'].unique(),df_prep['y'].value_counts()

(array([0, 1], dtype=int64),
 0    35347
 1     4653
 Name: y, dtype: int64)

## Normalizacion Train

In [15]:
# #selecionamos las columnas que normalizaremos
df_norm = df_prep[['age','balance','campaign','pdays','previous','poutcome']]
df_prep2 = df_prep.drop(columns=['age','balance','campaign','pdays','previous','poutcome'])

from sklearn.preprocessing import MinMaxScaler
df_trans = MinMaxScaler()
df_trans = df_trans.fit_transform(df_norm)
df_trans = pd.DataFrame(df_trans)
df_trans.columns = df_norm.columns
df_train_trans = pd.concat([df_trans, df_prep2], axis=1,)
df_train_trans.reindex(columns=['age','job','marital','education','default','balance','housing',
                          'loan','contact','day','month','campaign','pdays','previous','poutcome','y'])

df_train_trans.head()

Unnamed: 0,age,balance,campaign,pdays,previous,poutcome,job,marital,education,default,housing,loan,contact,day,month,y
0,0.363636,0.078187,0.032258,0.0,0.0,1.0,4,2,2,0,1,0,0,29,1,0
1,0.311688,0.086748,0.0,0.209862,0.014545,0.0,0,1,2,0,0,0,0,6,8,0
2,0.194805,0.076163,0.0,0.0,0.0,1.0,1,1,1,0,1,0,0,8,4,0
3,0.142857,0.086167,0.016129,0.0,0.0,1.0,1,2,1,0,0,0,0,18,5,0
4,0.142857,0.079767,0.0,0.0,0.0,1.0,9,1,1,0,1,0,0,5,2,0


In [16]:
# from sklearn.preprocessing import MinMaxScaler
# train_trans  = MinMaxScaler()
# train_trans  = train_trans.fit_transform(df_prep)

# from sklearn.preprocessing import Normalizer
# test_trans  = Normalizer()
# test_trans  = test_trans.fit_transform(df_prep)


# #crearmamos el dataset
# train_trans = pd.DataFrame(train_trans)
# train_trans.columns = df_prep.columns
# train_trans.reindex(columns=['age','job','marital','education','default','balance','housing',
#                          'loan','contact','day','month','campaign','pdays','previous','poutcome','y'])


# Preproceso datos test

## encoding datos sin balanceo

#### Esto es necesario para poder probar el modelo sobre la misma estructura de datos pre_procesados de train.

In [17]:
from sklearn.preprocessing import LabelEncoder
labelencoder=LabelEncoder()
df_test['job']=labelencoder.fit_transform(df_test['job'].values)
df_test['marital']=labelencoder.fit_transform(df_test['marital'].values)
df_test['education']=labelencoder.fit_transform(df_test['education'].values)
df_test['contact']=labelencoder.fit_transform(df_test['contact'].values)
df_test['poutcome']=labelencoder.fit_transform(df_test['poutcome'].values)

df_test['default'].replace(['yes','no'],[1,0],inplace=True)
df_test['housing'].replace(['yes','no'],[1,0],inplace=True)
df_test['loan'].replace(['yes','no'],[1,0],inplace=True)
df_test['month'].replace(['jan','feb','mar','apr','may','jun','jul','aug' , 'sep','oct','nov','dec'],
                         [1,2,3,4,5,6,7,8,9,10,11,12],inplace=True)

#CATEGORICAS ['job','marital','education','contact','month','poutcome']
#NUMERICAS ['age','balance','day','campaign','pdays','previous']
#BINARIAS  ['default','housing','loan']  

## Normalizacion Test

In [18]:
df_norm_test = df_test[['age','balance','campaign','pdays','previous']]
df_prep_test = df_test.drop(columns=['age','balance','campaign','pdays','previous','duration'])
from sklearn.preprocessing import MinMaxScaler
df_trans = MinMaxScaler()
df_trans = df_trans.fit_transform(df_norm_test)
df_trans = pd.DataFrame(df_trans)
df_trans.columns = df_norm_test.columns
test_trans = pd.concat([df_trans, df_prep_test], axis=1,)
test_trans.reindex(columns=['age','job','marital','education','balance','housing',
                         'loan','contact','day','month','campaign','pdays','previous','poutcome'])

test_trans.head()

Unnamed: 0,age,balance,campaign,pdays,previous,job,marital,education,default,housing,loan,contact,day,month,poutcome
0,0.470588,0.073101,0.02381,0.0,0.0,0,2,2,0,0,0,1,28,1,3
1,0.544118,0.072116,0.0,0.0,0.0,5,1,1,0,0,0,0,21,7,3
2,0.455882,0.067485,0.166667,0.0,0.0,2,1,2,0,0,0,0,31,7,3
3,0.514706,0.077355,0.0,0.0,0.0,2,1,1,0,1,0,2,15,5,3
4,0.411765,0.066631,0.02381,0.0,0.0,4,1,2,0,0,0,0,19,8,3


In [19]:
# from sklearn.preprocessing import MinMaxScaler
# test_trans  = MinMaxScaler()
# test_trans  = test_trans.fit_transform(df_test)

# from sklearn.preprocessing import Normalizer
# test_trans  = Normalizer()
# test_trans  = test_trans.fit_transform(df_test)

In [20]:
# #rearmamos el dataset
# test_trans = pd.DataFrame(test_trans)
# test_trans.columns = df_test.columns

# Balanceo de datos

# Oversampling

In [21]:
#Separamos el set de datos
x = df_train_trans.drop(['y'], axis=1)
y = df_train_trans['y']

In [22]:
random_over = RandomOverSampler(sampling_strategy='auto',
                               random_state=123)

X_over,y_over = random_over.fit_resample(x,y)
df_over = X_over
df_over['y'] = y_over

print(df_over.y.value_counts()/df_over.shape[0])
print(df_over.y.value_counts())
print(df_over.shape)
df_over.head()

0    0.5
1    0.5
Name: y, dtype: float64
0    35347
1    35347
Name: y, dtype: int64
(70694, 16)


Unnamed: 0,age,balance,campaign,pdays,previous,poutcome,job,marital,education,default,housing,loan,contact,day,month,y
0,0.363636,0.078187,0.032258,0.0,0.0,1.0,4,2,2,0,1,0,0,29,1,0
1,0.311688,0.086748,0.0,0.209862,0.014545,0.0,0,1,2,0,0,0,0,6,8,0
2,0.194805,0.076163,0.0,0.0,0.0,1.0,1,1,1,0,1,0,0,8,4,0
3,0.142857,0.086167,0.016129,0.0,0.0,1.0,1,2,1,0,0,0,0,18,5,0
4,0.142857,0.079767,0.0,0.0,0.0,1.0,9,1,1,0,1,0,0,5,2,0


# Undersampling

In [23]:
from sklearn.utils import resample, shuffle

#set the minority class to a seperate dataframe
df_yes = df_train_trans[df_train['y'] == 'yes']
#set other classes to another dataframe
df_no = df_train_trans[df_train['y'] == 'no']  

#upsample the class
df_no_upsampled = resample(df_no,
                           random_state=123,
                           n_samples=len(df_yes),
                           replace=False)

#concatenate the upsampled dataframe
df_under = pd.concat([df_no_upsampled,df_yes])

print(df_under.y.value_counts()/df_under.shape[0])
print(df_under.y.value_counts())
print(df_under.shape)
df_under.head()

0    0.5
1    0.5
Name: y, dtype: float64
0    4653
1    4653
Name: y, dtype: int64
(9306, 16)


Unnamed: 0,age,balance,campaign,pdays,previous,poutcome,job,marital,education,default,housing,loan,contact,day,month,y
38599,0.519481,0.120722,0.048387,0.0,0.0,1.0,3,1,0,0,1,0,2,4,6,0
11956,0.337662,0.066684,0.0,0.0,0.0,1.0,1,1,1,0,1,0,2,5,5,0
8420,0.142857,0.085968,0.0,0.0,0.0,1.0,0,2,1,0,1,0,2,28,5,0
15096,0.298701,0.078732,0.016129,0.0,0.0,1.0,4,1,2,0,1,0,1,17,11,0
4194,0.428571,0.095555,0.032258,0.0,0.0,1.0,9,1,2,0,0,0,0,30,11,0


#### Tenemos dos set de datos Balanceados en dos formas diferentes

# Ejecutando Modelos 

In [24]:
#df = df_train_trans
#df = df_over
#df = df_under

In [32]:
df_over = df_over.sample(frac=0.95, random_state=786)

In [25]:
df_over.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70694 entries, 0 to 70693
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        70694 non-null  float64
 1   balance    70694 non-null  float64
 2   campaign   70694 non-null  float64
 3   pdays      70694 non-null  float64
 4   previous   70694 non-null  float64
 5   poutcome   70694 non-null  float64
 6   job        70694 non-null  int32  
 7   marital    70694 non-null  int32  
 8   education  70694 non-null  int32  
 9   default    70694 non-null  int64  
 10  housing    70694 non-null  int64  
 11  loan       70694 non-null  int64  
 12  contact    70694 non-null  int32  
 13  day        70694 non-null  int64  
 14  month      70694 non-null  int64  
 15  y          70694 non-null  int64  
dtypes: float64(6), int32(4), int64(6)
memory usage: 7.6 MB


In [26]:
#df.info()

In [37]:
from pycaret.classification import *
clf1 = setup(data = df_over,target = 'y',train_size=0.2,session_id=123,fold_shuffle=True ,data_split_shuffle=False)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,y
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(67159, 16)"
5,Missing Values,False
6,Numeric Features,11
7,Categorical Features,4
8,Ordinal Features,False
9,High Cardinality Features,False


AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

In [None]:
best_model = compare_models(sort = 'F1')

    Precision nos da la calidad de la predicción: ¿qué porcentaje de los que hemos dicho que son la clase positiva, en realidad lo son?
    
    Recall nos da la cantidad: ¿qué porcentaje de la clase positiva hemos sido capaces de identificar?
    
    F1 combina Precision y Recall en una sola medida
    
    La Matriz de Confusión indica qué tipos de errores se cometen

###### Evaluando F1 antes de continuar con los modelos

    modelo_1 = 

    modelo_2 = 

    modelo_3 = 0.3472 xgboost
    
    modelo_4 = 0.3478 qda
    
    modelo_5 = 0.3262 qda   
   


In [None]:
print(best_model)

In [None]:
models()

In [None]:
catboost = create_model('catboost')

In [None]:
qda = create_model('qda')

In [None]:
lightgbm = create_model('lightgbm')

In [None]:
xgboost = create_model('xgboost')

# Optimizando Hiperparametros
### por F1

In [None]:
tuned_rf = tune_model(qda, optimize = 'F1')

In [None]:
print(tuned_rf)

# Ensamble

In [None]:
#boosted_dt = ensemble_model (ctb, method = 'Boosting', n_estimators = 100)

# Prediccion

In [None]:
# predict_model(boosted_dt);

In [None]:
rf_final = finalize_model(tuned_rf)
#Parámetros finales del modelo Random Forest para su despliegue a producción
print(rf_final)

In [None]:
predict_model(tuned_rf);

# Testeando resultados.

In [None]:
test_predictions = predict_model(rf_final, data = test_trans)
test_predictions.head()

In [None]:
test_predictions.info()

# Resultado

In [None]:
from pycaret.utils import check_metric

check_metric(test_predictions, test_predictions['Label'], metric = 'Accuracy')

accura_m1 =

accura_m2 = 0.9232 - puntaje 49.2757

    fix_imbalance = True, 
                 train_size = 0.2, 
                 iterative_imputation_iters = 5,
                 normalize = True,
                 normalize_method = 'minmax',
                 transformation = True,
                 high_cardinality_features = ['job','education','month'],
                 numeric_features = ['age','balance','day','campaign','pdays','previous'] )
             
accura_m3 = 0.9378  - puntaje 
     
     Es necesario preprocesar datos de test y train
      
accura_m3 = 0.9798
     
     Aplicamos modelo qda



accura_m3 = 0.9384 

    Aplicamos catboost







# Guardando datos para carga

In [None]:
test_predictions['Label'].replace([1,0],['yes','no'],inplace=True)

In [None]:
label = test_predictions[['Label']]

In [None]:
label.to_csv('example_3.csv',index=False , header = False )