# Bootcamp Data Science y MLOps

<img src="https://i.ibb.co/5RM26Cw/LOGO-COLOR2.png" width="500px">

---
# Ejercicio 🌮🥤

## 📍 Objetivo
Resolver la prueba técnica para el puesto de Data Analyst de la startup [ifood](https://www.ifood.com.br/) de Brasil.
<br>Esta startup se dedica al servicio de delivery de comida similar a Pedidos Ya, Rappi y Uber Eats.

## 📍 Contexto

### La empresa

Considere una empresa bien establecida que opera en el sector minorista de alimentos. Actualmente tienen alrededor
varios cientos de miles de clientes registrados y sirven a casi un millón de consumidores al año.
Venden productos de 5 categorías principales: vinos, productos cárnicos raros, frutas exóticas, especialmente
Pescados preparados y productos dulces. Estos se pueden dividir en productos premium y productos regulares. 

Los clientes pueden ordenar y adquirir productos a través de 3 canales de venta: tiendas físicas, catálogos y
el sitio web de la empresa. A nivel mundial, la compañía tuvo ingresos sólidos y un resultado final saludable en el
últimos 3 años, pero las perspectivas de crecimiento de ganancias para los próximos 3 años no son prometedoras ... 

**Por esta razón, se están considerando varias iniciativas estratégicas para revertir esta situación. Una es mejorar la realización de actividades de marketing, con un enfoque especial en las campañas de marketing.**


### El Departamento de Marketing

El departamento de marketing fue presionado para gastar su presupuesto anual de manera más inteligente. La CMO
percibe la importancia de tener un enfoque más cuantitativo a la hora de tomar decisiones, por lo que **se contrató a un pequeño equipo de científicos de datos con un objetivo claro en mente: construir una solución que apoye las iniciativas de marketing directo.**
<br>Deseablemente, el éxito de estas actividades demostrará el área de oportunidad y también deberan convencer a los más escépticos dentro de la empresa.


### El objetivo del equipo 

Es construir un análisis para abordar el mayor beneficio para la próxima campaña de marketing, programada para el próximo mes. La nueva campaña, la sexta, tiene como objetivo vender a una nueva base de datos de clientes. 

**Para construir el análisis, se desarrollo una campaña piloto que involucró 2.240 clientes. Los clientes fueron seleccionados al azar y contactados por teléfono con respecto a la adquisición del gadget. Durante los meses siguientes, los clientes que compraron el oferta fueron debidamente etiquetados.**

El coste total de la campaña de muestra fue de 6.720MU y los ingresos generado por los clientes que aceptaron la oferta fue de 3.674MU. A nivel mundial, la campaña tuvo un beneficio de -3.046MU. La tasa de éxito de la campaña fue del 15%.


## 📍 Consideraciones

- Repliquen este notebook para la resolución del ejercicio.
- Consideren las etapas: 1) Cargamos los datos, 2) Preparación de la data, 3) Clasificación, 4) Regresión y 5) Guardar un modelo.

**Son libres de decidir:**
- Cómo preparar y acondicionar el dataset.
- Pueden agregar y eliminar columnas del dataset.
- Decidir parámetros para ajustar en los modelos de clasificación y regresión.


## 📍 Consigna

- Creen un modelo de clasificación utilizando Random Forest para la columna `Response`. 
- Guarden el modelo de clasificación Randon forest como `rf.pkl`.
- Creen un modelo con regresión lineal y con Random Forest + GridsearchCV para predecir la columna `Income`.
- Cargar proyecto en Github / Gitlab, usen git y git-lfs para los `.csv` y `.pkl`.

---

## Data Preparation

In [1]:
import numpy as np
import pandas as pd

import funpymodeling

In [2]:
# Para este caso nos interesa visualizar todas las columnas
pd.set_option('display.max_columns', None)

In [3]:
df_data = pd.read_csv(
    "data/marketing_campaign.csv",
    sep=';',
    # parse_dates=['Dt_Customer']
)

In [4]:
print(df_data.columns)
print(df_data.shape)

Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
       'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')
(2240, 29)


In [5]:
df_data

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,0,0,0,3,11,0


In [6]:
df_data["Dt_Customer"] = pd.to_datetime(df_data["Dt_Customer"], format="%Y-%m-%d")

In [7]:
df_data

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,0,0,0,3,11,0


In [8]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,ID,0,0.0,1,0.000446,2240,int64
1,Year_Birth,0,0.0,0,0.0,59,int64
2,Education,0,0.0,0,0.0,5,object
3,Marital_Status,0,0.0,0,0.0,8,object
4,Income,24,0.010714,0,0.0,1974,float64
5,Kidhome,0,0.0,1293,0.577232,3,int64
6,Teenhome,0,0.0,1158,0.516964,3,int64
7,Dt_Customer,0,0.0,0,0.0,663,datetime64[ns]
8,Recency,0,0.0,28,0.0125,100,int64
9,MntWines,0,0.0,13,0.005804,776,int64


Eliminamos las columnas:
* `ID`: customer id
* `Z_CostContact`: revenue from the new gadget
* `Z_Revenue`: cost of contact for the sixth campaign

Que no son relevantes para el análisis.

In [9]:
labels = ['ID', 'Z_CostContact', 'Z_Revenue']
df_data = df_data.drop(labels=labels, axis=1)

In [10]:
df_data

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response
0,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,1
1,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,0
2,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,0
3,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,0
4,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,1967,Graduation,Married,61223.0,0,1,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,0,0,0,0
2236,1946,PhD,Together,64014.0,2,1,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,0,1,0,0,0
2237,1981,Graduation,Divorced,56981.0,0,0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,1,0,0,0,0,0
2238,1956,Master,Together,69245.0,0,1,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,0,0,0,0


Creamos una nueva variable `AcceptedAnyCmp`, la cual combina las variables `AcceptedCmp1`, `AcceptedCmp2`, `AcceptedCmp3`, `AcceptedCmp4` y `AcceptedCmp5`.

Esta variable indica si el cliente acepto alguna de las 5 campañas realizadas.

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [11]:
print("### AcceptedCmp1 ###")
display(df_data['AcceptedCmp1'].value_counts(dropna=False))
print()

print("### AcceptedCmp2 ###")
display(df_data['AcceptedCmp2'].value_counts(dropna=False))
print()

print("### AcceptedCmp3 ###")
display(df_data['AcceptedCmp3'].value_counts(dropna=False))
print()

print("### AcceptedCmp4 ###")
display(df_data['AcceptedCmp4'].value_counts(dropna=False))
print()

print("### AcceptedCmp5 ###")
display(df_data['AcceptedCmp5'].value_counts(dropna=False))

### AcceptedCmp1 ###


0    2096
1     144
Name: AcceptedCmp1, dtype: int64


### AcceptedCmp2 ###


0    2210
1      30
Name: AcceptedCmp2, dtype: int64


### AcceptedCmp3 ###


0    2077
1     163
Name: AcceptedCmp3, dtype: int64


### AcceptedCmp4 ###


0    2073
1     167
Name: AcceptedCmp4, dtype: int64


### AcceptedCmp5 ###


0    2077
1     163
Name: AcceptedCmp5, dtype: int64

In [12]:
cols = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']

df_data["AcceptedAnyCmp"] = np.where((df_data[cols] >= 1).any(axis=1), True, False)

In [13]:
print("### AcceptedAnyCmp ###")
display(df_data['AcceptedAnyCmp'].value_counts(dropna=False, normalize=False))
display(df_data['AcceptedAnyCmp'].value_counts(dropna=False, normalize=True))
print()

### AcceptedAnyCmp ###


False    1777
True      463
Name: AcceptedAnyCmp, dtype: int64

False    0.793304
True     0.206696
Name: AcceptedAnyCmp, dtype: float64




In [14]:
df_data['AcceptedAnyCmp'] = df_data['AcceptedAnyCmp'].astype(int)

In [15]:
labels = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5']
df_data = df_data.drop(labels=labels, axis=1)

In [16]:
df_data

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp
0,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,1,0
1,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0
2,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0
3,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0
4,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,1967,Graduation,Married,61223.0,0,1,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0
2236,1946,PhD,Together,64014.0,2,1,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,1
2237,1981,Graduation,Divorced,56981.0,0,0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,0,1
2238,1956,Master,Together,69245.0,0,1,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0


Creamos una nueva variable `HasChildren`, la cual combina las variables `Kidhome` y `Teenhome`.

Esta variable indica si el cliente tiene hijos o no.

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [17]:
print("### Kidhome ###")
display(df_data['Kidhome'].value_counts(dropna=False))
print()

print("### Teenhome ###")
display(df_data['Teenhome'].value_counts(dropna=False))
print()

### Kidhome ###


0    1293
1     899
2      48
Name: Kidhome, dtype: int64


### Teenhome ###


0    1158
1    1030
2      52
Name: Teenhome, dtype: int64




In [18]:
cols = ['Kidhome', 'Teenhome']

df_data['HasChildren'] = np.where((df_data[cols] > 0).any(axis=1), True, False)

In [19]:
print("### HasChildren ###")
display(df_data['HasChildren'].value_counts(dropna=False, normalize=False))
display(df_data['HasChildren'].value_counts(dropna=False, normalize=True))
print()

### HasChildren ###


True     1602
False     638
Name: HasChildren, dtype: int64

True     0.715179
False    0.284821
Name: HasChildren, dtype: float64




In [20]:
df_data['HasChildren'] = df_data['HasChildren'].astype(int)

In [21]:
labels = ['Kidhome', 'Teenhome']
df_data = df_data.drop(labels=labels, axis=1)

In [22]:
df_data

Unnamed: 0,Year_Birth,Education,Marital_Status,Income,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren
0,1957,Graduation,Single,58138.0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,1,0,0
1,1954,Graduation,Single,46344.0,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,1
2,1965,Graduation,Together,71613.0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0
3,1984,Graduation,Together,26646.0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,1
4,1981,PhD,Married,58293.0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,1967,Graduation,Married,61223.0,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,1
2236,1946,PhD,Together,64014.0,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,1,1
2237,1981,Graduation,Divorced,56981.0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,0,1,0
2238,1956,Master,Together,69245.0,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,1


Creamos una nueva variable `Generation`.

Esta variable indica la generacion del cliente en base al año de nacimiento (columna `Year_Birth`) y a partir del siguiente [link](https://www.telemadrid.es/noticias/sociedad/Generaciones-segun-ano-de-nacimiento-0-2470252960--20220719111500.html).

In [23]:
def set_generation(year_birth):
    if (year_birth >= 1930) and (year_birth <= 1948):
        return "Postguerra"
    elif (year_birth >= 1949) and (year_birth <= 1968):
        return "BabyBoomer"
    elif (year_birth >= 1969) and (year_birth <= 1980):
        return "X"
    elif (year_birth >= 1981) and (year_birth <= 1993):
        return "Millennials"
    elif (year_birth >= 1994) and (year_birth <= 2010):
        return "Z"
    elif year_birth >= 2011:
        return "Alfa"
    else:
        return "Other"

In [24]:
df_data["Generation"] = df_data['Year_Birth'].apply(lambda x: set_generation(x))

Creamos dos nuevas variables `Age` y `ClientDays` a partir de la columna `Year_Birth` y el año actual.

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [25]:
today = pd.to_datetime('today').normalize()

In [26]:
today

Timestamp('2023-04-25 00:00:00')

In [27]:
df_data["Age"] = today.year - df_data["Year_Birth"]

In [28]:
df_data["ClientDays"] = (today - df_data["Dt_Customer"]).dt.days

In [29]:
labels = ['Year_Birth', 'Dt_Customer']
df_data = df_data.drop(labels=labels, axis=1)

In [30]:
df_data

Unnamed: 0,Education,Marital_Status,Income,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays
0,Graduation,Single,58138.0,58,635,88,546,172,88,88,3,8,10,4,7,0,1,0,0,BabyBoomer,66,3885
1,Graduation,Single,46344.0,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,1,BabyBoomer,69,3335
2,Graduation,Together,71613.0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,BabyBoomer,58,3534
3,Graduation,Together,26646.0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,1,Millennials,39,3361
4,PhD,Married,58293.0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,1,Millennials,42,3383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,Married,61223.0,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,1,BabyBoomer,56,3603
2236,PhD,Together,64014.0,56,406,0,30,0,0,8,7,8,2,5,7,0,0,1,1,Postguerra,77,3241
2237,Graduation,Divorced,56981.0,91,908,48,217,32,12,24,1,2,3,13,6,0,0,1,0,Millennials,42,3377
2238,Master,Together,69245.0,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,1,BabyBoomer,67,3378


Eliminamos las filas que esten por fuera del Percentil 99 de la columna `Age`.

In [31]:
funpymodeling.profiling_num(df_data)

Unnamed: 0,variable,mean,std_dev,variation_coef,p_0.01,p_0.05,p_0.25,p_0.5,p_0.75,p_0.95,p_0.99
0,Income,52247.251354,25173.076661,0.481807,7579.2,18985.5,35303.0,51381.5,68522.0,84130.0,94458.8
1,Recency,49.109375,28.962453,0.589754,0.0,4.0,24.0,49.0,74.0,94.0,98.0
2,MntWines,303.935714,336.597393,1.107462,1.0,3.0,23.75,173.5,504.25,1000.0,1285.0
3,MntFruits,26.302232,39.773434,1.51217,0.0,0.0,1.0,8.0,33.0,123.0,172.0
4,MntMeatProducts,166.95,225.715373,1.351994,2.0,4.0,16.0,67.0,232.0,687.1,915.0
5,MntFishProducts,37.525446,54.628979,1.455785,0.0,0.0,3.0,12.0,50.0,168.05,226.22
6,MntSweetProducts,27.062946,41.280498,1.525351,0.0,0.0,1.0,8.0,33.0,126.0,177.22
7,MntGoldProds,44.021875,52.167439,1.185034,0.0,1.0,9.0,24.0,56.0,165.05,227.0
8,NumDealsPurchases,2.325,1.932238,0.83107,0.0,1.0,1.0,2.0,3.0,6.0,10.0
9,NumWebPurchases,4.084821,2.778714,0.680254,0.0,1.0,2.0,4.0,6.0,9.0,11.0


In [32]:
df_data[df_data['Age'] > 78]['Age'].value_counts(dropna=False, normalize=False)

80     7
79     7
123    1
130    1
124    1
82     1
83     1
Name: Age, dtype: int64

In [33]:
df_data = df_data[df_data['Age'] <= 78]

In [34]:
df_data

Unnamed: 0,Education,Marital_Status,Income,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays
0,Graduation,Single,58138.0,58,635,88,546,172,88,88,3,8,10,4,7,0,1,0,0,BabyBoomer,66,3885
1,Graduation,Single,46344.0,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,1,BabyBoomer,69,3335
2,Graduation,Together,71613.0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,BabyBoomer,58,3534
3,Graduation,Together,26646.0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,1,Millennials,39,3361
4,PhD,Married,58293.0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,1,Millennials,42,3383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,Married,61223.0,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,1,BabyBoomer,56,3603
2236,PhD,Together,64014.0,56,406,0,30,0,0,8,7,8,2,5,7,0,0,1,1,Postguerra,77,3241
2237,Graduation,Divorced,56981.0,91,908,48,217,32,12,24,1,2,3,13,6,0,0,1,0,Millennials,42,3377
2238,Master,Together,69245.0,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,1,BabyBoomer,67,3378


In [35]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Marital_Status,0,0.0,0,0.0,8,object
2,Income,23,0.010356,0,0.0,1957,float64
3,Recency,0,0.0,28,0.012607,100,int64
4,MntWines,0,0.0,13,0.005853,770,int64
5,MntFruits,0,0.0,397,0.178748,158,int64
6,MntMeatProducts,0,0.0,1,0.00045,551,int64
7,MntFishProducts,0,0.0,383,0.172445,181,int64
8,MntSweetProducts,0,0.0,412,0.185502,177,int64
9,MntGoldProds,0,0.0,60,0.027015,213,int64


Creamos una nueva variable `AmountSpend` a partir de las columnas `MntWines`, `MntFruits`, `MntMeatProducts`, `MntFishProducts`, `MntSweetProducts`, `MntGoldProds`.

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [36]:
df_data['AmountSpend'] = df_data['MntWines'] + \
                         df_data['MntFruits'] + \
                         df_data['MntMeatProducts'] + \
                         df_data['MntFishProducts'] + \
                         df_data['MntSweetProducts'] + \
                         df_data['MntGoldProds']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_data['AmountSpend'] = df_data['MntWines'] + \


In [37]:
labels = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df_data = df_data.drop(labels=labels, axis=1)

In [38]:
df_data

Unnamed: 0,Education,Marital_Status,Income,Recency,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays,AmountSpend
0,Graduation,Single,58138.0,58,3,8,10,4,7,0,1,0,0,BabyBoomer,66,3885,1617
1,Graduation,Single,46344.0,38,2,1,1,2,5,0,0,0,1,BabyBoomer,69,3335,27
2,Graduation,Together,71613.0,26,1,8,2,10,4,0,0,0,0,BabyBoomer,58,3534,776
3,Graduation,Together,26646.0,26,2,2,0,4,6,0,0,0,1,Millennials,39,3361,53
4,PhD,Married,58293.0,94,5,5,3,6,5,0,0,0,1,Millennials,42,3383,422
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,Married,61223.0,46,2,9,3,4,5,0,0,0,1,BabyBoomer,56,3603,1341
2236,PhD,Together,64014.0,56,7,8,2,5,7,0,0,1,1,Postguerra,77,3241,444
2237,Graduation,Divorced,56981.0,91,1,2,3,13,6,0,0,1,0,Millennials,42,3377,1241
2238,Master,Together,69245.0,8,2,6,5,10,3,0,0,0,1,BabyBoomer,67,3378,843


Creamos una nueva variable `NumPurchases` a partir de las columnas `NumDealsPurchases`, `NumWebPurchases`, `NumCatalogPurchases`, `NumStorePurchases`.

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [39]:
df_data['NumPurchases'] = df_data['NumDealsPurchases'] + \
                          df_data['NumWebPurchases'] + \
                          df_data['NumCatalogPurchases'] + \
                          df_data['NumStorePurchases']

In [40]:
labels = ['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
df_data = df_data.drop(labels=labels, axis=1)

In [41]:
df_data

Unnamed: 0,Education,Marital_Status,Income,Recency,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays,AmountSpend,NumPurchases
0,Graduation,Single,58138.0,58,7,0,1,0,0,BabyBoomer,66,3885,1617,25
1,Graduation,Single,46344.0,38,5,0,0,0,1,BabyBoomer,69,3335,27,6
2,Graduation,Together,71613.0,26,4,0,0,0,0,BabyBoomer,58,3534,776,21
3,Graduation,Together,26646.0,26,6,0,0,0,1,Millennials,39,3361,53,8
4,PhD,Married,58293.0,94,5,0,0,0,1,Millennials,42,3383,422,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,Married,61223.0,46,5,0,0,0,1,BabyBoomer,56,3603,1341,18
2236,PhD,Together,64014.0,56,7,0,0,1,1,Postguerra,77,3241,444,22
2237,Graduation,Divorced,56981.0,91,6,0,0,1,0,Millennials,42,3377,1241,19
2238,Master,Together,69245.0,8,3,0,0,0,1,BabyBoomer,67,3378,843,23


Creamos una nueva variable `NewMaritalStatus` a partir de la columna `Marital_Status`.

Para ello agrupamos los valores `Married` y `Together` de la columna `Marital_Status` en el nuevo valor `Couple`. Tambien agrupamos los valores `Alone`, `Absurd` y `YOLO` de la misma columna en el nuevo valor `Other`

Tambien eliminamos las columnas que ya no vamos a utilizar.

In [42]:
funpymodeling.freq_tbl(df_data['Marital_Status'])

Unnamed: 0,Marital_Status,frequency,percentage,cumulative_perc
0,Married,857,0.385862,0.385862
1,Together,579,0.260693,0.646556
2,Single,476,0.214318,0.860873
3,Divorced,228,0.102656,0.96353
4,Widow,74,0.033318,0.996848
5,Alone,3,0.001351,0.998199
6,Absurd,2,0.0009,0.9991
7,YOLO,2,0.0009,1.0


In [43]:
marital_status_mapper = {
    'Married': 'Couple',
    'Together': 'Couple',
    'Alone': 'Other',
    'Absurd': 'Other',
    'YOLO': 'Other',
}

df_data['NewMaritalStatus'] = df_data['Marital_Status'].apply(lambda x: marital_status_mapper.get(x, x))

In [44]:
labels = ['Marital_Status']
df_data = df_data.drop(labels=labels, axis=1)

In [45]:
df_data

Unnamed: 0,Education,Income,Recency,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays,AmountSpend,NumPurchases,NewMaritalStatus
0,Graduation,58138.0,58,7,0,1,0,0,BabyBoomer,66,3885,1617,25,Single
1,Graduation,46344.0,38,5,0,0,0,1,BabyBoomer,69,3335,27,6,Single
2,Graduation,71613.0,26,4,0,0,0,0,BabyBoomer,58,3534,776,21,Couple
3,Graduation,26646.0,26,6,0,0,0,1,Millennials,39,3361,53,8,Couple
4,PhD,58293.0,94,5,0,0,0,1,Millennials,42,3383,422,19,Couple
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,61223.0,46,5,0,0,0,1,BabyBoomer,56,3603,1341,18,Couple
2236,PhD,64014.0,56,7,0,0,1,1,Postguerra,77,3241,444,22,Couple
2237,Graduation,56981.0,91,6,0,0,1,0,Millennials,42,3377,1241,19,Divorced
2238,Master,69245.0,8,3,0,0,0,1,BabyBoomer,67,3378,843,23,Couple


In [46]:
funpymodeling.freq_tbl(df_data['Education'])

Unnamed: 0,Education,frequency,percentage,cumulative_perc
0,Graduation,1124,0.506078,0.506078
1,PhD,477,0.214768,0.720846
2,Master,365,0.16434,0.885187
3,2n Cycle,201,0.0905,0.975687
4,Basic,54,0.024313,1.0


In [47]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Income,23,0.010356,0,0.0,1957,float64
2,Recency,0,0.0,28,0.012607,100,int64
3,NumWebVisitsMonth,0,0.0,11,0.004953,16,int64
4,Complain,0,0.0,2202,0.991445,2,int64
5,Response,0,0.0,1891,0.851418,2,int64
6,AcceptedAnyCmp,0,0.0,1768,0.796038,2,int64
7,HasChildren,0,0.0,621,0.279604,2,int64
8,Generation,0,0.0,0,0.0,5,object
9,Age,0,0.0,0,0.0,52,int64


Vemos que la columna target `Response`tiene 23 valores nulos.

Pero como estos se correponden a la clase mayoritaria, vamos a descartarlos del dataset.

In [48]:
display(df_data['Response'].value_counts(dropna=False, normalize=False))
display(df_data['Response'].value_counts(dropna=False, normalize=True))

0    1891
1     330
Name: Response, dtype: int64

0    0.851418
1    0.148582
Name: Response, dtype: float64

In [49]:
display(
    df_data[
        pd.isna(df_data['Income'])
    ]['Response'].value_counts(dropna=False, normalize=False)
)

0    23
Name: Response, dtype: int64

In [50]:
df_data.dropna(inplace=True)

In [51]:
df_data

Unnamed: 0,Education,Income,Recency,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Generation,Age,ClientDays,AmountSpend,NumPurchases,NewMaritalStatus
0,Graduation,58138.0,58,7,0,1,0,0,BabyBoomer,66,3885,1617,25,Single
1,Graduation,46344.0,38,5,0,0,0,1,BabyBoomer,69,3335,27,6,Single
2,Graduation,71613.0,26,4,0,0,0,0,BabyBoomer,58,3534,776,21,Couple
3,Graduation,26646.0,26,6,0,0,0,1,Millennials,39,3361,53,8,Couple
4,PhD,58293.0,94,5,0,0,0,1,Millennials,42,3383,422,19,Couple
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,Graduation,61223.0,46,5,0,0,0,1,BabyBoomer,56,3603,1341,18,Couple
2236,PhD,64014.0,56,7,0,0,1,1,Postguerra,77,3241,444,22,Couple
2237,Graduation,56981.0,91,6,0,0,1,0,Millennials,42,3377,1241,19,Divorced
2238,Master,69245.0,8,3,0,0,0,1,BabyBoomer,67,3378,843,23,Couple


In [52]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Income,0,0.0,0,0.0,1957,float64
2,Recency,0,0.0,28,0.012739,100,int64
3,NumWebVisitsMonth,0,0.0,10,0.00455,16,int64
4,Complain,0,0.0,2179,0.991356,2,int64
5,Response,0,0.0,1868,0.849864,2,int64
6,AcceptedAnyCmp,0,0.0,1748,0.795268,2,int64
7,HasChildren,0,0.0,617,0.28071,2,int64
8,Generation,0,0.0,0,0.0,5,object
9,Age,0,0.0,0,0.0,52,int64


In [53]:
display(df_data['Response'].value_counts(dropna=False, normalize=False))
display(df_data['Response'].value_counts(dropna=False, normalize=True))

0    1868
1     330
Name: Response, dtype: int64

0    0.849864
1    0.150136
Name: Response, dtype: float64

Aplicamos 'one hot encoding' a las siguientes variables categoricas:
* `Education`
* `Generation`
* `NewMaritalStatus`

In [56]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Education,0,0.0,0,0.0,5,object
1,Income,0,0.0,0,0.0,1957,float64
2,Recency,0,0.0,28,0.012739,100,int64
3,NumWebVisitsMonth,0,0.0,10,0.00455,16,int64
4,Complain,0,0.0,2179,0.991356,2,int64
5,Response,0,0.0,1868,0.849864,2,int64
6,AcceptedAnyCmp,0,0.0,1748,0.795268,2,int64
7,HasChildren,0,0.0,617,0.28071,2,int64
8,Generation,0,0.0,0,0.0,5,object
9,Age,0,0.0,0,0.0,52,int64


In [57]:
columns = ['Education', 'Generation', 'NewMaritalStatus']
df_data = pd.get_dummies(df_data, columns=columns)

In [58]:
df_data

Unnamed: 0,Income,Recency,NumWebVisitsMonth,Complain,Response,AcceptedAnyCmp,HasChildren,Age,ClientDays,AmountSpend,NumPurchases,Education_2n Cycle,Education_Basic,Education_Graduation,Education_Master,Education_PhD,Generation_BabyBoomer,Generation_Millennials,Generation_Postguerra,Generation_X,Generation_Z,NewMaritalStatus_Couple,NewMaritalStatus_Divorced,NewMaritalStatus_Other,NewMaritalStatus_Single,NewMaritalStatus_Widow
0,58138.0,58,7,0,1,0,0,66,3885,1617,25,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0
1,46344.0,38,5,0,0,0,1,69,3335,27,6,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0
2,71613.0,26,4,0,0,0,0,58,3534,776,21,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0
3,26646.0,26,6,0,0,0,1,39,3361,53,8,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0
4,58293.0,94,5,0,0,0,1,42,3383,422,19,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,61223.0,46,5,0,0,0,1,56,3603,1341,18,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0
2236,64014.0,56,7,0,0,1,1,77,3241,444,22,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0
2237,56981.0,91,6,0,0,1,0,42,3377,1241,19,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0
2238,69245.0,8,3,0,0,0,1,67,3378,843,23,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0


In [59]:
funpymodeling.status(df_data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,Income,0,0.0,0,0.0,1957,float64
1,Recency,0,0.0,28,0.012739,100,int64
2,NumWebVisitsMonth,0,0.0,10,0.00455,16,int64
3,Complain,0,0.0,2179,0.991356,2,int64
4,Response,0,0.0,1868,0.849864,2,int64
5,AcceptedAnyCmp,0,0.0,1748,0.795268,2,int64
6,HasChildren,0,0.0,617,0.28071,2,int64
7,Age,0,0.0,0,0.0,52,int64
8,ClientDays,0,0.0,0,0.0,662,int64
9,AmountSpend,0,0.0,0,0.0,1038,int64


Guardamos el dataset "curado"

In [60]:
df_data.to_csv("data/new_marketing_campaign.csv", index=False)

In [None]:
income_cat = pd.qcut(df_data_1['Income'], q=10)

In [None]:
income_cat

In [None]:
df_data_1['Income_Cat'] = pd.qcut(df_data_1['Income'], q=10)

Dropeamos la variable original.

In [None]:
labels = ['Income']
df_data_1 = df_data_1.drop(labels=labels, axis=1)

In [None]:
df_data_1

Aplicamos 'one hot encoding' a las variables categoricas.

In [None]:
funpymodeling.status(df_data)

In [None]:
df_data_1 = pd.get_dummies(df_data_1)

In [None]:
df_data_1

In [None]:
funpymodeling.status(df_data_1)

Preparamos los datos que seran usados para el entrenamiento.

In [None]:
data_x = df_data_1.drop('Response', axis=1)
data_y = df_data_1['Response']

In [None]:
data_x = data_x.values
data_y = data_y.values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Spliteamos el dataset
x_train, x_test, y_train, y_test = train_test_split(
    data_x, data_y, test_size=0.3,
)

Creación del modelo de clasificacion `RandomForestClassifier`

In [None]:
from sklearn.ensemble import RandomForestClassifier 

In [None]:
# Creamos 1000 decision trees
rf = RandomForestClassifier(n_estimators=1000, random_state=99)

In [None]:
%%time

rf.fit(x_train, y_train)

In [None]:
# En training (por defecto asume 0.5 como punto de corte)
pred_tr = rf.predict(x_train)

In [None]:
# En testing (por defecto asume 0.5 como punto de corte)
pred_ts = rf.predict(x_test)

In [None]:
from pprint import pprint

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
train_metrics = {
    'accuracy': accuracy_score(y_train, pred_tr, normalize=True),
    'precision': precision_score(y_train, pred_tr),
    'recall': recall_score(y_train, pred_tr),
    'f1_score': f1_score(y_train, pred_tr),
}

pprint(train_metrics)

In [None]:
test_metrics = {
    'accuracy': accuracy_score(y_test, pred_ts, normalize=True),
    'precision': precision_score(y_test, pred_ts),
    'recall': recall_score(y_test, pred_ts),
    'f1_score': f1_score(y_test, pred_ts),
}

pprint(test_metrics)

In [None]:
import seaborn as sns

from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat1 = pd.crosstab(
    index=y_train,    # filas    = valor real
    columns=pred_tr,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat1, annot=True, cmap='Blues', fmt='g')

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat2 = pd.crosstab(
    index=y_test,     # filas    = valor real
    columns=pred_ts,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat2, annot=True, cmap='Blues', fmt='g')

In [None]:
# En training
pred_prob_tr = rf.predict_proba(x_train)

In [None]:
pred_prob_tr

In [None]:
pred_prob_tr = pred_prob_tr[:,1]

In [None]:
pred_prob_tr

In [None]:
pred_prob_tr.mean()

In [None]:
# En testing
pred_prob_ts = rf.predict_proba(x_test)

In [None]:
pred_prob_ts

In [None]:
pred_prob_ts = pred_prob_ts[:,1]

In [None]:
pred_prob_ts

In [None]:
pred_prob_ts.mean()

Seteamos el nuevo punto de corte en 0.15

In [None]:
pred_tr_1 = np.where(pred_prob_tr > 0.15, 1, 0)

In [None]:
pred_ts_1 = np.where(pred_prob_ts > 0.15, 1, 0)

In [None]:
train_metrics = {
    'accuracy': accuracy_score(y_train, pred_tr_1, normalize=True),
    'precision': precision_score(y_train, pred_tr_1),
    'recall': recall_score(y_train, pred_tr_1),
    'f1_score': f1_score(y_train, pred_tr_1),
}

pprint(train_metrics)

In [None]:
test_metrics = {
    'accuracy': accuracy_score(y_test, pred_ts_1, normalize=True),
    'precision': precision_score(y_test, pred_ts_1),
    'recall': recall_score(y_test, pred_ts_1),
    'f1_score': f1_score(y_test, pred_ts_1),
}

pprint(test_metrics)

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat3 = pd.crosstab(
    index=y_train,      # filas    = valor real
    columns=pred_tr_1,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat3, annot=True, cmap='Blues', fmt='g')

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat4 = pd.crosstab(
    index=y_test,      # filas    = valor real
    columns=pred_ts_1,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat4, annot=True, cmap='Blues', fmt='g')

Ahora busquemos el **Threshold** apartir de la libreria `yellowbrick`

In [None]:
!pip3 install Threshold

In [None]:
from yellowbrick.classifier import DiscriminationThreshold

In [None]:
visualizer = DiscriminationThreshold(rf)

In [None]:
visualizer.fit(x_train, y_train) # Ajustar data al visualizador
visualizer.show()                # Mostrar figura

Seteamos el nuevo punto de corte en 0.25

In [None]:
pred_tr_2 = np.where(pred_prob_tr >= 0.25, 1, 0)

In [None]:
pred_ts_2 = np.where(pred_prob_ts >= 0.25, 1, 0)

In [None]:
train_metrics = {
    'accuracy': accuracy_score(y_train, pred_tr_2, normalize=True),
    'precision': precision_score(y_train, pred_tr_2),
    'recall': recall_score(y_train, pred_tr_2),
    'f1_score': f1_score(y_train, pred_tr_2),
}

pprint(train_metrics)

In [None]:
test_metrics = {
    'accuracy': accuracy_score(y_test, pred_ts_2, normalize=True),
    'precision': precision_score(y_test, pred_ts_2),
    'recall': recall_score(y_test, pred_ts_2),
    'f1_score': f1_score(y_test, pred_ts_2),
}

pprint(test_metrics)

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat5 = pd.crosstab(
    index=y_train,      # filas    = valor real
    columns=pred_tr_2,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat5, annot=True, cmap='Blues', fmt='g')

In [None]:
sns.set(font_scale=1.5)  # Ajuste tamaño de letra (var global)

conf_mat6 = pd.crosstab(
    index=y_test,      # filas    = valor real
    columns=pred_ts_2,  # columnas = valor predicho
    rownames=['Actual'], 
    colnames=['Pred'], 
    normalize='index'
)

sns.heatmap(conf_mat6, annot=True, cmap='Blues', fmt='g')

Comparamos las metricas de los distintos puntos de corte

In [None]:
data_metrics_train = {
    'metric': ['accuracy', 'precision', 'recall', 'f1_score'],
    'threshold_0.5': [0.9928478543563068, 0.995475113122172, 0.9565217391304348, 0.975609756097561],
    'threshold_0.15': [0.9609882964889467, 0.7931034482758621, 1.0, 0.8846153846153846],
    'threshold_0.25': [0.9902470741222367, 0.9387755102040817, 1.0, 0.968421052631579],
}

data_metrics_test = {
    'metric': ['accuracy', 'precision', 'recall', 'f1_score'],
    'threshold_0.5': [0.8772727272727273, 0.7209302325581395, 0.31, 0.43356643356643354],
    'threshold_0.15': [0.7545454545454545, 0.3652173913043478, 0.84, 0.509090909090909],
    'threshold_0.25': [0.8363636363636363, 0.4696969696969697, 0.62, 0.5344827586206896],
}

# Creates train metrics pandas DataFrame.
print("Train Metrics")
display(pd.DataFrame(data_metrics_train))

print()

# Creates test metrics pandas DataFrame.
print("Test Metrics")
display(pd.DataFrame(data_metrics_test))

In [None]:
import matplotlib.pyplot as plt

In [None]:
from sklearn.metrics import RocCurveDisplay

tr_disp = RocCurveDisplay.from_estimator(rf, x_train, y_train)
ts_disp = RocCurveDisplay.from_estimator(rf, x_test, y_test, ax=tr_disp.ax_)
ts_disp.figure_.suptitle("ROC curve comparison")

plt.show()

In [None]:
df_data_2 = df_data.copy()

Dropeamos la variable `Response`.

In [None]:
labels = ['Response']
df_data_2 = df_data_2.drop(labels=labels, axis=1)

In [None]:
funpymodeling.status(df_data_2)

In [None]:
df_data_2['Income'].hist(bins=10)

Aplicamos 'one hot encoding' a las variables categoricas.

In [None]:
df_data_2 = pd.get_dummies(df_data_2)

In [None]:
df_data_2

In [None]:
funpymodeling.status(df_data_2)

Preparamos los datos que seran usados para el entrenamiento.

In [None]:
data_x = df_data_2.drop('Income', axis=1)
data_y = df_data_2['Income']

In [None]:
data_x = data_x.values
data_y = data_y.values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Spliteamos el dataset
x_train, x_test, y_train, y_test = train_test_split(
    data_x, data_y, test_size=0.3,
)

Creación del modelo de regresion lineal `LinearRegression`

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Creamos el modelo
model = LinearRegression()

In [None]:
model.fit(x_train, y_train)

In [None]:
# En training
pred_tr = model.predict(x_train)

In [None]:
# En testing
pred_ts = model.predict(x_test)

In [None]:
pred_tr

In [None]:
pred_ts

In [None]:
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error

In [None]:
len(y_test)

In [None]:
len(pred_ts)

In [None]:
test_metrics = {
    'MAE': mean_absolute_error(y_test, pred_ts),
    'MAPE': mean_absolute_percentage_error(y_test, pred_ts),
    #'MSE': mean_squared_error(y_test, pred_ts),
    #'RMSE': np.sqrt(mean_squared_error(y_test, pred_ts)),
}

pprint(test_metrics)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_model = RandomForestRegressor()

In [None]:
params = {
    'n_estimators': [10, 20, 500, 5000],
    'max_features': [50, 100],
    # 'bootstrap': [False, True],
    # 'max_depth': [50, 500],
    # 'min_samples_leaf': [3, 50],
    # 'min_samples_split': [10, 50],
}

In [None]:
rf_grid = GridSearchCV(
    estimator=rf_model,
    param_grid=params,
    scoring='neg_mean_absolute_error',
    cv=5, 
    verbose=1
)

In [None]:
rf_grid.fit(x_train, y_train)

In [None]:
rf_grid.best_estimator_

In [None]:
# En training
pred_tr = rf_grid.predict(x_train)

print(pred_tr)

In [None]:
# En testing
pred_ts = rf_grid.predict(x_test)

print(pred_ts)

In [None]:
pd.concat(
    [
        pd.DataFrame(rf_grid.cv_results_["params"]),
        pd.DataFrame(rf_grid.cv_results_["mean_test_score"],  columns=["neg_mean_absolute_error"])
    ],
    axis=1
).sort_values('neg_mean_absolute_error', ascending=False)

In [None]:
rf_grid.score(x_train, y_train)

In [None]:
rf_grid.score(x_test, y_test)

In [None]:
train_metrics = {
    'MAE': mean_absolute_error(y_train, pred_tr),
    'MAPE': mean_absolute_percentage_error(y_train, pred_tr),
    #'MSE': mean_squared_error(y_test, pred_ts),
    #'RMSE': np.sqrt(mean_squared_error(y_test, pred_ts)),
}

pprint(train_metrics)

In [None]:
test_metrics = {
    'MAE': mean_absolute_error(y_test, pred_ts),
    'MAPE': mean_absolute_percentage_error(y_test, pred_ts),
    #'MSE': mean_squared_error(y_test, pred_ts),
    #'RMSE': np.sqrt(mean_squared_error(y_test, pred_ts)),
}

pprint(test_metrics)

In [None]:
from yellowbrick.regressor import PredictionError, ResidualsPlot

In [None]:
# Grafico de la Curva de Error
vis_pred_err = PredictionError(rf_grid)

In [None]:
vis_pred_err.fit(x_train, y_train)  # Fiteamos los datos al visualizador
vis_pred_err.score(x_test, y_test)  # Calculamos las métricas para test
vis_pred_err.show()                 # Visualizamos!

In [None]:
vis_res = ResidualsPlot(rf_grid.best_estimator_)

# Copy-paste de la doc oficial: 
vis_res.fit(x_train, y_train)  # Fiteamos los datos al visualizador
vis_res.score(x_test, y_test)  # Calculamos las métricas para test
vis_res.show()                 # Visualizamos!

In [None]:
vis_res2 = ResidualsPlot(rf_grid)

# Copy-paste de la doc oficial: https://www.scikit-yb.org/en/latest/quickstart.html
vis_res2.fit(x_train, y_train)  # Fiteamos los datos al visualizador
vis_res2.score(x_test, y_test)  # Calculamos las métricas para test
vis_res2.show()                 # Visualizamos!