### 1. Análisis Exploratorio de Datos
- Revisar estructura, tipos y tamano del dataset.
- Identificar variables categoricas y numericas.
- Detectar posibles problemas: formatos de moneda, duplicados, nulos y outliers.


#### 1.1 Exploración Inicial del Dataset

In [2]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# 1. Cargamos el dataset
df = pd.read_csv('marketing_raw.csv')

# 2. Vista general de los datos
print('Filas/columnas:', df.shape)
df.head()

Filas/columnas: (2205, 39)


Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,Age,Customer_Days,marital_Divorced,marital_Married,marital_Single,marital_Together,marital_Widow,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,MntTotal,MntRegularProds,AcceptedCmpOverall
0,$58138,0,0,58,$635,$88,$546,$172,$88,$88,3,8,10,4,7,0,0,0,0,0,0,3,11,1,63,2822,0,0,1,0,0,0,0,1,0,0,$1529,1441,0
1,$46344,1,1,38,$11,$1,$6,$2,$1,$6,2,1,1,2,5,0,0,0,0,0,0,3,11,0,66,2272,0,0,1,0,0,0,0,1,0,0,$21,15,0
2,$71613,0,0,26,$426,$49,$127,$111,$21,$42,1,8,2,10,4,0,0,0,0,0,0,3,11,0,55,2471,0,0,0,1,0,0,0,1,0,0,$734,692,0
3,$26646,1,0,26,$11,$4,$20,$10,$3,$5,2,2,0,4,6,0,0,0,0,0,0,3,11,0,36,2298,0,0,0,1,0,0,0,1,0,0,$48,43,0
4,$58293,1,0,94,$173,$43,$118,$46,$27,$15,5,5,3,6,5,0,0,0,0,0,0,3,11,0,39,2320,0,1,0,0,0,0,0,0,0,1,$407,392,0


- Observamos que todas las columnas que tendrían que ser monetarias tienen quemado el "$". No es una buena práctica, idealmente debería convertirse en una columna numérica para poder realizar los respectivos cálculos.
- También observamos columnas one-hot, las cuales podrían conformar una sola columna con cada categoría. Aumenta innecesariamente la dimensionalidad de nuestro dataset.

In [3]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 39 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Income                2205 non-null   object
 1   Kidhome               2205 non-null   int64 
 2   Teenhome              2205 non-null   int64 
 3   Recency               2205 non-null   int64 
 4   MntWines              2205 non-null   object
 5   MntFruits             2205 non-null   object
 6   MntMeatProducts       2205 non-null   object
 7   MntFishProducts       2205 non-null   object
 8   MntSweetProducts      2205 non-null   object
 9   MntGoldProds          2205 non-null   object
 10  NumDealsPurchases     2205 non-null   int64 
 11  NumWebPurchases       2205 non-null   int64 
 12  NumCatalogPurchases   2205 non-null   int64 
 13  NumStorePurchases     2205 non-null   int64 
 14  NumWebVisitsMonth     2205 non-null   int64 
 15  AcceptedCmp3          2205 non-null   

- Con info() observamos que efectivamente, las columnas relacionadas a lo monetario son de tipo object, lo cual nos dificultaría en algún momento el realizar cálculos agregados.
- Además, tenemos 2205 registros y vemos que todas las columnas tiene ese mismo número de no nulos. Muy buena señal, pues no habría que imputar nada de momento.

In [4]:
# 4. Resumen estadístico de las variables numéricas
print('Resumen estadistico:')
df.describe()

Resumen estadistico:


Unnamed: 0,Kidhome,Teenhome,Recency,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response,Age,Customer_Days,marital_Divorced,marital_Married,marital_Single,marital_Together,marital_Widow,education_2n Cycle,education_Basic,education_Graduation,education_Master,education_PhD,MntRegularProds,AcceptedCmpOverall
count,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0,2205.0
mean,0.442177,0.506576,49.00907,2.318367,4.10068,2.645351,5.823583,5.336961,0.073923,0.074376,0.073016,0.064399,0.013605,0.00907,3.0,11.0,0.15102,51.095692,2512.718367,0.104308,0.387302,0.216327,0.257596,0.034467,0.089796,0.02449,0.504762,0.165079,0.215873,518.707483,0.29932
std,0.537132,0.54438,28.932111,1.886107,2.737424,2.798647,3.241796,2.413535,0.261705,0.262442,0.260222,0.245518,0.115872,0.094827,0.0,0.0,0.35815,11.705801,202.563647,0.30573,0.487244,0.411833,0.43741,0.182467,0.285954,0.154599,0.500091,0.371336,0.41152,553.847248,0.68044
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,24.0,2159.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-283.0,0.0
25%,0.0,0.0,24.0,1.0,2.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,43.0,2339.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,42.0,0.0
50%,0.0,0.0,49.0,2.0,4.0,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,50.0,2515.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,288.0,0.0
75%,1.0,1.0,74.0,3.0,6.0,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0,61.0,2688.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,884.0,0.0
max,2.0,2.0,99.0,15.0,27.0,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0,80.0,2858.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2458.0,4.0


- Información relevante para conocer un poco la distribución de los datos. En algunas columnas se ven valores algo alejados de la mayoría de los datos (a partir del 75%), por lo que podrían existir outliers, lo cual veremos a fondo más adelante.
- También observamos que en MntRegularProds hay valores negativos, lo cual no debería existir. Se podría imputar con la mediana más adelante este valor.

In [5]:
# 5. Mostramos cantidad de duplicados en el dataset
df.duplicated().sum()

np.int64(184)

- Observamos que hay algunos duplicados en el dataset, idealmente deberíamos eliminarlos para evitar obtener resultados con registros que no deberían estar presente al momento de los cálculos.

In [6]:
# 6. Mostramos cantidad de nulos por columna
df.isnull().sum()

Income                  0
Kidhome                 0
Teenhome                0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
Age                     0
Customer_Days           0
marital_Divorced        0
marital_Married         0
marital_Single          0
marital_Together        0
marital_Widow           0
education_2n Cycle      0
education_Basic         0
education_Graduation    0
education_Master        0
education_PhD           0
MntTotal                0
MntRegularProds         0
AcceptedCmpO

- Confirmamos que no tenemos nulos en ninguna columna.

In [7]:
# 7. Mostramos cantidad de valores únicos por columna
df.nunique()

Income                  1963
Kidhome                    3
Teenhome                   3
Recency                  100
MntWines                 775
MntFruits                158
MntMeatProducts          551
MntFishProducts          182
MntSweetProducts         176
MntGoldProds             212
NumDealsPurchases         15
NumWebPurchases           15
NumCatalogPurchases       13
NumStorePurchases         14
NumWebVisitsMonth         16
AcceptedCmp3               2
AcceptedCmp4               2
AcceptedCmp5               2
AcceptedCmp1               2
AcceptedCmp2               2
Complain                   2
Z_CostContact              1
Z_Revenue                  1
Response                   2
Age                       56
Customer_Days            662
marital_Divorced           2
marital_Married            2
marital_Single             2
marital_Together           2
marital_Widow              2
education_2n Cycle         2
education_Basic            2
education_Graduation       2
education_Mast

- Tenemos columnas con alta cardinalidad pero que probablemente serán útiles en los cálculos a realizar.

In [8]:
# 8. Sacamos columnas numéricas para outliers (Excluimos aquellas con solo 2 valores únicos pues estas no tienen sentido para outliers, además de representar un valor de verdad en muchos casos)
num_cols = df.select_dtypes(include=['int64']).columns
num_cols = [c for c in num_cols if df[c].nunique() > 2]
num_cols

['Kidhome',
 'Teenhome',
 'Recency',
 'NumDealsPurchases',
 'NumWebPurchases',
 'NumCatalogPurchases',
 'NumStorePurchases',
 'NumWebVisitsMonth',
 'Age',
 'Customer_Days',
 'MntRegularProds',
 'AcceptedCmpOverall']

In [9]:
Q1 = df[num_cols].quantile(0.25)
Q3 = df[num_cols].quantile(0.75)
IQR = Q3 - Q1
IQR

Kidhome                  1.0
Teenhome                 1.0
Recency                 50.0
NumDealsPurchases        2.0
NumWebPurchases          4.0
NumCatalogPurchases      4.0
NumStorePurchases        5.0
NumWebVisitsMonth        4.0
Age                     18.0
Customer_Days          349.0
MntRegularProds        842.0
AcceptedCmpOverall       0.0
dtype: float64

In [10]:
lower_limit = Q1 - 1.5 * IQR # límite inferior
upper_limit = Q3 + 1.5 * IQR # límite superior

print('Límites inferiores para outliers:')
print(lower_limit)
print('Límites superiores para outliers:')
print(upper_limit)

Límites inferiores para outliers:
Kidhome                  -1.5
Teenhome                 -1.5
Recency                 -51.0
NumDealsPurchases        -2.0
NumWebPurchases          -4.0
NumCatalogPurchases      -6.0
NumStorePurchases        -4.5
NumWebVisitsMonth        -3.0
Age                      16.0
Customer_Days          1815.5
MntRegularProds       -1221.0
AcceptedCmpOverall        0.0
dtype: float64
Límites superiores para outliers:
Kidhome                   2.5
Teenhome                  2.5
Recency                 149.0
NumDealsPurchases         6.0
NumWebPurchases          12.0
NumCatalogPurchases      10.0
NumStorePurchases        15.5
NumWebVisitsMonth        13.0
Age                      88.0
Customer_Days          3211.5
MntRegularProds        2147.0
AcceptedCmpOverall        0.0
dtype: float64


In [11]:
for col in num_cols:
    num_outliers = ((df[col] < lower_limit[col]) | (df[col] > upper_limit[col])).sum()
    porcent_outliers = (num_outliers / len(df) * 100).round(2)
    if num_outliers > 0:
        print(f'El número de outliers en la columna {col} es: {num_outliers}, correspondiendo al {porcent_outliers}% del total de registros.')

El número de outliers en la columna NumDealsPurchases es: 82, correspondiendo al 3.72% del total de registros.
El número de outliers en la columna NumWebPurchases es: 3, correspondiendo al 0.14% del total de registros.
El número de outliers en la columna NumCatalogPurchases es: 20, correspondiendo al 0.91% del total de registros.
El número de outliers en la columna NumWebVisitsMonth es: 8, correspondiendo al 0.36% del total de registros.
El número de outliers en la columna MntRegularProds es: 4, correspondiendo al 0.18% del total de registros.
El número de outliers en la columna AcceptedCmpOverall es: 458, correspondiendo al 20.77% del total de registros.


- Observamos que los outliers realmente no son parte significativa de la data por lo que podríamos omitir realizar alguna imputación. Lo único extraño es que AcceptedCmpOverall es una variable binaria (0 o 1), y al haber sido incluida con nuestras variables numéricas significa que hay valores diferentes a los esperados. A los que sean mayor a 1 podría imputarse los valores con el valor 1 (ya que al ser mayor se asume que aceptaron más campañas, aunque esto no debería haber sido ingresado así).
- Adicionalmente, podrían eliminarse columnas que no son relevantes para nuestros análisis (Ej.: Recency, Z_CostContact, Z_Revenue)

### 2. Preprocesamiento de Datos
- En base a las observaciones anteriores, realice el preprocesamiento de datos para mejorar la
calidad del mismo.

#### 2.1 Conversión de variables monetarias a tipo numérico

- Las variables monetarias contienen el símbolo $ y están almacenadas como object, lo que impide realizar cálculos agregados correctamente.

In [12]:
money_cols = [
    'Income', 'MntWines', 'MntFruits', 'MntMeatProducts',
    'MntFishProducts', 'MntSweetProducts', 'MntGoldProds'
]

In [13]:
for col in money_cols:
    df[col] = (df[col].str.replace('$', '', regex=False).astype(int)
    )
df[money_cols].dtypes

Income              int64
MntWines            int64
MntFruits           int64
MntMeatProducts     int64
MntFishProducts     int64
MntSweetProducts    int64
MntGoldProds        int64
dtype: object

- Eliminar el símbolo $ y convertir las columnas a tipo numérico (int).

#### 2.2 Eliminación de registros duplicados

In [14]:

df = df.drop_duplicates()

print(f'Registros restantes después de duplicados eliminados: {df.shape[0]}')


Registros restantes después de duplicados eliminados: 2021


#### 2.3 Eliminación de columnas no informativas

In [15]:
cols_to_drop = ['Z_CostContact', 'Z_Revenue']
df = df.drop(columns=cols_to_drop)

- Estas dos columnas solo tenian 2 valores unicos, por lo tanto no brindan información al análisis.

In [16]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 2021 entries, 0 to 2204
Data columns (total 37 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Income                2021 non-null   int64 
 1   Kidhome               2021 non-null   int64 
 2   Teenhome              2021 non-null   int64 
 3   Recency               2021 non-null   int64 
 4   MntWines              2021 non-null   int64 
 5   MntFruits             2021 non-null   int64 
 6   MntMeatProducts       2021 non-null   int64 
 7   MntFishProducts       2021 non-null   int64 
 8   MntSweetProducts      2021 non-null   int64 
 9   MntGoldProds          2021 non-null   int64 
 10  NumDealsPurchases     2021 non-null   int64 
 11  NumWebPurchases       2021 non-null   int64 
 12  NumCatalogPurchases   2021 non-null   int64 
 13  NumStorePurchases     2021 non-null   int64 
 14  NumWebVisitsMonth     2021 non-null   int64 
 15  AcceptedCmp3          2021 non-null   int64

### 3. Reconstrucción de Variables Categoricas
- Crear la columna Marital_Status a partir de marital_*.
- Crear la columna Education a partir de education_*.
- Eliminar las columnas originales utilizadas.


In [17]:
# Reconstrucción de columna Marital_Status
marital_cols = [col for col in df.columns if col.startswith('marital_')]

# idxmax(axis=1) devuelve la columna con el 1
# Luego quitamos el prefijo marital_ para dejar el estado
if marital_cols:
    df['Marital_Status'] = df[marital_cols].idxmax(axis=1).str.replace('marital_', '')

# Reconstrucción de columna Education
education_cols = [col for col in df.columns if col.startswith('education_')]
if education_cols:
    df['Education'] = df[education_cols].idxmax(axis=1).str.replace('education_', '')

# Eliminamos las columnas originales
cols_to_drop = marital_cols + education_cols
if cols_to_drop:
    df.drop(columns=cols_to_drop, inplace=True)

df.head()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Age,Customer_Days,MntTotal,MntRegularProds,AcceptedCmpOverall,Marital_Status,Education
0,58138,0,0,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,1,63,2822,$1529,1441,0,Single,Graduation
1,46344,1,1,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,0,66,2272,$21,15,0,Single,Graduation
2,71613,0,0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,0,55,2471,$734,692,0,Together,Graduation
3,26646,1,0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,0,36,2298,$48,43,0,Together,Graduation
4,58293,1,0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,0,39,2320,$407,392,0,Married,PhD


### 4. Tratamiento de valores invalidos
- Identificar edades con valor 99999.
- Proponer y justificar una estrategia de tratamiento.


#### Edad Máxima

In [18]:
df['Age'].max() ## Primero identificamos la edad máxima para determinar si existen outliers

np.int64(80)

- Primero identificamos la edad máxima para determinar si existen outliers

#### Edad Mínima

In [19]:
df['Age'].min() 

np.int64(24)

- Así mismo hacemos un min para identificar la edad mínima

#### Proponer estrategias de tratamiento

- Como observamos, realmente no existen outliers para la variable edad. Sin embargo, este sería un ejemplo de como resolver el problema en caso de existir.

In [20]:
df.loc[(df["Age"] < 0) | (df["Age"] > 110), "Age"] = np.nan

df["Age"] = df["Age"].fillna(df["Age"].median())

- Se procedió a reemplazar los valores imposibles de edad con NaN, lo cuál también  es útil porque trataríamos estos valores imposibles en conjunto con valores nulos de existir. 
- Finalmente, usamos la función fillna para imputar estos valores con la mediana. Al observar la variable Age en el describe() de arriba, la mediana y la media no están tan lejos, por lo que no es una distribución tan asímetrica probablemente. Sin embargo, se usa la mediana ya que sigue siendo una medida más robusta que es más realista y poco sesgada por extremos.

### 5. Validaciones finales
- Confirmar los tipos de datos correctos.
- Verificar coherencia general del dataset.


In [21]:
df.dtypes

Income                   int64
Kidhome                  int64
Teenhome                 int64
Recency                  int64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
MntSweetProducts         int64
MntGoldProds             int64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Response                 int64
Age                    float64
Customer_Days            int64
MntTotal                object
MntRegularProds          int64
AcceptedCmpOverall       int64
Marital_Status          object
Education               object
dtype: object

In [22]:
df.head()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Age,Customer_Days,MntTotal,MntRegularProds,AcceptedCmpOverall,Marital_Status,Education
0,58138,0,0,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,1,63.0,2822,$1529,1441,0,Single,Graduation
1,46344,1,1,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,0,66.0,2272,$21,15,0,Single,Graduation
2,71613,0,0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,0,55.0,2471,$734,692,0,Together,Graduation
3,26646,1,0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,0,36.0,2298,$48,43,0,Together,Graduation
4,58293,1,0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,0,39.0,2320,$407,392,0,Married,PhD


- Comprobamos si los tipos de datos fueron corregidos correctamente y si la estructura es coherente. Vemos que en general se tiene un dataset bastante bien estructurado, con excepción de que el MntTotal faltó corregirlo a un tipo numérico correcto.

In [23]:
df['MntTotal'] = (df['MntTotal'].str.replace('$', '', regex=False).astype(int))

- Corregimos el tipo de MntTotal.

In [24]:
money_cols = money_cols + ['MntTotal', 'MntRegularProds'] # Añadimos la columna faltante y la que ya estaba convertida a numérico.

for col in money_cols:
    df.loc[df[col] < 0, col] = np.nan # Identificamos valores negativos (inválidos) y los marcamos como NaN
    df[col] = df[col].fillna(df[col].median()) # Rellenamos los NaN con la mediana de la columna

- Ahora sí, una vez reemplazado los números invalidos en las columnas monetarias, comprobamos.

In [25]:
df.describe()

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Response,Age,Customer_Days,MntTotal,MntRegularProds,AcceptedCmpOverall
count,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0,2021.0
mean,51687.258783,0.443345,0.509649,48.880752,306.492331,26.364671,166.059871,37.603662,27.268679,43.921821,2.330035,4.115289,2.64473,5.807521,5.340426,0.074715,0.076695,0.072241,0.065809,0.012865,0.009401,0.153884,51.117269,2511.613063,563.789213,520.594013,0.302326
std,20713.046401,0.536196,0.546393,28.950917,337.603877,39.776518,219.869126,54.892196,41.575454,51.678211,1.892778,2.753588,2.799126,3.230434,2.426319,0.262997,0.266172,0.258951,0.248009,0.11272,0.096527,0.360927,11.667616,202.546762,576.775749,554.169794,0.680812
min,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.0,2159.0,4.0,0.0,0.0
25%,35416.0,0.0,0.0,24.0,24.0,2.0,16.0,3.0,1.0,9.0,1.0,2.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,43.0,2337.0,55.0,42.0,0.0
50%,51412.0,0.0,0.0,49.0,178.0,8.0,68.0,12.0,8.0,25.0,2.0,4.0,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,50.0,2511.0,343.0,289.5,0.0
75%,68274.0,1.0,1.0,74.0,507.0,33.0,230.0,50.0,34.0,56.0,3.0,6.0,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.0,2688.0,964.0,883.0,0.0
max,113734.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,262.0,321.0,15.0,27.0,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,80.0,2858.0,2491.0,2458.0,4.0


- Ahora sí todos los valores se ven más estables y sin valores imposibles.

In [26]:
df.dtypes

Income                 float64
Kidhome                  int64
Teenhome                 int64
Recency                  int64
MntWines               float64
MntFruits              float64
MntMeatProducts        float64
MntFishProducts        float64
MntSweetProducts       float64
MntGoldProds           float64
NumDealsPurchases        int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
AcceptedCmp3             int64
AcceptedCmp4             int64
AcceptedCmp5             int64
AcceptedCmp1             int64
AcceptedCmp2             int64
Complain                 int64
Response                 int64
Age                    float64
Customer_Days            int64
MntTotal               float64
MntRegularProds        float64
AcceptedCmpOverall       int64
Marital_Status          object
Education               object
dtype: object

- Finalmente, comprobamos que la estructura y los tipos de datos de nuestro dataset es coherente y homégeneo.

In [27]:
df.to_excel("Marketing.xlsx", sheet_name="Customers", index=False)