<div style="text-align: center;">
  <img src="https://github.com/Hack-io-Data/Imagenes/blob/main/01-LogosHackio/logo_naranja@4x.png?raw=true" alt="esquema" />
</div>

# Preprocesamiento de Datos para un Modelo de Clasificación

Nuestro equipo de riesgo ha identificado la necesidad de construir un modelo de clasificación para detectar transacciones potencialmente fraudulentas o anómalas. El éxito de este modelo permitirá identificar patrones en los datos y reducir pérdidas económicas para nuestros clientes y la empresa. Tu trabajo para hoy será preparar los datos proporcionados para desarrollar un modelo de clasificación que prediga la probabilidad de que una transacción sea “Normal” o “Anómala”, basándose en las características de las transacciones. Dentro de las tareas que tienes que realizar hoy deben estar incluidas las siguientes: 


1. **Análisis exploratorio de datos (EDA):**

   - Visualizar la distribución de las principales variables.

   - Explorar relaciones entre las características y la variable objetivo (`TransactionStatus`).


2. **Limpieza de datos:**

   - Identificar y tratar valores nulos.

   - Eliminar duplicados si existen.

   - Asegurar que las variables tienen el tipo de dato correcto.

   - etc. 

3. **Transformación de variables:**

   - Normalizar o escalar variables numéricas según sea necesario.

   - Codificar variables categóricas con técnicas como One-Hot Encoding, Target Encoding u Ordinal Encoding, según corresponda.

   - Crear variables derivadas útiles, como:

     - Frecuencia de transacciones de cada cliente.

     - Diferencia de tiempo entre transacciones consecutivas.

     - Ratio entre el saldo posterior y el monto de la transacción.

## Datos Proporcionados

El archivo de datos que analizarán contiene transacciones reales registradas por una compañia. A continuación, se describen las principales columnas que deberán preprocesar:

| Columna               | Descripción                                                                                     |
|-----------------------|-------------------------------------------------------------------------------------------------|
| `transaction_id`      | Identificador único de la transacción.                                                         |
| `customer_id`         | Identificador único del cliente que realizó la transacción.                                    |
| `merchant_id`         | Identificador único del comercio donde se realizó la transacción.                              |
| `amount`              | Monto de la transacción (en la moneda correspondiente).                                        |
| `transaction_time`    | Fecha y hora exacta en la que ocurrió la transacción.                                          |
| `is_fraudulent`       | Indicador de si la transacción fue fraudulenta (1: Sí, 0: No).                                 |
| `card_type`           | Tipo de tarjeta utilizada para la transacción (Visa, MasterCard, American Express, Discover).  |
| `location`            | Ubicación (ciudad o región) donde se realizó la transacción.                                   |
| `purchase_category`   | Categoría de la compra (por ejemplo, Gas Station, Online Shopping, Retail, etc.).              |
| `customer_age`        | Edad del cliente en el momento de la transacción.                                              |
| `transaction_description` | Descripción breve de la transacción, usualmente incluye el comercio asociado.              |


In [155]:
# Tratamiento de datos
# -----------------------------------------------------------------------
import numpy as np
import pandas as pd

# Otros objetivos
# -----------------------------------------------------------------------
import math

# Gráficos
# -----------------------------------------------------------------------
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = "{:,.2f}".format    #para la notación
pd.set_option("display.max_columns", None)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from itertools import product, combinations
from sklearn.preprocessing import MinMaxScaler, Normalizer, StandardScaler, RobustScaler
from sklearn.neighbors import LocalOutlierFactor # para detectar outliers usando el método LOF
from sklearn.ensemble import IsolationForest # para detectar outliers usando el metodo IF
import tqdm as tqdm
from scipy.stats import chi2_contingency
import os
import sys 
sys.path.append(os.path.abspath("..\src"))   
import soporte_preprocesamiento_log as f

import pickle

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder #, TargetEncoder # para poder aplicar los métodos de OneHot, Ordinal,  Label y Target Encoder 


from category_encoders import TargetEncoder                                   #solo me deja hacerlo si uso el entorno default en vez del proyecto
pd.set_option("display.max_columns", None)

# Para tratar el problema de desbalance
# -----------------------------------------------------------------------
# from imblearn.over_sampling import RandomOverSampler, SMOTE
# from imblearn.under_sampling import RandomUnderSampler
# from imblearn.combine import SMOTETomek


# from sklearn.impute import IterativeImputer

# from sklearn.impute import KNNImputer

In [156]:
df_datos=pd.read_csv("../datos/financial_data.csv")


In [157]:
visualizador=f.Visualizador(df_datos)

In [158]:
df_datos.head(3)

Unnamed: 0,transaction_id,customer_id,merchant_id,amount,transaction_time,is_fraudulent,card_type,location,purchase_category,customer_age,transaction_description
0,1,1082,2027,5758.59,2023-01-01 00:00:00,0,MasterCard,City-30,Gas Station,43,Purchase at Merchant-2027
1,2,1015,2053,1901.56,2023-01-01 00:00:01,1,Visa,City-47,Online Shopping,61,Purchase at Merchant-2053
2,3,1004,2035,1248.86,2023-01-01 00:00:02,1,MasterCard,City-6,Gas Station,57,Purchase at Merchant-2035


In [159]:
dicc={ 1:"Yes",
       0 :"No"}
df_datos["is_fraudulent"]=df_datos["is_fraudulent"].map(dicc)

In [160]:
f.exploracion_datos(df_datos)

El número de filas es 10000 y el número de columnas es 11

----------

En este conjunto de datos tenemos 0 valores duplicados

----------

Los columnas con valores nulos y sus porcentajes son: 


Series([], dtype: float64)


----------

Las principales estadísticas de las variables númericas son:


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
transaction_id,10000.0,5000.5,2886.9,1.0,2500.75,5000.5,7500.25,10000.0
customer_id,10000.0,1051.27,28.86,1001.0,1026.0,1052.0,1076.0,1100.0
merchant_id,10000.0,2050.49,28.88,2001.0,2025.0,2050.0,2076.0,2100.0
amount,10000.0,4958.38,2899.7,10.61,2438.18,4943.94,7499.31,9999.75
customer_age,10000.0,44.05,15.32,18.0,31.0,44.0,57.0,70.0



----------

Las principales estadísticas de las variables categóricas son:


Unnamed: 0,count,unique,top,freq
transaction_time,10000,10000,2023-01-01 02:46:23,1
is_fraudulent,10000,2,Yes,5068
card_type,10000,4,Discover,2633
location,10000,50,City-7,223
purchase_category,10000,6,Travel,1694
transaction_description,10000,100,Purchase at Merchant-2016,120



----------

Las características principales del dataframe son:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   transaction_id           10000 non-null  int64  
 1   customer_id              10000 non-null  int64  
 2   merchant_id              10000 non-null  int64  
 3   amount                   10000 non-null  float64
 4   transaction_time         10000 non-null  object 
 5   is_fraudulent            10000 non-null  object 
 6   card_type                10000 non-null  object 
 7   location                 10000 non-null  object 
 8   purchase_category        10000 non-null  object 
 9   customer_age             10000 non-null  int64  
 10  transaction_description  10000 non-null  object 
dtypes: float64(1), int64(4), object(6)
memory usage: 859.5+ KB


None

## EDA  
  
### DUPLICADOS  

No hay duplicados


### NULOS  

No hay nulos en el conjunto de datos   


### FORMATEOS  

**customer_age** la pasamos de int64 a categórica quizá agrupando en grupos de edad para no tener tantas categorías 
  

### ESTADISTICOS PRINCIPALES
- No parece haber datos atípicos significativos en nuestro conjunto de datos ya que se observa como apenas hay dispersión entre media y mediana.   
  
- En cuanto a las desviaciones, por lo general hay desviación significativa en amount (2,899.70)(50% aprox de la media).  
  
No se observa una gran desviación en nuestros datos por lo general salvo en amount lo que nos muestra que hay bastantes diferencias entre los montos de la transacción.  
  

### DATOS RELEVANTES  

- La variable respuesta (**is_fraudulent**) al ser numérica ("1"y "0") voy jugando con ella para la exploración de los datos como categorica y luego la vuelvo a dejar como numérica.
  
- Las variables que se eliminan porque no se consideran relevantes son: **transaction_id**, **transaction_description**


In [161]:
df_datos["customer_id"].unique()

array([1082, 1015, 1004, 1095, 1036, 1032, 1029, 1018, 1014, 1087, 1070,
       1012, 1076, 1055, 1005, 1028, 1030, 1065, 1078, 1072, 1026, 1092,
       1084, 1090, 1054, 1058, 1001, 1098, 1021, 1044, 1020, 1049, 1013,
       1046, 1045, 1034, 1006, 1094, 1059, 1069, 1016, 1011, 1071, 1038,
       1081, 1080, 1047, 1074, 1025, 1091, 1009, 1085, 1099, 1048, 1027,
       1086, 1035, 1088, 1083, 1010, 1022, 1060, 1089, 1042, 1100, 1008,
       1041, 1052, 1073, 1064, 1051, 1019, 1096, 1075, 1066, 1097, 1007,
       1077, 1050, 1068, 1033, 1002, 1093, 1056, 1023, 1039, 1063, 1003,
       1040, 1031, 1017, 1061, 1057, 1067, 1043, 1053, 1024, 1062, 1037,
       1079])

In [162]:
df_datos["merchant_id"].unique()

array([2027, 2053, 2035, 2037, 2083, 2021, 2033, 2088, 2077, 2031, 2052,
       2015, 2020, 2025, 2004, 2032, 2040, 2060, 2063, 2072, 2056, 2055,
       2005, 2026, 2018, 2034, 2036, 2085, 2071, 2023, 2006, 2093, 2003,
       2008, 2011, 2041, 2078, 2039, 2009, 2047, 2086, 2012, 2051, 2074,
       2095, 2065, 2016, 2067, 2075, 2100, 2068, 2064, 2099, 2043, 2050,
       2096, 2028, 2019, 2014, 2057, 2076, 2069, 2048, 2090, 2070, 2061,
       2092, 2007, 2089, 2082, 2002, 2097, 2081, 2013, 2046, 2054, 2094,
       2042, 2059, 2098, 2024, 2001, 2084, 2017, 2058, 2062, 2049, 2029,
       2079, 2030, 2022, 2080, 2073, 2087, 2038, 2091, 2010, 2044, 2045,
       2066])

In [163]:
df_datos["merchant_id"].value_counts()

merchant_id
2016    120
2078    120
2055    118
2019    117
2057    117
       ... 
2033     82
2045     82
2031     80
2070     75
2100     75
Name: count, Length: 100, dtype: int64

In [164]:
df_datos["customer_id"].value_counts()

customer_id
1074    126
1085    120
1059    118
1018    116
1099    116
       ... 
1027     85
1092     84
1066     84
1033     81
1003     81
Name: count, Length: 100, dtype: int64

In [165]:
df_datos["customer_age"].unique()

array([43, 61, 57, 59, 36, 19, 40, 55, 70, 27, 46, 34, 41, 64, 62, 44, 60,
       51, 28, 42, 23, 45, 65, 24, 30, 35, 32, 58, 66, 22, 39, 53, 21, 29,
       38, 56, 48, 37, 33, 47, 18, 67, 63, 31, 68, 49, 50, 52, 26, 54, 20,
       25, 69])

In [166]:

df_datos["customer_age"].value_counts()

customer_age
62    228
50    215
68    212
25    208
45    208
19    204
41    204
55    203
30    203
42    203
36    202
23    201
53    200
22    200
35    198
70    197
69    197
24    196
29    194
21    194
61    194
58    193
51    193
34    192
48    192
65    192
44    190
60    189
32    188
54    186
47    186
46    186
52    185
64    184
37    182
39    181
40    181
49    180
66    179
43    178
20    178
63    174
27    174
18    173
56    173
28    172
38    169
31    169
67    168
26    167
33    165
59    164
57    156
Name: count, dtype: int64

In [167]:
df_datos["customer_age"]=df_datos["customer_age"].astype("category")

In [168]:
df_datos.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
transaction_id,10000.0,5000.5,2886.9,1.0,2500.75,5000.5,7500.25,10000.0
customer_id,10000.0,1051.27,28.86,1001.0,1026.0,1052.0,1076.0,1100.0
merchant_id,10000.0,2050.49,28.88,2001.0,2025.0,2050.0,2076.0,2100.0
amount,10000.0,4958.38,2899.7,10.61,2438.18,4943.94,7499.31,9999.75


In [169]:
df_datos.describe(include="O").T

Unnamed: 0,count,unique,top,freq
transaction_time,10000,10000,2023-01-01 02:46:23,1
is_fraudulent,10000,2,Yes,5068
card_type,10000,4,Discover,2633
location,10000,50,City-7,223
purchase_category,10000,6,Travel,1694
transaction_description,10000,100,Purchase at Merchant-2016,120


In [170]:
df_datos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype   
---  ------                   --------------  -----   
 0   transaction_id           10000 non-null  int64   
 1   customer_id              10000 non-null  int64   
 2   merchant_id              10000 non-null  int64   
 3   amount                   10000 non-null  float64 
 4   transaction_time         10000 non-null  object  
 5   is_fraudulent            10000 non-null  object  
 6   card_type                10000 non-null  object  
 7   location                 10000 non-null  object  
 8   purchase_category        10000 non-null  object  
 9   customer_age             10000 non-null  category
 10  transaction_description  10000 non-null  object  
dtypes: category(1), float64(1), int64(3), object(6)
memory usage: 793.6+ KB


In [171]:
df_datos.drop(columns=["transaction_description", "transaction_id","transaction_time"], inplace=True)

In [172]:
df_cat=df_datos.select_dtypes("O")
df_num=df_datos.select_dtypes(include=np.number)

In [173]:
lista_cat=df_cat.columns

In [174]:
# df_cat.drop(columns=["location"], inplace=True)

In [175]:

# visualizador.plot_numericas()

In [176]:
# visualizador.plot_categoricas()

In [177]:
df_datos.head()

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,card_type,location,purchase_category,customer_age
0,1082,2027,5758.59,No,MasterCard,City-30,Gas Station,43
1,1015,2053,1901.56,Yes,Visa,City-47,Online Shopping,61
2,1004,2035,1248.86,Yes,MasterCard,City-6,Gas Station,57
3,1095,2037,7619.05,Yes,Discover,City-6,Travel,59
4,1036,2083,1890.1,Yes,MasterCard,City-34,Retail,36


In [178]:
# visualizador.plot_relacion("is_fraudulent")

No hay outliers univariados

In [179]:
# visualizador.deteccion_outliers()

In [180]:
# visualizador.correlacion()

Dado que no hay una alta correlación entre ninguna de las variables no prescindimos de ninguna

### ENCODING

In [181]:
f.detectar_orden_cat(df_datos,lista_cat,"is_fraudulent")

Estamos evaluando el orden de la variable IS_FRAUDULENT


is_fraudulent,No,Yes
is_fraudulent,Unnamed: 1_level_1,Unnamed: 2_level_1
No,4932,0
Yes,0,5068


La variable is_fraudulent SI tiene orden
Estamos evaluando el orden de la variable CARD_TYPE


is_fraudulent,No,Yes
card_type,Unnamed: 1_level_1,Unnamed: 2_level_1
American Express,1262,1232
Discover,1304,1329
MasterCard,1140,1243
Visa,1226,1264


La variable card_type NO tiene orden
Estamos evaluando el orden de la variable LOCATION


is_fraudulent,No,Yes
location,Unnamed: 1_level_1,Unnamed: 2_level_1
City-1,80,108
City-10,100,94
City-11,92,107
City-12,102,107
City-13,99,110
City-14,106,112
City-15,96,88
City-16,92,94
City-17,109,96
City-18,89,105


La variable location NO tiene orden
Estamos evaluando el orden de la variable PURCHASE_CATEGORY


is_fraudulent,No,Yes
purchase_category,Unnamed: 1_level_1,Unnamed: 2_level_1
Gas Station,792,874
Groceries,796,896
Online Shopping,847,804
Restaurant,851,785
Retail,808,853
Travel,838,856


La variable purchase_category SI tiene orden


In [182]:
df_datos.columns

Index(['customer_id', 'merchant_id', 'amount', 'is_fraudulent', 'card_type',
       'location', 'purchase_category', 'customer_age'],
      dtype='object')

### Orden  
  
**Variables con ORDEN:**
- is_fraudulent (VR)
- purchase_category
  
**Variables SIN orden:**
- card_type
- location
 

ONE HOT ENCODER

In [183]:
cols_onehot = ["card_type","location"]
one_hot_encoder = OneHotEncoder(categories='auto', 
                        drop=None, 
                        sparse_output=True, 
                        dtype='float', 
                        handle_unknown='error')

# Ajustar el codificador a los datos y transformarlos
encoder_trans = one_hot_encoder.fit_transform(df_datos[cols_onehot])

# lo siguiente que hacemos es convertir el objeto devuelto por el fit_transform a array para poder verlo
encoder_array = encoder_trans.toarray()

# usaremos el método '.get_feature_names_out()' para extraer el nombre de las columnas
nombre_columnas = one_hot_encoder.get_feature_names_out()

# creamos un DataFrame con los resultados obtenidos de la transformación
encoder_df = pd.DataFrame(encoder_array, columns = nombre_columnas)

# concatenamos estos resultados con el DataFrame original
df_sinull_copy = pd.concat([df_datos, encoder_df], axis = 1)

df_sinull_copy.drop(columns=cols_onehot, inplace=True)

In [184]:
with open('../pickles_general/one_hot_encoder.pkl', 'wb') as f:
    pickle.dump(one_hot_encoder, f)

In [185]:
one_hot_encoder.categories_

[array(['American Express', 'Discover', 'MasterCard', 'Visa'], dtype=object),
 array(['City-1', 'City-10', 'City-11', 'City-12', 'City-13', 'City-14',
        'City-15', 'City-16', 'City-17', 'City-18', 'City-19', 'City-2',
        'City-20', 'City-21', 'City-22', 'City-23', 'City-24', 'City-25',
        'City-26', 'City-27', 'City-28', 'City-29', 'City-3', 'City-30',
        'City-31', 'City-32', 'City-33', 'City-34', 'City-35', 'City-36',
        'City-37', 'City-38', 'City-39', 'City-4', 'City-40', 'City-41',
        'City-42', 'City-43', 'City-44', 'City-45', 'City-46', 'City-47',
        'City-48', 'City-49', 'City-5', 'City-50', 'City-6', 'City-7',
        'City-8', 'City-9'], dtype=object)]

In [186]:
df_sinull_copy.head(2)

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9
0,1082,2027,5758.59,No,Gas Station,43,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1015,2053,1901.56,Yes,Online Shopping,61,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [187]:
dicc2={"Yes":1,
      "No" : 0
      }

In [188]:
df_sinull_copy["is_fraudulent"]=df_sinull_copy["is_fraudulent"].map(dicc2)

**TARGET ENCODER**   
  
**Si hubiera desbalanceo te puede generar categorías nuevas que se inventa con lo que tendríamos que agrupar esas categorías a sus más cercanas originales**  
-  LA MEJOR SOLUCIÓN SERÍA en vez de usar SMOTETOMEK (que te sube y baja a la vez), usar *primero SMOTENC* (la única diferencia con SMOTE normal es que te permite meter Categóricas) para que genere los datos (aquí se pueden generar esas nuevas categorías), *luego hago ENCODING* y por *último usar TOMEK* (que disminuye la mayoritaría)  
  
- OTRA MUY BUENA OPCIÓN es *IMPUTAR las categorías nuevas* que te ha generado pasandolas a nulos y luego imputando.  

In [189]:
target_encoder = TargetEncoder(cols=["purchase_category"])
encoded = target_encoder.fit_transform(df_sinull_copy, df_sinull_copy[["is_fraudulent"]])


In [190]:
with open('../pickles_general/target_encoder.pkl', 'wb') as f:
    pickle.dump(target_encoder, f)

In [191]:
encoded.shape

(10000, 60)

In [192]:
encoded.head()

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9
0,1082,2027,5758.59,0,0.52,43,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1015,2053,1901.56,1,0.49,61,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1004,2035,1248.86,1,0.52,57,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1095,2037,7619.05,1,0.51,59,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,1036,2083,1890.1,1,0.51,36,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [193]:
encoded.columns

Index(['customer_id', 'merchant_id', 'amount', 'is_fraudulent',
       'purchase_category', 'customer_age', 'card_type_American Express',
       'card_type_Discover', 'card_type_MasterCard', 'card_type_Visa',
       'location_City-1', 'location_City-10', 'location_City-11',
       'location_City-12', 'location_City-13', 'location_City-14',
       'location_City-15', 'location_City-16', 'location_City-17',
       'location_City-18', 'location_City-19', 'location_City-2',
       'location_City-20', 'location_City-21', 'location_City-22',
       'location_City-23', 'location_City-24', 'location_City-25',
       'location_City-26', 'location_City-27', 'location_City-28',
       'location_City-29', 'location_City-3', 'location_City-30',
       'location_City-31', 'location_City-32', 'location_City-33',
       'location_City-34', 'location_City-35', 'location_City-36',
       'location_City-37', 'location_City-38', 'location_City-39',
       'location_City-4', 'location_City-40', 'location_C

In [194]:
num=encoded #.select_dtypes(include=np.number)


In [195]:
col_num=num.drop(columns="is_fraudulent").columns

In [196]:
col_num

Index(['customer_id', 'merchant_id', 'amount', 'purchase_category',
       'customer_age', 'card_type_American Express', 'card_type_Discover',
       'card_type_MasterCard', 'card_type_Visa', 'location_City-1',
       'location_City-10', 'location_City-11', 'location_City-12',
       'location_City-13', 'location_City-14', 'location_City-15',
       'location_City-16', 'location_City-17', 'location_City-18',
       'location_City-19', 'location_City-2', 'location_City-20',
       'location_City-21', 'location_City-22', 'location_City-23',
       'location_City-24', 'location_City-25', 'location_City-26',
       'location_City-27', 'location_City-28', 'location_City-29',
       'location_City-3', 'location_City-30', 'location_City-31',
       'location_City-32', 'location_City-33', 'location_City-34',
       'location_City-35', 'location_City-36', 'location_City-37',
       'location_City-38', 'location_City-39', 'location_City-4',
       'location_City-40', 'location_City-41', 'locatio

### ESTANDARIZACIÓN

In [197]:
numerical_columns = col_num

In [198]:
scaler = MinMaxScaler()
encoded[numerical_columns]= scaler.fit_transform(encoded[numerical_columns])

In [199]:
encoded

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9
0,0.82,0.26,0.58,0,0.90,0.48,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1,0.14,0.53,0.19,1,0.14,0.83,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,0.03,0.34,0.12,1,0.90,0.75,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00
3,0.95,0.36,0.76,1,0.51,0.79,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00
4,0.35,0.83,0.19,1,0.68,0.35,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.56,0.22,0.89,1,0.00,0.35,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
9996,0.53,0.25,0.00,0,0.68,0.46,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
9997,0.40,0.33,0.63,0,0.14,0.35,1.00,0.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
9998,0.08,0.18,0.28,1,0.68,0.75,0.00,0.00,0.00,1.00,0.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00


In [200]:


with open('../pickles_general/scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

In [201]:
encoded.head(2)

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [202]:
encoded.head()

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.03,0.34,0.12,1,0.9,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.95,0.36,0.76,1,0.51,0.79,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.35,0.83,0.19,1,0.68,0.35,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [203]:
col_nums=encoded.columns

### OUTLIERS  

  

Hago un IFO porque creo que es lo que más se ajusta a los datos

In [204]:
# dicc={"Yes" : 1,
#       "No" : 0}

In [205]:
# encoded["is_fraudulent"]=encoded["is_fraudulent"].map(dicc)

In [206]:
contaminacion= [0.01, 0.05, 0.1]
estimadores = [100,400,1000, 2000] 
combinaciones= list(product(contaminacion, estimadores))
for cont, esti in combinaciones:
    
    ifo=IsolationForest(random_state=42, n_estimators=esti, contamination= cont, n_jobs=-1)         #n_estimator es el número de árboles y n_jobs con -1 coge todos los nucleos del ordenador

    encoded[f"outliers_ifo_{cont}_{esti}"]=ifo.fit_predict(encoded[col_nums]) 

In [207]:
encoded.head(3)

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9,outliers_ifo_0.01_100,outliers_ifo_0.01_400,outliers_ifo_0.01_1000,outliers_ifo_0.01_2000,outliers_ifo_0.05_100,outliers_ifo_0.05_400,outliers_ifo_0.05_1000,outliers_ifo_0.05_2000,outliers_ifo_0.1_100,outliers_ifo_0.1_400,outliers_ifo_0.1_1000,outliers_ifo_0.1_2000
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
2,0.03,0.34,0.12,1,0.9,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,-1,1


In [208]:
lista_contaminaciones = [0.01, 0.05, 0.1]
lista_neighbors = [8,20, 50]

combinaciones = list(product(lista_contaminaciones, lista_neighbors))
combinaciones


[(0.01, 8),
 (0.01, 20),
 (0.01, 50),
 (0.05, 8),
 (0.05, 20),
 (0.05, 50),
 (0.1, 8),
 (0.1, 20),
 (0.1, 50)]

In [209]:


# for cont, neighbors in combinaciones:
    
#     lof = LocalOutlierFactor(n_neighbors=neighbors,
#                             contamination=cont,
#                             n_jobs=-1)

#     encoded[f"outliers_lof_{cont}_{neighbors}"] = lof.fit_predict(encoded[numerical_columns])
    
#     y_pred = lof.fit_predict(encoded[numerical_columns])
# encoded.head()





In [210]:
# #visualizacion
# columnas_hue = encoded.filter(like="outlier").columns

# combinaciones_viz = list(combinations(['customer_id', 'merchant_id', 'amount', 'is_fraudulent',
#        'purchase_category', 'customer_age'], 2))
# combinaciones_viz


# for outlier in columnas_hue:
#     fig, axes = plt.subplots(nrows=1, ncols=15, figsize = (45, 5))
#     axes = axes.flat

#     for indice, tupla in enumerate(combinaciones_viz):
#         sns.scatterplot(x = tupla[0],
#                         y = tupla[1],
#                         ax = axes[indice],
#                         data = encoded,
#                         hue=outlier,
#                         palette="Set1",
#                         style=outlier,
#                         alpha=0.5)
        
#     plt.suptitle(outlier)

### Volvemos a poner no y yes para que sea más facil de interpretar

In [211]:
# dicc2={1:"Yes",
#       0:"No"
#       }

In [212]:
# df_datos["is_fraudulent"]=df_datos["is_fraudulent"].map(dicc2)

In [213]:
encoded.head(3)

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9,outliers_ifo_0.01_100,outliers_ifo_0.01_400,outliers_ifo_0.01_1000,outliers_ifo_0.01_2000,outliers_ifo_0.05_100,outliers_ifo_0.05_400,outliers_ifo_0.05_1000,outliers_ifo_0.05_2000,outliers_ifo_0.1_100,outliers_ifo_0.1_400,outliers_ifo_0.1_1000,outliers_ifo_0.1_2000
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
2,0.03,0.34,0.12,1,0.9,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,-1,1


In [214]:
encoded.shape

(10000, 72)

In [215]:
columnasdf=encoded.filter(like="outliers_ifo")
columnas_ifo=columnasdf.columns

In [216]:

filtered_df = encoded[(encoded[columnas_ifo] == -1).all(axis=1)]

In [217]:
# filtered_df

In [218]:
filtered_df.shape

(2, 72)

In [219]:
# sin_outliers = encoded[(encoded[columnas_ifo] != -1).all(axis=1)]

In [220]:
# sin_outliers.shape

<!-- Al observar las filas que tienen todos los valores con -1 y observar que suponen una parte ínfima de los datos, se decide eliminarlos -->

In [221]:
df_result = encoded.drop(index = filtered_df.index)
df_result.reset_index(drop=True, inplace=True)

In [222]:
# df_result.shape

<!-- Ahora seleccionamos el 60% de los outliers -->

In [223]:
proporcion = 0.6 * len(columnas_ifo)
df_outliers_60 = df_result[df_result[columnas_ifo].eq(-1).sum(axis=1) >= proporcion]


<!-- Nos damos cuenta de que entre los outliers hay personas muy jovenes con mucho amount -->

In [224]:
# df_outliers_60.head(6)

In [225]:
# df_outliers_60.shape

In [226]:
df_result.shape

(9998, 72)

<!-- Dado que suponen muy pocos datos decido prescindir de ellos -->

In [227]:
df_result = encoded.drop(index = df_outliers_60.index)
df_result.reset_index(drop=True, inplace=True)

In [228]:
df_result.shape

(9878, 72)

In [229]:
df_result.head()

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9,outliers_ifo_0.01_100,outliers_ifo_0.01_400,outliers_ifo_0.01_1000,outliers_ifo_0.01_2000,outliers_ifo_0.05_100,outliers_ifo_0.05_400,outliers_ifo_0.05_1000,outliers_ifo_0.05_2000,outliers_ifo_0.1_100,outliers_ifo_0.1_400,outliers_ifo_0.1_1000,outliers_ifo_0.1_2000
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
2,0.03,0.34,0.12,1,0.9,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,-1,1
3,0.95,0.36,0.76,1,0.51,0.79,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
4,0.35,0.83,0.19,1,0.68,0.35,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1


In [230]:
encoded.head()

Unnamed: 0,customer_id,merchant_id,amount,is_fraudulent,purchase_category,customer_age,card_type_American Express,card_type_Discover,card_type_MasterCard,card_type_Visa,location_City-1,location_City-10,location_City-11,location_City-12,location_City-13,location_City-14,location_City-15,location_City-16,location_City-17,location_City-18,location_City-19,location_City-2,location_City-20,location_City-21,location_City-22,location_City-23,location_City-24,location_City-25,location_City-26,location_City-27,location_City-28,location_City-29,location_City-3,location_City-30,location_City-31,location_City-32,location_City-33,location_City-34,location_City-35,location_City-36,location_City-37,location_City-38,location_City-39,location_City-4,location_City-40,location_City-41,location_City-42,location_City-43,location_City-44,location_City-45,location_City-46,location_City-47,location_City-48,location_City-49,location_City-5,location_City-50,location_City-6,location_City-7,location_City-8,location_City-9,outliers_ifo_0.01_100,outliers_ifo_0.01_400,outliers_ifo_0.01_1000,outliers_ifo_0.01_2000,outliers_ifo_0.05_100,outliers_ifo_0.05_400,outliers_ifo_0.05_1000,outliers_ifo_0.05_2000,outliers_ifo_0.1_100,outliers_ifo_0.1_400,outliers_ifo_0.1_1000,outliers_ifo_0.1_2000
0,0.82,0.26,0.58,0,0.9,0.48,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
1,0.14,0.53,0.19,1,0.14,0.83,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
2,0.03,0.34,0.12,1,0.9,0.75,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,-1,1
3,0.95,0.36,0.76,1,0.51,0.79,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1
4,0.35,0.83,0.19,1,0.68,0.35,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,1,1,1,1,1,1,1,1,1,1


In [231]:
encoded.drop(columns=encoded.filter(like="outliers_ifo").columns, inplace=True)


In [232]:
with open('../pickles_general/datos_preprocesados.pkl', 'wb') as f:
    pickle.dump(encoded, f)    

## **BAJO QUE CRITERO PASAR A NUMERICA LA VARIABLE RESPUESTA (BINARIA O CON MÁS CATEGORÍAS)?**  
  
- Para que no subestime si hay más de 2 probaría primero un FREQUENCY porque el TARGET las pasaría mal  
  
- Si la variable DEPENDIENTE es BINARIA     

## **DESBALANCEO DE DATOS**  
### ***( Si hay desbalanceos de 65-35 hay que gestionarlos)***  
### ***(Se suele hacer primero un primer modelo y ves las metricas y luego ya se gestiona el desbalanceo para ver qu e tal)***  
- La variable DEPENDIENTE se toca para que no esté desbalanceada (Que haya una categoría que tenga más valores que la otra)

**Cómo tratarlo?**  
- PANDAS:  
    - up-sampling (se duplicaría las filas de la categoría minoritaria hasta que sea del mismo tamaño que la mayoritaria)   
    - down-sampling (se usaría sample (del tamaño de la minoritaria) para coger datos aleatorios de la categoría mayoritaria)  
  
- IMBLEARN:
    - up-sampling (duplica las filas de la categoría minoritaria hasta que sea del mismo tamaño que la mayoritaria)   
    - down-sampling (se usa samples (del tamaño de la minoritaria) para coger datos aleatorios de la categoría mayoritaria)  
***Son mejores estos:***

- SMOTE:
    - **up-sampling**:  
    Genera "hijos" nuevos agrupando los que se parecen en la categoría minoritaria y sacando hijos de ese grupo para aumentar el numero de datos de la minoritaria
  
- TOMEKLINK:
    - **down-sampling**:  
    Borra "hijos"  ,es decir,  los que se parecen en la categoría mayoritaria y borrando hijos de ese grupo para disminuir el numero de datos de la mayoritaria
  
- SMOTETOMEKLINK:
    - **up-sampling y down-sampling**:  
    Genera "hijos" nuevos agrupando los que se parecen en la categoría minoritaria y sacando hijos de ese grupo para aumentar el numero de datos de la minoritaria y después borra "hijos"  ,es decir,  los que se parecen en la categoría mayoritaria y borrando hijos de ese grupo para disminuir el numero de datos de la mayoritaria.