# Trabajo Práctico 1: Métricas custom para reducción de falsos positivos en clasificación binaria fraude

## Feature Engineering

Como la función a optimizar es `Detectar fraudes en regiones con alta incidencia histórica`, hay que encontrar nuevas características que puedan ayudar a mejorar el modelo.

In [2]:
import pandas as pd
import numpy as np

In [3]:
dataset = pd.read_csv("data/dataset_feature_engineering.csv")
dataset.head()

Unnamed: 0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,first_time_at_merchant,dist_between_client_and_merch,trans_month,trans_day,hour,year,times_shopped_at_merchant,times_shopped_at_merchant_year,times_shopped_at_merchant_month,times_shopped_at_merchant_day
0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,...,True,78.773821,1,1,0,2019,5,4,2,1
1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,...,True,30.216618,1,1,0,2019,4,4,1,1
2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,...,True,108.102912,1,1,0,2019,4,3,1,1
3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,...,True,95.685115,1,1,0,2019,1,1,1,1
4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,...,True,77.702395,1,1,0,2019,6,1,1,1


### Regionales

Características basadas en las columnas `zip`, `state`, `city` y `lat/long`

In [4]:
fraud_rate_by_zip = dataset.groupby("zip")["is_fraud"].mean()  # Proporción de fraudes por zip
dataset["fraud_rate_by_zip"] = dataset["zip"].map(fraud_rate_by_zip)

In [5]:
fraud_rate_by_city = dataset.groupby("city")["is_fraud"].mean()  # Proporción de fraudes por ciudad
dataset["fraud_rate_by_city"] = dataset["city"].map(fraud_rate_by_city)

In [6]:
fraud_rate_by_state = dataset.groupby("state")["is_fraud"].mean()  # Proporción de fraudes por estado
dataset["fraud_rate_by_state"] = dataset["state"].map(fraud_rate_by_state)

In [7]:
fraud_count_by_city = dataset.groupby("city")["is_fraud"].sum()  # Número total de fraudes históricamente por ciudad
dataset["city_fraud_count"] = dataset["city"].map(fraud_count_by_city)

In [8]:
dataset.head()

Unnamed: 0,cc_num,merchant,category,amt,first,last,gender,street,city,state,...,hour,year,times_shopped_at_merchant,times_shopped_at_merchant_year,times_shopped_at_merchant_month,times_shopped_at_merchant_day,fraud_rate_by_zip,fraud_rate_by_city,fraud_rate_by_state,city_fraud_count
0,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,NC,...,0,2019,5,4,2,1,0.003758,0.003758,0.004521,11
1,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,WA,...,0,2019,4,4,1,1,0.001605,0.00216,0.00466,11
2,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,ID,...,0,2019,4,3,1,1,0.010884,0.010884,0.004107,8
3,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,MT,...,0,2019,1,1,1,1,0.020188,0.020188,0.004106,15
4,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,VA,...,0,2019,6,1,1,1,0.004449,0.004449,0.006538,13


### Temporales y Regionales

Características basadas en las columnas `zip`, `state`, `city` y `lat/long` y columnas como `trans_month`, `trans_day`, `hour` y `year`

In [None]:
monthly_zip_fraud = dataset.groupby(["zip", "trans_month"])["is_fraud"].mean()  # Proporción de fraudes por zip y mes
dataset["monthly_zip_fraud_rate"] = dataset.set_index(["zip", "trans_month"]).index.map(monthly_zip_fraud)

In [10]:
dataset.sort_values(by=["zip", "unix_time"], inplace=True)
# Rate de fraude por zip y mes, considerando los últimos 100 registros
dataset["rolling_fraud_rate_by_zip"] = (
    dataset.groupby("zip")["is_fraud"].transform(lambda x: x.rolling(window=100, min_periods=1).mean())
)

In [None]:
dataset.head()

### Geolocalización

In [1]:
# Densidad de transacciones por región
dataset["lat_bin"] = pd.cut(dataset["lat"], bins=50)
dataset["long_bin"] = pd.cut(dataset["long"], bins=50)
region_density = dataset.groupby(["lat_bin", "long_bin"])["trans_num"].count()
dataset["region_density"] = dataset.set_index(["lat_bin", "long_bin"]).index.map(region_density)

NameError: name 'pd' is not defined

In [None]:
dataset.head()