# Introducción

**Objetivo**: Predecir la probabilidad de que una transacción en línea sea clasificada como fraudulenta.

DE acuerdo con la [esta discusión](https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284), es mejor predecir clientes fraudulentos (sus tarjetas de crédito) en lugar de las transacciones como tal.

Tenemos dos datasets: uno para la transacciones y otro para la identidad de lso clientes.

Nota: "Not all transactions have corresponding identity information."

Categorical Features - Transaction

* ProductCD
* emaildomain
* card1 - card6
* addr1, addr2
* P_emaildomain
* R_emaildomain
* M1 - M9

Categorical Features - Identity

* DeviceType
* DeviceInfo
* id_12 - id_38

"The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp)."

Para esta exploración, tomaremos como referencias [esta](https://www.kaggle.com/code/robikscube/ieee-fraud-detection-first-look-and-eda/notebook#Train-vs-Test-are-Time-Series-Split) y [esta](https://www.kaggle.com/code/kabure/extensive-eda-and-modeling-xgb-hyperopt/notebook#Target-Distribution) discusiones.

In [1]:
!pip install gdown

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!gdown https://drive.google.com/uc?id=1H0OEYX33Qj-ggnlic0ILN3tcwIRbGATt

Downloading...
From: https://drive.google.com/uc?id=1H0OEYX33Qj-ggnlic0ILN3tcwIRbGATt
To: /content/ieee-fraud-detection.zip
100% 124M/124M [00:01<00:00, 98.8MB/s]


In [3]:
!mkdir data
!unzip ieee-fraud-detection.zip -d data

mkdir: cannot create directory ‘data’: File exists
Archive:  ieee-fraud-detection.zip
replace data/sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [4]:
!rm -r sample_data/

rm: cannot remove 'sample_data/': No such file or directory


In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import gc

In [6]:
## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
  """
  From https://www.kaggle.com/code/kabure/extensive-eda-and-modeling-xgb-hyperopt/notebook

  """
  numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
  start_mem = df.memory_usage().sum() / 1024**2    
  for col in df.columns:
      col_type = df[col].dtypes
      if col_type in numerics:
          c_min = df[col].min()
          c_max = df[col].max()
          if str(col_type)[:3] == 'int':
              if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                  df[col] = df[col].astype(np.int8)
              elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                  df[col] = df[col].astype(np.int16)
              elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                  df[col] = df[col].astype(np.int32)
              elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                  df[col] = df[col].astype(np.int64)  
          else:
              if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                  df[col] = df[col].astype(np.float16)
              elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                  df[col] = df[col].astype(np.float32)
              else:
                  df[col] = df[col].astype(np.float64)    
  end_mem = df.memory_usage().sum() / 1024**2
  if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
  return df
  

In [7]:
def resumetable(df):
  """
  Adaptado de:
  https://www.kaggle.com/code/kabure/extensive-eda-and-modeling-xgb-hyperopt/notebook#Competition-Objective-is-to-detect-fraud-in-transactions;


  """
  print(f"Dataset Shape: {df.shape}")
  summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
  summary = summary.reset_index()
  summary['Nombre'] = summary['index']
  summary = summary[['Nombre','dtypes']]
  summary['% NaN'] = df.isnull().sum().values/len(df)*100    
  summary['Unique'] = df.nunique().values
  summary['Primer Valor'] = df.loc[0].values
  summary['Segundo Valor'] = df.loc[1].values
  summary['Tercer Valor'] = df.loc[2].values
  
  return summary

# Importado de datos

In [None]:
df_trans = pd.read_csv("data/train_transaction.csv")
df_trans = reduce_mem_usage(df_trans)
df_trans.head()

In [None]:
df_id = pd.read_csv("data/train_identity.csv")
df_id = reduce_mem_usage(df_id)
df_id.head()

In [None]:
#test_transaction = pd.read_csv('data/test_transaction.csv')
#test_transaction = reduce_mem_usage(test_transaction)
#test_transaction.head()

In [None]:
#test_id = pd.read_csv('data/test_transaction.csv')
#test_id = reduce_mem_usage(test_id)
#test_id.head()

# Exploración

## Transaction data

In [None]:
# Correspondencia entre tabla de transacciones e identidad.
x = np.sum(df_trans['TransactionID'].isin(df_id['TransactionID'].unique()))
print(f"# transacciones: {x}")

Solo 144233 transacciones tienen asociado almenos un registro en la tabla identidad. "Not all transactions have corresponding identity information."

In [None]:
trans_resume = resumetable(df_trans)

In [None]:
trans_resume.head()

Tenemos en total 590540 muestras con 394 características.

card1 - card6: payment card information, such as card type, card category,  issue bank, country, etc.

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("card"))]

dist: distance

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("dist"))]



"C1-C14: counting, such as how many addresses are found to be associated with * the payment card, etc. The actual meaning is masked."

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("C"))]

D1-D15: timedelta, such as days between previous transaction, etc.


In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("D\d", regex = True))]

M1-M9: match, such as names on card and address, etc.

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("M\d", regex = True))]

Vxxx: Vesta engineered rich features, including ranking, counting, and other  entity relations.

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("V\d", regex = True))]

In [None]:
trans_resume.loc[(trans_resume.Nombre.str.contains("V\d", regex = True))]["dtypes"].value_counts()

In [None]:
color_pal = [x['color'] for x in plt.rcParams['axes.prop_cycle']]

In [None]:
v_cols = [c for c in df_trans if c[0] == 'V']
df_trans['v_mean'] = df_trans[v_cols].mean(axis=1)


In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(15, 6))
df_trans.loc[df_trans['isFraud'] == 1]['v_mean'] \
    .apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='log transformed mean of V columns - Fraud',
          ax=ax1)
df_trans.loc[df_trans['isFraud'] == 0]['v_mean'] \
    .apply(np.log) \
    .plot(kind='hist',
          bins=100,
          title='log transformed mean of V columns - Not Fraud',
          color=color_pal[5],
          ax=ax2)
plt.show()

In [None]:
print('{:.2f}% de transacciones son fraude en tabla de transacciones.'.format(df_trans['isFraud'].mean() * 100))
sns.countplot(x=df_trans["isFraud"])

OBservamos un notable desbalance en la variable objetivo.

# Tabla identity

In [None]:
# Add the `isFraud` column for analysis
df_id_ = df_id.merge(df_trans[['TransactionID',
                                'TransactionDT',
                                'isFraud']],
                                on=['TransactionID'])


In [None]:
df_id_.groupby('DeviceType') \
    .mean()['isFraud'] \
    .sort_values() \
    .plot(kind='barh',
          figsize=(15, 5),
          title='Percentage of Fraud by Device Type')
plt.show()

In [None]:
df_id_.groupby('DeviceInfo') \
    .count()['TransactionID'] \
    .sort_values(ascending=False) \
    .head(20) \
    .plot(kind='barh', figsize=(15, 5), title='Top 20 Devices in Train')
plt.show()

In [None]:
id_cols = [c for c in df_id.columns if 'id' in c]
for i in id_cols:
    try:
        df_id_.set_index('TransactionDT')[i].plot(style='.', title=i, figsize=(15, 3))
        df_id_.set_index('TransactionDT')[i].plot(style='.', title=i, figsize=(15, 3))
        plt.show()
    except TypeError:
        pass

# Discusiones de apoyo

https://www.kaggle.com/code/artgor/eda-and-models/notebook

https://www.kaggle.com/code/kabure/extensive-eda-and-modeling-xgb-hyperopt/notebook

https://www.kaggle.com/code/robikscube/ieee-fraud-detection-first-look-and-eda/notebook

https://www.kaggle.com/code/shahules/tackling-class-imbalance/notebook

https://www.kaggle.com/code/alijs1/ieee-transaction-columns-reference/notebook




