## Data Prep
This notebook will not do much analysis, only some basic data prep and saving into pickle format, so I don't waste time trying to re-process the data every time I want to do any analysis or model building.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib as plt
import seaborn as sns
sns.set_style('darkgrid')
import datetime
import missingno as msno
import lightgbm as lgb
import xgboost as xgb
from sklearn import preprocessing
import gc
from sklearn.model_selection import KFold, TimeSeriesSplit
from sklearn.metrics import roc_auc_score
from time import time

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

### The Data
In this competition we are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud.

The data is broken into two files identity and transaction, which are joined by TransactionID.

Note: Not all transactions have corresponding identity information.

Categorical Features - Transaction

- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr1, addr2
- P_emaildomain: purchaser email domain
- R_emaildomain: recipient email domain
- M1 - M9: match, such as names on card and address, etc.

Categorical Features - Identity
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. 
They're collected by Vesta’s fraud protection system and digital security partners.
- DeviceType: 
- DeviceInfo
- id_12 - id_38

Numerical features
- TransactionAMT: transaction payment amount in USD. Non-US transactions have an exchange rate applied, so are not exact, have a number of extra dp's. This is potentially already marked in ProductCD as C.
- TrasactionDT: timedelta from a given reference datetime (not an actual timestamp). TransactionDT first value is 86400, which corresponds to the number of seconds in a day (60 * 60 * 24 = 86400) so I think the unit is seconds. Using this, we know the data spans 6 months, as the maximum value is 15811131, which would correspond to day 183. *Might be good to split train/validation sets by time, since train/test is split by time
- dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
- C1-C14: 
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
- D1-D15: timedelta, such as days between previous transaction, etc.

The TransactionDT feature is a timedelta from a given reference datetime (not an actual timestamp, but we can potentially still use this to build time dependent features).

#### Definition of Fraud
Below is the definition of fraud by one of the Vesta team organisers:
"The logic of our labeling is define reported chargeback on the card as fraud transaction (isFraud=1) and transactions posterior to it with either user account, email address or billing address directly linked to these attributes as fraud too. If none of above is reported and found beyond 120 days, then we define as legit transaction (isFraud=0).
However, in real world fraudulent activity might not be reported, e.g. cardholder was unaware, or forgot to report in time and beyond the claim period, etc. In such cases, supposed fraud might be labeled as legit, but we never could know of them. Thus, we think they're unusual cases and negligible portion."

In [None]:
identity_train = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_identity.csv")
identity_test = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_identity.csv")
transaction_train = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_transaction.csv")
transaction_test = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_transaction.csv")

In [None]:
print(identity_train.shape)
print(identity_test.shape)
print(transaction_train.shape)
print(transaction_test.shape)

In [None]:
identity_train.head()

In [None]:
transaction_train.head()

The transactionID is a unique key throughout the datasets. There are 590k transactions, all unique. In terms of identity, we only have identity data for 144233 out of those 590k transactions (~24%)

The 'isFraud' flag is the target variable. As expected, a heavy imbalance, with about 96.5% of non-fraud transactions.

In [None]:
transaction_train['isFraud'].value_counts(normalize=True).to_frame()

In [None]:
print(transaction_train['TransactionID'].nunique())
print(identity_train['TransactionID'].nunique())

In [None]:
train_full = pd.merge(transaction_train,identity_train, on = 'TransactionID',how='left')
test_full = pd.merge(transaction_test,identity_test, on = 'TransactionID',how='left')

del transaction_test,identity_test, transaction_train, identity_train


In [None]:
gc.collect()

Do Some Processing Step to get into X, y format and encode labels to make processing more efficient. (One hot enocding would create too many sparse columns for an already wide dataset)

In [None]:
# Label Encoding
for f in train_full.columns:
    if train_full[f].dtype=='object' or test_full[f].dtype=='object': 
        train_full[f] = train_full[f].fillna('unseen_before_label')
        test_full[f]  = test_full[f].fillna('unseen_before_label')
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train_full[f].values) + list(test_full[f].values))
        train_full[f] = lbl.transform(list(train_full[f].values))
        test_full[f] = lbl.transform(list(test_full[f].values))  
        

In [None]:
# Reduce memory usage of the dataset by converting numeric columns to their minimal types
# https://www.kaggle.com/c/champs-scalar-coupling/discussion/96655#558801
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                c_prec = df[col].apply(lambda x: np.finfo(x).precision).max()
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max and c_prec == np.finfo(np.float16).precision:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max and c_prec == np.finfo(np.float32).precision:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

train_full = reduce_mem_usage(train_full)
test_full = reduce_mem_usage(test_full)

In [None]:
train_full.to_pickle("train_full.pkl")
test_full.to_pickle("test_full.pkl")
