Questions 


### Transaction Table *
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- Pemaildomain Remaildomain
- M1 - M9

### Identity Table *
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. 
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id12 - id38

I have decided to exclude all Vxxx features from the analysis as we do not know what they are or how they were created.

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from time import time

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.model_selection import train_test_split

from sklearn import metrics

In [None]:
%%time
dataset_path = '../datasets/fraud_datasets/' # Local location
# dataset_path = '../input/ieee-fraud-detection/'

# sample_submission_df = pd.read_csv(f'{dataset_path}sample_submission.csv')

train_transaction_df = pd.read_csv(f'{dataset_path}train_transaction.csv')
test_transaction_df = pd.read_csv(f'{dataset_path}test_transaction.csv')

train_id_df = pd.read_csv(f'{dataset_path}train_identity.csv')
test_id_df = pd.read_csv(f'{dataset_path}test_identity.csv')

After importing our raw datasets, they need to be merged by TransactionID's.

<b>(Note: throughout this notebook, it will be important to delete any dataframes that will not be used again as they take up space in memory and slow our machine down.)</b>

In [None]:
%%time
train = train_transaction_df.merge(train_id_df, on='TransactionID', how='left')
test = test_transaction_df.merge(test_id_df, on='TransactionID', how='left')

# Renaming columns for better description
names = {
    'addr1': 'billing zipcode',
    'addr2': 'country codes',
    'P_emaildomain': 'Purchaser_emaildom',
    'R_emaildomain': 'Retailer_email.dom'
}

train.rename(columns=names, inplace=True)
test.rename(columns=names, inplace=True)

In [None]:
del train_transaction_df, train_id_df, test_transaction_df, test_id_df

### Handling Missing Values

Due to the large number of NaN values our dataframes contain, it is critical that they are replaced with a meaningful placeholder.

In [None]:
y_train = train['isFraud'].copy()
X_train = train.drop(['isFraud', 'TransactionID'], axis=1)
del train

In [12]:
X_train['card2'].isna().any()

True

In [11]:
X_train.isna().any()

TransactionDT         False
TransactionAmt        False
ProductCD             False
card1                 False
card2                  True
card3                  True
card4                  True
card5                  True
card6                  True
billing zipcode        True
country codes          True
dist1                  True
dist2                  True
Purchaser_emaildom     True
Retailer_email.dom     True
C1                    False
C2                    False
C3                    False
C4                    False
C5                    False
C6                    False
C7                    False
C8                    False
C9                    False
C10                   False
C11                   False
C12                   False
C13                   False
C14                   False
D1                     True
                      ...  
id_11                  True
id_12                  True
id_13                  True
id_14                  True
id_15               

To prevent our model from crashing, it is important to replace missing values with so derived value. The replacement values for categorical features will be the mode of that feature, and for numerical features, it will be the mean for that feature.

In [None]:
%%time
for feature in X_train.columns:
    if X_train[feature].isna().any():        
        if X_train[feature].dtype == 'object':
            X_train[feature] = X_train[feature].astype(str)
            X_train[feature] = X_train[feature].fillna(X_train[feature].mode())
        else:
            X_train[feature] = X_train[feature].fillna(X_train[feature].mean())
    if test[feature].isna().any():
        if test[feature].dtype == 'object':
            test[feature] = test[feature].astype(str)
            test[feature] = test[feature].fillna(test[feature].mode())
        else:
            test[feature] = test[feature].fillna(test[feature].mean())

In [38]:
X_train = X_train.fillna(-999)

In [None]:
X_train['Purchaser_emaildom'][15]

In [31]:
X_train['Purchaser_emaildom'].fillna(X_train['Purchaser_emaildom'].mode()).tolist()

['gmail.com',
 'gmail.com',
 'outlook.com',
 'yahoo.com',
 'gmail.com',
 'gmail.com',
 'yahoo.com',
 'mail.com',
 'anonymous.com',
 'yahoo.com',
 'gmail.com',
 'hotmail.com',
 'verizon.net',
 'aol.com',
 'yahoo.com',
 nan,
 'aol.com',
 'yahoo.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'me.com',
 'yahoo.com',
 nan,
 'gmail.com',
 'gmail.com',
 'yahoo.com',
 'yahoo.com',
 'yahoo.com',
 'yahoo.com',
 'yahoo.com',
 'gmail.com',
 'gmail.com',
 nan,
 'yahoo.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'gmail.com',
 'yahoo.com',
 'gmail.com',
 'anonymous.com',
 'outlook.com',
 'anonymous.com',
 'gmail.com',
 'yahoo.com',
 'gmail.com',
 'yahoo.com',
 'gmail.com',
 'gmail.com',
 nan,
 'gmail.com',
 nan,
 'gmail.com',
 'hotmail.com',
 'gmail.com',
 'yahoo.com',
 'yahoo.com',
 'gmail.com',
 nan,
 'anonymous.com',
 nan,
 'hotmail.com',
 nan,
 'anonymous.com',
 'gmail.com',
 'outlook.com',
 'yahoo.com',
 '

In [32]:
test.isna().any()

TransactionID         False
TransactionDT         False
TransactionAmt        False
ProductCD             False
card1                 False
card2                  True
card3                  True
card4                  True
card5                  True
card6                  True
billing zipcode        True
country codes          True
dist1                  True
dist2                  True
Purchaser_emaildom     True
Retailer_email.dom     True
C1                     True
C2                     True
C3                     True
C4                     True
C5                     True
C6                     True
C7                     True
C8                     True
C9                     True
C10                    True
C11                    True
C12                    True
C13                    True
C14                    True
                      ...  
id_11                  True
id_12                  True
id_13                  True
id_14                  True
id_15               

As our model can not accept categorical features as object data types, a label transformation must be preformed.

In [None]:
test.isna().any()

In [None]:
%%time
for feature in X_train.columns:
    if X_train[feature].dtype == 'object' or test[feature].dtype == 'object':
        le = LabelEncoder()
        le.fit(list(X_train[feature].values) + list(test[feature].values))
        X_train[feature] = le.transform(list(X_train[feature].values))
        test[feature] = le.transform(list(test[feature].values))

#### SMOTE or Oversampling

### Feature Selection
As we are preforming mean different feature selection methods, it is important to store each set of features for testing.

First up, we will append all of our features before selection into our feature sets dictionary.

In [None]:
# feature_sets = {}
# feature_sets['all'] = X_train.columns.tolist()

#### Random Forest Feature Selection

In [None]:
# # Build classifier
# clf = RandomForestClassifier(n_estimators=250, n_jobs=-1)
# clf.fit(X_train, y_train)

# # Select features > 0
# importances = pd.DataFrame(X_train.columns, columns=['features'])
# importances['importance'] = clf.feature_importances_
# importances.sort_values(by='importance', axis=0, ascending=False, inplace=True)
# importances = importances[importances['importance'] > 0]

# # Plot best features
# plt.figure(figsize=[20,10])
# plt.title("Feature importances")
# plt.bar(range(len(importances)), importances['importance'])
# plt.xticks(range(len(importances)), importances['features'])
# plt.xlim([-1, len(importances)])
# plt.xticks(rotation=90)
# plt.show()

In [None]:
# feature_sets['random_forest'] = importances['features'].tolist()

#### Chi Squared Feature Selection

In [None]:
# # Scale features to avoid negative values crashing chi2
# scaler = MinMaxScaler()
# X_train_scaled = scaler.fit_transform(X_train)

# # Building the model
# bestfeatures = SelectKBest(score_func=chi2, k=2)
# fit = bestfeatures.fit(X_train_scaled, y_train.values)

# # Select features > 50
# feature_scores = pd.DataFrame(X_train.columns, columns=['features'])
# feature_scores['scores'] = fit.scores_
# feature_scores.sort_values(by='scores', axis=0, ascending=False, inplace=True)
# feature_score = feature_scores[feature_scores['scores']>50]

# # Plot best features
# plt.figure(figsize=[20,10])
# plt.title("Feature importances")
# plt.bar(range(len(feature_scores)), feature_scores['scores'])
# plt.xticks(range(len(feature_scores)), feature_scores['features'])
# plt.xlim([-1, len(feature_scores)])
# plt.xticks(rotation=90)
# plt.show()

In [None]:
# feature_sets['chi2'] = feature_score['features'].tolist()

#### Train/test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42)

## Random Forest

In [None]:
model = RandomForestClassifier(n_estimators=400, max_features=0.3, # Try running Classifier with max_features=1
    min_samples_leaf=20, n_jobs=-1, verbose=1)

model.fit(X_train, y_train)

preds_valid = model.predict(X_test)

metrics.roc_auc_score(y_test, preds_valid)

In [None]:
y_pred_rf = model.predict(test)

# # result_path = 'results/' # Local Notebook
results_path = '' # Kaggle Notebook

finished_df_rf = pd.DataFrame(test['TransactionID'])
finished_df_rf['isFraud'] = y_pred_rf

finished_df_rf.to_csv(result_path + 'predictions_rf.csv', index=False)