# Fraud Detection 
* Author: Grant Gasser
* Last Edit: 8/20/2019
* Kaggle: "In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target `isFraud.`"

## Summary (Disclaimer: this kernel has become a bit unorthodox - Enter at your own risk!)
**This is my first serious attempt at a Kaggle competition. As such, I would really appreciate some feedback or tips for improving performance. If you enjoyed this notebook and it helped you, please leave a thumbs up! Though I've written most of the code myself, I have found the other public kernels very helpful and would encourage you do browse through them to look for other good ideas.**

* **Public Leaderboard Results:** (I plan on using some of the previous submission files for ensembling)
* Random Forest filled NaNs with -999: `.872`
* XGBoost filled NaNs with -999: `.938`, submission file: `baseline_xgboost.csv`
* XGBoost impute mean for numerical NaNs and most common cat for categorical NaNs, also normalized numerical vars: `.878`
* XGBoost impute mean for numerical NaNs and most common cat for categorical NaNs, no normalization: `.932`, submission file: `preprocessed_xgboost`
* XGBoost, impute mean for numerical NaNs, do not impute most common category for categorical NaNs, no normalization: `.934`, file: `preprocessed2_xgboost`. **NOTE**: imputing mean for numerical NaNs and most common category for categorical NaNs did not seem to help for XGBoost. 
* Version 21: hyperparameter tuning with XGBoost (Grid Search or Random Search), `.9284`, `xgboost_with_tuning`
* `.9226`, `xgboost_with_tuning2`
* Ensembling: `.9392`

## ENSEMBLING:
* Averaging out my previous predictions using the files listed above ^ 
* data: `https://www.kaggle.com/grantgasser/previous-submissions`

## Libraries

In [None]:
# For data analysis, model building, evaluating
from sklearn.model_selection import train_test_split, StratifiedKFold,KFold
from sklearn.metrics import precision_score, recall_score, confusion_matrix, accuracy_score, roc_auc_score, f1_score, roc_curve, auc,precision_recall_curve
from sklearn import preprocessing
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_columns', 200) # before I forget

import warnings
warnings.filterwarnings("ignore")

#For loading data
import os

# For plots
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams

## View provided files

In [None]:
print(os.listdir("../input"))

input_path = '../input'

%matplotlib inline

RANDOM_SEED = 42
nan_replace = -999

## Ensemble (Random Forest & XGBoost)
* Load previous submissions

In [None]:
# print(os.listdir('../input/previous-submissions'))

In [None]:
# baseline_random_forest = pd.read_csv('../input/previous-submissions/baseline_random_forest.csv', index_col='TransactionID') # .878
# baseline_xgboost = pd.read_csv('../input/previous-submissions/baseline_xgboost.csv', index_col='TransactionID') # .938
# xgboost_with_tuning = pd.read_csv('../input/previous-submissions/xgboost_with_tuning.csv', index_col='TransactionID') # .928
# preprocessed2_xgboost = pd.read_csv('../input/previous-submissions/preprocessed2_xgboost.csv', index_col='TransactionID') # .934

In [None]:
# assert(baseline_random_forest.shape == baseline_xgboost.shape == xgboost_with_tuning.shape == preprocessed2_xgboost.shape)

### Weighted Avg
* Based on scores (more weight for model outputs that had better scores)
* `.05, .05, .1, .8` -> `.9392`, minor improvement from the best model's score of `.9381`, file: `ensemble5.csv`

In [None]:
# ensemble = .05*baseline_random_forest + .05*xgboost_with_tuning + .01*preprocessed2_xgboost + .8*baseline_xgboost
# ensemble.head()

In [None]:
# ensemble.to_csv('ensemble5.csv')

# Data Description 
* As provided by VESTA: https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-586800

#### Transaction Table
* TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
* TransactionAMT: transaction payment amount in USD
* ProductCD: product code, the product for each transaction
* card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
* addr: address
* dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc.
* P_ and (R__) emaildomain: purchaser and recipient email domain
* C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
* D1-D15: timedelta, such as days between previous transaction, etc.
* M1-M9: match, such as names on card and address, etc.
* Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

** Categorical Features: **
* ProductCD
* card1 - card6
* addr1, addr2
* Pemaildomain Remaildomain
* M1 - M9

---

#### Identity Table
Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. 
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

** Categorical Features: **
* DeviceType
* DeviceInfo
* id12 - id38

## Load and explore data

In [None]:
%%time
train_identity =  pd.read_csv(os.path.join(input_path, 'train_identity.csv'), index_col='TransactionID')
train_transaction = pd.read_csv(os.path.join(input_path, 'train_transaction.csv'), index_col='TransactionID')
test_identity = pd.read_csv(os.path.join(input_path, 'test_identity.csv'), index_col='TransactionID')
test_transaction = pd.read_csv(os.path.join(input_path, 'test_transaction.csv'), index_col='TransactionID')

### View tables

In [None]:
print('train_identity shape:', train_identity.shape)
print('train_transaction shape:', train_transaction.shape)
print('test_identity shape:', test_identity.shape)
print('test_transaction shape:', test_transaction.shape)

In [None]:
train_transaction.head()

In [None]:
train_identity.head()

# Merge identity and transaction tables
* Per Kaggle: "The data is broken into two files `identity` and `transaction`, which are joined by `TransactionID`. Not all transactions have corresponding identity information.
* Merge identity and transaction tables with `TransactionID` as the key"
* Since "not all transactions have corresponding identity information," we will use a (left) outer join, using pandas merge function since a key might not appear in both tables

In [None]:
train = pd.merge(train_transaction, train_identity, how='left', on='TransactionID')
test = pd.merge(test_transaction, test_identity, how='left', on='TransactionID')

# see if transaction and identity variables one train table (should be same for test)
train.head()

In [None]:
# clear up RAM
del train_transaction, train_identity, test_transaction, test_identity

In [None]:
print('train shape:', train.shape)
print('test shape:', test.shape)

In [None]:
num_train = train.shape[0]
num_test = test.shape[0]
num_features = test.shape[1]

print('Test data is {:.2%}'.format(num_test/(num_train+num_test)), 'of total train/test data')

# Baseline Model Random Forest (.878) and XGBoost (.938)
### No pre-processing other than NaN replacement with -999 and label encode

In [None]:
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

## Replacing missing values
* Use `train_filled` and `test_filled` for random forest and XGBoost (NaNs replaced with -999)
* Use `train` and `test` for neural network, will imput values for NaNs and normalize

In [None]:
# store target
y_train = train['isFraud']
train = train.drop('isFraud', axis=1)

# replace NaNs
train_filled = train.fillna(nan_replace)
test_filled = test.fillna(nan_replace)

### Reduce memory usage before fit
* Thanks to https://www.kaggle.com/iasnobmatsu/xgb-model-with-feature-engineering

In [None]:
# not going to use until neural net, so compress to use less RAM
print(train.memory_usage().sum() / 1024**3, 'GB')
print(test.memory_usage().sum() / 1024**3, 'GB')

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train = reduce_mem_usage(train)
train_filled = reduce_mem_usage(train_filled)
test = reduce_mem_usage(test)
test_filled = reduce_mem_usage(test_filled)

## Label Encoding
* Change categorical variable data to numbers so that computer can understand
* e.g. if the encoding is: `['mastercard', 'discover', 'visa']` based on index, then data like `['visa', 'visa', 'mastercard', 'discover', 'mastercard']` would be encoded as `[2, 2, 0, 1, 0]`

In [None]:
# Label Encoding
for f in train.columns:
    if train[f].dtype=='object' or test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values) + list(test[f].values))
        train[f] = lbl.transform(list(train[f].values))
        test[f] = lbl.transform(list(test[f].values))

In [None]:
# -999 and no strings (label encoding)
train.head()

## Train Random Forest and XGBoost

In [None]:
xgb_clf = xgb.XGBClassifier(n_estimators=500,
    max_depth=9,
    subsample=0.9,
    colsample_bytree=0.9,
    missing=nan_replace,
    random_state=RANDOM_SEED,
    tree_method='gpu_hist')

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=9, random_state=RANDOM_SEED, n_jobs=-1)

In [None]:
%%time
xgb_clf.fit(train, y_train)

In [None]:
%%time
rf_clf.fit(train, y_train)

### Evaluate feature importance from Random Forest
* Unfortunately, the easily interpretable features did not seem to be important to the random forest model

In [None]:
for num, feature in sorted(zip(rf_clf.feature_importances_, train.columns), reverse=True):
    print('{:.4f} - {}'.format(num,feature))

In [None]:
xgb_pred = xgb_clf.predict_proba(test)[:,1]
rf_pred = rf_clf.predict_proba(test)[:,1]
del xgb_clf, rf_clf

In [None]:
print('XGBoost predictions:\n', xgb_pred[:5], '\n\nRandom Forest predictions:\n', rf_pred[:5])

# EDA and Pre-processing
### Next model will be Neural Network (3rd model in the ensemble)

### Data Types
* Before diving into EDA, look at data types of current features and see if they need to be changed

**Categorical Features:**
* ProductCD
* card1 - card6
* addr1, addr2
* Pemaildomain Remaildomain
* M1 - M9
* DeviceType
* DeviceInfo
* id12 - id38

In [None]:
for feature in train.columns[:20]:
    print(feature, '\t', train[feature].dtype)

### Thoughts
* Some of these should not be numerical data (e.g. card1-card6 should be 'object' types, not int64 or float64)
* The next few cells changes this

In [None]:
cat_features = ['ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'DeviceType', 'DeviceInfo', 'isFraud']

# add card1-card6
for i in range(1, 7):
    cat_features.append('card'+str(i))
    
    
# add M1-M9
for i in range(1, 10):
    cat_features.append('M'+str(i))
    
    
# add id12-38
for i in range(12, 39):
    cat_features.append('id_'+str(i))

In [None]:
# Convert categorical features to data type 'object'
def convert_to_object(df, cat_features):
    """
    Converts features to data type 'object', so that all categorical features in dataframe are of type 'object'
    
    Args:
        df (pd.Dataframe)
        cat_features (list): the categorical features as strings
        
    Returns:
        df (pd.Dataframe): where new df has categorical features as type 'object'
    """
    for feature in cat_features:
        if feature not in df.columns:
            print('ERROR:', feature)
        else:
            df[feature] = df[feature].astype('object')
                        
    return df

In [None]:
train = convert_to_object(train, cat_features)
test = convert_to_object(test, cat_features)

# EDA

## Analyze Categorical Variables

In [None]:
for feature in train.columns[:20]:
    if train[feature].dtype == 'object':
        print(feature, '\t Unique categories:', train[feature].describe()[1])
        print('-'*40)

### ^Confirming our intuition that 1-Hot encoding would be too high dim

## Replace NaNs with median or most common category
* Mean for numerical features, most common category for categorical features. 
* **NOTE:** with fillna, replace and other pandas functions, make sure you set the variable, because it returns the transformed object
 * e.g. `df[feature] = df[feature].replace()` instead of just `df[feature].replace()`

In [None]:
def replace_nans(df, val_to_be_replaced):
    """
    Replaces missing values (NaNs or -999) with the mean (if numerical) and with most
    common category (if categorical)
    
    Args:
        df (pd.DataFrame)
        
    Returns:
        df (pd.DataFrame): transformed dataframe
    """
    # NOTE: fillna did not work well here, recommend using replace
    print(val_to_be_replaced, type(val_to_be_replaced))
    
    for feature in df.columns:
        # replace categorical variable with most frequent
        if df[feature].dtype == 'object':
            df[feature] = df[feature].replace(val_to_be_replaced, df[feature].value_counts().index[0]) # most common category
        
        # replace NaN in numerical columns with median
        else:
            df[feature] = df[feature].replace(val_to_be_replaced, df[feature].median()) # median
            
    return df

In [None]:
train = replace_nans(train, np.nan) # if they were still NaNs and not -999, could pass np.nan as second argument
test = replace_nans(test, np.nan)

In [None]:
train.head()

## Explore Labels
* Note the class imbalance
* About 3.5% of train examples are fraudulent

In [None]:
num_fraud = y_train.sum()

print('# of fraudulent transactions:', num_fraud, '\n# of training examples:', num_train)

In [None]:
plt.bar([1, 2], height=[num_fraud, num_train-num_fraud])
plt.title('Class Imbalance')
plt.show()

## Compare fraud and non-fraud (within training set)
1. Compare the difference in means of numerical features between the fraud and non-fraud transactions. 

2. Compare the difference in distributions of categorical features between the fraud and non-fraud transactions. 
 

### Look at a few fraudulent transactions

In [None]:
train_fraud = train[y_train == 1]
train_not_fraud = train[y_train == 0]

train_fraud.head(10)

In [None]:
# def get_mean_of_feature(df, feature):
#     """
#     Calculates and returns mean value of a numerical feature variable
    
#     Args:
#         df (pd.DataFrame): the dataframe
#         feature (str): the name of the numerical feature/variable as a string
        
#     Returns:
#         mean (float)
#     """
#     return df[feature].mean()

# def get_categorical_distribution_of_feature(df, feature):
#     """
#     Calculates and returns distribution of a categorical feature variable
    
#     Args:
#         df (pd.DataFrame): the dataframe
#         feature (str): the name of the categorical feature/variable as a string
        
#     Returns:
#         categorical dist (pd.Series)
#     """
#     return df[feature].value_counts() / df[feature].value_counts().sum()

In [None]:
# def compare_dataframes(df1, df1_description, df2, df2_description):
#     """
#     Analyze each feature and compare the difference between fraud and not fraud table
    
#     Args:
#         train_fraud (pd.DataFrame): contains the fraudulent transactions
#         train_not_fraud (pd.DataFrame): contains the non-fraud transactions
        
#     Returns:
        
#     """
    
#     # features that look interesting from visual inspection
#     features = ['TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card4', 'card6', 
#                 'P_emaildomain', 'R_emaildomain', 'id_29', 'id_30', 'id_31', 'DeviceType', 'DeviceInfo']
    
#     # Use this if analyzing ALL features of dataframes
#     # make sure have same features in both dataframes
#     #assert(sorted(train_not_fraud.columns) == sorted(train_fraud.columns))
#     #features = train_fraud.columns 
    
#     for feature in features:
#         # numerical feature
#         if df1[feature].dtype == 'int64' or df1[feature].dtype == 'float64':
#             print('\nNumerical feature (' + str(df1_description), ')\tFeature name:', feature, '\nmean:', get_mean_of_feature(df1, feature))
#             print('\nNumerical feature (' + str(df2_description), ')\tFeature name:', feature, '\nmean:', get_mean_of_feature(df2, feature))
#         # categorical feature
#         elif df1[feature].dtype == 'object': # object, a string
#             print('\nCategorical feature(' + str(df1_description), ')\tFeature name:', feature, '\nDistribution:\n', get_categorical_distribution_of_feature(df1,feature)[:10])
#             print('\nCategorical feature(' + str(df2_description), ')\tFeature name:', feature, '\nDistribution:\n', get_categorical_distribution_of_feature(df2,feature)[:10])

In [None]:
# compare_dataframes(train_fraud, 'Train Fraud', train_not_fraud, 'Train Not Fraud')

In [None]:
# # Clear up RAM (10.3GB -> 8.6GB)
# del train_fraud, train_not_fraud

## Compare train and test

In [None]:
# compare_dataframes(train, 'Train set', test, 'Test set')

### Note TransactionDT has no overlap
* As mentioned: https://www.kaggle.com/robikscube/ieee-fraud-detection-first-look-and-eda
* Not sure what to do here. Maybe transform so that each value is relative to its range?

In [None]:
# plt.hist(train['TransactionDT'], label='train')
# plt.hist(test['TransactionDT'], label='test')
# plt.legend()
# plt.title('Distribution of TransactionDT dates')

In [None]:
# # could correct for time difference in later iteration, for now, just drop column
# train.drop(['TransactionDT'], axis=1)
# test.drop(['TransactionDT'], axis=1)

# print('dropped TransactionDT')

## Takeaways from EDA
### There are lots of missing values
### There is significant class imbalance (Only ~20,000 out of 590,000 are fraudulent, or 3.5 %)
* Thus, a classifier that always predicts not fraud (0) would have 96.5% accuracy (on the training set, the test set is similar)


### TRAIN SET: Comparing means of numerical features among fraud and non-fraud transactions:
* `TransactionDT` - fraudulent transactions 4.5% higher
* `TransactionAmt` - fraudulent transactions 11% more expensive

### TRAIN SET: Comparing distributions of categorical variables among fraud and non-fraud transactions:
* Take a look at the above cell to see the comparison
* Some of these may spurious, but with 20,000 fraudulent examples, they could imply something
* `ProductCD` - 39% of fraud transactions are 'C', but only 11% of non-fraud transactions are 'C'
* `card1` - looks similar
* `card4` - distribution looks similar
* `card6` - fraud transactions distributed evenly (52/48) between debit and credit whereas non-fraud transactions are mostly debit (76%)
* `P_emaildomain` - 13% of fraud comes from hotmail email vs. 9% non-fraud is hotmail email 
* `R_emaildomain` - 60% of emails on receiving end of fraud are gmail vs. only 40% for non-fraud
* `id_29` - 70% are 'Found' in the fraud examples vs. 52% in the non-fraud
* `id_30` - Though MAC OS versions show up on non-fraud top 10, do not show up in top 10 for fraud, implying fraud less common on MAC
* `DeviceType` - fraud was about evenly distributed (50/50) between mobile and desktop, most non-fraud on desktop (61%)
* `DeviceInfo` - similar to what id_30 implied, MAC used for 11% of non-fraud transactions but just 3% of fraud transactions


### Comparing train distribution and test distribution
* Remember, train size is $560,000$ and test size is $500,000$
* Other than `TransactionDT`, the distributions look similar
* Note that since the test set is later in time, there are some features where the distributions are almost certain to be different
* e.g. `id_31` represents the browser used. For the train set, the most common browser was **chrome 63** at 16%. In the test set, the most common was **chrome 70**.
7 versions later and **chrome 63** did not even show up in the top 10 most common browser for the test set, unsurprisingly.
* Should I drop `id_31` and other columns affected by time or let the model weight it?
* Also, looking at `DeviceType`, 60% of transactions in the train set were done on desktop vs. 54% on desktop in test set. 
Could this represent the increasing usage of mobile? Is there that much of a time difference between the train and test set?

# More pre-processing

### Dropping features with >80% missing values

### Leaving out for now

In [None]:
# drop_cols = [c for c in train.columns if (train[c].isnull().sum() /  num_train) > .80]

# # also dropping V107 (though float values and VESTA did not say it was categorical, it really looks categorical in {0,1})
# # it caused problems in making predictions, after further analysis, it seemed to have weak correlation with target variable
# drop_cols.append('V107')

# print('Dropping', len(drop_cols), 'columns.')
# print('Including...', drop_cols[0:10])

In [None]:
# train = train.drop(drop_cols, axis=1)
# test = test.drop(drop_cols, axis=1)

### Normalize Variables
* For speed of convergence and numerical stability
* Also to ensure variables with larger numbers do not dominate the model (e.g. TransactionAmt)
* Normalize numerical variables: $x_{i,j} = \frac{x_{i,j} - \mu_j}{\sigma_j}$ where $i$ is the row, $j$ is the column, $\mu_j$ is the mean of the column and $\sigma_j$ is the std of the col
* After the transformation, we will have $\mu_j = 0$ and $\sigma_j = 1$ for each numerical column/feature $j$
* Could also try Min-Max scaling too which gives $x_j \in (0,1)$ for all $i$.

In [None]:
# def normalize(df):
#     """
#     Normalize numerical variables
    
#     Args:
#         df (pd.DataFrame): dataframe to be normalized
        
#     Returns:
#         df (pd.Dataframe): dataframe where each column has mean 0
#     """
#     for feature in df.columns:
#         if df[feature].dtype != 'object': # if it is numerical
#             mu = df[feature].mean()
#             sd = df[feature].std()
#             df[feature] = (df[feature] - mu) / sd
            
#             # verify mean is 0
#             mu_after = df[feature].mean()
#             #print(feature, mu_after) # checks out
            
#     return df
            

## Skip normalize to see effect on performance
* Pre-processed XGBoost score `.878` vs. `.938` with no pre-processing
* **Note:** After removing normalization for XGBoost, performance jumped from `.878` to `.932`. Normalization may only be necessary or helpful with neural nets and similar algorithms

In [None]:
# train = normalize(train)
# test = normalize(test)

# Old XGBoost Hyperparameter tuning - skip down to Neural Net

## XGBoost (Old Code)
* https://developer.ibm.com/code/2018/06/20/handle-imbalanced-data-sets-xgboost-scikit-learn-python-ibm-watson-studio/
* XGBoost is an extreme gradient boosting algorithm based on trees that tends to perform very well out of the box compared to other ML algorithms.
* XGBoost is popular with data scientists and is one of the most common ML algorithms used in Kaggle Competitions.
* XGBoost allows you to tune various parameters.
* XGBoost allows parallel processing.

## Fit the XGBoost Classifier Again using Cross Validation (Old Code)
* See how the performance differs after imputing values and normalizing data
* The baseline score was `.938`

In [None]:
# import xgboost as xgb
# from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV

### Hyperparameter tuning with GridSearch and RandomizedSearch
* [XGBoost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html)
* [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) - **Takes too much RAM**, exhaustive search of the parameters, expensive but finds the optimal set
* set `scale_pos_weight` to adjust for class imbalance, common to do `sum(neg samples) / sum(pos samples)` which would be about 30 in this data set
* control overfitting: `max_depth`, `min_child_weight`, `gamma` per xgboost docs

## TODO
* GridSearchCV and RandomizedSearchCV take too much RAM
* Will write my own grid search loop and be more efficient with RAM

In [None]:
# y = y_train.astype('category')

In [None]:
# # grid of parameters, use GridSearch to find best combination
# n_estimators = [400, 550, 700]
# gamma = [.5, 1, 3]
# max_depth = [6, 8, 10]

In [None]:
# import time

# start = time.time()

# # try all combinations of parameters
# for n_est in n_estimators:
#     for md in max_depth: 
#         for g in gamma:               
#                 # train/test split, hopefully with a large dataset this is sufficient to estimate roc auc
#                 X_train, X_test, y_train, y_valid = train_test_split(train, y, test_size=.3, random_state=RANDOM_SEED, shuffle=True)
                
#                 # fit
#                 clf = xgb.XGBClassifier(n_estimators=n_est,
#                                         gamma=g,
#                                         max_depth=md,
#                                         missing=nan_replace,
#                                         subsample=.8,
#                                         colsample_bytree=.8,
#                                         scale_pos_weight=20, # to correct for class imbalance
#                                         random_state=RANDOM_SEED,
#                                         tree_method='gpu_hist')
                
#                 # fit with these parameters
#                 clf.fit(X_train, y_train)
#                 del X_train, y_train
                
#                 # predict on test/ estimate roc_auc, pick model with
#                 y_pred = clf.predict_proba(X_test)
#                 del X_test, clf
                
#                 print(roc_auc_score(y_valid, y_pred[:,1]), 'with parameters n_estimators={}, max_depth={}, gamma={},'.format(n_est, md, g))
#                 del y_valid, y_pred
                
#                 now = time.time()
#                 print('ELAPSED TIME:', now-start, 'seconds')
                
#                 # print(train.memory_usage().sum() / 1024**3, 'GB')
#                 # print(y.memory_usage() / 1024**3, 'GB\n')
                
#                 # train = reduce_mem_usage(train)
                
#                 # give RAM time to clear
#                 time.sleep(10)
                

### ^ Stopped early
* Changing `gamma` does not seem to affect the performance
* Adding more estimators and more max depth will improve performance on a subset of the test set, but has not led to improvement on the test set

In [None]:
# %%time

# # define xgboost classifier
# clf = xgb.XGBClassifier(n_estimators=400,
#                             gamma=1,
#                             max_depth=6,
#                             missing=nan_replace,
#                             subsample=.8,
#                             colsample_bytree=.8,
#                             scale_pos_weight=20, # to correct for class imbalance
#                             random_state=RANDOM_SEED,
#                             tree_method='gpu_hist')
    
# # fit classifier
# clf.fit(train, y_train)
# del train, y_train

In [None]:
# sample_submission = pd.read_csv(os.path.join(input_path, 'sample_submission.csv'), index_col='TransactionID')

# y_pred = clf.predict_proba(test)
# del clf, test

# sample_submission['isFraud'] = y_pred[:,1]
# del y_pred
# sample_submission.to_csv('xgboost_with_tuning2.csv')

# Neural Network