### 'IEEE_CIS' Dataset is of high memory and we will explore how to handle high memory datasets in this kernel

 # 'IEEE-CIS' Fraud Detection Data
 
 
 
 #### The data is broken into two files identity and transaction, which are joined by TransactionID.
 
 #### Categorical Features - Transaction

1. ProductCD
2. emaildomain
3. card1 - card6
4. addr1, addr2
5. P_emaildomain
6. R_emaildomain
7. M1 - M9

#### Categorical Features - Identity

1. DeviceType
2. DeviceInfo
3. id_12 - id_38
 
 
 
 


In [1]:
import pandas as pd
import numpy as np
import gc
import os

In [2]:
# Kaggle input path
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/ieee-fraud-detection/test_identity.csv
/kaggle/input/ieee-fraud-detection/train_identity.csv
/kaggle/input/ieee-fraud-detection/test_transaction.csv
/kaggle/input/ieee-fraud-detection/sample_submission.csv
/kaggle/input/ieee-fraud-detection/train_transaction.csv


### Read train(transaction and identity) and test(transaction and identity) data

In [3]:
# Read train data
train_trans = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_transaction.csv")
train_identity = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_identity.csv")

# Read test data
test_trans = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_transaction.csv")
test_identity = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_identity.csv")

### Combine 'transaction' and 'identity' data

In [4]:
# Train data (Combine 'train_identity' and 'train_trans')
df_train = train_trans.merge(train_identity, how='left', left_index=True, right_index=True, on='TransactionID')

# Test data (Combine 'test_identity' and 'test_trans')
df_test =  test_trans.merge(test_identity, how='left', left_index=True, right_index=True, on='TransactionID')

### In this kernel, we will delete temporary storage to provide space to RAM or else session crashes due to shortage of RAM

In [5]:
del train_trans, train_identity, test_trans, test_identity; x = gc.collect()

### 'train' and 'test' dataset Memory

In [6]:
print('train data memory in MB:', df_train.memory_usage().sum() / 1024**2) 
print('test data memory in MB:', df_test.memory_usage().sum() / 1024**2) 

train data memory in MB: 1955.3709106445312
test data memory in MB: 1673.8679428100586


### As the memory of train and test dataset is high and our RAM space is low, we will try to reduce the dataset memory

In [7]:
# Most often used function to reduce dataset memory

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            #print("******************************")
            #print("Column: ",col)
            #print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                      props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64) 
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
            
            # Print new column type
            #print("dtype after: ",props[col].dtype)
            #print("******************************")
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props

In [8]:
reduce_mem_usage(df_train)
reduce_mem_usage(df_test)

Memory usage of properties dataframe is : 1955.3709106445312  MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  546.287467956543  MB
This is  27.937792517148328 % of the initial size
Memory usage of properties dataframe is : 1673.8679428100586  MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  459.05740547180176  MB
This is  27.42494755596697 % of the initial size


Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38,DeviceType,DeviceInfo
0,3663549,18403224,31.950001,W,10409,111,150,visa,226,debit,...,chrome 67.0 for android,7,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
1,3663550,18403263,49.000000,W,4272,111,150,visa,226,debit,...,chrome 67.0 for android,24,1280x720,match_status:2,T,F,T,T,mobile,LGLS676 Build/MXB48T
2,3663551,18403310,171.000000,W,4476,574,150,visa,226,debit,...,ie 11.0 for tablet,7,,,F,T,T,F,desktop,Trident/7.0
3,3663552,18403310,284.950012,W,10989,360,150,visa,166,debit,...,chrome 67.0 for android,7,,,F,F,T,F,mobile,MYA-L13 Build/HUAWEIMYA-L13
4,3663553,18403317,67.949997,W,18018,452,150,mastercard,117,debit,...,chrome 67.0 for android,7,,,F,F,T,F,mobile,SM-G9650 Build/R16NW
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
506686,4170235,34214279,94.679001,C,13832,375,185,mastercard,224,debit,...,,7,,,,,,,,
506687,4170236,34214287,12.173000,C,3154,408,185,mastercard,224,debit,...,,7,,,,,,,,
506688,4170237,34214326,49.000000,W,16661,490,150,visa,226,debit,...,,7,,,,,,,,
506689,4170238,34214337,202.000000,W,16621,516,150,mastercard,224,debit,...,,7,,,,,,,,


### Conclusion: 
1. Train dataset memory reduced from '1955' MB to '546' MB
2. Test  dataset memory reduced from '1673' MB to '459' MB

## Combine train and test data

In [9]:
df_test['isFraud'] = 'test'
df = pd.concat([df_train, df_test], axis = 0, sort=False)
df = df.reset_index()
df.drop('index', axis=1, inplace = True)

In [10]:
del df_train, df_test; x = gc.collect()

In [11]:
df.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-29,id-30,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38
0,2987000,0,86400,68.5,W,13926,99,150,discover,142,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404,150,mastercard,102,...,,,,,,,,,,


### Understanding the columns
1. transaction related columns
2. card related columns
3. addr,dist,domain related columns
4. C columns
5. D columns
6. M columns
7. V columns
8. others ('identity' columns along with 'device' information)

In [12]:
# Transaction columns
df.columns[0:5]

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD'],
      dtype='object')

In [13]:
# Card related columns
df.columns[5:11]

Index(['card1', 'card2', 'card3', 'card4', 'card5', 'card6'], dtype='object')

In [14]:
#  addr, dist, emaildomain related columns
df.columns[11:17]

Index(['addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain'], dtype='object')

In [15]:
# C columns
df.columns[17:31]

Index(['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
       'C12', 'C13', 'C14'],
      dtype='object')

In [16]:
# D columns
df.columns[31:46]

Index(['D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11',
       'D12', 'D13', 'D14', 'D15'],
      dtype='object')

In [17]:
# M columns
df.columns[46:55]

Index(['M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9'], dtype='object')

In [18]:
# V columns
df.columns[55:394]

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       ...
       'V330', 'V331', 'V332', 'V333', 'V334', 'V335', 'V336', 'V337', 'V338',
       'V339'],
      dtype='object', length=339)

In [19]:
# Identity columns
df.columns[394:]

Index(['id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08',
       'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16',
       'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24',
       'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32',
       'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType',
       'DeviceInfo', 'id-01', 'id-02', 'id-03', 'id-04', 'id-05', 'id-06',
       'id-07', 'id-08', 'id-09', 'id-10', 'id-11', 'id-12', 'id-13', 'id-14',
       'id-15', 'id-16', 'id-17', 'id-18', 'id-19', 'id-20', 'id-21', 'id-22',
       'id-23', 'id-24', 'id-25', 'id-26', 'id-27', 'id-28', 'id-29', 'id-30',
       'id-31', 'id-32', 'id-33', 'id-34', 'id-35', 'id-36', 'id-37', 'id-38'],
      dtype='object')

### Summary
**It's been observed that, V columns are in large number (around 340). So we can either ignore all V columns or apply PCA for all V columns in order to reduce the columns/memory.**

**In this kernel, we will apply PCA for V columns in order not to lose any information.**

#### Email Mappings

In [20]:
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
          'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
          'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',
          'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
          'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
          'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
          'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
          'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
          'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
          'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
          'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
          'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
          'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
           'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
          'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
          'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
          'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
          'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
          'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

In [21]:
col = ['P_emaildomain', 'R_emaildomain']

for x in col:
    df[x + '_bin'] = df[x].map(emails)

### Label Encoding

In [22]:
from sklearn import preprocessing

In [23]:
for col in df.drop('isFraud', axis = 1).columns:    
    if df[col].dtype == 'object':
        le = preprocessing.LabelEncoder()
        le.fit(list(df[col].values))
        df[col] = le.transform(list(df[col].values))

In [24]:
df.memory_usage().sum() / 1024**2

1524.6064138412476

In [25]:
reduce_mem_usage(df)

Memory usage of properties dataframe is : 1524.6064138412476  MB
___MEMORY USAGE AFTER COMPLETION:___
Memory usage is:  1221.1501169204712  MB
This is  80.09608944539215 % of the initial size


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id-31,id-32,id-33,id-34,id-35,id-36,id-37,id-38,P_emaildomain_bin,R_emaildomain_bin
0,2987000,0,86400,68.500000,4,13926,99,150,1,142,...,107,6,390,2,2,2,2,2,6,6
1,2987001,0,86401,29.000000,4,2755,404,150,2,102,...,107,6,390,2,2,2,2,2,4,6
2,2987002,0,86469,59.000000,4,4663,490,150,4,166,...,107,6,390,2,2,2,2,2,5,6
3,2987003,0,86499,50.000000,4,18132,567,150,2,117,...,107,6,390,2,2,2,2,2,9,6
4,2987004,0,86506,50.000000,1,4497,514,150,2,102,...,107,6,390,2,2,2,2,2,4,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1097226,4170235,test,34214279,94.679001,0,13832,375,185,2,224,...,107,7,390,2,2,2,2,2,4,4
1097227,4170236,test,34214287,12.173000,0,3154,408,185,2,224,...,107,7,390,2,2,2,2,2,5,5
1097228,4170237,test,34214326,49.000000,4,16661,490,150,4,226,...,107,7,390,2,2,2,2,2,5,6
1097229,4170238,test,34214337,202.000000,4,16621,516,150,2,224,...,107,7,390,2,2,2,2,2,5,6


### I have tried to apply PCA for V columns using the entire dataset 'df' at once, but session crashed due to low RAM.

### Kaggle has RAM usage of 13GB. So i need to split the data and perform PCA

### Split the data back into train and test 

In [26]:
df_train, df_test = df[df['isFraud'] != 'test'], df[df['isFraud'] == 'test'].drop('isFraud', axis=1)

In [27]:
del df; x = gc.collect()

### Applying PCA for V columns

In [28]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import minmax_scale

In [29]:
v_columns = df_train.columns[55:394]
v_columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       ...
       'V330', 'V331', 'V332', 'V333', 'V334', 'V335', 'V336', 'V337', 'V338',
       'V339'],
      dtype='object', length=339)

In [30]:
# fill NaN values and scale the data using scalar function

# for train data
for col in v_columns:
    df_train[col] = df_train[col].fillna((df_train[col].min() - 2))
    df_train[col] = (minmax_scale(df_train[col], feature_range=(0,1)))

# for test data
for col in v_columns:
    df_test[col] = df_test[col].fillna((df_test[col].min() - 2))
    df_test[col] = (minmax_scale(df_test[col], feature_range=(0,1)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-d

In [31]:
def func_pca(df, v_columns, prefix):
    
    pca = PCA(n_components = 30, random_state = 1)
    pca = pca.fit_transform(df[v_columns])
    pca_df = pd.DataFrame(pca)
    df.drop(v_columns, axis=1, inplace=True)
    pca_df.rename(columns=lambda x: str(prefix)+str(x), inplace=True)
    df = pd.concat([df, pca_df], axis=1)
    
    return df

In [32]:
train = func_pca(df_train, v_columns = v_columns, prefix = 'PCA_V_')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [33]:
del df_train; x= gc.collect()

In [34]:
test = func_pca(df_test, v_columns = v_columns, prefix = 'PCA_V_')

In [35]:
del df_test; x= gc.collect()

In [36]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 590540 entries, 0 to 590539
Columns: 165 entries, TransactionID to PCA_V_29
dtypes: float32(5), float64(30), int16(2), int32(6), int8(18), object(1), uint16(16), uint32(26), uint8(61)
memory usage: 292.3+ MB


In [37]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1013382 entries, 0 to 1097230
Columns: 164 entries, TransactionID to PCA_V_29
dtypes: float32(5), float64(159)
memory usage: 1.2 GB


### Summary

**Still test dataset is having 1.2 GB memory. Function that we have used to reduce the memory is not effective.**

**There is a simple approach to reduce dataset memory. Just convert float64 into float32 and int64 into int32**

In [38]:
for col in test.columns:
    if test[col].dtype=='float64': test[col] = test[col].astype('float32')    

In [39]:
for col in train.columns:
    if train[col].dtype=='float64': train[col] = train[col].astype('float32')
    if train[col].dtype=='int64': train[col] = train[col].astype('int32')      

In [40]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1013382 entries, 0 to 1097230
Columns: 164 entries, TransactionID to PCA_V_29
dtypes: float32(164)
memory usage: 641.7 MB


### Conclusion: Almost 50% of test dataset memory is reduced with simple approach

In [41]:
train.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,PCA_V_20,PCA_V_21,PCA_V_22,PCA_V_23,PCA_V_24,PCA_V_25,PCA_V_26,PCA_V_27,PCA_V_28,PCA_V_29
0,2987000,0,86400,68.5,4,13926,99,150,1,142,...,-0.002375,-0.00515,-0.000375,-0.00229,0.00653,0.000311,6.4e-05,-7.4e-05,0.000104,0.001011
1,2987001,0,86401,29.0,4,2755,404,150,2,102,...,-0.000328,-0.003118,-0.000247,-0.002412,0.004743,-0.000136,-6.1e-05,0.000177,-0.000407,0.000631


### Build the model

In [42]:
# spilt the train data for 'training' and 'validation'.

# train index
idxT = train.index[:3*len(train)//4]

# Validation index
idxV = train.index[3*len(train)//4:]

In [43]:
# only X columns
cols = train.columns.difference(['isFraud'])
cols

Index(['C1', 'C10', 'C11', 'C12', 'C13', 'C14', 'C2', 'C3', 'C4', 'C5',
       ...
       'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36',
       'id_37', 'id_38'],
      dtype='object', length=164)

In [44]:
# Model
import xgboost as xgb

In [45]:
# xgb.XGBClassifier?

In [46]:
clf = xgb.XGBClassifier(n_estimators = 300, eval_metric = 'auc')

In [47]:
X_train = train.loc[idxT, cols]
y_train = train['isFraud'][idxT]

X_val = train.loc[idxV, cols]
y_val = train['isFraud'][idxV]

In [48]:
 clf.fit(X_train, y_train,eval_set=[(X_val, y_val)],verbose=50, early_stopping_rounds=100)

[0]	validation_0-auc:0.79745
Will train until validation_0-auc hasn't improved in 100 rounds.
[50]	validation_0-auc:0.89114
[100]	validation_0-auc:0.89577
[150]	validation_0-auc:0.89778
[200]	validation_0-auc:0.89659
[250]	validation_0-auc:0.89703
Stopping. Best iteration:
[157]	validation_0-auc:0.89818



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eval_metric='auc',
              gamma=0, gpu_id=-1, importance_type='gain',
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=300, n_jobs=0,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              validate_parameters=1, verbosity=None)

## Conclusion: 

#### Without using any feature engineering or optimizing the model, we have acheived validation accuracy around 90% (which is not bad). 

#### Now the model is ready and we can predict the 'Y' for test data.

### References for feature engineering, EDA and some advance techniques.

https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

https://www.kaggle.com/alijs1/ieee-transaction-columns-reference

https://www.kaggle.com/kabure/extensive-eda-and-modeling-xgb-hyperopt/comments

#### Thank you for reading the kernel, hope you find it useful:)
