# Credit Card Transaction Fraud Detection Project
The purpose of this project is to build a fraud prediction model to identify credit card transaction frauds in the type “P” transact records in a government card transaction data set. Data contains around 100,000 records of card information, date, merchant information, location information of each transaction.

We filled in the missing values in the dataset using median values of specific groups, then we created more than 300 variables (amount variables, frequency variables, day since variables) based on the original dataset.

To reduce the dimensionality of the dataset, we first computed the KS (Kolmogorov-Smirnov
) score and FDR (False Detection Rate) and dropped half of the variables based on the combination ranking of these two scores. Further, we used backward stepwise logistic regression algorithm to select 20 variables for our model fitting process.

For the training and test set, after we dropped the out-of-date records, we random sampled 10 times of training and test set in order to better test the performance of each model. Then we used logistic regression model, boosted tree model, random forest model and support vector machine model to fit the training sets and compared the prediction outcomes on test sets of these models. We found that random forest model yielded the best result.

## Load Data

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
card= pd.read_excel('Raw data_card transactions.xlsx')
card = card[card['Transtype']=='P'] #We only study the Transtype P here
card.head()

Unnamed: 0,Recnum,Cardnum,Date,Merchnum,Merch description,Merch state,Merch zip,Transtype,Amount,Fraud
0,1,5142190439,2010-01-01,5509006296254,FEDEX SHP 12/23/09 AB#,TN,38118.0,P,3.62,0
1,2,5142183973,2010-01-01,61003026333,SERVICE MERCHANDISE #81,MA,1803.0,P,31.42,0
2,3,5142131721,2010-01-01,4503082993600,OFFICE DEPOT #191,MD,20706.0,P,178.49,0
3,4,5142148452,2010-01-01,5509006296254,FEDEX SHP 12/28/09 AB#,TN,38118.0,P,3.62,0
4,5,5142190439,2010-01-01,5509006296254,FEDEX SHP 12/23/09 AB#,TN,38118.0,P,3.62,0


## Fill in Missing Values

In [2]:
card.info() #There are NAs in Merch state, Merch zip
## We don't want to study Merchnum, since this field and Merch description are both unique identifier for merchants
## while Merch description doesn't have NAs.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96398 entries, 0 to 96752
Data columns (total 10 columns):
Recnum               96398 non-null int64
Cardnum              96398 non-null int64
Date                 96398 non-null datetime64[ns]
Merchnum             93199 non-null object
Merch description    96398 non-null object
Merch state          95377 non-null object
Merch zip            92097 non-null float64
Transtype            96398 non-null object
Amount               96398 non-null float64
Fraud                96398 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(4)
memory usage: 8.1+ MB


In [3]:
## Add values into zip by major corresponding values group by merch description, new field is zip_x
merNum_Des = card.dropna(axis=0).groupby('Merch description').agg({'Merch zip':lambda x:x.value_counts().index[0]})
newcard = card.merge(merNum_Des,right_index=True,left_on='Merch description',how='left')
newcard['Merch zip_x'].fillna(newcard['Merch zip_y'],inplace=True)

In [4]:
## Add values into zip by major corresponding values group by cardnum, new field is zip_x_x
merNum_Des = newcard.dropna(axis=0).groupby('Cardnum').agg({'Merch zip_x':lambda x:x.value_counts().index[0]})
newcard = newcard.merge(merNum_Des,right_index=True,left_on='Cardnum',how='left')
newcard['Merch zip_x_x'].fillna(newcard['Merch zip_x_y'],inplace=True)

In [5]:
newcard.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96398 entries, 0 to 96752
Data columns (total 12 columns):
Recnum               96398 non-null int64
Cardnum              96398 non-null int64
Date                 96398 non-null datetime64[ns]
Merchnum             93199 non-null object
Merch description    96398 non-null object
Merch state          95377 non-null object
Merch zip_x_x        96344 non-null float64
Transtype            96398 non-null object
Amount               96398 non-null float64
Fraud                96398 non-null int64
Merch zip_y          93185 non-null float64
Merch zip_x_y        96334 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(3), object(4)
memory usage: 9.6+ MB


In [5]:
## Add values into state by major corresponding values group by zip, new field is state_x
a = newcard[newcard['Merch zip_x_x'].notna()] 
b = a[a['Merch state'].notna()] ## b is data without NAs in zip_x_x and state
merNum_Des = b.groupby('Merch zip_x_x').agg({'Merch state':lambda x:x.value_counts().index[0]})
newcard = newcard.merge(merNum_Des,right_index=True,left_on='Merch zip_x_x',how='left')
newcard['Merch state_x'].fillna(newcard['Merch state_y'],inplace=True)
newcard.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96398 entries, 0 to 96752
Data columns (total 13 columns):
Recnum               96398 non-null int64
Cardnum              96398 non-null int64
Date                 96398 non-null datetime64[ns]
Merchnum             93199 non-null object
Merch description    96398 non-null object
Merch state_x        96290 non-null object
Merch zip_x_x        96344 non-null float64
Transtype            96398 non-null object
Amount               96398 non-null float64
Fraud                96398 non-null int64
Merch zip_y          93185 non-null float64
Merch zip_x_y        96334 non-null float64
Merch state_y        96277 non-null object
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 10.3+ MB


In [6]:
newcard[newcard['Merch state_x'].isna()].head()

Unnamed: 0,Recnum,Cardnum,Date,Merchnum,Merch description,Merch state_x,Merch zip_x_x,Transtype,Amount,Fraud,Merch zip_y,Merch zip_x_y,Merch state_y
3258,3259,5142153880,2010-01-14,582582822587,DIGITAL TECHNOLOGY CONTRA,,926.0,P,2340.0,0,,20746.0,
3262,3263,5142154098,2010-01-14,582582822587,DIGITAL TECHNOLOGY CONTRA,,926.0,P,2387.0,0,,20639.0,
3540,3541,5142154098,2010-01-17,582582822587,DIGITAL TECHNOLOGY CONTRA,,926.0,P,2300.0,0,,20639.0,
3642,3643,5142153880,2010-01-17,582582822587,DIGITAL TECHNOLOGY CONTRA,,926.0,P,2500.0,0,,20746.0,
4969,4970,5142194136,2010-01-24,597597721468,CRISTALIA ACQUISITION COR,,929.0,P,83.0,0,,90640.0,


In [7]:
## We searched the missing value in Zip on Google and find their corresponding State.
dict = {"907.0":"PR", "922.0":"PR", "920.0":"PR", "801.0":"USVI","31040.0":"GA", "41160.0":"KY", "934.0": "PR",
"902.0": "PR", "738.0": "PR", "90805.0": "CA", "76302.0": "TX", "914.0": "PR", "95461.0": "CA", "50823.0": "Other", 
'926.0': "PR", '929.0':"PR", '1400.0':"Other", '65132.0':"Other", '86899.0':"Other", '23080.0':"Other",
'60528.0':"Other", "48700.0": "CA", "680.0": "PR", "681.0": "PR", "623.0": "PR", "726.0": "PR", "936.0": "PR",
"791.0": "PR", "12108.0": "Other", "nan":'Other'}

In [8]:
## We used the previous dictionary to fill in missing fields in state.
ab = newcard[newcard['Merch state_x'].isna()]
ab['Merch zip_x_x']=ab['Merch zip_x_x'].astype('str')
for i in range(len(ab['Merch zip_x_x'])):
    ab['Merch state_x'].iloc[i]=dict[ab['Merch zip_x_x'].iloc[i]]

In [9]:
## We replace the values in newcard dataset with filled data
ac = newcard[newcard['Merch state_x'].notna()][['Merch state_x']]
ad = pd.concat([ab[['Merch state_x']],ac])
newcard['Merch state_x'] = ad

In [10]:
## No NAs in field state now
newcard[newcard['Merch state_x'].isna()]

Unnamed: 0,Recnum,Cardnum,Date,Merchnum,Merch description,Merch state_x,Merch zip_x_x,Transtype,Amount,Fraud,Merch zip_y,Merch zip_x_y,Merch state_y


In [11]:
newcard['Merch zip_x_x'].fillna(0,inplace=True)

In [12]:
# Select the useful columns and change the column name
fdata = newcard[["Recnum","Cardnum","Date","Merch description","Merch state_x","Merch zip_x_x","Transtype","Amount","Fraud"]]
fdata.columns = ["Recnum","Cardnum","Date","Merch description","Merch state","Merch zip","Transtype","Amount","Fraud"]

In [13]:
fdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 96398 entries, 0 to 96752
Data columns (total 9 columns):
Recnum               96398 non-null int64
Cardnum              96398 non-null int64
Date                 96398 non-null datetime64[ns]
Merch description    96398 non-null object
Merch state          96398 non-null object
Merch zip            96398 non-null float64
Transtype            96398 non-null object
Amount               96398 non-null float64
Fraud                96398 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(3)
memory usage: 7.4+ MB


## Feature Engineering
Creating variables is accomplished by using the "data table" package in R. The R codes below can **generate 300 rolling window variables within 30 seconds**. Usually, it took at least 15 minutes for an efficient algorithm (as showed below) written in Python to do the same thing (without any package). The graphs shows what the 300 variables mean.
<img src="png1.png">

In [15]:

##  Cardnum, Merchnum
for groupbyvar in ['Cardnum', 'Merchnum']:
    data_sorted = data_sorted.sort_values(by = [groupbyvar, 'Date'])
    data_sorted_index = data_sorted.set_index('Date’)
    for agg in ['mean', 'max', 'median', 'sum', 'count']: 
        for days in ['1d', '3d', '7d', '14d', '30d']:
            data_sorted[agg + '_' + groupbyvar + "_" + days]=getattr(data_sorted_index.groupby(groupbyvar)['Amount'].rolling(days),agg)().values
            data_sorted['Actual/' + agg + "_" + groupbyvar + "_" + days] = data_sorted['Amount']/data_sorted[agg + '_' + groupbyvar + "_" + days]

##  Cardnum and Merchnum, Cardnum and Zip, Cardnum and state
for groupbyvar in ['Merchnum', 'Merch zip', 'Merch state']:
    data_sorted = data_sorted.sort_values(by = ['Cardnum',groupbyvar, 'Date'])
    data_sorted_index = data_sorted.set_index('Date')
    for agg in ['mean', 'max', 'median', 'sum', 'count']: 
        for days in ['1d', '3d', '7d', '14d', '30d']:
            data_sorted[agg + '_' + "Cardnum_" + groupbyvar + "_" + days] = getattr(data_sorted_index.groupby(['Cardnum',groupbyvar])['Amount'].rolling(days),agg)().values
            data_sorted['Actual/' + agg + "_" + "Cardnum_" + groupbyvar + "_" + days] = data_sorted['Amount']/data_sorted[agg + '_' + "Cardnum_" + groupbyvar + "_" + days]            

##  Create days since variables
for groupbyvar in [['Cardnum'], ['Merchnum'], ['Cardnum', 'Merchnum'], ['Cardnum', 'Merch zip'], ['Cardnum', 'Merch state']]:
    sortCols = groupbyvar[:]
    sortCols.append('Date')
    data_sorted1 = data_sorted1.sort_values(by = sortCols)
    if len(groupbyvar) == 1:
        data_sorted1['Days_since_per_' + groupbyvar[0]] = data_sorted1.groupby(groupbyvar)['Date'].apply(lambda x: (x - x.shift(1)).astype('timedelta64[D]')).fillna(365).values 
    else:
        data_sorted1['Days_since_per_Cardnum_' + groupbyvar[1]] = data_sorted1.groupby(groupbyvar)['Date'].apply(lambda x: (x -x.shift(1)).astype('timedelta64[D]')).fillna(365).values 

                                              

## Dimensions Reduction
Since this is a fraud detection project and we have 300 variables, we decided to utilize **univariate Kolmogorov-Smirnov (KS)** and **Fraud Detection Rate (FDR) at top 3%** to filter out un-important variables. Then a stepwise regression model was implemented to do feature selection.
### 1. Filters

In [16]:
import pandas as pd
import numpy as np
##import the 300 variables created by R codes along with the original 10 variables contained in the dataset
mydata = pd.read_csv('var310.csv')

In [17]:
goods=mydata[mydata['Fraud']==0]
bads=mydata[mydata['Fraud']==1]

In [18]:
### KS score for each variable
from scipy.stats import ks_2samp
KS=[]
i=0
for column in mydata.columns[11:]:
    KS.append([ks_2samp(goods[column],bads[column])[0],column])
    i = i+1
KS_df = pd.DataFrame(KS, columns=['KS_score','variables']).sort_values(by='KS_score',ascending=False).set_index('variables')

In [None]:
### FDR score for each variable
fdrdic = {}
for column in mydata.columns[11:]:
    a = mydata[['Fraud',column]]
    fdr=a.sort_values(column,ascending = False)['Fraud'].iloc[:round(0.03*len(mydata)),].sum()/len(bads)
    fdrdic[column] = fdr
fdr_df = pd.DataFrame.from_dict(fdrdic,orient = 'index')
fdr_df.columns=['FDR']

In [None]:
FDR_KS_df = fdr_df.merge(KS_df,left_index=True,right_index=True,how='left')
FDR_KS_df['sum'] = FDR_KS_df['FDR']+FDR_KS_df['KS_score']
top150 = FDR_KS_df.sort_values(by='sum',ascending=False)[:150]

In [None]:
var = list(top150.index)
var.append('Fraud')
top150_df = mydata[var]

### 2. Stepwise - backward lgistic regression

In [None]:
X = top150_df.loc[:, top150_df.columns != 'Fraud']
y=top150_df['Fraud']

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
model = LogisticRegression()
rfecv = RFECV(estimator = model, step=1, cv = 3, verbose=3,n_jobs =-1, scoring='roc_auc')
rfecv.fit(X,y)

In [None]:
import matplotlib.pyplot as plt
plt.figure()
plt.plot(range(1,len(rfecv.grid_scores_) +1), rfecv.grid_scores_)
plt.show()

In [None]:
var_selected2 = pd.DataFrame(sorted(zip(map(lambda x: round(x), rfecv.ranking_),X.columns)), columns =['ranking','variables'])
pd.options.display.max_rows = 60
print(var_selected2)
cols_keep=list(var_selected2['variables'][0:60])
cols_keep

In [None]:
cols_keep=list(var_selected2['variables'][0:60])
cols_keep

## Modeling with linear and non-linear machine learning models
We tried four kinds of machine learning models to do model selection, namely random forest, neural networks, gradient boosting tree and logistic regression. Each model was runned 10 times to mitigate the small sample size problem. Some visulization are made to show different model performance.

In [None]:
X = X[cols_keep]
#Set up a out-of-time validation set
tt_n = 84280
x_ood = X.iloc[tt_n:,:]
y_ood = y[tt_n:]

In [None]:
### Random Forest
RF_train_total=[]
RF_train_FDT=[]
RF_train_FPT=[]
RF_test_total=[]
RF_test_FDT=[]
RF_test_FPT=[]
RF_OOD_total=[]
RF_OOD_FDT=[]
RF_OOD_FPT=[]
for i in range(10):
    RF_train=[]
    RF_train_FP=[]
    RF_train_FD=[]
    RF_test=[]
    RF_test_FP=[]
    RF_test_FD=[]
    RF_OOD=[]
    RF_OOD_FP=[]
    RF_OOD_FD=[]
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25)
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)  
    
    y_pred1 = clf.predict(X_train)
    RF_FDR1 = pd.DataFrame(y_train)
    RF_FDR1['score'] = clf.predict_proba(X_train)[:,1]
    RF_FDR1['pred_y'] = y_pred1
    RF_FDR1 = RF_FDR1.sort_values(by='score',ascending=False)
    
    
    y_pred2 = clf.predict(X_test)
    RF_FDR2 = pd.DataFrame(y_test)
    RF_FDR2['score'] = clf.predict_proba(X_test)[:,1]
    RF_FDR2['pred_y'] = y_pred2
    RF_FDR2 = RF_FDR2.sort_values(by='score',ascending=False)
    
    y_pred3 = clf.predict(x_ood)
    RF_FDR3 = pd.DataFrame(y_ood)
    RF_FDR3['score'] = clf.predict_proba(x_ood)[:,1]
    RF_FDR3['pred_y'] = y_pred3
    RF_FDR3 = RF_FDR3.sort_values(by='score',ascending=False)
    
    for j in range(1,101):
        RF_OOD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0])/np.sum(RF_FDR3.iloc[:,0]))
        RF_OOD_FD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0]))
        RF_OOD_FP.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['pred_y']-RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['Fraud']==1))
        RF_train.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0])/np.sum(RF_FDR1.iloc[:,0]))
        RF_train_FD.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0]))
        RF_train_FP.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['pred_y']-RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['Fraud']==1))
        RF_test.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0])/np.sum(RF_FDR2.iloc[:,0]))
        RF_test_FD.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0]))
        RF_test_FP.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['pred_y']-RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['Fraud']==1))
    RF_train_total.append(RF_train)
    RF_train_FPT.append(RF_train_FP)
    RF_train_FDT.append(RF_train_FD)                      
    RF_test_total.append(RF_test)
    RF_test_FPT.append(RF_test_FP)
    RF_test_FDT.append(RF_test_FD)
    RF_OOD_total.append(RF_OOD)
    RF_OOD_FPT.append(RF_OOD_FP)
    RF_OOD_FDT.append(RF_OOD_FD)

In [None]:
### Neural Network
from sklearn.neural_network import MLPClassifier
RF_train_total=[]
RF_train_FDT=[]
RF_train_FPT=[]
RF_test_total=[]
RF_test_FDT=[]
RF_test_FPT=[]
RF_OOD_total=[]
RF_OOD_FDT=[]
RF_OOD_FPT=[]
for i in range(10):
    RF_train=[]
    RF_train_FP=[]
    RF_train_FD=[]
    RF_test=[]
    RF_test_FP=[]
    RF_test_FD=[]
    RF_OOD=[]
    RF_OOD_FP=[]
    RF_OOD_FD=[]
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25)
    
    clf = MLPClassifier(hidden_layer_sizes=(10,5))
    #clf = RandomForestClassifier(n_estimators=100)
    clf.fit(X_train, y_train)  
    
    y_pred1 = clf.predict(X_train)
    RF_FDR1 = pd.DataFrame(y_train)
    RF_FDR1['score'] = clf.predict_proba(X_train)[:,1]
    RF_FDR1['pred_y'] = y_pred1
    RF_FDR1 = RF_FDR1.sort_values(by='score',ascending=False)
    
    
    y_pred2 = clf.predict(X_test)
    RF_FDR2 = pd.DataFrame(y_test)
    RF_FDR2['score'] = clf.predict_proba(X_test)[:,1]
    RF_FDR2['pred_y'] = y_pred2
    RF_FDR2 = RF_FDR2.sort_values(by='score',ascending=False)
    
    y_pred3 = clf.predict(x_ood)
    RF_FDR3 = pd.DataFrame(y_ood)
    RF_FDR3['score'] = clf.predict_proba(x_ood)[:,1]
    RF_FDR3['pred_y'] = y_pred3
    RF_FDR3 = RF_FDR3.sort_values(by='score',ascending=False)
    
    for j in range(1,101):
        RF_OOD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0])/np.sum(RF_FDR3.iloc[:,0]))
        RF_OOD_FD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0]))
        RF_OOD_FP.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['pred_y']-RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['Fraud']==1))
        RF_train.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0])/np.sum(RF_FDR1.iloc[:,0]))
        RF_train_FD.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0]))
        RF_train_FP.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['pred_y']-RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['Fraud']==1))
        RF_test.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0])/np.sum(RF_FDR2.iloc[:,0]))
        RF_test_FD.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0]))
        RF_test_FP.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['pred_y']-RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['Fraud']==1))
    RF_train_total.append(RF_train)
    RF_train_FPT.append(RF_train_FP)
    RF_train_FDT.append(RF_train_FD)                      
    RF_test_total.append(RF_test)
    RF_test_FPT.append(RF_test_FP)
    RF_test_FDT.append(RF_test_FD)
    RF_OOD_total.append(RF_OOD)
    RF_OOD_FPT.append(RF_OOD_FP)
    RF_OOD_FDT.append(RF_OOD_FD)

In [None]:
## Gradient Boosting Tree
from sklearn.neural_network import MLPClassifier
RF_train_total=[]
RF_train_FDT=[]
RF_train_FPT=[]
RF_test_total=[]
RF_test_FDT=[]
RF_test_FPT=[]
RF_OOD_total=[]
RF_OOD_FDT=[]
RF_OOD_FPT=[]
for i in range(10):
    RF_train=[]
    RF_train_FP=[]
    RF_train_FD=[]
    RF_test=[]
    RF_test_FP=[]
    RF_test_FD=[]
    RF_OOD=[]
    RF_OOD_FP=[]
    RF_OOD_FD=[]
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25)
    
    clf = GradientBoostingClassifier()
    clf.fit(X_train, y_train)  
    
    y_pred1 = clf.predict(X_train)
    RF_FDR1 = pd.DataFrame(y_train)
    RF_FDR1['score'] = clf.predict_proba(X_train)[:,1]
    RF_FDR1['pred_y'] = y_pred1
    RF_FDR1 = RF_FDR1.sort_values(by='score',ascending=False)
    
    
    y_pred2 = clf.predict(X_test)
    RF_FDR2 = pd.DataFrame(y_test)
    RF_FDR2['score'] = clf.predict_proba(X_test)[:,1]
    RF_FDR2['pred_y'] = y_pred2
    RF_FDR2 = RF_FDR2.sort_values(by='score',ascending=False)
    
    y_pred3 = clf.predict(x_ood)
    RF_FDR3 = pd.DataFrame(y_ood)
    RF_FDR3['score'] = clf.predict_proba(x_ood)[:,1]
    RF_FDR3['pred_y'] = y_pred3
    RF_FDR3 = RF_FDR3.sort_values(by='score',ascending=False)
    
    for j in range(1,101):
        RF_OOD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0])/np.sum(RF_FDR3.iloc[:,0]))
        RF_OOD_FD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0]))
        RF_OOD_FP.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['pred_y']-RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['Fraud']==1))
        RF_train.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0])/np.sum(RF_FDR1.iloc[:,0]))
        RF_train_FD.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0]))
        RF_train_FP.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['pred_y']-RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['Fraud']==1))
        RF_test.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0])/np.sum(RF_FDR2.iloc[:,0]))
        RF_test_FD.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0]))
        RF_test_FP.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['pred_y']-RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['Fraud']==1))
    RF_train_total.append(RF_train)
    RF_train_FPT.append(RF_train_FP)
    RF_train_FDT.append(RF_train_FD)                      
    RF_test_total.append(RF_test)
    RF_test_FPT.append(RF_test_FP)
    RF_test_FDT.append(RF_test_FD)
    RF_OOD_total.append(RF_OOD)
    RF_OOD_FPT.append(RF_OOD_FP)
    RF_OOD_FDT.append(RF_OOD_FD)

In [None]:
### Logistic Regression
from sklearn.neural_network import MLPClassifier
RF_train_total=[]
RF_train_FDT=[]
RF_train_FPT=[]
RF_test_total=[]
RF_test_FDT=[]
RF_test_FPT=[]
RF_OOD_total=[]
RF_OOD_FDT=[]
RF_OOD_FPT=[]
for i in range(10):
    RF_train=[]
    RF_train_FP=[]
    RF_train_FD=[]
    RF_test=[]
    RF_test_FP=[]
    RF_test_FD=[]
    RF_OOD=[]
    RF_OOD_FP=[]
    RF_OOD_FD=[]
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25)
    
    clf = LogisticRegression()
    clf.fit(X_train, y_train)  
    
    y_pred1 = clf.predict(X_train)
    RF_FDR1 = pd.DataFrame(y_train)
    RF_FDR1['score'] = clf.predict_proba(X_train)[:,1]
    RF_FDR1['pred_y'] = y_pred1
    RF_FDR1 = RF_FDR1.sort_values(by='score',ascending=False)
    
    
    y_pred2 = clf.predict(X_test)
    RF_FDR2 = pd.DataFrame(y_test)
    RF_FDR2['score'] = clf.predict_proba(X_test)[:,1]
    RF_FDR2['pred_y'] = y_pred2
    RF_FDR2 = RF_FDR2.sort_values(by='score',ascending=False)
    
    y_pred3 = clf.predict(x_ood)
    RF_FDR3 = pd.DataFrame(y_ood)
    RF_FDR3['score'] = clf.predict_proba(x_ood)[:,1]
    RF_FDR3['pred_y'] = y_pred3
    RF_FDR3 = RF_FDR3.sort_values(by='score',ascending=False)
    
    for j in range(1,101):
        RF_OOD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0])/np.sum(RF_FDR3.iloc[:,0]))
        RF_OOD_FD.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,0]))
        RF_OOD_FP.append(np.sum(RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['pred_y']-RF_FDR3.iloc[:round(len(RF_FDR3)*j/100)+1,:]['Fraud']==1))
        RF_train.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0])/np.sum(RF_FDR1.iloc[:,0]))
        RF_train_FD.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,0]))
        RF_train_FP.append(np.sum(RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['pred_y']-RF_FDR1.iloc[:round(len(RF_FDR1)*j/100)+1,:]['Fraud']==1))
        RF_test.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0])/np.sum(RF_FDR2.iloc[:,0]))
        RF_test_FD.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,0]))
        RF_test_FP.append(np.sum(RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['pred_y']-RF_FDR2.iloc[:round(len(RF_FDR2)*j/100)+1,:]['Fraud']==1))
    RF_train_total.append(RF_train)
    RF_train_FPT.append(RF_train_FP)
    RF_train_FDT.append(RF_train_FD)                      
    RF_test_total.append(RF_test)
    RF_test_FPT.append(RF_test_FP)
    RF_test_FDT.append(RF_test_FD)
    RF_OOD_total.append(RF_OOD)
    RF_OOD_FPT.append(RF_OOD_FP)
    RF_OOD_FDT.append(RF_OOD_FD)

In [None]:
### data for visualization
#results
RF_train_total_df = pd.DataFrame(RF_train_total)
RF_test_total_df = pd.DataFrame(RF_test_total)
RF_OOD_total_df = pd.DataFrame(RF_OOD_total)
RF_train_FDT_df = pd.DataFrame(RF_train_FDT)
RF_test_FDT_df = pd.DataFrame(RF_test_FDT)
RF_OOD_FDT_df = pd.DataFrame(RF_OOD_FDT)
RF_train_FPT_df = pd.DataFrame(RF_train_FPT)
RF_test_FPT_df = pd.DataFrame(RF_test_FPT)
RF_OOD_FPT_df = pd.DataFrame(RF_OOD_FPT)


train_FDT=RF_train_FDT_df.mean(axis=0)*2000
test_FDT=RF_test_FDT_df.mean(axis=0)*2000
OOD_FDT=RF_OOD_FDT_df.mean(axis=0)*2000
train_FPT=RF_train_FPT_df.mean(axis=0)*50
test_FPT=RF_test_FPT_df.mean(axis=0)*50
OOD_FPT=RF_OOD_FPT_df.mean(axis=0)*50

#for graph
rf_results = pd.DataFrame()
rf_results['train_FDT']=train_FDT
rf_results['test_FDT']=test_FDT
rf_results['OOD_FDT']=OOD_FDT
rf_results['train_FPT']=train_FPT
rf_results['test_FPT']=test_FPT
rf_results['OOD_FPT']=OOD_FPT
print(rf_results)

# for table
rf_table = pd.DataFrame()
rf_table['train']=RF_train_total_df.iloc[:,2]
rf_table['test']=RF_test_total_df.iloc[:,2]
rf_table['OOD']=RF_OOD_total_df.iloc[:,2]
rf_table.loc[10]=rf_table.mean()
print(rf_table)