# IEEE-CIS Fraud Detection

# Can you detect fraud from customer transactions?

IEEE-CIS works across a variety of AI and machine learning areas, including deep neural networks, fuzzy systems, evolutionary computation, and swarm intelligence. Today they’re partnering with the world’s leading payment service company, Vesta Corporation, seeking the best solutions for fraud prevention industry, and now you are invited to join the challenge.

    Steps:
1 - Preprocessing and EDA
1.1 - Imports
1.2 - Checking Types
1.3 - Missing Values
1.4 - Remove unwanted features (high correlated and outliers)
1.5 - Transformations
1.6 - Prepare for model

2 - Modeling : The fraud detection is likely to be an Anamoly Detection alogorithm. You may feel attracted towards deploying a classification model, but the proportions of Fraud vs. legitimate (non-fraud) transactions is highly poised towards one side. So classification will not be the right choice.
2.1 - AutoML (H20)
2.2 - AutoML (TPOT)

3 - Model Evaluation

4 - Submission

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew
from scipy.special import boxcox1p
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder

In [2]:
train_trans=pd.read_csv("C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/train_transaction.csv")
train_identity=pd.read_csv("C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/train_identity.csv")

In [3]:
test_trans=pd.read_csv("C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/test_transaction.csv")
test_identity=pd.read_csv("C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/test_identity.csv")

In [4]:
len_train=train_trans.shape[0]
len_train_id=train_identity.shape[0]
len_test=test_trans.shape[0]
len_test_id=test_identity.shape[0]
print(train_trans.shape)
print(test_trans.shape)

print(train_identity.shape)
print(test_identity.shape)

(590540, 394)
(506691, 393)
(144233, 41)
(141907, 41)


In [5]:
# transactions=pd.concat([train_trans,test_trans], sort=False)

In [6]:
train_trans.select_dtypes(include='object').head(20)

Unnamed: 0,ProductCD,card4,card6,P_emaildomain,R_emaildomain,M1,M2,M3,M4,M5,M6,M7,M8,M9
0,W,discover,credit,,,T,T,T,M2,F,T,,,
1,W,mastercard,credit,gmail.com,,,,,M0,T,T,,,
2,W,visa,debit,outlook.com,,T,T,T,M0,F,F,F,F,F
3,W,mastercard,debit,yahoo.com,,,,,M0,T,F,,,
4,H,mastercard,credit,gmail.com,,,,,,,,,,
5,W,visa,debit,gmail.com,,T,T,T,M1,F,T,,,
6,W,visa,debit,yahoo.com,,T,T,T,M0,F,F,T,T,T
7,W,visa,debit,mail.com,,,,,M0,F,F,,,
8,H,visa,debit,anonymous.com,,,,,,,,,,
9,W,mastercard,debit,yahoo.com,,T,T,T,M0,T,T,,,


In [7]:
train_trans.select_dtypes(include='category').head(20)

0
1
2
3
4
5
6
7
8
9
10


Per the dataset description, the below fields are cosnidered to be Categorical fields. Whereas in our step above, few fields were not categorized as objects. So let's manuall convert them to Categorical fields.
ProductCD
card1 - card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 - M9

In [8]:
for col in ('ProductCD','card1','card2','card3','card4','card5','card6','addr1','addr2','P_emaildomain','R_emaildomain','M1','M2','M3','M4','M5','M6','M7','M8','M9'):
            train_trans[col]=train_trans[col].astype('object')

In [9]:
train_trans.select_dtypes(include='object').isnull().sum()[train_trans.select_dtypes(include='object').isnull().sum()>0]

card2              8933
card3              1565
card4              1577
card5              4259
card6              1571
addr1             65706
addr2             65706
P_emaildomain     94456
R_emaildomain    453249
M1               271100
M2               271100
M3               271100
M4               281444
M5               350482
M6               169360
M7               346265
M8               346252
M9               346252
dtype: int64

In [10]:
for col in ('P_emaildomain','R_emaildomain','addr1','addr2','M1','M2','M3','M4','M5','M6','M7','M8','M9'):
    train_trans[col]=train_trans[col].fillna('None')
    test_trans[col]=test_trans[col].fillna('None')

In [11]:
for col in ('card2','card3','card4','card5','card6'):
    train_trans[col]=train_trans[col].fillna(train_trans[col].mode()[0])
    test_trans[col]=test_trans[col].fillna(test_trans[col].mode()[0])

In [12]:
train_trans.select_dtypes(include=['float', 'int']).head(20)

Unnamed: 0,TransactionAmt,card2,card3,card5,dist1,dist2,C1,C2,C3,C4,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,68.5,321.0,150.0,142.0,19.0,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
1,29.0,404.0,150.0,102.0,,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
2,59.0,490.0,150.0,166.0,287.0,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
3,50.0,567.0,150.0,117.0,,,2.0,5.0,0.0,0.0,...,,,,,,,,,,
4,50.0,514.0,150.0,102.0,,,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,49.0,555.0,150.0,226.0,36.0,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
6,159.0,360.0,150.0,166.0,0.0,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
7,422.5,490.0,150.0,226.0,,,1.0,1.0,0.0,0.0,...,,,,,,,,,,
8,15.0,100.0,150.0,226.0,,,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,117.0,111.0,150.0,224.0,19.0,,2.0,2.0,0.0,0.0,...,,,,,,,,,,


In [13]:
train_trans.select_dtypes(include=['float','int']).isnull().sum()[train_trans.select_dtypes(include=['float','int']).isnull().sum()>0]

dist1    352271
dist2    552913
D1         1269
D2       280797
D3       262878
D4       168922
D5       309841
D6       517353
D7       551623
D8       515614
D9       515614
D10       76022
D11      279287
D12      525823
D13      528588
D14      528353
D15       89113
V1       279287
V2       279287
V3       279287
V4       279287
V5       279287
V6       279287
V7       279287
V8       279287
V9       279287
V10      279287
V11      279287
V12       76073
V13       76073
          ...  
V310         12
V311         12
V312         12
V313       1269
V314       1269
V315       1269
V316         12
V317         12
V318         12
V319         12
V320         12
V321         12
V322     508189
V323     508189
V324     508189
V325     508189
V326     508189
V327     508189
V328     508189
V329     508189
V330     508189
V331     508189
V332     508189
V333     508189
V334     508189
V335     508189
V336     508189
V337     508189
V338     508189
V339     508189
Length: 356, dtype: int6

In [14]:
a = train_trans.select_dtypes(include=['float','int']).isnull().sum()
[train_trans.select_dtypes(include=['float','int']).isnull().sum()>0]
train_df=pd.DataFrame(a)
train_df.to_csv('C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/int_exceptions_07232019.csv', mode='a')

In [15]:
train_trans1=train_trans

In [16]:
# train_trans.drop(['dist2','D6','D7','D8','D9','D12','D13','D14','V138','V139','V140','V141','V142','V143','V144','V145','V146','V147','V148','V149','V150','V151','V152','V153','V154','V155','V156','V157','V158','V159','V160','V161','V162','V163','V164','V165','V166','V167','V168','V169','V170','V171','V172','V173','V174','V175','V176','V177','V178','V179','V180','V181','V182','V183','V184','V185','V186','V187','V188','V189','V190','V191','V192','V193','V194','V195','V196','V197','V198','V199','V200','V201','V202','V203','V204','V205','V206','V207','V208','V209','V210','V211','V212','V213','V214','V215','V216','V217','V218','V219','V220','V221','V222','V223','V224','V225','V226','V227','V228','V229','V230','V231','V232','V233','V234','V235','V236','V237','V238','V239','V240','V241','V242','V243','V244','V245','V246','V247','V248','V249','V250','V251','V252','V253','V254','V255','V256','V257','V258','V259','V260','V261','V262','V263','V264','V265','V266','V267','V268','V269','V270','V271','V272','V273','V274','V275','V276','V277','V278','V322','V323','V324','V325','V326','V327','V328','V329','V330','V331','V332','V333','V334','V335','V336','V337','V338','V339'],axis=1, inplace = True)

In [17]:
for col in (
'D4',
'D10',
'D15',
'V12',
'V13',
'V14',
'V15',
'V16',
'V17',
'V18',
'V19',
'V20',
'V21',
'V22',
'V23',
'V24',
'V25',
'V26',
'V27',
'V28',
'V29',
'V30',
'V31',
'V32',
'V33',
'V34',
'V35',
'V36',
'V37',
'V38',
'V39',
'V40',
'V41',
'V42',
'V43',
'V44',
'V45',
'V46',
'V47',
'V48',
'V49',
'V50',
'V51',
'V52',
'V53',
'V54',
'V55',
'V56',
'V57',
'V58',
'V59',
'V60',
'V61',
'V62',
'V63',
'V64',
'V65',
'V66',
'V67',
'V68',
'V69',
'V70',
'V71',
'V72',
'V73',
'V74',
'V75',
'V76',
'V77',
'V78',
'V79',
'V80',
'V81',
'V82',
'V83',
'V84',
'V85',
'V86',
'V87',
'V88',
'V89',
'V90',
'V91',
'V92',
'V93',
'V94'):
    train_trans[col]=train_trans[col].fillna(train_trans[col].mean())

In [18]:
for col in (
'D1',
'V95',
'V96',
'V97',
'V98',
'V99',
'V100',
'V101',
'V102',
'V103',
'V104',
'V105',
'V106',
'V107',
'V108',
'V109',
'V110',
'V111',
'V112',
'V113',
'V114',
'V115',
'V116',
'V117',
'V118',
'V119',
'V120',
'V121',
'V122',
'V123',
'V124',
'V125',
'V126',
'V127',
'V128',
'V129',
'V130',
'V131',
'V132',
'V133',
'V134',
'V135',
'V136',
'V137',
'V279',
'V280',
'V281',
'V282',
'V283',
'V284',
'V285',
'V286',
'V287',
'V288',
'V289',
'V290',
'V291',
'V292',
'V293',
'V294',
'V295',
'V296',
'V297',
'V298',
'V299',
'V300',
'V301',
'V302',
'V303',
'V304',
'V305',
'V306',
'V307',
'V308',
'V309',
'V310',
'V311',
'V312',
'V313',
'V314',
'V315',
'V316',
'V317',
'V318',
'V319',
'V320',
'V321'):
    train_trans[col]=train_trans[col].fillna(train_trans[col].mean())

In [19]:
for col in (
    'dist1',
'dist2',
'D2',
'D3',
'D5',
'D6',
'D7',
'D8',
'D9',
'D11',
'D12',
'D13',
'D14',
'V1',
'V2',
'V3',
'V4',
'V5',
'V6',
'V7',
'V8',
'V9',
'V10',
'V11',
'V138',
'V139',
'V140',
'V141',
'V142',
'V143',
'V144',
'V145',
'V146',
'V147',
'V148',
'V149',
'V150',
'V151',
'V152',
'V153',
'V154',
'V155',
'V156',
'V157',
'V158',
'V159',
'V160',
'V161',
'V162',
'V163',
'V164',
'V165',
'V166',
'V167',
'V168',
'V169',
'V170',
'V171',
'V172',
'V173',
'V174',
'V175',
'V176',
'V177',
'V178',
'V179',
'V180',
'V181',
'V182',
'V183',
'V184',
'V185',
'V186',
'V187',
'V188',
'V189',
'V190',
'V191',
'V192',
'V193',
'V194',
'V195',
'V196',
'V197',
'V198',
'V199',
'V200',
'V201',
'V202',
'V203',
'V204',
'V205',
'V206',
'V207',
'V208',
'V209',
'V210',
'V211',
'V212',
'V213',
'V214',
'V215',
'V216',
'V217',
'V218',
'V219',
'V220',
'V221',
'V222',
'V223',
'V224',
'V225',
'V226',
'V227',
'V228',
'V229',
'V230',
'V231',
'V232',
'V233',
'V234',
'V235',
'V236',
'V237',
'V238',
'V239',
'V240',
'V241',
'V242',
'V243',
'V244',
'V245',
'V246',
'V247',
'V248',
'V249',
'V250',
'V251',
'V252',
'V253',
'V254',
'V255',
'V256',
'V257',
'V258',
'V259',
'V260',
'V261',
'V262',
'V263',
'V264',
'V265',
'V266',
'V267',
'V268',
'V269',
'V270',
'V271',
'V272',
'V273',
'V274',
'V275',
'V276',
'V277',
'V278',
'V322',
'V323',
'V324',
'V325',
'V326',
'V327',
'V328',
'V329',
'V330',
'V331',
'V332',
'V333',
'V334',
'V335',
'V336',
'V337',
'V338',
'V339'):
    train_trans[col]=train_trans[col].fillna(0)


In [20]:
train_trans.select_dtypes(include=['float','int']).isnull().sum()[train_trans.select_dtypes(include=['float','int']).isnull().sum()>0]

Series([], dtype: int64)

In [21]:
train_trans.select_dtypes(include='object').isnull().sum()[train_trans.select_dtypes(include='object').isnull().sum()>0]

Series([], dtype: int64)

In [22]:
corr1 = train_trans.corr()
corr1_df=pd.DataFrame(corr1)
corr1_df.to_csv('C:/Users/112013/Desktop/ML_H2O/IEEE_fraud/corr1_07282019.csv', mode='a')

In [23]:
feature_name = list(train_trans.columns)
# no of maximum features we need to select
num_feats=50

In [24]:
X = train_trans.drop('isFraud',axis=1).values
y = train_trans['isFraud'].values

In [26]:
X = pd.get_dummies(X)

Exception: Data must be 1-dimensional

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

In [None]:
xyz = train_trans.test()

Per the dataset description given, lets convert the fields mentioned as Categorical, so we can do oneHot encoding

In [None]:
for col in ('ProductCD','card1','card2','card3','card4','card5','card6','addr1','addr2','P_emaildomain','R_emaildomain','M1','M2','M3','M4','M5','M6','M7','M8','M9'):
            train_trans[col]=train_trans[col].astype('category')

In [None]:
skew=train_trans.select_dtypes(include=['int','float']).apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skew_df=pd.DataFrame({'Skew':skew})
skewed_df=skew_df[(skew_df['Skew']>0.5)|(skew_df['Skew']<-0.5)]

In [None]:
skewed_df.index

In [None]:
lam = 0.1
for col in skewed_df.index:
    train_trans[col]=boxcox1p(train_trans[col],lam)

In [None]:
train_trans = pd.get_dummies(train_trans)

****Lets do EDA now****

In [None]:
# plt.figure(figsize=[60,30])
# sns.heatmap(train_trans.corr(), annot=True)