### Option 05 - Fraud Detection for Finance
For this project, import the dataset here, <br/>
https://www.kaggle.com/ntnu-testimon/banksim1 <br>
Alternative link <br/>
https://www.kaggle.com/code/vmeh23/general-data-analysis-for-banksim-data-set/data

* step: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation). <br/>
* customer: ID of the customer <br/>
* age: The age of the accounts  <br/>
* gender: Gender of the customer <br/>
* zipcodeOri: The zip code of the customer<br/>
* merchant:  ID of the merchant <br/>
* zipMerchant: The zip code of the merchant <br/>
* category: Category of the transaction <br/>
* amount: The amount of the transaction <br/>
* fraud: Binary value indicating fraudulent or not fraudulent <br/>

In [2]:
# (1) Import libraries
import pandas as pd

In [3]:
# (2.c) Read transaction data into Pandas DataFrame
fraud_df = pd.read_csv('./bs140513_032310.csv')


# Note: the equivalent method for a PySpark DataFrame is
# df = spark.read.csv()

In [4]:
# (3) Explore data (for example, see what is categorial and numerical)
# We used info method to check if the dataset has null values. There are no null 
#values in any column
fraud_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594643 entries, 0 to 594642
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   step         594643 non-null  int64  
 1   customer     594643 non-null  object 
 2   age          594643 non-null  object 
 3   gender       594643 non-null  object 
 4   zipcodeOri   594643 non-null  object 
 5   merchant     594643 non-null  object 
 6   zipMerchant  594643 non-null  object 
 7   category     594643 non-null  object 
 8   amount       594643 non-null  float64
 9   fraud        594643 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 45.4+ MB


In [5]:

fraud_df.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


In [6]:
fraud_df['zipMerchant'].value_counts()


'28007'    594643
Name: zipMerchant, dtype: int64

In [7]:
fraud_df['zipcodeOri'].value_counts()

'28007'    594643
Name: zipcodeOri, dtype: int64

In [8]:
fraud_df['category'].value_counts()

'es_transportation'        505119
'es_food'                   26254
'es_health'                 16133
'es_wellnessandbeauty'      15086
'es_fashion'                 6454
'es_barsandrestaurants'      6373
'es_hyper'                   6098
'es_sportsandtoys'           4002
'es_tech'                    2370
'es_home'                    1986
'es_hotelservices'           1744
'es_otherservices'            912
'es_contents'                 885
'es_travel'                   728
'es_leisure'                  499
Name: category, dtype: int64

In [9]:
fraud_df['fraud'].value_counts()

0    587443
1      7200
Name: fraud, dtype: int64

In [10]:
fraud_df['merchant'].value_counts()

'M1823072687'    299693
'M348934600'     205426
'M85975013'       26254
'M1053599405'      6821
'M151143676'       6373
'M855959430'       6098
'M1946091778'      5343
'M1913465890'      3988
'M209847108'       3814
'M480139044'       3508
'M349281107'       2881
'M1600850729'      2624
'M1535107174'      1868
'M980657600'       1769
'M78078399'        1608
'M1198415165'      1580
'M840466850'       1399
'M1649169323'      1173
'M547558035'        949
'M50039827'         916
'M1888755466'       912
'M692898500'        900
'M1400236507'       776
'M1842530320'       751
'M732195782'        608
'M97925176'         599
'M45060432'         573
'M1741626453'       528
'M1313686961'       527
'M1872033263'       525
'M1352454843'       370
'M677738360'        358
'M2122776122'       341
'M923029380'        323
'M3697346'          308
'M17379832'         282
'M1748431652'       274
'M1873032707'       250
'M2011752106'       244
'M1416436880'       220
'M1294758098'       191
'M1788569036'   

In [11]:
fraud_df['step'].value_counts().sort_index(ascending=False)

179    3709
178    3743
177    3758
176    3721
175    3774
       ... 
4      2532
3      2499
2      2462
1      2424
0      2430
Name: step, Length: 180, dtype: int64

In [12]:
# We analyzed each column in the dataset, identified irrelevant features and dropped them.
 
columns_to_drop=['customer','zipcodeOri','merchant','zipMerchant']
fraud_df.drop(columns_to_drop,axis=1,inplace=True)




In [13]:
# (4) Choose the label and features
label_df=fraud_df['fraud']
features_df=fraud_df.drop('fraud',axis=1,inplace=False)

In [14]:
fraud_df.head()

Unnamed: 0,step,age,gender,category,amount,fraud
0,0,'4','M','es_transportation',4.55,0
1,0,'2','M','es_transportation',39.68,0
2,0,'4','F','es_transportation',26.89,0
3,0,'3','M','es_transportation',17.25,0
4,0,'5','M','es_transportation',35.72,0


In [15]:
fraud_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594643 entries, 0 to 594642
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   step      594643 non-null  int64  
 1   age       594643 non-null  object 
 2   gender    594643 non-null  object 
 3   category  594643 non-null  object 
 4   amount    594643 non-null  float64
 5   fraud     594643 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 27.2+ MB


In [16]:
# We used series.unique() method for each column in the dataset
# to check if the column contains any incorrect data

fraud_df['age'].unique()



array(["'4'", "'2'", "'3'", "'5'", "'1'", "'6'", "'U'", "'0'"],
      dtype=object)

In [17]:
fraud_df['step'].unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179], d

In [18]:
fraud_df['step'].unique()

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179], d

In [19]:
fraud_df['category'].unique()

array(["'es_transportation'", "'es_health'", "'es_otherservices'",
       "'es_food'", "'es_hotelservices'", "'es_barsandrestaurants'",
       "'es_tech'", "'es_sportsandtoys'", "'es_wellnessandbeauty'",
       "'es_hyper'", "'es_fashion'", "'es_home'", "'es_contents'",
       "'es_travel'", "'es_leisure'"], dtype=object)

In [20]:
fraud_df['gender'].unique()

array(["'M'", "'F'", "'E'", "'U'"], dtype=object)

In [21]:
# We used series.str.replace() method to remove quotations
# fraud_df['age']=fraud_df['age'].str.replace("'",'')
# fraud_df['category']=fraud_df['category'].str.replace("'",'')
# fraud_df['gender']=fraud_df['gender'].map({"'M'":0,"'F'":1})
# # fraud_df['gender']=fraud_df['gender'].astype(int)
u_values=fraud_df.loc[fraud_df['gender']=="'E'",['age','gender']]

u_values.value_counts()

age  gender
'U'  'E'       1178
dtype: int64

In [22]:

# (5) Feature engineer for data that is
# (a) relevant 
# (b) unique
# (c) correct 
# (d) not missing

# Drop data that is not a,b,c, or d
# Use one-hot encoding for nominal
# Reduce dimensions of your features

In [23]:
# (6) Confirm data is ready with further exploratory analysis

In [24]:
# (7) Training, Testing (and/or Validation) data split 

# for example, 60/20/20

In [25]:
# (7.b) If using Deep Learning, building the model

# Add Input Layer
# Add Hidden Layers
# Add Output

In [26]:
# (8) Training the Machine Learning Model (i.e, Fitting the Model)

In [27]:
# (9) Evaluate the model metrics for Training (and/or Validation) data

In [28]:
# (10) Evaluate the model metris for Testing data


# If metrics are poor, optimize either (a) the data, (b) the hyperpamaters

In [29]:
# (11) Use the model for prediction

In [30]:
# (12) Write final predicted data  (e.g, to CSV or JSON, etc.)