# Transactions Fraud Detection

**Authors:** [Peter Macinec](https://github.com/pmacinec), [Timotej Zatko](https://github.com/timzatko)

## Dataset Processing

In this jupyter notebook, we will have a first sight at the data and do initial processing for further usage.  

### Setup and reading the data

In [1]:
import pandas as pd

For this problem, we are using data from Kaggle competition called [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/overview). The dataset contains two `csv` files, *identities* and *transactions*.

Let's import both csv files:

In [2]:
df_identities = pd.read_csv('../data/identities.csv')
df_transactions = pd.read_csv('../data/transactions.csv')

How the data look like?

In [3]:
df_identities.head(3)

Unnamed: 0,TransactionID,id_01,id_02,id_03,id_04,id_05,id_06,id_07,id_08,id_09,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987004,0.0,70787.0,,,,,,,,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M
1,2987008,-5.0,98945.0,,,0.0,-5.0,,,,...,mobile safari 11.0,32.0,1334x750,match_status:1,T,F,F,T,mobile,iOS Device
2,2987010,-5.0,191631.0,0.0,0.0,0.0,0.0,,,0.0,...,chrome 62.0,,,,F,F,T,T,desktop,Windows


In [4]:
df_transactions.head(3)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,


### Join datasets

Original dataset source at Kaggle contains description of the data with following part:
```
The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.
```

Because not all transactions have corresponding identity, `left` join has to be performed to join the data:

In [5]:
df = pd.merge(df_transactions, df_identities, on='TransactionID', how='left')

In [6]:
df.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T,mobile,SAMSUNG SM-G892A Build/NRD90M


What are the shapes of `transactions` annd `identities` dataframes?

In [7]:
print(f'Transactions shape: {df_transactions.shape}')
print(f'Identities shape: {df_identities.shape}')

Transactions shape: (590540, 394)
Identities shape: (144233, 41)


Final dataset should have all identity featues appended to transaction ones:

In [8]:
df.shape

(590540, 434)

The dataframes are now merged and can be handled as one in next phases.

Let's save the final dataset for further usage:

In [9]:
df.to_csv('../data/dataset.csv', index=False)