# Data wrangling

## Background

The dataset has been downloaded from [Kaggle.com](https://www.kaggle.com/mlg-ulb/creditcardfraud).

The datasets contains transactions made by credit cards in September 2013 by european cardholders. 
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. The original features cannot be provided due to confidentiality issues. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

## Import packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Read in data

In [2]:
credit_df = pd.read_csv('~/documents/Data/Credit Card Fraud Data/credit_card.csv')

##  Basic information

In [3]:
credit_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Shape of raw dataframe.

In [4]:
print(credit_df.shape)

(284807, 31)


Number of fraud cases.

In [5]:
print(credit_df['Class'].sum())

492


Columns

In [6]:
print(credit_df.columns)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


In [7]:
print(credit_df.dtypes)

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object


## Drop irrelevant columns

I choose to drop the 'Time' column because I doubt this has a meaningful impact on fraudulent transactions.

In the future, I could do some time series analysis with this column, e.g. to see if fraudulent transactions are more likely to occur at a certain time of day.

In [8]:
credit_df = credit_df.drop('Time', axis = 1)

In [9]:
credit_df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')

## Rename columns

Rename 'Class' to 'Fraud' so its meaning is clearer.

In [10]:
credit_df = credit_df.rename(columns = {'Class':'Fraud'})

In [11]:
credit_df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Fraud'],
      dtype='object')

## Standardise columns

The PCA columns are standardised by construction. All that remains is to standardise the 'Amount' column. This will result in negative amounts, which is unintuitive, but that does not matter for my purposes.

In [12]:
mean = credit_df['Amount'].mean()
std = credit_df['Amount'].std()
credit_df['Amount'] = (credit_df['Amount'] - mean) / std

## Duplicate rows

Number of duplicate rows.

In [13]:
print(credit_df.duplicated().sum())

9144


Drop the duplicate rows as there are plenty more left after, and there is no clear alternative option.

In [14]:
credit_df = credit_df.drop_duplicates()

New shape after removing duplicate rows.

In [15]:
print(credit_df.shape)

(275663, 30)


New number of fraud rows.

In [16]:
print(credit_df['Fraud'].sum())

473


This is only a small decrease, so it's OK to carry on.

## Check for null or NA values

In [17]:
print(credit_df.isna().sum().sum())

0


There are no null or NA values.

## Save cleaned data to file

In [18]:
credit_df.to_csv('~/documents/Data/Credit Card Fraud Data/credit_card_clean.csv')