# Data preprocessing

Once you understand your dataset, you'll probably have some idea about how you want to model your data. Machine learning models in Python require numerical input, so if your dataset has categorical variables, you'll need to transform them.

# 1. One-Hot Encoding
One-Hot Encoding creates a binary column for each category level. It's useful for nominal data without an intrinsic order.

Pros:

Easy to understand and implement.
Does not assume an order of the categories.
Cons:

Increases the dataset's dimensionality, which can lead to the curse of dimensionality in case of high cardinality.
Not suitable for variables with many levels

# 2. Label Encoding
Label Encoding converts each category level into a numerical value. It's suitable for ordinal data where the order matters.

Pros:

Keeps the dataset's dimensionality low.
Suitable for tree-based algorithms.
Cons:

Introduces a numerical relationship between categories which might not exist.
Not suitable for models like linear regression, where numerical values have mathematical implications.

# 3. Frequency Encoding
Frequency Encoding replaces categories with their frequencies. Useful when the frequency distribution is informative.

Pros:

Captures the importance of category levels based on their frequency.
Keeps dimensionality low.
Cons:

Different categories can have the same frequency, leading to loss of information.

# 4. Binary Encoding
Binary Encoding first converts categories into ordinal, then those integers into binary code, and then splits the digits of the binary code into separate columns.

Pros:

Reduces the dimensions as compared to one-hot encoding.
Suitable for variables with a high number of categories.
Cons:

More complex and not as straightforward as one-hot or label encoding.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
import category_encoders as ce

In [2]:
# Load data
file_path = "C:\\Users\\praja\\PycharmProjects\\FraudDetection\\Data\\transactions_train.csv"
train = pd.read_csv(file_path)

# Drop the 'nameOrig' and 'nameDest' columns
train.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)

# Define target and features
X = train.drop("isFraud", axis=1)
y = train["isFraud"].to_numpy()

# Select the 'type' column for encoding
cat_column = 'type'

In [3]:
# Label Encoding
label_encoder = LabelEncoder()
X_label_encoded = pd.DataFrame(label_encoder.fit_transform(X[cat_column]), columns=[cat_column + '_label'])

In [4]:
X_label_encoded

Unnamed: 0,type_label
0,3
1,3
2,4
3,1
4,3
...,...
6351188,4
6351189,1
6351190,4
6351191,1


In [5]:
# Frequency Encoding
frequency_encoding = X[cat_column].map(X[cat_column].value_counts())

In [6]:
frequency_encoding

0          2147832
1          2147832
2           531817
3          2233369
4          2147832
            ...   
6351188     531817
6351189    2233369
6351190     531817
6351191    2233369
6351192      41310
Name: type, Length: 6351193, dtype: int64

In [7]:
# Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=[cat_column])
X_binary_encoded = binary_encoder.fit_transform(X[[cat_column]])

In [8]:
X_binary_encoded

Unnamed: 0,type_0,type_1,type_2
0,0,0,1
1,0,0,1
2,0,1,0
3,0,1,1
4,0,0,1
...,...,...,...
6351188,0,1,0
6351189,0,1,1
6351190,0,1,0
6351191,0,1,1


In [9]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

In [10]:
# Concatenate label encoded 'type' column with the original data
X_full_processed = pd.concat([X, X_label_encoded], axis=1)

In [11]:
X_full_processed

Unnamed: 0,step,type,amount,oldbalanceOrig,newbalanceOrig,oldbalanceDest,newbalanceDest,type_label
0,1,PAYMENT,9839.64,170136.00,160296.36,0.00,0.00,3
1,1,PAYMENT,1864.28,21249.00,19384.72,0.00,0.00,3
2,1,TRANSFER,181.00,181.00,0.00,0.00,0.00,4
3,1,CASH_OUT,181.00,181.00,0.00,21182.00,0.00,1
4,1,PAYMENT,11668.14,41554.00,29885.86,0.00,0.00,3
...,...,...,...,...,...,...,...,...
6351188,699,TRANSFER,162326.52,162326.52,0.00,0.00,0.00,4
6351189,699,CASH_OUT,162326.52,162326.52,0.00,0.00,162326.52,1
6351190,699,TRANSFER,2763398.31,2763398.31,0.00,0.00,0.00,4
6351191,699,CASH_OUT,2763398.31,2763398.31,0.00,339515.35,3102913.66,1


In [12]:
X_full_processed['log_transformed_amount'] = np.log1p(X_full_processed['amount'])

In [13]:
X_full_processed=X_full_processed[['step','oldbalanceOrig','newbalanceOrig','oldbalanceDest','newbalanceDest','type_label','log_transformed_amount']]

In [14]:
# Dimensionality Reduction with PCA
pca = PCA(n_components=0.95)  # Adjust the number of components as needed
X_pca = pca.fit_transform(X_full_processed)

In [15]:
X_pca

array([[-1762981.6933353 ,  -728136.30484524],
       [-1790911.01715249,  -931078.04294425],
       [-1794809.20062294,  -959404.02649291],
       ...,
       [-1530600.96978408,   964020.19848722],
       [  953363.71152689,   592886.29761781],
       [-1614525.38105568,  -969557.35487434]])