# Preprocessing

In this section, we will prepare the dataset for anomaly detection using neural networks in TensorFlow. Since we’re planning to use AutoEncoders and other deep learning models, we will avoid one-hot encoding to reduce dimensionality and instead apply label encoding for categorical features.

Here's a summary of the preprocessing steps:
- **Country Code**: With over 229 unique values, we will apply **label encoding** to represent countries numerically. One-hot encoding would significantly increase the feature space, which is not optimal for neural networks.
- **Device Type**: This has a limited number of categories and will also be **label encoded**.
- **Boolean Features** (`is_login_success`, `is_attack_ip`, `is_account_takeover`): These will be converted to integers — `False` as `0` and `True` as `1`.
- **Browser Name** and **Operating System Name**: These categorical features will be **label encoded** as well. One-hot encoding is unnecessary here, given our modeling choice.

This encoding strategy is compact and well-suited for TensorFlow models, ensuring that our AutoEncoder and any other anomaly detection algorithms can efficiently process the input features.

In [13]:
# General purpose
import numpy as np
import pandas as pd

# Dask for handling large datasets
import dask.dataframe as dd

# Encoding
from sklearn.preprocessing import LabelEncoder

# Scaling
from sklearn.preprocessing import MinMaxScaler

# TensorFlow for model building
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# System and warnings
import os
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Step 1: Load partitioned CSVs
df = dd.read_csv('../data/processed/*.part')

# Step 2: Drop index early
df = df.reset_index(drop=True)

# Step 3: Optional — check for duplicate column names (only if you're unsure)
assert df.columns.duplicated().sum() == 0, "You have duplicate column names!"

# Step 4: Compute it into memory
df = df.compute()

In [5]:
df.dtypes

user_id                          int64
country_code           string[pyarrow]
asn                              int64
device_type            string[pyarrow]
is_login_success                  bool
is_attack_ip                      bool
is_account_takeover               bool
login_hours                      int64
login_day                        int64
browser_name           string[pyarrow]
os_name                string[pyarrow]
dtype: object

**Encoding the Columns**:

In [7]:
# Columns to encode
cols_to_encode = ['country_code', 'device_type', 'browser_name', 'os_name']

# Store encoders
encoders = {}

# Encode each columns
for col in cols_to_encode:
    le = LabelEncoder()
    df[col] = df[col] = le.fit_transform(df[col].astype(str))
    encoders[col] = le 

# Boolean columns
bool_cols = ['is_login_success', 'is_attack_ip', 'is_account_takeover']
for col in bool_cols:
    df[col] = df[col].astype(int)

In [8]:
# Verify Encoding
df.dtypes

user_id                int64
country_code           int32
asn                    int64
device_type            int32
is_login_success       int32
is_attack_ip           int32
is_account_takeover    int32
login_hours            int64
login_day              int64
browser_name           int32
os_name                int32
dtype: object

In [9]:
df.head()

Unnamed: 0,user_id,country_code,asn,device_type,is_login_success,is_attack_ip,is_account_takeover,login_hours,login_day,browser_name,os_name
0,-4324475583306591935,153,29695,2,0,0,0,12,0,46,43
1,-4324475583306591935,11,60117,2,0,0,0,12,0,24,0
2,-3284137479262433373,153,29695,2,1,0,0,12,0,5,43
3,-4324475583306591935,211,393398,2,0,0,0,12,0,25,0
4,-4618854071942621186,211,398986,2,0,1,0,12,0,25,0


**Scale the Columns with `MinMaxScaler`**:

In [14]:
# Columns to scale
cols_to_scale = [
    'country_code', 'device_type',
    'is_login_success', 'is_attack_ip',
    'login_hours', 'login_day',
    'browser_name', 'os_name'
]

# Initialize scaler
scaler = MinMaxScaler()

# Fit and transform
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])

In [15]:
df.head()

Unnamed: 0,user_id,country_code,asn,device_type,is_login_success,is_attack_ip,is_account_takeover,login_hours,login_day,browser_name,os_name
0,-4324475583306591935,0.671053,29695,0.5,0.0,0.0,0,0.521739,0.0,0.237113,0.977273
1,-4324475583306591935,0.048246,60117,0.5,0.0,0.0,0,0.521739,0.0,0.123711,0.0
2,-3284137479262433373,0.671053,29695,0.5,1.0,0.0,0,0.521739,0.0,0.025773,0.977273
3,-4324475583306591935,0.925439,393398,0.5,0.0,0.0,0,0.521739,0.0,0.128866,0.0
4,-4618854071942621186,0.925439,398986,0.5,0.0,1.0,0,0.521739,0.0,0.128866,0.0


In [16]:
df.dtypes

user_id                  int64
country_code           float64
asn                      int64
device_type            float64
is_login_success       float64
is_attack_ip           float64
is_account_takeover      int32
login_hours            float64
login_day              float64
browser_name           float64
os_name                float64
dtype: object

**Type Casting**:

In [17]:
columns_to_check = ['country_code', 'device_type', 'login_hours', 'login_day', 'browser_name', 'os_name', 'user_id', 'asn']

for col in columns_to_check:
    min_val = df[col].min()
    max_val = df[col].max()
    print(f"{col}: min = {min_val}, max = {max_val}")

country_code: min = 0.0, max = 1.0
device_type: min = 0.0, max = 1.0
login_hours: min = 0.0, max = 1.0
login_day: min = 0.0, max = 1.0
browser_name: min = 0.0, max = 1.0
os_name: min = 0.0, max = 1.0
user_id: min = -9223371191532286299, max = 9223358976525004362
asn: min = 12, max = 507727


In [18]:
# Type casting for memory efficiency
df['country_code'] = df['country_code'].astype(np.float32)
df['device_type'] = df['device_type'].astype(np.float32)
df['is_login_success'] = df['is_login_success'].astype(np.float32)
df['is_attack_ip'] = df['is_attack_ip'].astype(np.float32)
df['login_hours'] = df['login_hours'].astype(np.float32)
df['login_day'] = df['login_day'].astype(np.float32)
df['browser_name'] = df['browser_name'].astype(np.float32)
df['os_name'] = df['os_name'].astype(np.float32)

# Cast other relevant columns
df['asn'] = df['asn'].astype(np.uint32)
df['is_account_takeover'] = df['is_account_takeover'].astype(np.uint8)

In [19]:
# Verify
df.dtypes

user_id                  int64
country_code           float32
asn                     uint32
device_type            float32
is_login_success       float32
is_attack_ip           float32
is_account_takeover      uint8
login_hours            float32
login_day              float32
browser_name           float32
os_name                float32
dtype: object

**Save Data Frame for the Future**:

In [20]:
df.to_parquet('../data/scaled/scaled_data.parquet', index=False)