# 01_EDA – Exploratory Data Analysis on Login Event Data

This notebook performs exploratory data analysis (EDA) on the RBA login dataset to understand user login behaviors, identify key patterns, and prepare the data for anomaly detection using machine learning.

### Objectives:
- Inspect dataset structure and types
- Analyze login distributions by country, time, and user behavior
- Visualize geo-based and temporal login trends
- Identify potential indicators of malicious activity (e.g., rare locations, odd login hours)
- Guide feature engineering for downstream ML models

**Imports**:

In [2]:
# Standart Imports
import dask.dataframe as dd
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

**Load the Dataset**:

In [4]:
df = dd.read_csv('../data/rba-dataset.csv')
df.head()

Unnamed: 0,index,Login Timestamp,User ID,Round-Trip Time [ms],IP Address,Country,Region,City,ASN,User Agent String,Browser Name and Version,OS Name and Version,Device Type,Login Successful,Is Attack IP,Is Account Takeover
0,0,2020-02-03 12:43:30.772,-4324475583306591935,,10.0.65.171,NO,-,-,29695,Mozilla/5.0 (iPhone; CPU iPhone OS 13_4 like ...,Firefox 20.0.0.1618,iOS 13.4,mobile,False,False,False
1,1,2020-02-03 12:43:43.549,-4324475583306591935,,194.87.207.6,AU,-,-,60117,Mozilla/5.0 (Linux; Android 4.1; Galaxy Nexus...,Chrome Mobile 46.0.2490,Android 4.1,mobile,False,False,False
2,2,2020-02-03 12:43:55.873,-3284137479262433373,,81.167.144.58,NO,Vestland,Urangsvag,29695,Mozilla/5.0 (iPad; CPU OS 7_1 like Mac OS X) ...,Android 2.3.3.2672,iOS 7.1,mobile,True,False,False
3,3,2020-02-03 12:43:56.180,-4324475583306591935,,170.39.78.152,US,-,-,393398,Mozilla/5.0 (Linux; Android 4.1; Galaxy Nexus...,Chrome Mobile WebView 85.0.4183,Android 4.1,mobile,False,False,False
4,4,2020-02-03 12:43:59.396,-4618854071942621186,,10.0.0.47,US,Virginia,Ashburn,398986,Mozilla/5.0 (Linux; U; Android 2.2) Build/NMA...,Chrome Mobile WebView 85.0.4183,Android 2.2,mobile,False,True,False


**Basic EDA**:

In [16]:
df.loc[0]["User Agent String"].compute()

0    Mozilla/5.0  (iPhone; CPU iPhone OS 13_4 like ...
0    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6...
0    Mozilla/5.0  (iPhone; CPU iPhone OS 11_2_6 lik...
0    Mozilla/5.0  (Linux; Android 4.1; Galaxy Nexus...
0    Mozilla/5.0  (iPad; CPU OS 7_1 like Mac OS X) ...
                           ...                        
0    Mozilla/5.0  (iPhone; CPU iPhone OS 13_4 like ...
0    Mozilla/5.0 (iPod; CPU iPhone OS 6_1_6 like Ma...
0    Mozilla/5.0  (iPhone; CPU iPhone OS 13_4 like ...
0         AwarioSmartBot/1.0  (en-us) variation/294821
0    Mozilla/5.0  (iPhone; CPU iPhone OS 11_2_6 lik...
Name: User Agent String, Length: 141, dtype: string

In [5]:
df.info()

<class 'dask.dataframe.dask_expr.DataFrame'>
Columns: 16 entries, index to Is Account Takeover
dtypes: bool(3), float64(1), int64(3), string(9)

In [19]:
df.describe().compute()

Unnamed: 0,index,User ID,Round-Trip Time [ms],ASN
count,31269260.0,31269260.0,1275935.0,31269260.0
mean,15634630.0,-268956300000.0,663.9332,162121.5
std,9026659.0,4.514276e+18,1116.125,171918.5
min,0.0,-9.223371e+18,8.0,12.0
25%,7814304.0,-4.324476e+18,474.0,29695.0
50%,15582110.0,-4.324476e+18,544.0,207174.0
75%,23443990.0,2.293924e+18,697.0,393398.0
max,31269260.0,9.223359e+18,223457.0,507727.0


In [7]:
df.columns

Index(['index', 'Login Timestamp', 'User ID', 'Round-Trip Time [ms]',
       'IP Address', 'Country', 'Region', 'City', 'ASN', 'User Agent String',
       'Browser Name and Version', 'OS Name and Version', 'Device Type',
       'Login Successful', 'Is Attack IP', 'Is Account Takeover'],
      dtype='object')

In [6]:
df.dtypes

index                                 int64
Login Timestamp             string[pyarrow]
User ID                               int64
Round-Trip Time [ms]                float64
IP Address                  string[pyarrow]
Country                     string[pyarrow]
Region                      string[pyarrow]
City                        string[pyarrow]
ASN                                   int64
User Agent String           string[pyarrow]
Browser Name and Version    string[pyarrow]
OS Name and Version         string[pyarrow]
Device Type                 string[pyarrow]
Login Successful                       bool
Is Attack IP                           bool
Is Account Takeover                    bool
dtype: object

In [12]:
# Check NaN
df.isna().sum().compute()

index                              0
Login Timestamp                    0
User ID                            0
Round-Trip Time [ms]        29993329
IP Address                         0
Country                            0
Region                         47409
City                            8590
ASN                                0
User Agent String                  0
Browser Name and Version           0
OS Name and Version                0
Device Type                     1526
Login Successful                   0
Is Attack IP                       0
Is Account Takeover                0
dtype: int64

In [13]:
# Length of the full dataset
len(df)

31269264

In [17]:
# Check the value count of User ID
df["User ID"].value_counts().compute().describe()

count    4.304857e+06
mean     7.263717e+00
std      6.760161e+03
min      1.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      4.000000e+00
max      1.402590e+07
Name: count, dtype: float64

In the next steps, I will clean and transform the dataset based on early observations:

* `User ID` will be kept as there are approximently 7 rows for each user.
* `Round-Trip Time [ms]` contains too many missing values (over 95%) and will be dropped.
* `Region` and `City` are partially missing and less useful compared to `Country`, which I’ll retain as the primary geolocation feature.
* `User Agent String` will be dropped since I already have separate columns for browser, OS, and device type.
* I may extract major versions from browser or OS strings later if needed, but for now I’ll keep them as-is.
* `Is Attack IP` will be kept to help label or validate suspicious activity.
* Finally, I’ll drop the `index` column as it doesn’t serve any purpose.

For time-based analysis, I will extract only the hour from the `Login Timestamp`. Using year, month, or day likely won’t add meaningful patterns, while minute or second would be overly granular and introduce noise. Hour-level granularity should be sufficient to identify suspicious login times.