# AuthPulse â€” LANL Dataset Quick EDA

This notebook does lightweight EDA to support the data-understanding deliverables:

- Volume per day
- Unique users and hosts
- Quick sanity checks on schema

It defaults to the small sample file in `data/sample/auth_sample.csv` and includes an optional section to read from the compressed LANL `.bz2` file if present.

In [13]:
from __future__ import annotations

from pathlib import Path

import pandas as pd

REPO_ROOT = Path.cwd().parent if Path.cwd().name == "notebooks" else Path.cwd()
SAMPLE_CSV = REPO_ROOT / "data" / "sample" / "auth_sample.csv"
LANL_BZ2 = REPO_ROOT / "data" / "raw" / "lanl-auth-dataset-1-00.bz2"

print("repo:", REPO_ROOT)
print("sample exists:", SAMPLE_CSV.exists(), SAMPLE_CSV)
print("lanl bz2 exists:", LANL_BZ2.exists(), LANL_BZ2)

repo: c:\Users\bhoop\Desktop\authpulse-aws-streaming-security-analytics
sample exists: True c:\Users\bhoop\Desktop\authpulse-aws-streaming-security-analytics\data\sample\auth_sample.csv
lanl bz2 exists: True c:\Users\bhoop\Desktop\authpulse-aws-streaming-security-analytics\data\raw\lanl-auth-dataset-1-00.bz2


## Load sample subset

The sample uses the same 3-column schema as the LANL auth file: `time,user,computer`.

In [None]:
df = pd.read_csv(SAMPLE_CSV)
df.head()

Index(['time', 'user', 'computer'], dtype='str')

In [None]:
df.shape

(8, 3)

In [None]:
# Convert time (seconds) to a timezone-aware UTC datetime
# Keeping UTC aligns with the AuthEvent data contract.
df["event_time"] = pd.to_datetime(df["time"], unit="s", utc=True)

df["event_date"] = df["event_time"].dt.date

df.dtypes

time                       int64
user                         str
computer                     str
event_time    datetime64[s, UTC]
event_date                object
dtype: object

## Volume per day

Compute daily event volume. This is a baseline for Kinesis throughput and SLA sizing.

In [None]:
# Volume per day
volume_per_day = df.groupby("event_date").size()
print(volume_per_day)

event_date
1970-01-01    8
dtype: int64


## Unique users and hosts

Compute total unique actors in the sample, plus daily unique counts.

In [19]:
# Unique users and hosts
n_users = df["user"].nunique()
n_hosts = df["computer"].nunique()
n_users, n_hosts

(4, 4)

## Optional: read from the compressed LANL dataset (`.bz2`)

If you have the full `.bz2` file locally, this reads a limited number of rows for quick sanity-check metrics without unpacking the entire dataset.

Note: the full dataset is large; keep `NROWS` relatively small for interactive work.

In [20]:
if LANL_BZ2.exists():
    NROWS = 200_000  # adjust as needed
    lanl = pd.read_csv(
        LANL_BZ2,
        compression="bz2",
        header=None,
        names=["time", "user", "computer"],
        nrows=NROWS,
    )
    lanl["event_time"] = pd.to_datetime(lanl["time"], unit="s", utc=True)
    lanl["event_date"] = lanl["event_time"].dt.date

    metrics = {
        "rows_loaded": len(lanl),
        "min_event_time": lanl["event_time"].min(),
        "max_event_time": lanl["event_time"].max(),
        "unique_users": int(lanl["user"].nunique()),
        "unique_hosts": int(lanl["computer"].nunique()),
    }
    display(metrics)

    volume_per_day = lanl.groupby("event_date").size()
    display(volume_per_day.head(10))
else:
    print("LANL_BZ2 not found; skipping")

{'rows_loaded': 200000,
 'min_event_time': Timestamp('1970-01-01 00:00:01+0000', tz='UTC'),
 'max_event_time': Timestamp('1970-01-01 02:58:13+0000', tz='UTC'),
 'unique_users': 4093,
 'unique_hosts': 4716}

event_date
1970-01-01    200000
dtype: int64