# Module 7 – Anomaly Detection Mini Project
## Apply rule-based or statistical detection • Flag suspicious events • Interpret findings

**Goal:** turn raw logs into an **alerts table** you could hand to a SOC analyst.


In [1]:
import pandas as pd
import numpy as np
from datetime import timedelta


## 1) Load + Prep

In [2]:
df = pd.read_csv('anomaly_logs_mini.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.head()

Unnamed: 0,timestamp,user,ip_address,user_agent,event_type,resource,status,bytes_out
0,2025-02-14 23:16:50,bob,192.168.1.10,Chrome,LOGIN_FAILED,/login,FAIL,0
1,2025-02-14 03:43:54,bob,45.33.32.156,Chrome,LOGIN_SUCCESS,/login,SUCCESS,0
2,2025-02-14 18:23:57,alice,185.220.101.1,Firefox,LOGIN_SUCCESS,/login,SUCCESS,0
3,2025-02-14 16:21:18,eva,192.168.1.10,Edge,FILE_ACCESS,/payroll,SUCCESS,91895
4,2025-02-14 12:15:18,bob,192.168.1.20,python-requests,LOGIN_SUCCESS,/login,SUCCESS,0


In [3]:
# Quick checks
df.info()
print('Rows:', len(df))
print('Users:', df['user'].nunique(), 'IPs:', df['ip_address'].nunique())
df['event_type'].value_counts().head(10)
df.describe()

<class 'pandas.DataFrame'>
RangeIndex: 944 entries, 0 to 943
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   timestamp   944 non-null    datetime64[us]
 1   user        944 non-null    str           
 2   ip_address  944 non-null    str           
 3   user_agent  944 non-null    str           
 4   event_type  944 non-null    str           
 5   resource    944 non-null    str           
 6   status      944 non-null    str           
 7   bytes_out   944 non-null    int64         
dtypes: datetime64[us](1), int64(1), str(6)
memory usage: 59.1 KB
Rows: 944
Users: 8 IPs: 9


Unnamed: 0,timestamp,bytes_out
count,944,944.0
mean,2025-02-14 11:43:38.474576,397675.3
min,2025-02-14 00:00:37,0.0
25%,2025-02-14 05:04:34.250000,0.0
50%,2025-02-14 11:24:41.500000,0.0
75%,2025-02-14 18:10:37.250000,0.0
max,2025-02-14 23:59:34,15990390.0
std,,1854579.0


## 2) Feature Engineering (starter)

In [4]:
# Add hour + day
df['hour'] = df['timestamp'].dt.hour
df['date'] = df['timestamp'].dt.date

# Sort for window operations
df = df.sort_values('timestamp').reset_index(drop=True)
df.head()

Unnamed: 0,timestamp,user,ip_address,user_agent,event_type,resource,status,bytes_out,hour,date
0,2025-02-14 00:00:37,charlie,192.168.1.20,Firefox,DATA_EXPORT,/api/v1/token,SUCCESS,10439392,0,2025-02-14
1,2025-02-14 00:01:28,grace,172.16.0.8,PowerShell,DATA_EXPORT,/payroll,SUCCESS,13789588,0,2025-02-14
2,2025-02-14 00:07:43,eva,192.168.1.10,Firefox,LOGIN_SUCCESS,/login,SUCCESS,0,0,2025-02-14
3,2025-02-14 00:09:00,david,172.16.0.8,Chrome,LOGIN_FAILED,/login,FAIL,0,0,2025-02-14
4,2025-02-14 00:12:33,henry,172.16.0.8,Firefox,LOGIN_SUCCESS,/login,SUCCESS,0,0,2025-02-14


## 3) Rule-Based Detection – Build an Alerts Table

In [5]:
alerts = []

def add_alert(alert_type, severity, user=None, ip=None, start=None, end=None, evidence=None):
    alerts.append({
        'alert_type': alert_type,
        'severity': severity,
        'user': user,
        'ip_address': ip,
        'start_time': start,
        'end_time': end,
        'evidence': evidence
    })

# Example Rule A: failed logins per user in 5-minute windows > 10
failed = df[df['event_type'] == 'LOGIN_FAILED'].copy()
failed = failed.set_index('timestamp')
win = '5min'
counts = failed.groupby('user').resample(win).size().reset_index(name='fail_count')
susp = counts[counts['fail_count'] > 10]
for _, r in susp.iterrows():
    st = r['timestamp']
    add_alert(
        alert_type='FAIL_BURST_USER',
        severity='HIGH',
        user=r['user'],
        start=st,
        end=st + pd.Timedelta(win),
        evidence=f"{r['fail_count']} failed logins in {win}"
    )

alerts_df = pd.DataFrame(alerts)
alerts_df.head()

Unnamed: 0,alert_type,severity,user,ip_address,start_time,end_time,evidence
0,FAIL_BURST_USER,HIGH,alice,,2025-02-14 03:05:00,2025-02-14 03:10:00,35 failed logins in 5min


### Add at least 2 more rules
Ideas:
- Success after many failures (same user/ip)
- Privilege escalation events
- Large bytes_out outliers
- Activity at 2–4 AM


In [6]:
# Rule B: failed logins per IP in 5-minute windows > 10
failed_ip = df[df['event_type'] == 'LOGIN_FAILED'].copy()
failed_ip = failed_ip.set_index('timestamp')

counts_ip = failed_ip.groupby('ip_address').resample(win).size().reset_index(name='fail_count')
susp_ip = counts_ip[counts_ip['fail_count'] > 10]

for _, r in susp_ip.iterrows():
    st = r['timestamp']
    add_alert(
        alert_type='FAIL_BURST_IP',
        severity='HIGH',
        ip=r['ip_address'],
        start=st,
        end=st + pd.Timedelta(win),
        evidence=f"{r['fail_count']} failed logins from IP in {win}"
    )

# Rule C: Successful login after multiple failures (within 10 minutes)
success = df[df['event_type'] == 'LOGIN_SUCCESS']

for _, s in success.iterrows():
    user = s['user']
    success_time = s['timestamp']

    recent_fails = df[
        (df['event_type'] == 'LOGIN_FAILED') &
        (df['user'] == user) &
        (df['timestamp'] >= success_time - pd.Timedelta('10min')) &
        (df['timestamp'] < success_time)
    ]

    if len(recent_fails) >= 5:
        add_alert(
            alert_type='BRUTE_FORCE_SUCCESS',
            severity='CRITICAL',
            user=user,
            ip=s['ip_address'],
            start=recent_fails['timestamp'].min(),
            end=success_time,
            evidence=f"{len(recent_fails)} failed logins before success"
        )

# Rule D: Large bytes_out using Z-score
df['bytes_out'] = pd.to_numeric(df['bytes_out'], errors='coerce')
mean_bytes = df['bytes_out'].mean()
std_bytes = df['bytes_out'].std()

if std_bytes > 0:
    df['z_score'] = (df['bytes_out'] - mean_bytes) / std_bytes
    exfil = df[df['z_score'] > 3]

    for _, r in exfil.iterrows():
        add_alert(
            alert_type='LARGE_BYTES_OUT',
            severity='HIGH',
            user=r['user'],
            ip=r['ip_address'],
            start=r['timestamp'],
            end=r['timestamp'],
            evidence=f"bytes_out={r['bytes_out']} (z={round(r['z_score'],2)}) from {r['resource']}"
        )

alerts_df = pd.DataFrame(alerts)
alerts_df.sort_values('start_time').head(10)

Unnamed: 0,alert_type,severity,user,ip_address,start_time,end_time,evidence
4,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.20,2025-02-14 00:00:37,2025-02-14 00:00:37,bytes_out=10439392 (z=5.41) from /api/v1/token
5,LARGE_BYTES_OUT,HIGH,grace,172.16.0.8,2025-02-14 00:01:28,2025-02-14 00:01:28,bytes_out=13789588 (z=7.22) from /payroll
6,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.10,2025-02-14 00:42:22,2025-02-14 00:42:22,bytes_out=9423840 (z=4.87) from /home
7,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.20,2025-02-14 01:15:51,2025-02-14 01:15:51,bytes_out=15990388 (z=8.41) from /files/reports
8,LARGE_BYTES_OUT,HIGH,david,104.26.3.2,2025-02-14 01:25:30,2025-02-14 01:25:30,bytes_out=9572044 (z=4.95) from /files/db_backup
9,LARGE_BYTES_OUT,HIGH,charlie,185.220.101.1,2025-02-14 02:03:18,2025-02-14 02:03:18,bytes_out=7695354 (z=3.93) from /hr
0,FAIL_BURST_USER,HIGH,alice,,2025-02-14 03:05:00,2025-02-14 03:10:00,35 failed logins in 5min
2,BRUTE_FORCE_SUCCESS,CRITICAL,alice,185.220.101.1,2025-02-14 03:05:00,2025-02-14 03:10:00,35 failed logins before success
1,FAIL_BURST_IP,HIGH,,185.220.101.1,2025-02-14 03:05:00,2025-02-14 03:10:00,36 failed logins from IP in 5min
3,BRUTE_FORCE_SUCCESS,CRITICAL,alice,192.168.1.20,2025-02-14 03:05:00,2025-02-14 03:11:35,35 failed logins before success


## 4) Statistical Detection (choose one)

In [7]:
# Option: z-score outliers for bytes_out (simple)
x = df['bytes_out'].fillna(0)
mu, sigma = x.mean(), x.std(ddof=0)
df['z_bytes'] = (x - mu) / (sigma if sigma else 1)

outliers = df[df['z_bytes'] > 3].copy()
outliers[['timestamp','user','ip_address','event_type','resource','bytes_out','z_bytes']].head(10)

Unnamed: 0,timestamp,user,ip_address,event_type,resource,bytes_out,z_bytes
0,2025-02-14 00:00:37,charlie,192.168.1.20,DATA_EXPORT,/api/v1/token,10439392,5.417423
1,2025-02-14 00:01:28,grace,172.16.0.8,DATA_EXPORT,/payroll,13789588,7.224826
21,2025-02-14 00:42:22,charlie,192.168.1.10,DATA_EXPORT,/home,9423840,4.869541
33,2025-02-14 01:15:51,charlie,192.168.1.20,DATA_EXPORT,/files/reports,15990388,8.412139
42,2025-02-14 01:25:30,david,104.26.3.2,DATA_EXPORT,/files/db_backup,9572044,4.949496
65,2025-02-14 02:03:18,charlie,185.220.101.1,DATA_EXPORT,/hr,7695354,3.937037
147,2025-02-14 03:11:47,grace,185.220.101.1,DATA_EXPORT,/hr,6601204,3.346752
159,2025-02-14 03:19:00,alice,185.220.101.1,DATA_EXPORT,/files/db_backup,12057666,6.290468
190,2025-02-14 04:01:14,henry,10.0.0.5,DATA_EXPORT,/files/db_backup,6924794,3.521326
230,2025-02-14 04:55:50,grace,198.51.100.77,DATA_EXPORT,/api/v1/token,6546644,3.317318


In [8]:
# Statistical outliers already captured as LARGE_BYTES_OUT alerts via z_score in Rule D above.
# Displaying the outlier summary here for reference.
outliers[['timestamp','user','ip_address','bytes_out','z_bytes']].sort_values('z_bytes', ascending=False)

Unnamed: 0,timestamp,user,ip_address,bytes_out,z_bytes
33,2025-02-14 01:15:51,charlie,192.168.1.20,15990388,8.412139
1,2025-02-14 00:01:28,grace,172.16.0.8,13789588,7.224826
234,2025-02-14 05:00:18,alice,185.220.101.1,13652421,7.150825
925,2025-02-14 23:27:34,david,104.26.3.2,13225354,6.920426
363,2025-02-14 08:17:58,charlie,203.0.113.5,12373881,6.461064
159,2025-02-14 03:19:00,alice,185.220.101.1,12057666,6.290468
638,2025-02-14 16:00:37,grace,198.51.100.77,11967759,6.241964
838,2025-02-14 21:25:47,charlie,198.51.100.77,11860714,6.184214
269,2025-02-14 06:06:32,bob,10.0.0.5,11543135,6.012883
581,2025-02-14 14:24:39,david,192.168.1.10,10894625,5.663017


## 5) Export Alerts + Write Your Interpretation

In [9]:
alerts_df = pd.DataFrame(alerts)
alerts_df = alerts_df.sort_values(['severity','start_time'], ascending=[True, True])
alerts_df.to_csv('alerts.csv', index=False)
alerts_df.head(20)

Unnamed: 0,alert_type,severity,user,ip_address,start_time,end_time,evidence
2,BRUTE_FORCE_SUCCESS,CRITICAL,alice,185.220.101.1,2025-02-14 03:05:00,2025-02-14 03:10:00,35 failed logins before success
3,BRUTE_FORCE_SUCCESS,CRITICAL,alice,192.168.1.20,2025-02-14 03:05:00,2025-02-14 03:11:35,35 failed logins before success
4,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.20,2025-02-14 00:00:37,2025-02-14 00:00:37,bytes_out=10439392 (z=5.41) from /api/v1/token
5,LARGE_BYTES_OUT,HIGH,grace,172.16.0.8,2025-02-14 00:01:28,2025-02-14 00:01:28,bytes_out=13789588 (z=7.22) from /payroll
6,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.10,2025-02-14 00:42:22,2025-02-14 00:42:22,bytes_out=9423840 (z=4.87) from /home
7,LARGE_BYTES_OUT,HIGH,charlie,192.168.1.20,2025-02-14 01:15:51,2025-02-14 01:15:51,bytes_out=15990388 (z=8.41) from /files/reports
8,LARGE_BYTES_OUT,HIGH,david,104.26.3.2,2025-02-14 01:25:30,2025-02-14 01:25:30,bytes_out=9572044 (z=4.95) from /files/db_backup
9,LARGE_BYTES_OUT,HIGH,charlie,185.220.101.1,2025-02-14 02:03:18,2025-02-14 02:03:18,bytes_out=7695354 (z=3.93) from /hr
0,FAIL_BURST_USER,HIGH,alice,,2025-02-14 03:05:00,2025-02-14 03:10:00,35 failed logins in 5min
1,FAIL_BURST_IP,HIGH,,185.220.101.1,2025-02-14 03:05:00,2025-02-14 03:10:00,36 failed logins from IP in 5min


### Interpretation (required)
In 6–10 bullet points, answer:
- What is the **top suspicious pattern** you found?
- What evidence supports it?
- What would you investigate next?
- What could be a false positive?

- The most concerning pattern observed is repeated large outbound data transfers, particularly from sensitive directories such as /payroll, /admin, and /report, which may indicate potential data exfiltration.

- The user Alice triggered multiple high-risk detections, including brute force failures, brute force success, and unusually large bytes_out values involving HR and payroll-related files.

- The combination of repeated failed logins followed by a successful login and subsequent large data extraction strongly suggests potential account compromise.

- This activity is high severity because it may involve unauthorized access to financial or personnel data, which could represent a significant security and compliance risk.

- Grace also exhibited suspicious behavior, including large outbound data transfers from similar sensitive directories and the use of multiple IP addresses within short timeframes.

- The rapid IP changes associated with Grace may indicate credential sharing, VPN/proxy usage, or potentially malicious access patterns consistent with attacker behavior.

- Midnight outbound data spikes from users such as Charlie and Grace should be investigated to determine whether they were legitimate automated processes (e.g., backups or scheduled jobs) or unauthorized transfers.

- While large bytes_out events can indicate data exfiltration, they may also represent legitimate business operations; therefore, such alerts should be prioritized only when correlated with additional suspicious activity such as failed logins, IP changes, or access to sensitive directories.

