# Lab 2 – Basic Anomaly Detection for Cybersecurity Logs

**Student Name:** Nadav Shapira  
**Student ID:** 325363505  
**MITRE ATT&CK Technique:** T1078 – Valid Accounts

This lab demonstrates an end-to-end anomaly detection pipeline using synthetic login session data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score

## Load Dataset

In [None]:
df = pd.read_csv('lab2_dataset_nadav_shapira.csv')
df.head()

## Exploratory Data Analysis (EDA)

The dataset represents normal login behavior with rare anomalous sessions. Most sessions occur during working hours with moderate session durations. Anomalies are expected to appear as unusually long sessions at uncommon hours. Categorical features such as user and country help distinguish legitimate from suspicious activity.

In [None]:
df['session_duration'].hist(bins=40)
plt.title('Session Duration Distribution')
plt.show()

In [None]:
df['hour'].value_counts().sort_index().plot(kind='bar')
plt.title('Login Hour Distribution')
plt.show()

## Preprocessing and Isolation Forest

In [None]:
X = df.drop(columns=['label_attack'])
y = df['label_attack']

cat_cols = ['user', 'country']
num_cols = ['hour', 'session_duration']

encoder = OneHotEncoder(sparse=False)
X_cat = encoder.fit_transform(X[cat_cols])

scaler = StandardScaler()
X_num = scaler.fit_transform(X[num_cols])

import numpy as np
X_processed = np.hstack([X_num, X_cat])

model = IsolationForest(contamination=0.035, random_state=42)
preds = model.fit_predict(X_processed)
scores = model.decision_function(X_processed)

df['anomaly_pred'] = (preds == -1).astype(int)
df['anomaly_score'] = scores

In [None]:
print('Detected anomalies:', df['anomaly_pred'].sum())
print('Accuracy:', accuracy_score(y, df['anomaly_pred']))
print('Precision:', precision_score(y, df['anomaly_pred']))
print('Recall:', recall_score(y, df['anomaly_pred']))

In [None]:
plt.hist(df['anomaly_score'], bins=50)
plt.title('Anomaly Score Distribution')
plt.show()

## 2D Visualization using PCA

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_processed)

plt.scatter(X_pca[:,0], X_pca[:,1], c=df['anomaly_pred'])
plt.title('PCA Projection of Login Sessions')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

Normal behavior forms a dense cluster near the center of the projection. Anomalous sessions appear more isolated, reflecting unusual login times and durations.

## Conclusion

The Isolation Forest model successfully identified rare anomalous login sessions corresponding to the MITRE ATT&CK technique T1078. The combination of temporal, numeric, and categorical features allowed effective separation between normal and suspicious behavior.