**Meaning of Data Columns**

- SD: Sus domain i.e. words associated to hacking
- JD: Job domain i.e. words associated with job portals
- CD: Cloud domain i.e. words associated with cloud drives

- WKE: user accessed during weekends
- OWH: user accessed out of work hours (of work days)
- WH: user accessed during work hours

**Import Feature Engineered Data**
- OCEAN Personality Test
- File
- HTTP

In [None]:
import pandas as pd
data = pd.read_parquet('../data/FEData_For_Modelling.parquet').reset_index(drop = True)
data

Get a list of malicious users from answers dataset
- Identified in scenario 2

In [None]:
import os
malicious_filenames = os.listdir('../data/answers')
malicious_users = []

for filename in malicious_filenames:
    malicious_users.append(filename.replace('r5.2-2-', "").replace('.csv', ""))

malicious_users

**Add 'malicous column' to identify such users**
- 1: malicous
- 2: not malicious

In [None]:
import numpy as np

data['malicious'] = np.where(data['user'].isin(malicious_users), 1, 0)
data

In [None]:
data[data.user == 'TRC1838']

# Decision Tree / Random Forest

In [None]:
from sklearn.model_selection import train_test_split
from collections import Counter

X = data.drop(columns = ['user', 'malicious'])
y = data.malicious

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train Labels before Resampling")
print(Counter(y_train))

**SMOTE Oversampling**
- Not included for now

In [None]:
# from imblearn.over_sampling import SMOTE

# # transform the dataset
# oversample = SMOTE(sampling_strategy=0.8) #sampling_strategy=0.8
# resampled_X_train, resampled_y_train = oversample.fit_resample(X_train, y_train)

# print("Train Labels after Resampling")
# print(Counter(resampled_y_train))

**Feature Normalisation**

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Recursive Feature Elimination + Cross Validation**

In [None]:
from sklearn.feature_selection import RFECV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Feature Selection
dt = DecisionTreeClassifier()
# dt = RandomForestClassifier()
rfe = RFECV(estimator = dt, scoring = 'precision') #minimises false positives
X_train_rfe = rfe.fit_transform(X_train_scaled, y_train)
X_test_rfe = rfe.transform(X_test_scaled)

print('Chosen best features by rfe:', X_train.columns[rfe.support_])
print('Ranking of Feature Importance:', rfe.ranking_)

**Model Evaluation**

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Evaluate the model
dt.fit(X_train_rfe, y_train)
y_pred = dt.predict(X_test_rfe)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
disp.plot()
plt.show()

**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to tune
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Perform GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring = "precision")
grid_search.fit(X_train_rfe, y_train)

# Print the best hyperparameters and corresponding score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate the best model on the test set
best_dt = grid_search.best_estimator_
test_score = best_dt.score(X_test_rfe, y_test)
print("Test Set Score:", test_score)


**Model Evaluation after Tuning**

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Evaluate the model
best_dt.fit(X_train_rfe, y_train)
y_pred = best_dt.predict(X_test_rfe)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
disp.plot()
plt.show()