# **Z3 Baseline Model Template Applied**

## Objectives

* Establish a baseline binary classifier for the cybersecurity intrusion dataset using logistic regression.
* Document a reproducible modelling workflow that follows the team notebook template structure.

## Inputs

* `data/processed/cybersecurity_intrusion_data_eda.csv`: feature-engineered dataset prepared during previous steps of the project.

## Outputs

* Fitted `LogisticRegression` pipeline with preprocessing steps and tuned hyperparameters.
* Evaluation metrics (classification report and confusion matrix) for the held-out test set.
* Ranked logistic regression coefficients to highlight the most influential features.

## Additional Comments

* The workflow maintains template sections and focuses exclusively on the logistic regression baseline requested for Zone 3.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

## Import libraries and load the modelling dataset

This section collects all external dependencies and loads the processed cybersecurity intrusion data so the remaining steps can focus on modelling tasks.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import OneHotEncoder

# Visual style for plots
sns.set_style("whitegrid")

In [None]:
# Feature engineered cybersecurity dataset
df_clf = pd.read_csv('data/processed/cybersecurity_intrusion_data_eda.csv')
print(f'Dataset shape: {df_clf.shape}')
df_clf.head()

---
# Section 2

## Prepare features and create train/test splits

Columns that were replaced by engineered versions are removed before splitting. A stratified split preserves the attack rate across training and test sets.

In [None]:
columns_to_drop = ['session_id', 'network_packet_size', 'session_duration', 'ip_reputation_score']

df_model = df_clf.drop(columns=columns_to_drop)
X = df_model.drop(columns=['attack_detected'])
y = df_model['attack_detected']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101, stratify=y
)

print('* Train set:', X_train.shape, y_train.shape)
print('* Test set:', X_test.shape, y_test.shape)
y.value_counts(normalize=True).rename('overall_attack_rate')

---
# Section 3

## Build the logistic regression pipeline and tune hyperparameters

The pipeline handles imputation, categorical encoding and feature scaling before fitting a logistic regression model. A `GridSearchCV` explores a compact hyperparameter grid for different regularisation strengths while keeping the workflow limited to logistic regression as requested.

In [None]:
logistic_pipeline = Pipeline([
    ('median', MeanMedianImputer(imputation_method='median', variables=['ip_reputation_score_log'])),
    ('categorical_imputer', CategoricalImputer(imputation_method='frequent', variables=['browser_type'])),
    ('onehot', OneHotEncoder(variables=['browser_type', 'protocol_type', 'encryption_used'], drop_last=True)),
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(solver='saga', max_iter=2000, n_jobs=-1, random_state=101))
])

param_grid = [
    {'model__penalty': ['l1'], 'model__C': [0.1, 1.0, 10.0]},
    {'model__penalty': ['l2'], 'model__C': [0.1, 1.0, 10.0]},
    {'model__penalty': ['elasticnet'], 'model__C': [0.1, 1.0, 10.0], 'model__l1_ratio': [0.0, 0.5, 1.0]},
]

logistic_search = GridSearchCV(
    estimator=logistic_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    refit=True,
    verbose=1
)

logistic_search.fit(X_train, y_train)

print('Best parameters:', logistic_search.best_params_)
print(f"Best mean CV F1-score: {logistic_search.best_score_:.3f}")

cv_results = (
    pd.DataFrame(logistic_search.cv_results_)[
        ['mean_test_score', 'std_test_score', 'param_model__penalty', 'param_model__C', 'param_model__l1_ratio']
    ]
    .sort_values('mean_test_score', ascending=False)
    .reset_index(drop=True)
)
cv_results.head()

---
# Section 4

## Evaluate the tuned baseline on the test set

Model diagnostics focus on standard classification metrics and a confusion matrix so stakeholders can compare performance against future experiments.

In [None]:
best_pipeline = logistic_search.best_estimator_

test_predictions = best_pipeline.predict(X_test)
print('Classification report (test set):')
print(classification_report(y_test, test_predictions))

fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay.from_predictions(
    y_test, test_predictions,
    display_labels=['No Attack', 'Attack'],
    colorbar=False,
    cmap='Blues',
    ax=ax
)
ax.set_title('Confusion matrix (test set)')
plt.tight_layout()
plt.show()

---
# Section 5

## Interpret logistic regression coefficients

Coefficients provide directional insight into how each engineered feature influences the log-odds of detecting an attack after scaling.

In [None]:
feature_engineering_steps = Pipeline(best_pipeline.steps[:-2])
feature_matrix = feature_engineering_steps.transform(X_train)
feature_names = feature_matrix.columns

coef_series = pd.Series(best_pipeline.named_steps['model'].coef_[0], index=feature_names, name='coefficient')
coef_df = (
    coef_series.to_frame()
    .assign(absolute_coefficient=lambda df_: df_['coefficient'].abs())
    .sort_values('absolute_coefficient', ascending=False)
)
coef_df.head(10)

---
# Conclusions and Next Steps

* Logistic regression with elastic net regularisation offers a solid baseline for the intrusion detection task and achieves competitive recall on the attack class.
* Future iterations can compare additional algorithms or extend the feature engineering pipeline, using this notebook as a controlled reference point.