# **03 Baseline Model**


## Objectives

* Build a reproducible baseline machine learning pipeline using logistic regression for the intrusion detection dataset.

## Inputs

* `data/raw/cybersecurity_intrusion_data.csv`

## Outputs

* Baseline performance metrics printed within the notebook.
* A fitted scikit-learn pipeline ready for downstream experimentation.

## Additional Comments

* The notebook adopts the shared template structure to stay consistent with the rest of the project.


---


# Change working directory


* These notebooks live in the `jupyter_notebooks` subfolder, so the project root is one level up.
* Capture the current directory so we can move to the repository root before referencing project-relative paths.


In [None]:
import os
current_dir = os.getcwd()
current_dir

* Set the working directory to the project root so relative paths in the remaining cells resolve correctly.


In [None]:
os.chdir(os.path.dirname(current_dir))
print(f"You set a new current directory: {os.getcwd()}")

* Double-check the update before continuing.


In [None]:
current_dir = os.getcwd()
current_dir

---


# Data Loading
* Load the intrusion detection dataset and perform a quick sanity check on its structure.


In [None]:
import pandas as pd

DATA_PATH = 'data/raw/cybersecurity_intrusion_data.csv'
df = pd.read_csv(DATA_PATH)
print(f'Dataset shape: {df.shape}')
df.head()

# Feature Preparation
* Remove identifier columns, separate the target, and catalogue numeric versus categorical predictors for preprocessing.


In [None]:
target_col = 'attack_detected'
id_cols = ['session_id']
feature_frame = df.drop(columns=id_cols + [target_col])
target = df[target_col]
numeric_features = feature_frame.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = feature_frame.select_dtypes(include=['object']).columns.tolist()
print('Numeric features:', numeric_features)
print('Categorical features:', categorical_features)

# Train/Test Split and Pipeline Definition
* Split the data with stratification to keep class balance and define a preprocessing + logistic regression pipeline.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(
    feature_frame,
    target,
    test_size=0.2,
    random_state=42,
    stratify=target
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

baseline_pipeline = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=1000))
    ]
)

# Model Training and Evaluation
* Fit the pipeline and capture straightforward metrics to benchmark future experiments.


In [None]:
from sklearn.metrics import accuracy_score, classification_report

baseline_pipeline.fit(X_train, y_train)
y_pred = baseline_pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Baseline accuracy: {accuracy:.3f}')
print('Classification report:')
print(classification_report(y_test, y_pred))

---


# Conclusions and Next Steps
* Logistic regression establishes a ~baseline accuracy that will anchor future model comparisons.
* Consider exploring feature interactions or more expressive models (e.g., tree ensembles) to improve recall for the attack class.
* Persist the fitted pipeline to disk once its performance is validated against additional evaluation metrics.
