# 02 – Baseline Model: Logistic Regression

**Project:** Predictive Modeling for Drug Discovery via Virtual Screening  
**Student:** Milica Jeftić (ID: 89211255)  
**Date:** January 2026  
**Dataset:** Kaggle – Drug Discovery Virtual Screening Dataset

---

## Goal of This Notebook

This notebook establishes a **baseline logistic regression model** as a reference point for all subsequent models.
The objectives are:

1. **Load Preprocessed Data** – Import train/validation/test sets from notebook 01
2. **Train Baseline Model** – Fit logistic regression on training data
3. **Validate Performance** – Evaluate on validation set with comprehensive metrics
4. **Test Evaluation** – Final unbiased evaluation on held-out test set
5. **Baseline Summary** – Document strengths, limitations, and reference point

---

## Expected Outputs

- Baseline model performance metrics (accuracy, precision, recall, F1, ROC-AUC)
- Confusion matrix and ROC curve visualizations
- Saved baseline model (joblib)
- Baseline report (saved to `results/metrics/`)

---

## 1. Environment Setup

This section sets up the computational environment, imports required libraries, and defines global configuration used throughout the notebook to ensure reproducibility and consistency.

In [1]:
# ============================
# Environment & Configuration
# ============================

import os
import sys
import warnings

import numpy as np
import pandas as pd
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, roc_curve
)

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams

# ----------------------------
# Reproducibility & Warnings
# ----------------------------
np.random.seed(42)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# ----------------------------
# Pandas display options
# ----------------------------
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", 120)
pd.set_option("display.float_format", "{:.4f}".format)

# ----------------------------
# Visualization defaults
# ----------------------------
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")

rcParams["figure.figsize"] = (12, 6)
rcParams["font.size"] = 12

%matplotlib inline

# ----------------------------
# Project paths
# ----------------------------
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))

DATA_PROCESSED_PATH = os.path.join(PROJECT_ROOT, "data", "processed")
MODELS_PATH = os.path.join(PROJECT_ROOT, "models")
RESULTS_PATH = os.path.join(PROJECT_ROOT, "results")

print("=" * 60)
print("Environment initialized successfully")
print("=" * 60)
print(f"Python       : {sys.version.split()[0]}")
print(f"Numpy        : {np.__version__}")
print(f"Pandas       : {pd.__version__}")
print(f"Scikit-learn : {__import__('sklearn').__version__}")
print("-" * 60)
print(f"Project root : {PROJECT_ROOT}")
print(f"Data dir     : {DATA_PROCESSED_PATH}")
print(f"Models dir   : {MODELS_PATH}")
print("=" * 60)

Environment initialized successfully
Python       : 3.10.19
Numpy        : 2.2.5
Pandas       : 2.3.3
Scikit-learn : 1.7.2
------------------------------------------------------------
Project root : c:\Users\KORISNIK\Documents\drug-discovery-virtual-screening
Data dir     : c:\Users\KORISNIK\Documents\drug-discovery-virtual-screening\data\processed
Models dir   : c:\Users\KORISNIK\Documents\drug-discovery-virtual-screening\models
