# 01 – Exploratory Data Analysis and Preprocessing

**Project:** Predictive Modeling for Drug Discovery via Virtual Screening  
**Student:** Milica Jeftić (ID: 89211255)  
**Date:** January 2026  
**Dataset:** Kaggle – Drug Discovery Virtual Screening Dataset

---

## Goal of This Notebook

This notebook performs exploratory data analysis (EDA) and preprocessing on the raw dataset loaded in notebook 00.
The objectives are:

1. **Environment Setup** – Import libraries and load raw data from notebook 00
2. **Exploratory Data Analysis** – Examine feature distributions, correlations, and relationships
3. **Missing Value Handling** – Impute or remove rows with missing values
4. **Feature Scaling/Normalization** – Prepare features for machine learning models
5. **Data Preparation** – Create train/test split and save processed data

---

## Expected Outputs

- EDA visualizations and statistical summaries (saved to `results/figures/`)
- Feature correlation analysis
- Processed dataset (saved to `data/processed/`)
- Preprocessing report and decisions (saved to `results/metrics/`)

---

## 1. Environment Setup and Data Loading

In [1]:
# ============================
# Environment & Configuration
# ============================

import os
import sys
import warnings

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams

# ----------------------------
# Warning configuration
# ----------------------------
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

# ----------------------------
# Pandas display options
# ----------------------------
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", 120)
pd.set_option("display.float_format", "{:.4f}".format)

# ----------------------------
# Visualization defaults
# ----------------------------
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("husl")

rcParams["figure.figsize"] = (12, 6)
rcParams["font.size"] = 12

%matplotlib inline

# ----------------------------
# Project paths
# ----------------------------
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))

DATA_RAW_PATH = os.path.join(PROJECT_ROOT, "data", "raw")
DATA_PROCESSED_PATH = os.path.join(PROJECT_ROOT, "data", "processed")
RESULTS_PATH = os.path.join(PROJECT_ROOT, "results")

print("=" * 60)
print("Environment initialized successfully")
print("=" * 60)
print(f"Python       : {sys.version.split()[0]}")
print(f"Numpy        : {np.__version__}")
print(f"Pandas       : {pd.__version__}")
print(f"Scikit-learn : {__import__('sklearn').__version__}")
print("-" * 60)
print(f"Project root : {PROJECT_ROOT}")
print(f"Raw data dir : {DATA_RAW_PATH}")
print("=" * 60)

Environment initialized successfully
Python       : 3.10.19
Numpy        : 2.2.5
Pandas       : 2.3.3
Scikit-learn : 1.7.2
------------------------------------------------------------
Project root : c:\Users\KORISNIK\Documents\drug-discovery-virtual-screening
Raw data dir : c:\Users\KORISNIK\Documents\drug-discovery-virtual-screening\data\raw


### Load raw data from notebook 00

## 2. Exploratory Data Analysis (EDA)

Now we examine the raw dataset to understand its structure, feature distributions, and relationships.
This includes basic statistics and identifying data quality issues.

In [2]:
# Load the raw dataset
dataset_path = os.path.join(DATA_RAW_PATH, "drug_discovery_virtual_screening.csv")

print("Loading raw dataset...")
df = pd.read_csv(dataset_path)

print(f"✓ Dataset loaded: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
display(df.head())

print(f"\nData types:")
print(df.dtypes)

Loading raw dataset...
✓ Dataset loaded: 2000 rows × 17 columns

Dataset shape: (2000, 17)

First few rows:


Unnamed: 0,compound_id,protein_id,molecular_weight,logp,h_bond_donors,h_bond_acceptors,rotatable_bonds,polar_surface_area,compound_clogp,protein_length,protein_pi,hydrophobicity,binding_site_size,mw_ratio,logp_pi_interaction,binding_affinity,active
0,CID_00000,PID_361,499.6714,2.4872,1,7,4,113.3508,4.0507,678,6.0197,0.8125,12.5122,0.737,14.9723,5.9967,0
1,CID_00001,PID_165,436.1736,3.2832,3,4,4,71.9811,3.7044,876,6.4474,0.6514,11.5384,0.4979,21.1683,6.4457,0
2,CID_00002,PID_168,514.7689,,2,11,11,83.9363,1.8696,658,3.9258,0.6335,13.1557,0.7823,9.0741,5.6896,0
3,CID_00003,PID_226,602.303,3.0381,0,5,5,79.8681,2.4519,312,7.5971,0.513,12.0718,1.9305,23.0803,6.0434,0
4,CID_00004,PID_224,426.5847,0.6596,2,4,5,88.1987,1.7719,1418,4.2495,0.6136,15.8504,0.3008,2.8028,4.8451,0



Data types:
compound_id             object
protein_id              object
molecular_weight       float64
logp                   float64
h_bond_donors            int64
h_bond_acceptors         int64
rotatable_bonds          int64
polar_surface_area     float64
compound_clogp         float64
protein_length           int64
protein_pi             float64
hydrophobicity         float64
binding_site_size      float64
mw_ratio               float64
logp_pi_interaction    float64
binding_affinity       float64
active                   int64
dtype: object


## 3. Missing Value Handling

We identified 3 columns with missing values (~3% of data) in notebook 00.
Since the missing percentage is low and data loss is acceptable, we will remove rows with any missing values.
This ensures all features and targets are complete for modeling.

The affected features are closely related physicochemical descriptors (logP, hydrophobicity, polar surface area), 
and missing values likely reflect incomplete molecular property computation rather than random data loss. 
Dropping rows preserves chemical validity and prevents introducing noise through imputation.

In [4]:
print("=" * 60)
print("MISSING VALUE HANDLING")
print("=" * 60)

# Identify missing values
missing_count = df.isna().sum()
missing_pct = (df.isna().mean() * 100)

missing_report = pd.DataFrame({
    "column": missing_count.index,
    "missing_count": missing_count.values,
    "missing_pct": missing_pct.values
}).sort_values("missing_count", ascending=False)

missing_nonzero = missing_report[missing_report["missing_count"] > 0]

print(f"\nColumns with missing values:")
display(missing_nonzero)

print(f"\nStrategy: Drop rows with missing values (only 3% of data affected)")
df_clean = df.dropna()
print(f"Rows before: {len(df)}")
print(f"Rows after:  {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

# Verify no missing values remain
print(f"\nMissing values after cleaning: {df_clean.isna().sum().sum()}")

MISSING VALUE HANDLING

Columns with missing values:


Unnamed: 0,column,missing_count,missing_pct
3,logp,60,3.0
11,hydrophobicity,60,3.0
7,polar_surface_area,60,3.0



Strategy: Drop rows with missing values (only 3% of data affected)
Rows before: 2000
Rows after:  1826
Rows removed: 174

Missing values after cleaning: 0
