# üåæ Crop Recommendation: Data Preprocessing Pipeline

This notebook preprocesses raw agricultural data for ML model training.

**Pipeline Steps:**
1. Load & validate raw data
2. Handle missing values
3. Detect & treat outliers (IQR method)
4. Encode target labels
5. Split into train/val/test (stratified)
6. Scale features (StandardScaler)
7. Export artifacts

---

## üì¶ Setup & Imports

In [5]:
import json
import os
from datetime import datetime
from pathlib import Path

import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

print("‚úì Imports loaded")

‚úì Imports loaded


## ‚öôÔ∏è Configuration

Adjust these parameters to customize the pipeline:

In [6]:
# Paths
INPUT_PATH = "../data/raw/Crop_recommendation.csv"
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Split ratios
VAL_SIZE = 0.15
TEST_SIZE = 0.15
RANDOM_STATE = 42

# Outlier treatment
OUTLIER_METHOD = "clip"  # Options: "clip", "remove", None
IQR_THRESHOLD = 1.5

print(f"Train: {(1-VAL_SIZE-TEST_SIZE)*100:.0f}% | Val: {VAL_SIZE*100:.0f}% | Test: {TEST_SIZE*100:.0f}%")

Train: 70% | Val: 15% | Test: 15%


---
## 1Ô∏è‚É£ Load Raw Data

In [7]:
df = pd.read_csv(INPUT_PATH)
print(f"Loaded: {df.shape[0]} rows √ó {df.shape[1]} columns")
df.head()

Loaded: 2200 rows √ó 8 columns


Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.71734,rice


In [8]:
# Identify columns
TARGET_COL = "label"
FEATURE_COLS = [c for c in df.columns if c != TARGET_COL]

print(f"Features: {FEATURE_COLS}")
print(f"Target: {TARGET_COL} ({df[TARGET_COL].nunique()} classes)")

Features: ['N', 'P', 'K', 'temperature', 'humidity', 'ph', 'rainfall']
Target: label (22 classes)


---
## 2Ô∏è‚É£ Missing Values Check

In [9]:
missing = df.isnull().sum()
print("Missing values per column:")
print(missing[missing > 0] if missing.sum() > 0 else "‚úì No missing values")

Missing values per column:
‚úì No missing values


---
## 3Ô∏è‚É£ Outlier Detection & Treatment

Using IQR method: values outside `Q1 - 1.5√óIQR` to `Q3 + 1.5√óIQR` are outliers.

In [10]:
def detect_outliers_iqr(df, columns, threshold=1.5):
    """Detect outliers using IQR method."""
    outlier_mask = pd.DataFrame(False, index=df.index, columns=columns)
    bounds = {}
    
    for col in columns:
        Q1, Q3 = df[col].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        lower, upper = Q1 - threshold * IQR, Q3 + threshold * IQR
        outlier_mask[col] = (df[col] < lower) | (df[col] > upper)
        bounds[col] = (lower, upper)
    
    return outlier_mask, bounds

# Detect
numeric_cols = df[FEATURE_COLS].select_dtypes(include=[np.number]).columns.tolist()
outlier_mask, bounds = detect_outliers_iqr(df, numeric_cols, IQR_THRESHOLD)

print("Outliers per column:")
print(outlier_mask.sum())
print(f"\nTotal rows with outliers: {outlier_mask.any(axis=1).sum()}")

Outliers per column:
N                0
P              138
K              200
temperature     86
humidity        30
ph              57
rainfall       100
dtype: int64

Total rows with outliers: 432


In [11]:
# Treat outliers
if OUTLIER_METHOD == "clip":
    for col, (lower, upper) in bounds.items():
        df[col] = df[col].clip(lower=lower, upper=upper)
    print(f"‚úì Clipped outliers in {len(bounds)} columns")
elif OUTLIER_METHOD == "remove":
    df = df[~outlier_mask.any(axis=1)].reset_index(drop=True)
    print(f"‚úì Removed {outlier_mask.any(axis=1).sum()} outlier rows")
else:
    print("‚ö† Outlier treatment disabled")

‚úì Clipped outliers in 7 columns


---
## 4Ô∏è‚É£ Encode Target Labels

In [12]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df[TARGET_COL])
X = df[FEATURE_COLS].copy()

print(f"Classes ({len(label_encoder.classes_)}):")
for i, cls in enumerate(label_encoder.classes_):
    print(f"  {i:2d}: {cls}")

Classes (22):
   0: apple
   1: banana
   2: blackgram
   3: chickpea
   4: coconut
   5: coffee
   6: cotton
   7: grapes
   8: jute
   9: kidneybeans
  10: lentil
  11: maize
  12: mango
  13: mothbeans
  14: mungbean
  15: muskmelon
  16: orange
  17: papaya
  18: pigeonpeas
  19: pomegranate
  20: rice
  21: watermelon


---
## 5Ô∏è‚É£ Train / Val / Test Split

> ‚ö†Ô∏è **Important:** Split BEFORE scaling to prevent data leakage.

In [13]:
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)

# Second split: separate validation from training
val_relative = VAL_SIZE / (1 - TEST_SIZE)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=val_relative, random_state=RANDOM_STATE, stratify=y_temp
)

print(f"X_train: {X_train.shape}  ({len(X_train)/len(X)*100:.1f}%)")
print(f"X_val:   {X_val.shape}  ({len(X_val)/len(X)*100:.1f}%)")
print(f"X_test:  {X_test.shape}  ({len(X_test)/len(X)*100:.1f}%)")

X_train: (1540, 7)  (70.0%)
X_val:   (330, 7)  (15.0%)
X_test:  (330, 7)  (15.0%)


---
## 6Ô∏è‚É£ Feature Scaling

Fit scaler on **training data only**, then transform all sets.

In [14]:
scaler = StandardScaler()

# Fit on train, transform all
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=FEATURE_COLS)
X_val_scaled = pd.DataFrame(scaler.transform(X_val), columns=FEATURE_COLS)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=FEATURE_COLS)

print("‚úì Features scaled")
X_train_scaled.describe().round(2)

‚úì Features scaled


Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall
count,1540.0,1540.0,1540.0,1540.0,1540.0,1540.0,1540.0
mean,0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-1.37,-1.52,-1.42,-2.45,-2.5,-2.63,-1.59
25%,-0.8,-0.79,-0.78,-0.6,-0.5,-0.69,-0.73
50%,-0.37,-0.05,-0.27,-0.01,0.41,-0.04,-0.14
75%,0.93,0.49,0.45,0.63,0.83,0.63,0.43
max,2.42,2.4,2.31,2.48,1.28,2.59,2.17


---
## 7Ô∏è‚É£ Export Artifacts

In [15]:
# Save data splits
X_train_scaled.to_csv(OUTPUT_DIR / "X_train.csv", index=False)
X_val_scaled.to_csv(OUTPUT_DIR / "X_val.csv", index=False)
X_test_scaled.to_csv(OUTPUT_DIR / "X_test.csv", index=False)

pd.Series(y_train, name="label").to_csv(OUTPUT_DIR / "y_train.csv", index=False)
pd.Series(y_val, name="label").to_csv(OUTPUT_DIR / "y_val.csv", index=False)
pd.Series(y_test, name="label").to_csv(OUTPUT_DIR / "y_test.csv", index=False)

# Save transformers
joblib.dump(scaler, OUTPUT_DIR / "scaler.joblib")
joblib.dump(label_encoder, OUTPUT_DIR / "label_encoder.joblib")

print("‚úì Artifacts saved to:", OUTPUT_DIR.absolute())

‚úì Artifacts saved to: c:\Users\NOBEL\GitHub\AgroSense\notebooks\..\data\processed


In [16]:
# Save preprocessing report
report = {
    "timestamp": datetime.now().isoformat(),
    "input_shape": list(pd.read_csv(INPUT_PATH).shape),
    "split": {
        "train_samples": len(X_train),
        "val_samples": len(X_val),
        "test_samples": len(X_test),
    },
    "encoding": {
        "n_classes": len(label_encoder.classes_),
        "classes": label_encoder.classes_.tolist(),
    },
    "config": {
        "val_size": VAL_SIZE,
        "test_size": TEST_SIZE,
        "outlier_method": OUTLIER_METHOD,
        "random_state": RANDOM_STATE,
    }
}

with open(OUTPUT_DIR / "preprocessing_report.json", "w") as f:
    json.dump(report, f, indent=2)

print("‚úì Report saved")

‚úì Report saved


---
## ‚úÖ Summary

In [17]:
print("=" * 50)
print("PREPROCESSING COMPLETE")
print("=" * 50)
print(f"\nTrain: {len(X_train):,} samples")
print(f"Val:   {len(X_val):,} samples")
print(f"Test:  {len(X_test):,} samples")
print(f"\nClasses: {len(label_encoder.classes_)}")
print(f"Features: {len(FEATURE_COLS)}")
print(f"\nOutput: {OUTPUT_DIR.absolute()}")

PREPROCESSING COMPLETE

Train: 1,540 samples
Val:   330 samples
Test:  330 samples

Classes: 22
Features: 7

Output: c:\Users\NOBEL\GitHub\AgroSense\notebooks\..\data\processed
