# Dry Bean Dataset (Multi-Class) â€” 6 ML Models + Required Metrics

This is the notebook with the whole source code to analyze the **Dry Bean Dataset** (multi-class, 7 classes) and implementing **all 6 required models** on the **same dataset**.

## Required metrics (reported for every model)
- **Accuracy**
- **AUC (multi-class ROC-AUC, OvR macro)**
- **Precision (macro)**
- **Recall (macro)**
- **F1-score (macro)**
- **Matthews Correlation Coefficient (MCC)**

## Models
1. Logistic Regression (multinomial)
2. Decision Tree
3. KNN
4. Naive Bayes (Gaussian)
5. Random Forest
6. XGBoost

This notebook also:
- saves trained models into `model/`
- saves a `label_encoder` for consistent class handling
- generates `app.py` (Streamlit), `requirements.txt`, and `README.md` for deployment in GITHub.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core
import os
import json
import numpy as np
import pandas as pd

# Modeling
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    matthews_corrcoef
)

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

import joblib
import matplotlib.pyplot as plt


## 1) Load dataset
Here we set `DATA_PATH` as the locally saved Dry Bean CSV dataset file.

- Features: numeric columns (16 features in the UCI dataset)
- Target: class column (often named **Class**)


In [None]:
DATA_PATH = "/content/drive/MyDrive/Dry_Bean_Dataset.xlsx"

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(
        f"Could not find {DATA_PATH}. Set DATA_PATH correctly."
    )

df = pd.read_excel(DATA_PATH)
print("Shape:", df.shape)
df.head()

test_df = df.sample(300, random_state=42)
test_df.to_csv("test_upload.csv", index=False)

## 2) Identify target + encode classes

We label-encode the target (bean type) to integers 0..(K-1).


In [None]:
# Common target column name on Kaggle/UCI: 'Class'

TARGET_COL = "Class" if "Class" in df.columns else df.columns[-1]

X = df.drop(columns=[TARGET_COL]).copy()
y_raw = df[TARGET_COL].copy()

X = X.apply(pd.to_numeric, errors="coerce")
if X.isna().any().any():
    before = len(X)
    mask = ~X.isna().any(axis=1)
    X = X.loc[mask].reset_index(drop=True)
    y_raw = y_raw.loc[mask].reset_index(drop=True)
    print(f"Dropped {before - len(X)} rows due to non-numeric/NaN features.")

le = LabelEncoder()
y = le.fit_transform(y_raw)

print("Target column:", TARGET_COL)
print("Num features:", X.shape[1])
print("Classes:", list(le.classes_))
print("Class distribution (counts):")
print(pd.Series(y).value_counts().sort_index().rename(index=lambda i: le.classes_[i]))


## 3) Train/Test split for Data

Stratified split preserves class proportions (same for all models)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print("Train:", X_train.shape, "Test:", X_test.shape)


## 4) Defining (call) the 6 models

Scaling is used for:
- Logistic Regression
- KNN

Tree/ensemble models don't require scaling.


In [None]:
num_classes = len(le.classes_)

models = {
    #Scaled models below - LR and KNN
    "Logistic Regression": Pipeline([
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(
            max_iter=4000,
            solver="lbfgs",
            multi_class="multinomial",
            random_state=42
        ))
    ]),
    "KNN": Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsClassifier(n_neighbors=9))
    ]),
    #Non scaled models - DT, RF, XGBoost, and NBClassifiers
    "Decision Tree": DecisionTreeClassifier(
        random_state=42
    ),
    "Naive Bayes (Gaussian)": GaussianNB(),
    "Random Forest": RandomForestClassifier(
        n_estimators=150,
        max_depth=15,
        random_state=42,
        n_jobs=-1
    ),
    "XGBoost": XGBClassifier(
        objective="multi:softprob",
        num_class=num_classes,
        n_estimators=600,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        random_state=42,
        eval_metric="mlogloss",
        n_jobs=-1
    )
}

list(models.keys())


## 5) Evaluation functions (multi-class required metrics)

- **Precision/Recall/F1** are reported as **macro-average**
- **AUC** is computed as **multi-class ROC-AUC** using **OvR** with **macro averaging**
- **MCC** supports multi-class directly


In [None]:
def get_score_matrix(model, X):
    """Return probability matrix for multi-class ROC-AUC, if available."""
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)
    return None

def evaluate_model(name, model, X_test, y_test, class_names):
    y_pred = model.predict(X_test)
    y_score = get_score_matrix(model, X_test)

    metrics = {
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision_macro": precision_score(y_test, y_pred, average="macro", zero_division=0),
        "Recall_macro": recall_score(y_test, y_pred, average="macro", zero_division=0),
        "F1_macro": f1_score(y_test, y_pred, average="macro", zero_division=0),
        "MCC": matthews_corrcoef(y_test, y_pred),
    }

    if y_score is not None and y_score.ndim == 2 and y_score.shape[1] == len(class_names):
        metrics["AUC_ovr_macro"] = roc_auc_score(
            y_test, y_score,
            multi_class="ovr",
            average="macro"
        )
    else:
        metrics["AUC_ovr_macro"] = np.nan

    cm = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred, target_names=class_names, digits=4, zero_division=0)
    return metrics, cm, report


## 6) Training & evaluating all models

In [None]:
results = []
conf_mats = {}
reports = {}
fitted_models = {}

class_names = list(le.classes_)

for name, model in models.items():
    model.fit(X_train, y_train)
    fitted_models[name] = model

    met, cm, rep = evaluate_model(name, model, X_test, y_test, class_names)
    results.append(met)
    conf_mats[name] = cm
    reports[name] = rep

results_df = pd.DataFrame(results).sort_values(by="F1_macro", ascending=False)
results_df


## 7) Processed Model Report to identify best model

In [None]:
best_model_name = results_df.iloc[0]["Model"]
print("Best model (by F1_macro):", best_model_name)
print("\nClassification Report:")
print(reports[best_model_name])


## 8) Confusion matrices

In [None]:
import itertools

def plot_confusion_matrix(cm, labels, title="Confusion Matrix"):
    plt.figure(figsize=(6,6))
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(labels))
    plt.xticks(tick_marks, labels, rotation=45, ha="right")
    plt.yticks(tick_marks, labels)

    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], "d"),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.tight_layout()

for name, cm in conf_mats.items():
    plot_confusion_matrix(cm, class_names, title=f"{name} - Confusion Matrix")
    plt.show()


## 9) 5-fold Cross-Validation

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_rows = []

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="f1_macro")
    cv_rows.append({
        "Model": name,
        "CV_F1_macro_Mean": scores.mean(),
        "CV_F1_macro_Std": scores.std()
    })

cv_df = pd.DataFrame(cv_rows).sort_values(by="CV_F1_macro_Mean", ascending=False)
cv_df


## 10) Saving the models (with the label encoders for Streamlit)

In [None]:
os.makedirs("model", exist_ok=True)

joblib.dump(le, "model/label_encoder.joblib")

for name, model in fitted_models.items():
    safe_name = name.lower().replace(" ", "_").replace("(", "").replace(")", "")
    joblib.dump(model, f"model/{safe_name}.joblib")

results_df_out = results_df.copy()
results_df_out.to_csv("model/model_comparison_metrics.csv", index=False)
with open("model/model_comparison_metrics.json", "w") as f:
    json.dump(results_df_out.to_dict(orient="records"), f, indent=2)

print("Saved files in model/:")
print(os.listdir("model"))


## 11) Generating Requirements for the GITHub and Streamlink

In [None]:
APP_CODE = '''
import os
import joblib
import numpy as np
import pandas as pd
import streamlit as st

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    matthews_corrcoef
)

st.set_page_config(page_title="Dry Bean Classifier", layout="wide")
st.title("Dry Bean Classification (6 Models)")

st.markdown("""
Upload a CSV with the same **feature columns** used for training, plus a target column (bean class).
If your target column is named `Class`, the app will use it automatically; otherwise it will use the **last column**.
""")

uploaded = st.file_uploader("Upload CSV", type=["csv"])

MODEL_DIR = "model"
ENCODER_PATH = os.path.join(MODEL_DIR, "label_encoder.joblib")

model_files = {
    "Logistic Regression": "logistic_regression.joblib",
    "Decision Tree": "decision_tree.joblib",
    "KNN": "knn.joblib",
    "Naive Bayes (Gaussian)": "naive_bayes_gaussian.joblib",
    "Random Forest": "random_forest.joblib",
    "XGBoost": "xgboost.joblib"
}

def compute_metrics_multiclass(y_true, y_pred, y_proba, n_classes):
    out = {
        "Accuracy": float(accuracy_score(y_true, y_pred)),
        "Precision_macro": float(precision_score(y_true, y_pred, average="macro", zero_division=0)),
        "Recall_macro": float(recall_score(y_true, y_pred, average="macro", zero_division=0)),
        "F1_macro": float(f1_score(y_true, y_pred, average="macro", zero_division=0)),
        "MCC": float(matthews_corrcoef(y_true, y_pred)),
    }
    if y_proba is not None and y_proba.ndim == 2 and y_proba.shape[1] == n_classes:
        out["AUC_ovr_macro"] = float(roc_auc_score(y_true, y_proba, multi_class="ovr", average="macro"))
    else:
        out["AUC_ovr_macro"] = None
    return out

col1, col2 = st.columns([1, 2])

with col1:
    selected = st.selectbox("Select Model", list(model_files.keys()))

if uploaded is not None:
    df = pd.read_csv(uploaded)

    target_col = "Class" if "Class" in df.columns else df.columns[-1]
    X = df.drop(columns=[target_col]).copy()
    y_raw = df[target_col].copy()

    X = X.apply(pd.to_numeric, errors="coerce")
    if X.isna().any().any():
        st.warning("Some feature values could not be parsed as numeric. Dropping rows with NaNs.")
        mask = ~X.isna().any(axis=1)
        X = X.loc[mask].reset_index(drop=True)
        y_raw = y_raw.loc[mask].reset_index(drop=True)

    if not os.path.exists(ENCODER_PATH):
        st.error("label_encoder.joblib not found in model/. Please include it in your repo.")
        st.stop()

    le = joblib.load(ENCODER_PATH)
    class_names = list(le.classes_)
    n_classes = len(class_names)

    try:
        y = le.transform(y_raw)
    except Exception:
        y = pd.to_numeric(y_raw, errors="coerce").astype(int).to_numpy()

    model_path = os.path.join(MODEL_DIR, model_files[selected])
    if not os.path.exists(model_path):
        st.error(f"Model file not found: {model_path}. Ensure model/ folder is in your repo.")
        st.stop()

    model = joblib.load(model_path)

    y_pred = model.predict(X)
    y_proba = model.predict_proba(X) if hasattr(model, "predict_proba") else None

    metrics = compute_metrics_multiclass(y, y_pred, y_proba, n_classes)
    cm = confusion_matrix(y, y_pred)

    with col2:
        st.subheader("Evaluation Metrics")
        st.json(metrics)

        st.subheader("Confusion Matrix")
        st.write(cm)

        st.subheader("Classification Report")
        st.text(classification_report(y, y_pred, target_names=class_names, digits=4, zero_division=0))
else:
    st.info("Upload a CSV to evaluate the selected model.")
'''

REQS = """pandas
numpy
scikit-learn
xgboost
joblib
matplotlib
streamlit
"""

README = """# Dry Bean Classification (6 Models)

## Problem Statement
The objective of this project is to classify dry beans into one of seven categories
based on geometric and morphological features using supervised machine learning techniques.

## Dataset
Dry Bean Dataset (public repository).
- Instances: 13,611
- Features: 16 numerical attributes
- Classes: 7 bean types (BARBUNYA, BOMBAY, CALI, DERMASON, HOROZ, SEKER, SIRA)

## Models Implemented
1. Logistic Regression (Multinomial)
2. Decision Tree
3. K-Nearest Neighbors (KNN)
4. Gaussian Naive Bayes
5. Random Forest
6. XGBoost

## Evaluation Metrics
- Accuracy
- AUC (multi-class ROC-AUC, OvR macro)
- Precision (macro)
- Recall (macro)
- F1-score (macro)
- MCC (Matthews Correlation Coefficient)

## Deployment:
The application is deployed using Streamlit Community Cloud and allows users to:
- Upload a CSV dataset
- Select one of six ML models
- View evaluation metrics
- View confusion matrix
- Download predictions (if labels are not provided)

## Run locally
```bash
pip install -r requirements.txt
streamlit run app.py
```

Observations:
XGBoost achieved the highest macro F1-score and overall accuracy, demonstrating
superior performance in handling multi-class classification tasks.
ML Model Name	Accuracy	Precision	Recall	F1	MCC	AUC
Logistic Regression	0.921410	0.935383	0.932149	0.933538	0.905045	0.994776
Decision Tree	0.892031	0.907513	0.909028	0.908061	0.869569	0.944996
kNN	0.916269	0.931763	0.926738	0.928868	0.898792	0.986807
Naives-Bayes	0.763863	0.774427	0.769417	0.767750	0.715406	0.967193
Random Forest	0.920308	0.934654	0.930010	0.932210	0.903591	0.993567
XGBoost	0.925450	0.939923	0.935143	0.937430	0.909807	0.995291




Observations:
ML Model Name	Observation on Model Performance and Output
Logistic Regression	As a linear classifier with multinomial optimization, Logistic Regression achieved strong macro-averaged metrics, indicating effective separation of classes in feature space. However, its performance slightly declined for structurally similar beans, highlighting limitations in modelling nonlinear feature interactions.
Decision Tree	The Decision Tree captured nonlinear decision boundaries but exhibited higher variance compared to ensemble methods. While it performed reasonably well, slight inconsistencies across classes suggest sensitivity to training data splits and potential overfitting.
kNN	kNN demonstrated competitive macro F1 performance by leveraging distance-based classification. However, its effectiveness depended heavily on feature scaling and class density, and performance degraded slightly in regions with overlapping morphological features.
Naive Bayes	Gaussian Naive Bayes produced moderate results due to its strong independence assumption among features. Since geometric attributes in the dataset are correlated, this assumption reduced its ability to model complex class boundaries accurately.
Random Forest (Ensemble)	Random Forest improved predictive stability by aggregating multiple decision trees, significantly reducing variance and improving macro-level metrics. It handled nonlinear relationships effectively and showed better class balance compared to single-tree models.
XGBoost (Ensemble)	XGBoost achieved the highest macro F1-score and overall accuracy due to its gradient boosting framework, which sequentially corrected previous errors. Its regularization and optimized tree-building strategy allowed superior handling of subtle inter-class differences, especially among morphologically overlapping bean types.


## Notes
- The app expects a CSV containing the same feature columns plus a target column named `Class` (preferred) or as the last column.
- Trained models and label encoder are stored in the `model/` folder.

with open("app.py", "w", encoding="utf-8") as f:
    f.write(APP_CODE)

with open("requirements.txt", "w", encoding="utf-8") as f:
    f.write(REQS)

with open("README.md", "w", encoding="utf-8") as f:
    f.write(README)

print("Generated: app.py, requirements.txt, README.md")


## 12) FINAL RESULTS

In [None]:
print("=== Model Comparison (Test Set) ===")
display(results_df.reset_index(drop=True))

print("\n=== Cross-Validation Summary (Train Set) ===")
display(cv_df.reset_index(drop=True))
