<a href="https://colab.research.google.com/github/mannat244/ML_Lab/blob/main/lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**ML Lab 1** *2311201205*

---



This cell imports necessary libraries: `scipy.io` for ARFF file handling, `pandas` for data manipulation, and `numpy` for numerical operations.

In [None]:
from scipy.io import arff
import pandas as pd
import numpy as np

**First we are loading the dat files using scipy.io**


This function `load_keel_dat` is designed to load `.dat` files in KEEL format. It reads the file, cleans metadata lines (like `@inputs`, `@outputs`), and then uses `scipy.io.arff.loadarff` to parse the ARFF data. Finally, it separates features (X) from labels (y) and decodes the labels from bytes to strings.

In [None]:
from io import StringIO

def load_keel_dat(filepath):
    with open(filepath, "r") as f:
        lines = f.readlines()

    cleaned_lines = []
    for line in lines:
        l = line.lower()
        if l.startswith("@inputs") or l.startswith("@input") \
           or l.startswith("@outputs") or l.startswith("@output"):
            continue
        cleaned_lines.append(line)

    data, meta = arff.loadarff(StringIO("".join(cleaned_lines)))
    df = pd.DataFrame(data)

    X = df.iloc[:, :-1].values
    y = df.iloc[:, -1].values

    y = np.array([label.decode("utf-8") for label in y])

    return X.astype(float), y


**Testing the loader fuction with a small iris data** **bold text**

This cell tests the `load_keel_dat` function by loading the `iris-5-1tra.dat` dataset and printing the shape of the features (X) and labels (y), as well as the first sample and its corresponding label.

In [None]:
X, y = load_keel_dat("/content/iris-5-1tra.dat")

print("X shape:", X.shape)
print("y shape:", y.shape)
print("First sample:", X[0])
print("First label:", y[0])


X shape: (120, 4)
y shape: (120,)
First sample: [5.1 3.5 1.4 0.2]
First label: Iris-setosa


**we need ML libraries like sklearn for this lab...**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score


This cell imports essential machine learning libraries from `sklearn`: `KNeighborsClassifier` for the KNN model, `LabelEncoder` for converting categorical labels to numerical format, and `accuracy_score` for evaluating model performance.

In [None]:
BASE_PATH = "/content/"


This cell demonstrates a complete workflow for evaluating a K-Nearest Neighbors (KNN) model on the Iris dataset, specifically for the first fold. It loads training and testing data, encodes the labels, trains a KNN model with `k=3`, makes predictions, and calculates the accuracy.

In [None]:
# Load fold 1 of IRIS
X_train, y_train = load_keel_dat(f"{BASE_PATH}/iris-5-1tra.dat")
X_test, y_test = load_keel_dat(f"{BASE_PATH}/iris-5-1tst.dat")

# Encode labels
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

# Train KNN (k = 3)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Predict & evaluate
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Fold 1 Accuracy (k=3):", acc)


Fold 1 Accuracy (k=3): 1.0


The `evaluate_knn_5fold` function performs a 5-fold cross-validation for a given dataset and `k` value for the KNN classifier. It iterates through each fold, loads the data, encodes labels, trains the model, predicts, and calculates the accuracy. Finally, it prints the accuracy for each fold and the mean accuracy across all folds.

In [None]:
def evaluate_knn_5fold(dataset_name, k):
    accuracies = []

    for fold in range(1, 6):
        X_train, y_train = load_keel_dat(
            f"{BASE_PATH}/{dataset_name}-5-{fold}tra.dat"
        )
        X_test, y_test = load_keel_dat(
            f"{BASE_PATH}/{dataset_name}-5-{fold}tst.dat"
        )

        le = LabelEncoder()
        y_train = le.fit_transform(y_train)
        y_test = le.transform(y_test)

        model = KNeighborsClassifier(n_neighbors=k)
        model.fit(X_train, y_train)

        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        accuracies.append(acc)

        print(f"Fold {fold} Accuracy (k={k}): {acc}")

    mean_acc = np.mean(accuracies)
    print(f"Mean Accuracy (k={k}): {mean_acc}")

    return mean_acc


This cell calls the `evaluate_knn_5fold` function to perform 5-fold cross-validation on the 'iris' dataset with `k=3`, demonstrating the function's usage and printing the results.

In [None]:
evaluate_knn_5fold("iris", k=3)


Fold 1 Accuracy (k=3): 1.0
Fold 2 Accuracy (k=3): 0.9666666666666667
Fold 3 Accuracy (k=3): 0.9666666666666667
Fold 4 Accuracy (k=3): 0.9333333333333333
Fold 5 Accuracy (k=3): 0.9
Mean Accuracy (k=3): 0.9533333333333335


np.float64(0.9533333333333335)

The `tune_knn` function automates the hyperparameter tuning process for the KNN model. It evaluates a range of `k` values (1, 3, 5, 7, 9) using 5-fold cross-validation for a specified dataset. It then identifies and returns the `k` value that yields the best mean accuracy.

In [None]:
def tune_knn(dataset_name):
    k_values = [1, 3, 5, 7, 9]

    best_k = None
    best_mean_acc = -1

    print(f"\n===== Hyperparameter tuning for {dataset_name.upper()} =====")

    for k in k_values:
        print(f"\nEvaluating k = {k}")
        mean_acc = evaluate_knn_5fold(dataset_name, k)

        if mean_acc > best_mean_acc:
            best_mean_acc = mean_acc
            best_k = k

    print("\n===== RESULT =====")
    print(f"Best k: {best_k}")
    print(f"Best Mean Accuracy: {best_mean_acc}")

    return best_k, best_mean_acc


This cell executes the `tune_knn` function for the 'iris' dataset. It will print the accuracy for each `k` value tested and then output the best `k` and its corresponding mean accuracy for Iris.

In [None]:
best_k_iris, best_acc_iris = tune_knn("iris")



===== Hyperparameter tuning for IRIS =====

Evaluating k = 1
Fold 1 Accuracy (k=1): 1.0
Fold 2 Accuracy (k=1): 0.9666666666666667
Fold 3 Accuracy (k=1): 0.9333333333333333
Fold 4 Accuracy (k=1): 0.9666666666666667
Fold 5 Accuracy (k=1): 0.9
Mean Accuracy (k=1): 0.9533333333333335

Evaluating k = 3
Fold 1 Accuracy (k=3): 1.0
Fold 2 Accuracy (k=3): 0.9666666666666667
Fold 3 Accuracy (k=3): 0.9666666666666667
Fold 4 Accuracy (k=3): 0.9333333333333333
Fold 5 Accuracy (k=3): 0.9
Mean Accuracy (k=3): 0.9533333333333335

Evaluating k = 5
Fold 1 Accuracy (k=5): 1.0
Fold 2 Accuracy (k=5): 1.0
Fold 3 Accuracy (k=5): 0.9333333333333333
Fold 4 Accuracy (k=5): 0.9333333333333333
Fold 5 Accuracy (k=5): 0.9
Mean Accuracy (k=5): 0.9533333333333335

Evaluating k = 7
Fold 1 Accuracy (k=7): 1.0
Fold 2 Accuracy (k=7): 1.0
Fold 3 Accuracy (k=7): 0.9
Fold 4 Accuracy (k=7): 0.9333333333333333
Fold 5 Accuracy (k=7): 0.9333333333333333
Mean Accuracy (k=7): 0.9533333333333334

Evaluating k = 9
Fold 1 Accuracy 

This cell defines a list of datasets (`iris`, `haberman`, `ecoli`, `satimage`, `wisconsin`) and then iterates through them. For each dataset, it calls the `tune_knn` function to find the best `k` value and its mean accuracy, storing these results in a dictionary.

In [None]:
datasets = ["iris", "haberman", "ecoli", "satimage", "wisconsin"]

results = {}

for ds in datasets:
    best_k, best_acc = tune_knn(ds)
    results[ds] = (best_k, best_acc)



===== Hyperparameter tuning for IRIS =====

Evaluating k = 1
Fold 1 Accuracy (k=1): 1.0
Fold 2 Accuracy (k=1): 0.9666666666666667
Fold 3 Accuracy (k=1): 0.9333333333333333
Fold 4 Accuracy (k=1): 0.9666666666666667
Fold 5 Accuracy (k=1): 0.9
Mean Accuracy (k=1): 0.9533333333333335

Evaluating k = 3
Fold 1 Accuracy (k=3): 1.0
Fold 2 Accuracy (k=3): 0.9666666666666667
Fold 3 Accuracy (k=3): 0.9666666666666667
Fold 4 Accuracy (k=3): 0.9333333333333333
Fold 5 Accuracy (k=3): 0.9
Mean Accuracy (k=3): 0.9533333333333335

Evaluating k = 5
Fold 1 Accuracy (k=5): 1.0
Fold 2 Accuracy (k=5): 1.0
Fold 3 Accuracy (k=5): 0.9333333333333333
Fold 4 Accuracy (k=5): 0.9333333333333333
Fold 5 Accuracy (k=5): 0.9
Mean Accuracy (k=5): 0.9533333333333335

Evaluating k = 7
Fold 1 Accuracy (k=7): 1.0
Fold 2 Accuracy (k=7): 1.0
Fold 3 Accuracy (k=7): 0.9
Fold 4 Accuracy (k=7): 0.9333333333333333
Fold 5 Accuracy (k=7): 0.9333333333333333
Mean Accuracy (k=7): 0.9533333333333334

Evaluating k = 9
Fold 1 Accuracy 

This cell prints a final summary of the hyperparameter tuning results for all evaluated datasets. It displays the best `k` value and the corresponding mean accuracy for each dataset in a formatted table.

In [None]:
print("\n===== FINAL SUMMARY =====")
for ds, (k, acc) in results.items():
    print(f"{ds.upper():10s} -> Best k = {k}, Mean Accuracy = {acc:.4f}")



===== FINAL SUMMARY =====
IRIS       -> Best k = 9, Mean Accuracy = 0.9733
HABERMAN   -> Best k = 9, Mean Accuracy = 0.7320
ECOLI      -> Best k = 7, Mean Accuracy = 0.8246
SATIMAGE   -> Best k = 3, Mean Accuracy = 0.9080
WISCONSIN  -> Best k = 5, Mean Accuracy = 0.9751
