# 📌 Machine Learning Assignment 1 - Instructions & Guidelines

### **📝 General Guidelines**
Welcome to Machine Learning Assignment 1! This assignment will test your understanding of **regression and classification models**, including **data preprocessing, hyperparameter tuning, and model evaluation**.

Follow the instructions carefully, and ensure your implementation is **correct, well-structured, and efficient**.

🔹 **Submission Format:**  
- Your submission **must be a single Jupyter Notebook (.ipynb)** file.  
- **File Naming Convention:**  
  - Use **your university email as the filename**, e.g.,  
    ```
    j.doe@innopolis.university.ipynb
    ```
  - **Do NOT modify this format**, or your submission may not be graded.

🔹 **Assignment Breakdown:**
| Task | Description | Points |
|------|------------|--------|
| **Task 1.1** | Linear Regression | 20 |
| **Task 1.2** | Polynomial Regression | 20 |
| **Task 2.1** | Data Preprocessing | 15 |
| **Task 2.2** | Model Comparison | 45 |
| **Total** | - | **100** |

---

### **📂 Dataset & Assumptions**
The dataset files are stored in the `datasets/` folder.  
- **Regression Dataset:** `datasets/task1_data.csv`
- **Classification Dataset:** `datasets/pokemon_modified.csv`

Each dataset is structured as follows:

🔹 **`task1_data.csv` (for regression tasks)**  
- Contains `X_train`, `y_train`, `X_test`, and `y_test`.  
- The goal is to fit **linear and polynomial regression models** and evaluate their performance.  

🔹 **`pokemon_modified.csv` (for classification tasks)**  
- Contains Pokémon attributes, with `is_legendary` as the **binary target variable (0 or 1)**.  
- Some features contain **missing values** and **categorical variables**, requiring preprocessing.

---

### **🚀 How to Approach the Assignment**
1. **Start with Regression (Task 1)**
   - Implement **linear regression** and **polynomial regression**.
   - Use **GridSearchCV** for polynomial regression to find the best degree.
   - Evaluate using **MSE, RMSE, MAE, and R² Score**.

2. **Move to Data Preprocessing (Task 2.1)**
   - Load and clean the Pokémon dataset.
   - Handle **missing values** correctly.
   - Encode categorical variables properly.
   - Ensure **no data leakage** when doing the preprocessing.

3. **Train and Evaluate Classification Models (Task 2.2)**
   - Train **Logistic Regression, KNN, and Naive Bayes**.
   - Use **GridSearchCV** for hyperparameter tuning.
   - Evaluate models using **Accuracy, Precision, Recall, and F1-score**.

---

### **📌 Grading & Evaluation**
- Your notebook will be **autograded**, so ensure:
  - Your function names **exactly match** the given specifications.
  - Your output format matches the expected results.
- Partial credit will be given where applicable.

🔹 **Need Help?**  
- If you have any questions, refer to the **assignment markdown instructions** in each task before asking for clarifications.
- You can post your question on this [Google sheet](https://docs.google.com/spreadsheets/d/1oyrqXDjT2CeGYx12aZhZ-oDKcQQ-PCgT91wHPhTlBCY/edit?usp=sharing)

🚀 **Good luck! Happy coding!** 🎯

### FAQ

**1) Should we include the lines to import the libraries?**

- **Answer:**  
  It doesn't matter if you include extra import lines, as the grader will only call the specified functions.

**2) Is it okay to submit my file with code outside of the functions?**

- **Answer:**  
  Yes, you can include additional code outside of the functions as long as the entire script runs correctly when converted to a `.py` file.

**Important Clarification:**

- The grader will first convert the Jupyter Notebook (.ipynb) into a Python file (.py) and then run it.
- **Note:** Please do not include any commands like `!pip install numpy` because they may break the conversion process and therefore the submission will not be graded.

## Task 1: Linear and Polynomial Regression (30 Points)

### Task 1.1 - Linear Regression (15 Points)
#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Train a **linear regression model** on `X_train` and `y_train`.
4. Use the trained model to predict `y_test` values.
5. Compute and return the following **evaluation metrics** as a dictionary:
   - **Mean Squared Error (MSE)**
   - **Root Mean Squared Error (RMSE)**
   - **Mean Absolute Error (MAE)**
   - **R² Score**
6. The function signature should match:
   ```python
   def task1_linear_regression() -> Dict[str, float]:

Please do not use any other libraries except for the ones imported below.

In [26]:
# Standard Library Imports
import os
import importlib.util
import nbformat
from tempfile import NamedTemporaryFile
from typing import Tuple, Dict

# Third-Party Library Imports
import numpy as np
import pandas as pd

from nbconvert import PythonExporter

# Scikit-Learn Imports
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             mean_squared_error, mean_absolute_error, r2_score)
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

In [27]:
def load_dataset(name: str) -> pd.DataFrame:
    return pd.read_csv(f'datasets/{name}.csv', sep=',')

In [28]:
df = pd.read_csv('datasets/task1_data.csv', sep=',')
# df.head()

In [29]:
# print(df['X_test'].std())
# print(df['y_test'].std())

In [30]:
x_train, y_train, x_test, y_test = df['X_train'].values.reshape(-1, 1), df['y_train'], df['X_test'].values.reshape(-1, 1), df['y_test']

linear_model = LinearRegression()
linear_model.fit(x_train, y_train)

y_pred = linear_model.predict(x_test)

# y_pred

In [31]:
# import matplotlib.pyplot as plt
# %matplotlib inline

# plt.scatter(x_train, y_train, color='blue', label='Training Data')
# plt.scatter(x_test, y_test, color='red', label = 'Testing Data')
# plt.plot(x_test, y_pred, color='green', label='Linear Regression')
# plt.xlabel('X')
# plt.ylabel('y')
# plt.legend()

# plt.show()

In [32]:
# metrics = {
#     "MSE": mean_squared_error(y_test, y_pred),
#     "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
#     "MAE": mean_absolute_error(y_test, y_pred),
#     "R2": r2_score(y_test, y_pred)
# }

# metrics

In [33]:
def task1_linear_regression() -> Dict[str, float]:
    """
    Performs linear regression on a predefined dataset and returns performance metrics.

    **Dataset Assumption:**
    - The dataset is located at `"datasets/task1_data.csv"`.
    - It should contain the following columns:
      - `"X_train"`: Training feature values (numerical).
      - `"y_train"`: Training target values.
      - `"X_test"`: Testing feature values (numerical).
      - `"y_test"`: Testing target values.

    **Process:**
    1. Load the dataset from `"datasets/task1_data.csv"`.
    2. Extract training and testing data.
    3. Train a linear regression model on `X_train, y_train`.
    4. Use the trained model to predict `y_test` values.
    5. Compute evaluation metrics: **MSE, RMSE, MAE, R² Score**.

    **Output (Dictionary with Regression Metrics):**
    ```python
    {
        "MSE": <Mean Squared Error>,
        "RMSE": <Root Mean Squared Error>,
        "MAE": <Mean Absolute Error>,
        "R2": <R² Score>
    }
    ```
    """
    df = load_dataset('task1_data')

    x_train, y_train, x_test, y_test = df['X_train'].values.reshape(-1, 1), df['y_train'], df['X_test'].values.reshape(-1, 1), df['y_test']

    linear_model = LinearRegression()
    linear_model.fit(x_train, y_train)

    y_pred = linear_model.predict(x_test)

    return {
    "MSE": mean_squared_error(y_test, y_pred),
    "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
    "MAE": mean_absolute_error(y_test, y_pred),
    "R2": r2_score(y_test, y_pred)
    }

In [34]:
# task1_linear_regression()

### Task 1.2 - Polynomial Regression (15 Points)

#### **Instructions**
1. Load the dataset from **`datasets/task1_data.csv`**.
2. Extract training and testing data from the following columns:
   - `"X_train"`: Training feature values.
   - `"y_train"`: Training target values.
   - `"X_test"`: Testing feature values.
   - `"y_test"`: Testing target values.
3. Define a **pipeline** that includes:
   - **Polynomial feature transformation** (degree range: **2 to 10**).
   - **Linear regression model**.
4. Use **GridSearchCV** with **8-fold cross-validation** to determine the best polynomial degree.
5. Train the model with the best polynomial degree and **evaluate it on the test set**.
6. Compute and return the following results as a dictionary:
   - **Best polynomial degree** (`best_degree`)
   - **Mean Squared Error (MSE)**

#### **Function Signature**
```python
def task1_polynomial_regression() -> Dict[str, float]:

In [35]:
def task1_polynomial_regression() -> Dict[str, float]:
    """
    Performs polynomial regression using GridSearchCV to find the best polynomial degree.


    **Process:**
    1. Load the dataset and extract `X_train, y_train, X_test, y_test`.
    2. Define a **pipeline** with polynomial feature transformation and linear regression.
    3. Use **GridSearchCV** (with 8-fold cross-validation) to determine the best polynomial degree (range: **2 to 10**).
    4. Train the best polynomial regression model and evaluate its performance.
    5. Compute and return:
       - **Best polynomial degree (`best_degree`)**
       - **Mean Squared Error (MSE)**

     **Expected Output:**
    ```
    {
        "best_degree": <Optimal Polynomial Degree>,
        "MSE": <Mean Squared Error>
    }
    ```
    """

    df = load_dataset('task1_data')

    x_train, y_train, x_test, y_test = df['X_train'].values.reshape(-1, 1), df['y_train'], df['X_test'].values.reshape(-1, 1), df['y_test']

    pipeline = Pipeline([('poly', PolynomialFeatures()), ('linear_model', LinearRegression())])
    
    param_grid = {'poly__degree': range(2, 11)}
    grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=8, scoring='neg_mean_squared_error')

    grid_search.fit(x_train, y_train)

    best_degree = grid_search.best_params_['poly__degree']
    best_model = grid_search.best_estimator_

    y_pred = best_model.predict(x_test)
    MSE = mean_squared_error(y_test, y_pred)

    return {
        'best_degree': best_degree,
        'MSE': MSE
    }

In [36]:
# task1_polynomial_regression()

## Task 2: Classification with Data Preprocessing (70 Points)

### Task 2.1 - Data Preprocessing (30 Points)

#### **Instructions**
1. Load the dataset from **`datasets/pokemon_modified.csv`**.
2. Look at the data and study the provided features
3. Remove the **two redundant features**
4. Handle **missing values**:
   - Use **mean imputation** for **"height_m"** and **"weight_kg"**.
   - Use **median imputation** for **"percentage_male"**.
5. Perform **one-hot encoding** for the categorical column **"type1"**.
6. Ensure the **target variable** (`"is_legendary"`) is present.
7. **Split the data into training and testing sets** (`80%-20%` split). Is it balanced?
8. **Apply feature scaling** using **StandardScaler** or **MinMaxScaler**.
9. Return the following:
   - `X_train_scaled`: Processed training features.
   - `X_test_scaled`: Processed testing features.
   - `y_train`: Training labels.
   - `y_test`: Testing labels.

#### **Function Signature**
```python
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:

In [37]:
df = load_dataset('pokemon_modified')

# df.columns.get_loc('classification')

# df

In [38]:
def delete_redundant_features(dataset: pd.DataFrame, cols: list) -> pd.DataFrame:
    return dataset.drop(dataset.columns[cols], axis=1)

In [39]:
def handle_missing_values(dataset: pd.DataFrame) -> pd.DataFrame:
    imputer_height_weight = SimpleImputer(strategy='mean')
    dataset[['height_m', 'weight_kg']] = imputer_height_weight.fit_transform(dataset[['height_m', 'weight_kg']])


    imputer_percentage_male = SimpleImputer(strategy='median')
    dataset[['percentage_male']] = imputer_percentage_male.fit_transform(dataset[['percentage_male']])

    return dataset

In [40]:
def ohe_new_features(dataset: pd.DataFrame, feature_name) -> pd.DataFrame:
    encoder = OneHotEncoder(sparse_output=False)
    encoded_type1 = encoder.fit_transform(dataset[feature_name])

    new_cols = pd.DataFrame(encoded_type1, columns=encoder.get_feature_names_out(feature_name))
    dataset = pd.concat([dataset, new_cols], axis=1)

    dataset.drop(feature_name, axis=1, inplace=True)

    return dataset

In [41]:
def task2_preprocessing() -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
    """
    Preprocesses the Pokémon dataset by handling missing values, encoding categorical data, 
    and applying feature scaling before returning train-test splits, ensuring class balance.

    **Dataset Assumption:**
    - The dataset is located at `"datasets/pokemon_modified.csv"`.

    **Process:**
    1. Load the dataset and remove redundant columns.
    2. Handle missing values:
       - Mean imputation for **"height_m"** and **"weight_kg"**.
       - Median imputation for **"percentage_male"**.
    3. Perform **one-hot encoding** on `"type1"`.
    4. Ensure **"is_legendary"** is present as the target variable.
    5. Split the dataset into **80% training, 20% testing** using **stratification** to maintain class balance.
    6. Apply feature scaling (**StandardScaler**).
    7. Return the preprocessed train-test splits.
    """

    df = load_dataset('pokemon_modified')
    
    columns = [df.columns.get_loc('classification'), df.columns.get_loc('name')]

    df = delete_redundant_features(df, columns)

    df = handle_missing_values(df)

    df = ohe_new_features(df, ['type1'])

    data_label = df['is_legendary']
    data_feature = df.drop(['is_legendary'], axis=1)

    x_train, x_test, y_train, y_test = train_test_split(data_feature, data_label, test_size=0.2, random_state=42, stratify=data_label)

    scaler = StandardScaler()

    scaler.fit(x_train)

    x_train = pd.DataFrame(scaler.transform(x_train), columns=x_train.columns)
    x_test = pd.DataFrame(scaler.transform(x_test), columns=x_test.columns)
    
    
    return [x_train, x_test, y_train, y_test]

### Task 2.2 - Model Comparison (40 Points)

#### **Instructions**
1. **Train three classification models** on the preprocessed dataset:
   - **Logistic Regression**
   - **K-Nearest Neighbors (KNN)**
   - **Gaussian Naive Bayes (GNB)**
2. Use **GridSearchCV** for **hyperparameter tuning** on:
   - **Logistic Regression**: Regularization strength (`C`) and penalty (`l1`, `l2`).
   - **KNN**: Number of neighbors (`n_neighbors`), weight function, and distance metric.
3. Train each model on the **training set** and evaluate on the **test set**.
4. Compute the following **evaluation metrics**:
   - **Accuracy**
   - **Precision**
   - **Recall**
   - **F1 Score**
5. Return a dictionary containing the evaluation metrics for each model.

#### **Function Signature**
```python
def task2_model_comparison() -> Dict[str, Dict[str, float]]:

In [42]:
# def optimize_threshold_gnb(x_train, x_test, y_train, y_test, thresholds):
#         gnb = GaussianNB()
#         gnb.fit(x_train, y_train)
#         best_precision = 0
#         best_recall = 0
#         best_f1 = 0
#         best_accuracy = 0

#         for threshold in thresholds:
#             y_proba = gnb.predict_proba(x_test)[:, 1]
#             y_pred_threshold = (y_proba >= threshold).astype(int)

#             precision = precision_score(y_test, y_pred_threshold)
#             recall = recall_score(y_test, y_pred_threshold)
#             f1 = f1_score(y_test, y_pred_threshold)
#             accuracy = accuracy_score(y_test, y_pred_threshold)

#             if precision > best_precision:
#                 best_precision = precision
#                 best_recall = recall
#                 best_f1 = f1
#                 best_accuracy = accuracy

#         return best_precision, best_recall, best_f1, best_accuracy

In [43]:
def task2_model_comparison() -> Dict[str, Dict[str, float]]:
    """
    Trains and evaluates three classification models using GridSearchCV for hyperparameter tuning.

    **Dataset Assumption:**
    - The preprocessed dataset is obtained from `task2_preprocessing()`, which returns:
      - `X_train`: Training features (scaled)
      - `X_test`: Testing features (scaled)
      - `y_train`: Training labels
      - `y_test`: Testing labels

    **Process:**
    1. Load the preprocessed dataset from `task2_preprocessing()`.
    2. Train the following models:
       - **Logistic Regression** (Hyperparameters: `C`, `penalty`, `solver`).
       - **K-Nearest Neighbors (KNN)** (Hyperparameters: `n_neighbors`, `weights`, `metric`).
       - **Gaussian Naive Bayes** (No hyperparameter tuning required).
    3. Evaluate the models using the following metrics:
       - **Accuracy**
       - **Precision**
       - **Recall**
       - **F1 Score**
    4. Return a dictionary with model names as keys and evaluation metrics as values.

    **Expected Output:**
    ```python
    {
        "Logistic Regression": {"accuracy": <float>, "precision": <float>, "recall": <float>, "f1_score": <float>},
        "KNN": {"accuracy": <float>, "precision": <float>, "recall": <float>, "f1_score": <float>},
        "Naive Bayes": {"accuracy": <float>, "precision": <float>, "recall": <float>, "f1_score": <float>}
    }
    ```
    """
    x_train, x_test, y_train, y_test = task2_preprocessing()

    models = {
        'Logistic Regression': {
            'estimator': LogisticRegression(),
            'params': {
                'C': [0.01, 0.1, 1, 10, 100, 1000],
                'penalty': ['l1', 'l2'],
                'solver': ['liblinear']
            }
        },
        'KNN': {
            'estimator': KNeighborsClassifier(),
            'params': {
                'n_neighbors': list(range(2,20)),
                'weights': ['distance'],
                'metric': ['euclidean', 'manhattan']
            }
        },
        'Naive Bayes': {
            'estimator': GaussianNB(),
            'params': {
                'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1]
            }
        }
    }

    results = {}

    for model_name, model_info in models.items():
        estimator = model_info['estimator']
        params = model_info['params']

        grid_search = GridSearchCV(estimator, params, cv=8, scoring='accuracy')
        grid_search.fit(x_train, y_train)

        best_estimator = grid_search.best_estimator_

        y_pred = best_estimator.predict(x_test)
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        # if model_name == 'Naive Bayes':
        #     thresholds = np.arange(0.1, 1, 0.05)
        #     best_precision, best_recall, best_f1, best_accuracy = optimize_threshold_gnb(x_train, x_test, y_train, y_test, thresholds)
        #     results[model_name] = {
        #         'accuracy': best_accuracy,
        #         'precision': best_precision,
        #         'recall': best_recall,
        #         'f1_score': best_f1
        #     }
        # else:
        results[model_name] = {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1
        }


    return results

In [None]:
# task2_model_comparison()

{'Logistic Regression': {'accuracy': 0.9875776397515528,
  'precision': 1.0,
  'recall': 0.8571428571428571,
  'f1_score': 0.9230769230769231},
 'KNN': {'accuracy': 0.9627329192546584,
  'precision': 0.9,
  'recall': 0.6428571428571429,
  'f1_score': 0.75},
 'Naive Bayes': {'accuracy': 0.9565217391304348,
  'precision': 0.8181818181818182,
  'recall': 0.6428571428571429,
  'f1_score': 0.72}}