## Prepare python environment


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

%matplotlib inline

In [None]:
random_state=5 # use this to control randomness across runs e.g., dataset partitioning

## Preparing the Glass Dataset (2 points)

We will use glass dataset from UCI machine learning repository. Details for this data can be found [here](https://archive.ics.uci.edu/dataset/42/glass+identification). The objective of the dataset is to identify the class of glass based on the following features:

1. RI: refractive index
2. Na: Sodium
3. Mg: Magnesium
4. Al: Aluminum
5. Si: Silica
6. K: Potassium
7. Ca: Calcium
8. Ba: Barium
9. Fe: Iron
10. Type of glass (Target label)

The classes of glass are:

1. building_windows_float_processed
2. building_windows_non_float_processed
3. vehicle_windows_float_processed
4. containers
5. tableware
6. headlamps

Identification of glass from its content can be used for forensic analysis.


### Load the dataset

In [None]:
# Download and load the dataset
import os
if not os.path.exists('glass.csv'):
    !wget https://raw.githubusercontent.com/JHA-Lab/ece364_2025/master/data/glass.csv
df = pd.read_csv('glass.csv')
# Display the first five instances in the dataset
df.head(5)

In [None]:
# Additional features to be added to the data
df['Ca_Na'] = df.Ca*df.Na
df['Al_Mg'] = df.Al*df.Mg
df['Ca_Mg'] = df.Ca*df.Mg
df['Ca_RI'] = df.Ca*df.RI

# get column names
col_names = df.columns

### Extract target and descriptive features (0.5 points)


#### Separate the target and features from the data.



In [None]:
# Store all the features from the data in X
X = # insert your code here
# Store all the target labels in y
y = # insert your code here

In [None]:
# Convert data to numpy arrays
X = # insert your code here
y = # insert your code here

### Create training and validation datasets (0.5 points)


Split the data into training and validation set using `train_test_split`.  See [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for details. To get consistent result while splitting, set `random_state` to the value defined earlier. We use 80% of the data for training and 20% of the data for validation.

In [None]:
X_train, X_val, y_train, y_val = # insert your code here

### Preprocess the dataset (1 point)


Preprocess the data by normalizing each feature to have zero mean and unit standard deviation. This can be done using the `StandardScaler()` function. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for more details.



In [None]:
# Define the scaler for scaling the data
scaler = # insert your code here

# Normalize the training data
X_train = # insert your code here

# Use the scaler defined above to standardize the validation data by applying the same transformation to the validation data.
X_val = # insert your code here

## Training K-nearest neighbor models (18 points)

We will use the `sklearn` library to train a K-nearest neighbors (kNN) classifier. Review ch.5 and see [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) for more details.


### Exercise 1:  Learning a kNN classifier (14 points)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#### Exercise 1a: Evaluate the effect of the number of neighbors (4 points)

- Train kNN classifiers with different number of neighbors among {1, 5, 25, 100, length(X_train)}.

- Keep all other parameters at their default values.  

- Report the model's accuracy on the training and validation sets.


In [None]:
# insert your code here

#### Explain the effect of increasing the number of neighbors on the performance over the training and validation sets.

`Insert your answer.`

#### Exercise 1b: Evaluate the effect of a weighted kNN (5 points)


- Train kNN classifiers with distance-weighting and vary the  number of neighbors among {1, 5, 25, 100, length(X_train)}.


- Keep all other parameters at their default values.  


- Report the model's accuracy on the training and validation sets.


In [None]:
# insert your code here

#### Compare the effect of the number of neighbors on model performance (train and validation) under the distance-weighted kNN against the uniformly weighted kNN. Explain any differences observed.


`Insert your answer.`

#### Exercise 1c: Evaluate the effect of the power parameter in the Minkowski distance metric (5 points)


- Train kNN classifiers with different distance functions by varying the power parameter for the Minkowski distance among {1, 2, 10, 100}.


- Fix the number of neighbors to be 25, and use the uniformly-weighted kNN. Keep all other parameters at their default values.  

- Report the model's accuracy over the validation set.


In [None]:
# insert your code here

#### Explain any effect observed on the model performance upon increasing the power parameter.


`Insert your answer.`

### Exercise 2: Feature Importance Analysis (4 points)

In this exercise you will implement a function to calculate feature importance for KNN classifier.

In [None]:
def knn_feature_importance(X, y, n_neighbors):
    # Split the data into training and validation sets
    X_train, X_val, y_train, y_val = # insert your code here

    # Initialize KNN classifier and feature importance array
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    feature_importance = np.zeros(X.shape[1])

    # Calculate baseline accuracy
    knn.fit(X_train, y_train)
    baseline_accuracy = accuracy_score(y_val, knn.predict(X_val))

    # Calculate feature importance
    for i in range(X.shape[1]):
        X_train_reduced = np.delete(X_train, i, axis=1)

        # insert your code here

    # Normalization
    return feature_importance / np.sum(feature_importance)


Then you can use your function to calculate the feature importance when $n_{neighbors}=5$, and you can plot or print them for better visualization.

In [None]:
importance = knn_feature_importance(X, y, 5)

plt.figure(figsize=(10, 6))
# insert your code here
plt.show()

#### Discuss the questions:
1. Identify the top 2 most important features for the KNN classifier. Explain why these features might be particularly important for glass prediction.
2. What do you think about the negative values in the result?
3. How might this information be useful for forensic scientists?

`Insert your answer.`