
# Programming Assignment 3 - KNN Classifier

---
## Student Performance Dataset

You can download the dataset here: https://archive.ics.uci.edu/dataset/320/student+performance

This dataset contains student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and was collected using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). We will be using the Mathematics portion for this assignment (`student-mat.csv`).

There are a total of 33 features in this dataset - for simplicity, we will only use the following numerical features:
1. Age - student's age (numeric: from 15 to 22)
2. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
3. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
4. Traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
5. Studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
6. Failures - number of past class failures (numeric: n if 1<=n<3, else 4)
7. Famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
8. Freetime - free time after school (numeric: from 1 - very low to 5 - very high)
9. Goout - going out with friends (numeric: from 1 - very low to 5 - very high)
10. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
11. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
12. Health - current health status (numeric: from 1 - very bad to 5 - very good)
13. Absences - number of school absences (numeric: from 0 to 93)
G3 - final grade (numeric: from 0 to 20, output target)

`student-mat.csv` contains the actual dataset and `student.txt` contains the descriptions of the dataset.

## Objective

You are to implement a KNN algorithm in python to classify students into one of 20 grades, using the chosen numerical features. After completing this assignment, you should be familiar with the following:
1. Loading a dataset
2. Standardising a dataset
3. Computing a similarity measure between 2 data points
4. Finding the K nearest neighbours of a given data point
5. Implementing a KNN algorithm
6. Using accuracy to evaluate a machine learning algorithm
7. Why feature scaling matters

### **Total Marks: 25**
---


## Downloading the Dataset and Importing Modules

You can follow the steps below to download the dataset and upload it to a Colab environment.
1. Download the dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00320/student.zip, which contains the `student-mat.csv` file. 
2. Open the Colab file browser by pressing the small folder icon on the top left of the Colab page.  
3. Drag and drop the `student-mat.csv` file into the Colab folder.

We will be using `csv`, `math` and `numpy` as `np` for the questions. **You do not need to import them when submitting on Coursemology.**

In [None]:
import csv
import math
import numpy as np

# to display the float numbers with 2 decimal points and supress the use of
# scientific notations for small numbers
## np.set_printoptions(precision=2, suppress=True)

---

### Q1 loadStudentData (3 marks)

We first need to load the dataset from `student-mat.csv` and store the relevant data in numpy arrays.

The function `loadStudentData` takes in a csv file `filename` and returns the numpy arrays `X`, containing the 13 numerical features in `X_COLUMN_NAMES`, and `y`, containing the final grade in `Y_COLUMN_NAME`. Please **leave the rows and columns in the order that they appear** in the csv file and **exclude the headers**.

Tips:
1. You can use `csv.reader` to read the `student-mat.csv` file. The delimiter should be set to ; (see https://docs.python.org/3/library/csv.html#csv-fmt-params).
2. If using `csv.reader`, the data will be in string form. You need to convert it to float for the later questions.

In [None]:
# You can use X_COLUMN_NAMES and Y_COLUMN_NAME to extract the relevant information from the csv file
X_COLUMN_NAMES = [
    "age",
    "Medu",
    "Fedu",
    "traveltime",
    "studytime",
    "failures",
    "famrel",
    "freetime",
    "goout",
    "Dalc",
    "Walc",
    "health",
    "absences",
]
Y_COLUMN_NAME = "G3"

# Submit to Coursemology
def loadStudentData(filename):
    """
    filename: string, the path of the student-mat.csv dataset
    RETURN
        X: numpy array, shape = [N, D]
        y: numpy array, shape = [N]
    """
    X, y = None, None
    ## start your code here


    ## end
    return X, y

In [None]:
# Testing

filename = "/content/student-mat.csv"

X, y = loadStudentData(filename)

print(X.shape)
print(X[100][12])
print(y.shape)
print(y[177])

Expected output:

```
(395, 13)
14.0
(395,)
6.0
```

---

### Q2 standardizeDataset (3 marks)

After storing the data, we need to perform feature scaling on the `X` values (Why feature scale? See Q8). In this case, we will be using standardization. Each column in `X` needs to be standardized separately. That is, for each column in `X`, for each value, we subtract the mean of that column and then divide by standard deviation of that column.

The function `standardizeDataset` takes in the numpy array `X` and returns the standardized numpy array `Xstd`.

In [None]:
# Submit to Coursemology
def standardizeDataset(X):
    """
    X: numpy array, shape = [N,D]
    RETURN
      Xstd: numpy array, shape = [N,D]
    """
    Xstd = np.zeros_like(X)
    ## start your code here
    
    
    ## end
    return Xstd

In [None]:
# Testing

Xstd = standardizeDataset(X)
print(Xstd.shape)
print(Xstd[10, 10])

Expected output:

```
(395, 13)
-0.2263446258965982
```

---

### Q3 euclideanDistance (3 marks)

For KNN, we also need a distance metric to compare rows of data from the `Xstd` array. We will be using Euclidean distance as the metric for this assignment.

The function `euclideanDistance` takes two rows of data `x1` and `x2` and returns the Euclidean distance `dist`.

In [None]:
# Submit to Coursemology
def euclideanDistance(x1, x2):
    """
    x1: numpy array, shape = [D]
    x2: numpy array, shape = [D]
    RETURN
        dist: float value
    """
    dist = 0
    ## start your code here
        
            
    ## end
    return dist

In [None]:
# Testing

print(euclideanDistance(Xstd[1, :], Xstd[1, :]))
print(euclideanDistance(Xstd[87, :], Xstd[179, :]))

Expected output:

```
0.0
2.9451626819659182
```

---

### Q4 kNearestNeighbours (4 marks)

Now we're ready to find the K nearest neighbours. We will use `euclideanDistance` to compare distances, and get the K rows of data with the lowest Euclidean distance to Xtest.

The function `kNearestNeighbours` takes the training data `X` and `y`, the testing data `Xtest`, and the number of neighbours `K`. It returns the array `Xng` which contains the `K` nearest rows of data from `Xtest` and the array `yng` which contains their corresponding class values. **The order of the elements in `Xng` and `yng` do not matter as long as they contain the `K` nearest neighbours**. A correct implementation of `euclideanDistance` has been given to you in Coursemology (i.e. you don't need to code it again)

Tip: you can use `np.argsort` to find the indices of `Xtest`'s most similar rows of data.

In [None]:
# Submit to Coursemology
def kNearestNeighbours(X, y, Xtest, K):
    """
    X: numpy array, shape = [N, D]
    y: numpy array, shape = [N]
    Xtest: numpy array, shape = [D]
    K: int value
    RETURN
        Xng: numpy array, shape = [K, D]
        yng: numpy array, shape = [K]
    """
    Xng, yng = None, None
    ## start your code here


    ## end
    return Xng, yng

In [None]:
# Testing

Xng, yng = kNearestNeighbours(Xstd, y, Xstd[100, :], 5)
print(Xng)
print(yng)

Expected output: (can vary based on how you order the rows)

```
[[-0.55  1.14  1.36 -0.64 -1.24 -0.45  0.06  1.77  1.7  3.96  2.11  0.32 1.04]
 [-0.55  1.14  1.36 -0.64 -0.04 -0.45  0.06  0.77  1.7  3.96  2.11  1.04 1.29]
 [-0.55  0.23  0.44 -0.64 -0.04  0.9   0.06  1.77  1.7  2.83  1.33  1.04 -0.21]
 [ 0.24  1.14  1.36 -0.64 -0.04 -0.45  1.18 -0.24  1.7  2.83  2.11 -0.4 0.91]
 [ 0.24  1.14  0.44  0.79 -0.04 -0.45  0.06  0.77  0.8  2.83  1.33  0.32 -0.21]]
[ 5. 11. 12. 13.  9.]
```

---

### Q5 KNNClassifier (5 marks)

With all the previous functions, we can now assemble the KNN algorithm.

The function `KNNClassifier` takes in the training data `X` and `y`, the testing data `Xtest`, and the number of neighbours `K`. It first finds the `K` most similar neighbours, and then returns the most frequent class value of the `K` neighbours as prediction for the class value of `Xtest`. **In the event of a tie (i.e. several class values have the same highest frequency), returning any class value with the highest frequency will do**. A correct implementation of `kNearestNeighbours` has been given to you in Coursemology (i.e. you don't need to code it again)

In [None]:
# Submit to Coursemology
def KNNClassifier(X, y, Xtest, K):
    """
    X: shape = [N, D]
    y: shape = [N]
    Xtest: shape
    K: int value
    RETURN
        output_class: float value from 0 to 20
    """
    output_class = None
    ## start your code here
        
        
    ## end
    return output_class

In [None]:
# Testing

# We shall consider 3 data points from the dataset as our test data
Xtest1, ytest1, K = Xstd[100, :], y[100], 3
prediction1 = KNNClassifier(Xstd, y, Xtest1, K)
print("=====Point 1=====")
print("Predicted class for test data by KNN: ", prediction1)
print("Actual class for test data from dataset: ", ytest1)

Xtest2, ytest2, K = Xstd[200, :], y[200], 3
prediction2 = KNNClassifier(Xstd, y, Xtest2, K)
print("=====Point 2=====")
print("Predicted class for test data by KNN: ", prediction2)
print("Actual class for test data from dataset: ", ytest2)

Xtest3, ytest3, K = Xstd[300, :], y[300], 3
prediction3 = KNNClassifier(Xstd, y, Xtest3, K)
print("=====Point 3=====")
print("Predicted class for test data by KNN: ", prediction3)
print("Actual class for test data from dataset: ", ytest3)

Expected output:

```
=====Point 1=====
Predicted class for test data by KNN:  5.0 or 11.0 or 12.0
Actual class for test data from dataset:  5.0
=====Point 2=====
Predicted class for test data by KNN:  6.0 or 10.0 or 16.0
Actual class for test data from dataset:  16.0
=====Point 3=====
Predicted class for test data by KNN:  6.0 or 9.0 or 11.0
Actual class for test data from dataset:  11.0
```

---

### Q6 accuracyPercentage (3 marks)

With any algorithm, we should obtain performance metrics to see how good it is. In this case, because we can compare our predicted values to the actual class values, we can use accuracy as a metric.

The function `accuracyPercentage` takes in the predicted class values from Q5 `predicted_class` and the actual class values of the test data `actual_class`. It returns the percentage `percent` of correctly predicted classes for the test data used in Q5.

In [None]:
# Submit to Coursemology
def accuracyPercentage(predicted_class, actual_class):
    """
    actual_class: numpy array, shape = [N]
    prediceted_class: numpy array, shape = [N]
    RETURN
        percent: float value
    """
    percent = 0
    ## start your code here
        
        
    ## end
    return percent

In [None]:
# Testing

# Check accuracy of the last 20 data points
Xtest, ytest, K = Xstd[-20:], y[-20:], 3
predictions = np.array(list(map(lambda testData: KNNClassifier(Xstd, y, testData, K), Xtest)))
print("Accuracy {}%".format(accuracyPercentage(predictions, ytest)))

Expected output: (can vary based on how your KNNClassifier is implemented)

```
Accuracy 90.0%
```

---

### Q7 Reflection (3 marks)

Please answer the following questions and list your comments in bullet points. **This section is graded**. Note that you won't get a mark when you submit this question, but you will automatically be awarded the full mark when finalising submission (subject to manual marking afterwards)
1. In what ways did hands-on coding deepen your comprehension of machine learning algorithms compared to solely focusing on theoretical study?
2. How was your experience using ChatGPT as a TA compared to human TAs you've interacted with in the past?
3. Did you encounter any specific questions or hurdles that ChatGPT was unable to assist with? Please specify.
4. To what extent did leveraging ChatGPT inspire you to dive deeper or try varied approaches compared to your usual learning methods? Elaborate on your experience.

Please enter your comments here by double-clicking on this text cell:
* Q1
* Q2
* etc.

---

### Q8 KNN with and without standardizing the dataset

In this question, you will understand the purpose of standardization of numerical data in machine learning algorithms. Click submit and run KNN for the given test data with and without standarzation, and compare their performances. **This section is not graded**.

In [None]:
# load the original training data
X, y = loadStudentData(filename)

# standardize the data
Xstd = standardizeDataset(X)

# randomly choose data from X and Xstd
# X - dataset that is not standardized
# Xstd - standardized dataset
# In both cases, the class value y is the same
random_indx = np.asarray([9, 153, 91, 29, 20, 10, 138, 130, 1, 11, 25, 137, 120])
testX = X[random_indx]
testXstd = Xstd[random_indx]
testy = y[random_indx]

# predictedNoStd has the classes predicted for test data without standardization
# predictedStd has the classes predicted for test data with standardization
K = 3
predictedNoStd = np.empty(len(testy))
predictedStd = np.empty(len(testy))

# call KNN without standardized dataset and test data testX. Record the predicted
# class in predictedNoStd numpy array
count = 0
for test in testX:
    predictedNoStd[count] = KNNClassifier(X, y, test, K)
    count += 1

# call KNN with standardized dataset and test data testXStd. Record the predicted
# class in predictedStd numpy array
count = 0
for test in testXstd:
    predictedStd[count] = KNNClassifier(Xstd, y, test, K)
    count += 1

# print the classes predicted classes and the actual classes for the test data
print("Predicted class with standardization: ", predictedStd)
print("Predicted class without standardization: ", predictedNoStd)
print("Actual class for test data: ", testy)

# print the accuracy of KNN with and without standardizing dataset
standardizedAccuracy = accuracyPercentage(testy, predictedStd)
print("Accuracy of KNN with standardization: ", standardizedAccuracy)

unstandardizedAccuracy = accuracyPercentage(testy, predictedNoStd)
print("Accuracy of KNN without standardization: ", unstandardizedAccuracy)

Expected Output: (can vary based on how your KNNClassifier is implemented)

```
Predicted class with standardization:  [15.  0. 18. 11. 15.  9. 12.  0.  6. 12.  8.  0. 15.]
Predicted class without standardization:  [15.  0. 18. 11. 15.  9. 12.  0. 10. 12.  8.  0. 15.]
Actual class for test data:  [15.  0. 18. 11. 15.  9. 12.  0.  6. 12.  8.  0. 15.]
Accuracy of KNN with standardization:  100.0
Accuracy of KNN without standardization:  92.3076923076923
```

---

# End of Assignment