<h1 align='center'>Homework 3<br>KNN Implementation</h1>

## Project Overview
In this homework, you will implement three custom K-Nearest Neighbors (KNN) classifiers that handle numeric, categorical, and mixed features without preprocessing. You should use separate distance metrics for each feature type and combine them in a weighted fashion to classify new instances.

## Learning Objectives
Students will implement K-Nearest Neighbors (KNN) classifiers tailored to the feature types in three datasets:
- Numeric-only features
- Categorical-only features
- Mixed (categorical and numeric) features

Each part includes a concrete dataset that can be used as-is, with no preprocessing required.

This assignment helps you understand how distance metrics affect classification when feature types vary. You will also see how the number of neighbors affects the prediction results. You will be asked to implement KNN yourself and compare the results of your own implementation with the `KNeighborsClassifier` from sklearn.

## Part 1: Numeric-only classification (Breast Cancer dataset)

### Overview
- Dataset: Breast Cancer Wisconsin (Diagnostic) dataset.
- Features: 30 numerical measurements of cell nuclei.
- Target: malignant vs. benign.
- Task: Implement KNN using Euclidean distance, evaluate accuracy across different values of $K$, and plot accuracy vs. $K$. Then, compare the results with scikit-learn’s built-in implementation of KNN.

### Step-by-step implementation guide
1. Load the dataset from `sklearn.datasets` using `load_breast_cancer()` function.
2. Split into training and test sets.
3. Implement Euclidean distance manually.
4. Write your own KNN classifier using that distance.
5. Evaluate accuracy for different values of $K$.
6. Plot accuracy vs. $K$.
7. Compare results with scikit-learn’s ```KNeighborsClassifier```.

### Sample result

<img src="images/breast-cancer-KNN-manual-vs-builtin.png" width="40%" height="40%"/>

## Part 2: Categorical-only classification (Car Evaluation dataset)

### Overview
- Dataset: UCI Car Evaluation dataset.
- Features: buying, maint, doors, persons, lug_boot, safety (all categorical).
- Target: car acceptability.
- Task: Implement KNN using Hamming distance, evaluate accuracy across different values of $K$, and plot accuracy vs. $K$. Then, compare the results with scikit-learn’s built-in implementation of KNN.

### Step-by-step implementation guide
1. Load the dataset directly from [UCI Website](https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data "Car Evaluation dataset").
2. Split into training and test sets.
3. Implement Hamming distance manually.
4. Write your own KNN classifier using that distance.
5. Evaluate accuracy for different values of $K$.
6. Plot accuracy vs. $K$.
7. Compare results with scikit-learn’s ```KNeighborsClassifier``` (using metric="hamming").

### Sample result

<img src="images/car-evaluation-KNN-manual-vs-builtin.png" width="40%" height="40%"/>

## Part 3: Mixed-feature classification (Restaurant Tips dataset)

### Overview
- Dataset: Seaborn Restaurant Tips dataset.
- Features: Mix of categorical (sex, smoker, day) and numeric (total_bill, tip, size).
- Target: Time (Lunch/Dinner).
- Task: Implement a mixed-distance KNN. Use Euclidean distance for numeric features, Hamming distance for categorical features, combine them with a weight $W$, evaluate across different $K$ and $W$, and plot accuracy vs. $K$ for multiple $W$ values.

### Step-by-step implementation guide
1. Load the dataset from `seaborn.load_dataset()` function using **"tips"** argument.
2. Map numeric-like features to integers.
3. Split into training and test sets.
4. Implement Euclidean distance for numeric features.
$$d_{Euclidean}(x, y)=\sqrt{\sum_i(x_i-y_i)^2}$$
5. Implement Hamming distance for categorical features.
$$d_{Hamming}(x, y)=\sum_i1[x_i\neq y_i]$$
6. Combine them with a weight $W$: 
$$d_{total}=d_{Euclidean}+W\times d_{Hamming}$$
7. Write your own KNN classifier using this combined distance.
8. Evaluate accuracy for different $K$ and $W$
> - Loop over $K \in \{1, 2, \ldots, 20\}$
> - Loop over weights $W \in \{0.5, 1.0, 2.0, 3.0\}$ (you may add more).
> - Record accuracy for each pair $(K, W)$
9. Plot accuracy vs. $K$ for multiple weights.
10. Discuss how changing $W$ shifts the influence of categorical vs. numeric features.

### Sample result

<img src="images/tips-KNN-mixed-data.png" width="40%" height="40%"/>

## Deliverables

- Code: Three separate scripts or notebooks, one per part.
- plots:
> - For parts 2 and 3, accuracy vs. $K$ plot in manual and built-in implementations.
> - For Part 3, accuracy vs. $K$ curves for multiple $W$ values.
- Short report:
> - Observations: How distance choice impacts performance.
> - Sensitivity: Effects of varying $K$ and $W$.
> - Trade-offs: When categorical signals dominate vs. numeric signals.