<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
    <img style="float: right; padding-right: 10px; width: 65px" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

# Week 3 | Homework 3: k-NN Regression

**Clemson University Instructor(s):** Tim Ransom

-----------------------------

## Learning objectives 

- Describe the k-Nearest Neighbors algorithm for regression.
- Implement k-NN regression using scikit-learn.
- Choose an appropriate value for k based on model performance.
- Explain the concept of overfitting in the context of k-NN.
- Apply data normalization techniques to improve k-NN performance.

------------------------------------------------------------------------

## INSTRUCTIONS

-   As much as possible, try and stick to the hints and functions we
    import at the top of the homework, as those are the ideas and tools
    the class supports and is aiming to teach. And if a problem
    specifies a particular library you're required to use that library,
    and possibly others from the import list.
-   Please use .head() when viewing data. 

------------------------------------------------------------------------

## About

#### This homework involves: 
- Regression methods for predicting a quantitative variable. Specifically, we will build regression models that can predict the number of taxi pickups in New York City at any given time of the day. 
- These prediction models will be useful, for example, in monitoring traffic in the city.
- The data set for this problem is given in the file `nyc_taxi.csv`. 
- You will have to separate it into training and test sets. 
- The first column contains the time of a day in minutes, and the second column contains the number of pickups observed at that time. 
- The data set covers taxi pickups recorded in NYC during Jan 2015.
- We will fit models that use the time of the day (in minutes) as a predictor and predict the average number of taxi pickups at that time.
- The models will be fitted to the training set and evaluated on the test set. 
- The performance of the models will be evaluated using the $R^2$ metric.

---------------

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

In [None]:
# Import necessary libraries for numerical operations, data manipulation, modeling, and visualization
import numpy as np
import pandas as pd

from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.api import OLS

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# The %matplotlib inline is a magic command used in Jupyter Notebooks to display Matplotlib plots directly within the notebook cells.
%matplotlib inline

<div class="theme"> Question 1 </div>

<div class="exercise"> <b> Exercise 1.1 </b> </div>

- Use pandas to load the dataset from the csv file `nyc_taxi.csv` into a pandas data frame.  
- Use the `train_test_split` method from `sklearn` with a `random_state` = `42` and a `test_size` of 0.2 to split the dataset into training and test sets.  
- Store your train set data frame as `train_data`, your test set data frame as `test_data`, and the whole dataset as `data`.

In [None]:
"""Write your code for exercise-1.1 here:"""

# your code here
raise NotImplementedError

<div class="exercise"> <b> Exercise 1.2 </b>  </div>

- Generate a scatter plot of the training data points to demonstrate how the number of taxi pickups is dependent on the time of the day. 
- Extract 1st column of dataset to variable named `time_of_day` and second column to variable named `taxi_pickups`.
- Use this variables to generate the scatter plot.
  1. To create the scatter plot, you need to set up the Figure and Axes objects using `plt.subplots()`. 
        - These two objects (fig and ax) will allow you to control various properties of the figure and plot.
        - `fig` is the Figure object: It serves as the overall container for the plot.
        - `ax` is the Axes object: This is where the actual data points will be plotted, including x and y axes, labels, etc.
        ```python
            Example code: fig, ax = plt.subplots(figsize=(10, 6))
        ```
  2. Customize the Plot: 
        - You need to set the title of the plot, the labels for the x-axis and y-axis, and format the ticks on the x-axis for better readability.
 


In [None]:
"""Write your code for exercise-1.2 here:"""

# your code here
raise NotImplementedError

<div class="exercise"> <b> Question 1.3 </b> </div>

- In a few sentences, describe the general pattern of taxi pickups over the course of the day and explain why this is a reasonable result.

Write your answer for question 1.3 in below markdown cell marked as `Your answer for question 1.3 here:`

YOUR ANSWER HERE

In [None]:
# your code here
raise NotImplementedError

<div class="exercise"> <b> Question 1.4 </b></div>

- You should see a <i>hole</i> in the scatter plot when `TimeMin` is 500-550 minutes and `PickupCount` is roughly 20-30 pickups.
- Briefly surmise why this is the case. 

Write your answer for question 1.4 in below markdown cell marked as `Your answer for question 1.4 here:`

YOUR ANSWER HERE

In [None]:
# your code here
raise NotImplementedError

<div class="theme"> Question 2 </div>

- In lecture we've seen k-Nearest Neighbors (k-NN) Regression, a non-parametric regression technique. 
- In the following problems please use built in functionality from `sklearn` to run k-NN Regression.

<div class="exercise"> <b> Exercise 2.1 </b> </div>

- Choose `TimeMin` as your feature variable and `PickupCount` as your response variable. 
- Create a dictionary of `KNeighborsRegressor` objects called `KNNModels`. 
- Let the key for your `KNNmodels` dictionary be the value of $k$ and the value be the corresponding `KNeighborsRegressor` object. 
- For $k \in \{1, 10, 75, 250, 500, 750, 1000\}$, fit k-NN regressor models on the training set (`train_data`).

In [None]:
"""Write your code for exercise-2.1 here:"""
k_values = [1, 10, 75, 250, 500, 750, 1000]


# your code here
raise NotImplementedError

<div class="exercise"> <b> Exercise 2.2 </b> </div>

- For each $k$, overlay a scatter plot of the actual values of `PickupCount` vs. `TimeMin` in the training set with a scatter plot of **predictions** for `PickupCount` vs `TimeMin`. 
- Do the same for the test set. 
- You should have one figure with 7 x 2 total subplots.
- For each $k$ the figure should have two subplots, one subplot for the training set and one for the test set.
  1. To create this subplot, you need to set up the Figure and Axes objects using `plt.subplots()`. 
        - These two objects (fig and ax) will allow you to control various properties of the figure and plot.
        - `fig` is the Figure object: It serves as the overall container for the plot.
        - `ax` is the Axes object: This is where the actual data points will be plotted, including x and y axes, labels, etc.
        ```python
            Example code: fig, ax = plt.subplots(1, 2, figsize=(10, 6))
        ```
        - Here `plt.subplots(1, 2)` creates one row and two columns of subplots. This results in two individual Axes that will be side by side.
        - Refer to this [document](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) for more information.
  2. Customize the Plot: 
        - You need to set the title of the plot, the labels for the x-axis and y-axis, and format the ticks on the x-axis for better readability.

**Hints**:

1.  Each subplot should use different color and/or markers to
    distinguish k-NN regression prediction values from the actual data
    values.
2.  Each subplot must have appropriate axis labels, title, and legend, and be stored in a variable named `ax`
3.  The overall figure should have a title.

In [None]:
"""Write your code for exercise-2.2 here:"""

# your code here
raise NotImplementedError

<div class="exercise"> <b> Exercise 2.3 </b> </div>

- Report the $R^2$ score for the fitted models on both the training and test sets for each $k$ (reporting the values in tabular form is encouraged).
- Store the results in a list called `results`.
- Loop over each k-value and calculate the R^2 score for both the training and test sets
- Append the values -> (`k`, `train_r2`, `test_r2`) to list `results` as a tuple (`k`, `train_r2`, `test_r2`)

**Hint:** In order to report values in tabular form - store the results in pandas DataFrame

In [None]:
"""Write your code for exercise-2.3 here:"""

# your code here
raise NotImplementedError

<div class="exercise"> <b> Exercise 2.4 </b> </div>

- Plot, in a single figure, the $R^2$ values from the model on the training and test set as a function of $k$.


   1. To create this plot, you need to set up the Figure and Axes objects using `plt.subplots()`. 
        - These two objects (fig and ax) will allow you to control various properties of the figure and plot.
        - `fig` is the Figure object: It serves as the overall container for the plot.
        - `ax` is the Axes object: This is where the actual data points will be plotted, including x and y axes, labels, etc.
        ```python
            Example code: fig, ax = plt.subplots(figsize=(10, 6))
        ```
        - Refer to this [document](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) for more information.
   2. Customize the Plot: 
        - You need to set the title of the plot, the labels for the x-axis and y-axis, and format the ticks on the x-axis for better readability.
**Hints**:

1.  Differentiate $R^2$ plots on the training and test set by color
    and/or marker.
2.  Make sure the $k$ values are sorted before making your plot.

In [None]:
"""Write your code for exercise 2.4 here:"""

# your code here
raise NotImplementedError

<div class="exercise"> <b> Question 2.5 </b> </div>

**Write your answers for question 2.5 in below cell below:**

### 1. If \( n \) is the number of observations in the training set, what can you say about a k-NN regression model that uses \( k = n \)?  

1. It will perfectly fit the training data.  
2. It will produce the same prediction for all test points.  
3. It will have the highest possible variance.  
4. It will generalize well to unseen data.  

---

### 2. What does an \( R^2 \) score of 0 mean?  

1. The model perfectly predicts the target values.  
2. The model performs worse than a simple mean predictor.  
3. The model is no better than predicting the mean of the target variable.  
4. The model has no bias but high variance.  

---

### 3. What would a negative \( R^2 \) score mean?  

1. The model predicts worse than using the mean of the target variable.  
2. The model predicts perfectly.  
3. The model is highly overfitting.  
4. The model has high variance but low bias.  

---

### 4. Do the training and test \( R^2 \) plots exhibit different trends?  

1. No, they both always increase with increasing \( k \).  
2. Yes, training \( R^2 \) decreases while test \( R^2 \) initially increases and then decreases.  
3. No, they both always decrease with increasing \( k \).  
4. Yes, test \( R^2 \) remains constant while training \( R^2 \) decreases.  

---

### 5. What is the best value of \( k \)? How do the corresponding training/test set \( R^2 \) values compare?  

1. The smallest possible \( k \), because it minimizes bias.  
2. The largest possible \( k \), because it minimizes variance.  
3. A moderate \( k \) where test \( R^2 \) is highest, balancing bias and variance.  
4. Any value of \( k \) produces the same result in k-NN regression.  

---

### 6. Use the plots of the predictions to justify why your choice of the best \( k \) makes sense (**Hint**: think Goldilocks).  

1. The best \( k \) is small because it memorizes training data.  
2. The best \( k \) is large because it smooths out noise completely.  
3. The best \( k \) is moderate because it avoids underfitting and overfitting.  
4. The best \( k \) does not matter as k-NN is always a good predictor.  

In [None]:
answer1 = 0
answer2 = 0
answer3 = 0
answer4 = 0
answer5 = 0
answer6 = 0

# your code here
raise NotImplementedError

# END
