# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [30]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [31]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Add your code here
print(Caravan.describe())
print(Caravan.head())

           MOSTYPE     MAANTHUI      MGEMOMV     MGEMLEEF     MOSHOOFD  \
count  5822.000000  5822.000000  5822.000000  5822.000000  5822.000000   
mean     24.253349     1.110615     2.678805     2.991240     5.773617   
std      12.846706     0.405842     0.789835     0.814589     2.856760   
min       1.000000     1.000000     1.000000     1.000000     1.000000   
25%      10.000000     1.000000     2.000000     2.000000     3.000000   
50%      30.000000     1.000000     3.000000     3.000000     7.000000   
75%      35.000000     1.000000     3.000000     3.000000     8.000000   
max      41.000000    10.000000     5.000000     6.000000    10.000000   

            MGODRK       MGODPR       MGODOV       MGODGE       MRELGE  ...  \
count  5822.000000  5822.000000  5822.000000  5822.000000  5822.000000  ...   
mean      0.696496     4.626932     1.069907     3.258502     6.183442  ...   
std       1.003234     1.715843     1.017503     1.597647     1.909482  ...   
min       0.00000

Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?    
_(ii)_ How many variables (columns) does the dataset contain?    
_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  

In [33]:
# Add your code here
# (i) How many observations (rows) does the dataset contain?
num_rows = Caravan.shape[0]
print(f"(i) The dataset contains {num_rows} observations (rows).")

# (ii) How many variables (columns) does the dataset contain?
num_columns = Caravan.shape[1]
print(f"(ii) The dataset contains {num_columns} variables (columns).")

# (iii) What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?
purchase_type = Caravan['Purchase'].dtype
purchase_levels = Caravan['Purchase'].unique()
print(f"(iii) The response variable 'Purchase' is of type '{purchase_type}'. It has the following levels: {purchase_levels}.")

# (iv) How many predictor variables do we have (Hint: all variables other than `Purchase`)?
num_predictors = num_columns - 1
print(f"(iv) There are {num_predictors} predictor variables.")


(i) The dataset contains 5822 observations (rows).
(ii) The dataset contains 86 variables (columns).
(iii) The response variable 'Purchase' is of type 'object'. It has the following levels: ['No' 'Yes'].
(iv) There are 85 predictor variables.


Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [34]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  
_(vi)_ Why did we elect not to standard our response variable `Purchase`?  


# Your answer here

(v) Because it ensures that all variables contribute equally to the distance calculations used in KNN. KNN is a distance-based algorithm, meaning it relies on the Euclidean distance (or other distance metrics) to determine the nearest neighbors. If the predictor variables are on different scales, variables with larger scales will dominate the distance calculation, leading to biased results. Standardizing

(vi) The response variable Purchase is a categorical variable that indicates whether a customer purchased a caravan insurance policy ('Yes' or 'No'). Standardizing a categorical variable does not make sense because the values are not on a numerical scale that would benefit from scaling. The KNN algorithm treats the response variable as a label, not a feature, so it does not require or benefit from standardization.



_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?

# Your answer here
(vii) Setting a seed is important because it allows for the reproducibility of results. In machine learning and data science, many algorithms and processes involve some form of randomness, such as splitting data into training and test sets, initializing parameters, or performing random sampling. By setting a seed, you ensure that these random processes produce the same results each time the code is run. This is crucial for debugging, comparing models, and sharing results with others.

The specific value of the seed itself is not important. What is important is that you use a seed consistently. Different seed values will produce different sequences of random numbers, but as long as you use the same seed, you will get the same sequence of random numbers each time you run the code. Therefore, the particular seed value is arbitrary; it can be any integer. The key is to use the same seed if you want to reproduce the same results.

_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

In [35]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

# Create a random vector of True and False values for splitting
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']

# Display the shapes of the training and testing sets
print(f"Training set X shape: {training_X.shape}")
print(f"Training set Y shape: {training_Y.shape}")
print(f"Testing set X shape: {testing_X.shape}")
print(f"Testing set Y shape: {testing_Y.shape}")


    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

In [36]:
# Add your code here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix


# Fit the KNN model with k=1
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(training_X, training_Y)

# Make predictions on the test set
predictions = knn.predict(testing_X)

# Evaluate the model
print(confusion_matrix(testing_Y, predictions))
print(classification_report(testing_Y, predictions))


[[1233  102]
 [  65   11]]
              precision    recall  f1-score   support

          No       0.95      0.92      0.94      1335
         Yes       0.10      0.14      0.12        76

    accuracy                           0.88      1411
   macro avg       0.52      0.53      0.53      1411
weighted avg       0.90      0.88      0.89      1411



Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)

In [37]:
# prediction accuracy rate
accuracy = knn.score(testing_X, testing_Y)
print(f"Prediction accuracy: {accuracy}")


Prediction accuracy: 0.8816442239546421


In [52]:
# prediction error rate
predictor_error = 1 - accuracy
print(f"Predictor error: {predictor_error}")

Predictor error: 0.11835577604535785


_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:

In [54]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan['Purchase'])) * 100
print(f"Percentage of customers who purchase insurance: {percentage_purchase:.2f}%")

# Calculate the baseline accuracy
baseline_accuracy = Caravan['Purchase'].eq('No').sum() / len(Caravan['Purchase'])
print(f"Baseline accuracy (always guessing 'No'): {baseline_accuracy:.2f}")

# Calculate the baseline predictor error
baseline_error = 1 - baseline_accuracy
print(f"Baseline predictor error: {baseline_error:.2f}")


Percentage of customers who purchase insurance: 5.98%
Baseline accuracy (always guessing 'No'): 0.94
Baseline predictor error: 0.06


The KNN model's accuracy (88%) is lower than the baseline accuracy (94%). This suggests that the KNN model with k=1 is not performing better than simply guessing the most frequent class ('No'). The predictor error for the KNN model (12%) is higher than the baseline predictor error (6%), further indicating that the KNN model's performance is not better than the baseline for this dataset. This could be due to the imbalance in the dataset, where the majority class ('No') dominates the predictions. To improve the model, techniques such as balancing the dataset, tuning hyperparameters, or using more advanced models might be considered.

_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?

In [57]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Fit the KNN model with k=3
knn_k3 = KNeighborsClassifier(n_neighbors=3)
knn_k3.fit(training_X, training_Y)

# Make predictions on the test set
predictions_k3 = knn_k3.predict(testing_X)

# Evaluate the model with k=3
print("KNN Model with K=3")
print(confusion_matrix(testing_Y, predictions_k3))
print(classification_report(testing_Y, predictions_k3))

# Calculate the prediction accuracy with k=3
accuracy_k3 = knn_k3.score(testing_X, testing_Y)
print(f"Prediction accuracy with K=3: {accuracy_k3:.2f}")

# Calculate the predictor error with k=3
predictor_error_k3 = 1 - accuracy_k3
print(f"Predictor error with K=3: {predictor_error_k3:.2f}")

# Baseline accuracy
baseline_accuracy = Caravan['Purchase'].eq('No').sum() / len(Caravan['Purchase'])
print(f"Baseline accuracy (always guessing 'No'): {baseline_accuracy:.2f}")

# Baseline predictor error
baseline_error = 1 - baseline_accuracy
print(f"Baseline predictor error: {baseline_error:.2f}")


KNN Model with K=3
[[1313   22]
 [  69    7]]
              precision    recall  f1-score   support

          No       0.95      0.98      0.97      1335
         Yes       0.24      0.09      0.13        76

    accuracy                           0.94      1411
   macro avg       0.60      0.54      0.55      1411
weighted avg       0.91      0.94      0.92      1411

Prediction accuracy with K=3: 0.94
Predictor error with K=3: 0.06
Baseline accuracy (always guessing 'No'): 0.94
Baseline predictor error: 0.06


The KNN model with K=3 has a prediction accuracy of 94%, which matches the baseline accuracy of always guessing 'No'. This indicates that the KNN model with K=3 performs as well as the baseline.

# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
