# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [43]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Add your code here
Caravan.describe()

Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,ALEVEN,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND
count,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,...,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0
mean,24.253349,1.110615,2.678805,2.99124,5.773617,0.696496,4.626932,1.069907,3.258502,6.183442,...,0.076606,0.005325,0.006527,0.004638,0.570079,0.000515,0.006012,0.031776,0.007901,0.014256
std,12.846706,0.405842,0.789835,0.814589,2.85676,1.003234,1.715843,1.017503,1.597647,1.909482,...,0.377569,0.072782,0.080532,0.077403,0.562058,0.022696,0.081632,0.210986,0.090463,0.119996
min,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,1.0,2.0,2.0,3.0,0.0,4.0,0.0,2.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30.0,1.0,3.0,3.0,7.0,0.0,5.0,1.0,3.0,6.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,35.0,1.0,3.0,3.0,8.0,1.0,6.0,2.0,4.0,7.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,41.0,10.0,5.0,6.0,10.0,9.0,9.0,5.0,9.0,9.0,...,8.0,1.0,1.0,2.0,7.0,1.0,2.0,3.0,2.0,2.0


Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?    
_(ii)_ How many variables (columns) does the dataset contain?    
_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  

In [44]:

# Display the first few rows of the dataset to understand its structure
print(Caravan.head())

# Display summary statistics of the dataset
print(Caravan.describe())

# (i) How many observations (rows) does the dataset contain?
num_rows = Caravan.shape[0]
print(f"Number of observations (rows): {num_rows}")

# (ii) How many variables (columns) does the dataset contain?
num_columns = Caravan.shape[1]
print(f"Number of variables (columns): {num_columns}")

# (iii) What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?
purchase_dtype = Caravan['Purchase'].dtype
purchase_levels = Caravan['Purchase'].unique()
print(f"Variable type of 'Purchase': {purchase_dtype}")
print(f"Levels of 'Purchase': {purchase_levels}")

# (iv) How many predictor variables do we have (Hint: all variables other than `Purchase`)?
num_predictors = num_columns - 1
print(f"Number of predictor variables: {num_predictors}")


   MOSTYPE  MAANTHUI  MGEMOMV  MGEMLEEF  MOSHOOFD  MGODRK  MGODPR  MGODOV  \
0       33         1        3         2         8       0       5       1   
1       37         1        2         2         8       1       4       1   
2       37         1        2         2         8       0       4       2   
3        9         1        3         3         3       2       3       2   
4       40         1        4         2        10       1       4       1   

   MGODGE  MRELGE  ...  APERSONG  AGEZONG  AWAOREG  ABRAND  AZEILPL  APLEZIER  \
0       3       7  ...         0        0        0       1        0         0   
1       4       6  ...         0        0        0       1        0         0   
2       4       3  ...         0        0        0       1        0         0   
3       4       5  ...         0        0        0       1        0         0   
4       4       7  ...         0        0        0       1        0         0   

   AFIETS  AINBOED  ABYSTAND  Purchase  
0       0

Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [45]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  

* ***Answer***: Standardizing the predictor variables is important because it ensures that each variable contributes equally to the distance calculations in KNN. KNN is a distance-based algorithm, and without standardization, variables with larger scales can dominate the distance metric, leading to biased results. Standardization transforms the variables to have a mean of 0 and a standard deviation of 1, putting them on a comparable scale.

_(vi)_ Why did we elect not to standard our response variable `Purchase`?  

* ***Answer***: We did not standardize the response variable Purchase because it is a categorical variable indicating whether a customer purchased a caravan insurance policy or not. Standardization is not applicable to categorical variables as it is used for numeric variables to ensure they are on the same scale. The response variable should remain in its original form for classification tasks.



_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?

* **Answer:** Setting a seed is important because it ensures that the random processes in our analysis (such as data splitting and shuffling) yield the same results every time the code is run. This consistency is crucial for reproducibility and for others to verify the results. The particular seed value itself is not important; what matters is that a seed is set to maintain consistency

In [46]:
# Add your code here
np.random.seed(42) # set a random seed


_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

In [47]:
np.random.seed(42) # set a random seed

# Create a random vector of True and False values
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']

# Display the shapes of the training and testing sets to verify the split
print(f"Shape of training_X: {training_X.shape}")
print(f"Shape of training_Y: {training_Y.shape}")
print(f"Shape of testing_X: {testing_X.shape}")
print(f"Shape of testing_Y: {testing_Y.shape}")

Shape of training_X: (4383, 85)
Shape of training_Y: (4383,)
Shape of testing_X: (1439, 85)
Shape of testing_Y: (1439,)


_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

In [48]:
# Add your code here

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Initialize the KNN model with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model using the training data
knn.fit(training_X, training_Y)

# Predict the response for the testing set
predictions = knn.predict(testing_X)

# Evaluate the model's performance
accuracy = accuracy_score(testing_Y, predictions)
conf_matrix = confusion_matrix(testing_Y, predictions)
class_report = classification_report(testing_Y, predictions)

# Print the results
print(f"Accuracy of KNN model with k=1: {accuracy}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)


Accuracy of KNN model with k=1: 0.8957609451007644
Confusion Matrix:
[[1280   80]
 [  70    9]]
Classification Report:
              precision    recall  f1-score   support

          No       0.95      0.94      0.94      1360
         Yes       0.10      0.11      0.11        79

    accuracy                           0.90      1439
   macro avg       0.52      0.53      0.53      1439
weighted avg       0.90      0.90      0.90      1439



Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)

In [49]:
# prediction accuracy rate
accuracy = accuracy_score(testing_Y, predictions)

# another way to get accuracy
accuracy = knn.score(testing_X, testing_Y)

print(f"Prediction accuracy of KNN model with k=1: {accuracy:.2f}")



Prediction accuracy of KNN model with k=1: 0.90


In [50]:
# prediction error rate

prediction_error = 1 - accuracy
print(f"Prediction error of KNN model with k=1: {prediction_error:.2f}")

Prediction error of KNN model with k=1: 0.10


_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:

In [53]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan['Purchase']) * 100)

print(f"Percentage of customers who purchase insurance: {percentage_purchase:.2f}%")

# Since the majority class is "No", the baseline accuracy is the percentage of "No" responses
percentage_no_purchase = 100 - percentage_purchase

print(f"Baseline accuracy (always predicting 'No'): {percentage_no_purchase:.2f}%")

'''
Answer: The baseline accuracy is the percentage of "No" responses because the majority of customers do not purchase insurance. 
By always predicting "No", we achieve a certain baseline accuracy. 
If the KNN model's accuracy is significantly higher than this baseline accuracy, 
it indicates that the KNN model is performing better than random guessing or always predicting the majority class. 
Conversely, if the KNN model's accuracy is close to or less than the baseline accuracy, 
it suggests that the model is not performing well.

In this case, accuracy(knn=1) is 90%, which means the model is not performing well.

'''


Percentage of customers who purchase insurance: 5.98%
Baseline accuracy (always predicting 'No'): 94.02%


'\nAnswer: The baseline accuracy is the percentage of "No" responses because the majority of customers do not purchase insurance. \nBy always predicting "No", we achieve a certain baseline accuracy. \nIf the KNN model\'s accuracy is significantly higher than this baseline accuracy, \nit indicates that the KNN model is performing better than random guessing or always predicting the majority class. \nConversely, if the KNN model\'s accuracy is close to or less than the baseline accuracy, \nit suggests that the model is not performing well.\n\n'

_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?

In [52]:
# Your code here
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, accuracy_score, confusion_matrix, classification_report

# Initialize the KNN model with k=1
knn3 = KNeighborsClassifier(n_neighbors=3)

# Fit the KNN model using the training data
knn3.fit(training_X, training_Y)

# Predict the response for the testing set
predictions = knn3.predict(testing_X)

# Evaluate the model's performance
accuracy_knn3 = accuracy_score(testing_Y, predictions)
conf_matrix_knn3 = confusion_matrix(testing_Y, predictions)
class_report_knn3 = classification_report(testing_Y, predictions) 
# Calculate the prediction error
prediction_error_knn3 = 1 - accuracy


# Print the results
print(f"Accuracy of KNN model with k=3: {accuracy_knn3:.2f}")
print(f"Prediction error of KNN model with k=3: {prediction_error_knn3:.2f}")


# Print the results
print("Confusion Matrix:")
print(conf_matrix_knn3)
print("Classification Report:")
print(class_report_knn3)


# KNN=3 model performs better! In this case, accuracy(knn=3) is 93%, which is closer to Baseline Accuracy of 94.02%


Accuracy of KNN model with k=3: 0.93
Prediction error of KNN model with k=3: 0.10
Confusion Matrix:
[[1334   26]
 [  76    3]]
Classification Report:
              precision    recall  f1-score   support

          No       0.95      0.98      0.96      1360
         Yes       0.10      0.04      0.06        79

    accuracy                           0.93      1439
   macro avg       0.52      0.51      0.51      1439
weighted avg       0.90      0.93      0.91      1439



# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
