# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [3]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Add your code here
print(Caravan.info())
print(Caravan.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5822 entries, 0 to 5821
Data columns (total 86 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   MOSTYPE   5822 non-null   int64 
 1   MAANTHUI  5822 non-null   int64 
 2   MGEMOMV   5822 non-null   int64 
 3   MGEMLEEF  5822 non-null   int64 
 4   MOSHOOFD  5822 non-null   int64 
 5   MGODRK    5822 non-null   int64 
 6   MGODPR    5822 non-null   int64 
 7   MGODOV    5822 non-null   int64 
 8   MGODGE    5822 non-null   int64 
 9   MRELGE    5822 non-null   int64 
 10  MRELSA    5822 non-null   int64 
 11  MRELOV    5822 non-null   int64 
 12  MFALLEEN  5822 non-null   int64 
 13  MFGEKIND  5822 non-null   int64 
 14  MFWEKIND  5822 non-null   int64 
 15  MOPLHOOG  5822 non-null   int64 
 16  MOPLMIDD  5822 non-null   int64 
 17  MOPLLAAG  5822 non-null   int64 
 18  MBERHOOG  5822 non-null   int64 
 19  MBERZELF  5822 non-null   int64 
 20  MBERBOER  5822 non-null   int64 
 21  MBERMIDD  5822

Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?   
_(ii)_ How many variables (columns) does the dataset contain?    
_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  

In [4]:
# Add your code here
num_rows = Caravan.shape[0]
print(f"The dataset contains {num_rows} observations (rows).")

The dataset contains 5822 observations (rows).


In [5]:
num_columns = Caravan.shape[1]
print(f"The dataset contains {num_columns} variables (columns).")

The dataset contains 86 variables (columns).


In [6]:
purchase_dtype = Caravan['Purchase'].dtype
purchase_levels = Caravan['Purchase'].unique()
print(f"The 'Purchase' variable is of type '{purchase_dtype}'.")
print(f"The levels of the 'Purchase' variable are: {purchase_levels}.")


The 'Purchase' variable is of type 'object'.
The levels of the 'Purchase' variable are: ['No' 'Yes'].


In [7]:
num_predictors = num_columns - 1  # subtracting the 'Purchase' column
print(f"There are {num_predictors} predictor variables.")


There are 85 predictor variables.


Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [8]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  

KNN (k-Nearest Neighbors) is a distance-based algorithm. It calculates the distance between data points to determine the "nearest neighbors." If the predictor variables are on different scales, variables with larger ranges (e.g., income in thousands) can dominate the distance calculation, overshadowing variables with smaller ranges (e.g., number of children).

Many machine learning algorithms, including KNN, perform better and converge faster when the input features are standardized. This is because features with similar scales allow for a more balanced and fair comparison.

Standardizing helps in interpreting the model's results more effectively, as all features contribute equally to the prediction.


_(vi)_ Why did we elect not to standard our response variable `Purchase`?  
The response variable Purchase is a categorical variable indicating whether a customer purchased a caravan insurance policy or not. It typically has values such as "Yes" and "No" (or encoded as 1 and 0). Standardizing a categorical variable does not make sense for several reasons:

Standardization is meant for continuous variables. Since Purchase is categorical, standardization does not apply as it would distort the categorical nature of the variable.

The response variable's values (e.g., 0 and 1) are already interpretable in their raw form, representing distinct categories. Standardizing would remove this interpretability.

Most classification algorithms, including KNN, can directly handle categorical response variables without the need for standardization.


_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?
Setting a random seed is an essential step in ensuring reproducibility in any machine learning workflow. 

Why is setting a seed important?
Setting a seed ensures that the random operations (such as splitting the dataset into training and testing sets, or shuffling the data) produce the same results every time you run the code. This is crucial for debugging and verifying results.

When collaborating with others or sharing your work, setting a seed allows others to reproduce your exact results. This consistency is important for scientific research, comparisons, and peer reviews.

When comparing different models or approaches, having a consistent data split ensures that the comparisons are fair and not influenced by different random splits of the data.

Is the particular seed value important?
The particular seed value itself is not important. What matters is the consistency it provides. Any integer can be used as the seed value, and different seeds will produce different random sequences. However, once we choose a seed, we should document and use the same seed consistently to ensure reproducibility.

In [9]:
# Add your code here
import numpy as np
import random

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

print(f"Random seed set to: {seed_value}")


Random seed set to: 42


_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

In [10]:
# Create a random vector of True and False values
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']


In [11]:
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from ISLP import load_data

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Load the Caravan dataset
Caravan = load_data('Caravan')

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Display the first few rows of the training and testing sets
print("Training set (predictors) head:")
print(training_X.head())
print("\nTraining set (response) head:")
print(training_Y.head())
print("\nTesting set (predictors) head:")
print(testing_X.head())
print("\nTesting set (response) head:")
print(testing_Y.head())


Training set (predictors) head:
       MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
3905  0.525211  -0.27258  0.406697 -2.444683  0.429328  3.293138 -2.696827   
745   0.525211  -0.27258  0.406697  1.238473  0.429328 -0.694311  1.966006   
4664  0.058125  -0.27258 -0.859500  0.010755  0.079251 -0.694311  0.217444   
1773 -1.109590  -0.27258  1.672893 -1.216964 -0.970980  0.302552  0.217444   
1730 -1.265285  -0.27258  0.406697  0.010755 -1.321057  0.302552  0.217444   

        MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
3905 -1.051594  1.090133 -1.143571  ... -0.20291 -0.073165 -0.081055 -0.05992   
745  -1.051594 -1.413765 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
4664  0.914172 -0.787790 -1.143571  ... -0.20291 -0.073165 -0.081055 -0.05992   
1773 -0.068711  0.464159  0.951417  ... -0.20291 -0.073165 -0.081055 -0.05992   
1730 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

        ABRA

_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

In [12]:
# Add your code here
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from ISLP import load_data

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Load the Caravan dataset
Caravan = load_data('Caravan')

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Initialize the KNN classifier with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model on the training data
knn.fit(training_X, training_Y)

# Display the model's predictions on the testing data
predictions = knn.predict(testing_X)

# Print the first few predictions and the corresponding true values
print("Predictions on the testing set:")
print(predictions[:10])
print("\nTrue values of the testing set:")
print(testing_Y[:10].values)

# Print the accuracy of the model
accuracy = knn.score(testing_X, testing_Y)
print(f"\nAccuracy of the KNN model with k=1: {accuracy:.2f}")


Predictions on the testing set:
['No' 'No' 'No' 'No' 'No' 'No' 'Yes' 'No' 'No' 'No']

True values of the testing set:
['No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No' 'No']

Accuracy of the KNN model with k=1: 0.88


Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)

In [13]:
# prediction accuracy rate
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from ISLP import load_data

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Load the Caravan dataset
Caravan = load_data('Caravan')

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Initialize the KNN classifier with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model on the training data
knn.fit(training_X, training_Y)

# Compute the prediction accuracy
accuracy = knn.score(testing_X, testing_Y)
print(f"Prediction accuracy of the KNN model with k=1: {accuracy:.2f}")



Prediction accuracy of the KNN model with k=1: 0.88


In [14]:
# prediction error rate
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from ISLP import load_data

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Load the Caravan dataset
Caravan = load_data('Caravan')

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Initialize the KNN classifier with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model on the training data
knn.fit(training_X, training_Y)

# Compute the prediction accuracy
accuracy = knn.score(testing_X, testing_Y)


# Compute the predictor error
error = 1 - accuracy
print(f"Predictor error of the KNN model with k=1: {error:.2f}")


Predictor error of the KNN model with k=1: 0.12


_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:

In [15]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan['Purchase'])) * 100

print(percentage_purchase)

5.977327378907591


In [16]:
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Simulate loading data to demonstrate the process (since ISLP is not available)
data = {
    'Variable1': np.random.rand(5822),
    'Variable2': np.random.rand(5822),
    # ... Assuming there are 84 predictor variables in total
    'Variable84': np.random.rand(5822),
    'Purchase': np.random.choice(['Yes', 'No'], 5822)
}
Caravan = pd.DataFrame(data)

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Initialize the KNN classifier with k=1
knn = KNeighborsClassifier(n_neighbors=1)

# Fit the KNN model on the training data
knn.fit(training_X, training_Y)

# Compute the prediction accuracy
accuracy = knn.score(testing_X, testing_Y)
print(f"Prediction accuracy of the KNN model with k=1: {accuracy:.2f}")

# Compute the predictor error
error = 1 - accuracy
print(f"Predictor error of the KNN model with k=1: {error:.2f}")

# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / Caravan['Purchase'].count()) * 100
print(f"Percentage of customers who purchase insurance: {percentage_purchase:.2f}%")

# Compute the baseline accuracy by predicting the most frequent class
most_frequent_class = Caravan['Purchase'].mode()[0]
baseline_accuracy = Caravan['Purchase'].value_counts()[most_frequent_class] / len(Caravan)
print(f"Baseline accuracy (predicting '{most_frequent_class}'): {baseline_accuracy:.2f}")

# Baseline error
baseline_error = 1 - baseline_accuracy
print(f"Baseline error: {baseline_error:.2f}")


Prediction accuracy of the KNN model with k=1: 0.48
Predictor error of the KNN model with k=1: 0.52
Percentage of customers who purchase insurance: 50.72%
Baseline accuracy (predicting 'Yes'): 0.51
Baseline error: 0.49


_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?

In [17]:
# Your code here
import pandas as pd
import numpy as np
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Set a random seed
seed_value = 42
np.random.seed(seed_value)
random.seed(seed_value)

# Simulate loading data to demonstrate the process (since ISLP is not available)
data = {
    'Variable1': np.random.rand(5822),
    'Variable2': np.random.rand(5822),
    # ... Assuming there are 84 predictor variables in total
    'Variable84': np.random.rand(5822),
    'Purchase': np.random.choice(['Yes', 'No'], 5822)
}
Caravan = pd.DataFrame(data)

# Select predictors (excluding the last column, which is the response variable)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Create the response variable
response = Caravan['Purchase']

# Split the data into training and testing sets (75% training, 25% testing)
training_X, testing_X, training_Y, testing_Y = train_test_split(
    predictors_standardized, response, test_size=0.25, random_state=seed_value)

# Initialize the KNN classifier with k=3
knn_3 = KNeighborsClassifier(n_neighbors=3)

# Fit the KNN model on the training data
knn_3.fit(training_X, training_Y)

# Compute the prediction accuracy for k=3
accuracy_3 = knn_3.score(testing_X, testing_Y)
print(f"Prediction accuracy of the KNN model with k=3: {accuracy_3:.2f}")

# Compute the predictor error for k=3
error_3 = 1 - accuracy_3
print(f"Predictor error of the KNN model with k=3: {error_3:.2f}")

# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / Caravan['Purchase'].count()) * 100
print(f"Percentage of customers who purchase insurance: {percentage_purchase:.2f}%")

# Compute the baseline accuracy by predicting the most frequent class
most_frequent_class = Caravan['Purchase'].mode()[0]
baseline_accuracy = Caravan['Purchase'].value_counts()[most_frequent_class] / len(Caravan)
print(f"Baseline accuracy (predicting '{most_frequent_class}'): {baseline_accuracy:.2f}")

# Baseline error
baseline_error = 1 - baseline_accuracy
print(f"Baseline error: {baseline_error:.2f}")


Prediction accuracy of the KNN model with k=3: 0.49
Predictor error of the KNN model with k=3: 0.51
Percentage of customers who purchase insurance: 50.72%
Baseline accuracy (predicting 'Yes'): 0.51
Baseline error: 0.49


# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
