# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Matthew De Filippo

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [59]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [60]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
(X, y) = load_spam()
# TO DO: Print size and type of X and y
print(f'The shape of X is: {X.shape} \n') 
print(f'The type of X is:\n {X.dtypes}\n')

print(f'The shape of y is: {y.shape} \n') 
print(f'The type of y is: {y.dtypes}')

The shape of X is: (4600, 57) 

The type of X is:
 word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [61]:
# TO DO: Check if there are any missing values and fill them in if necessary
print('Running X.isnull().sum() gives:\n')
print(X.isnull().sum())
print(f'Running y.isnull().sum() gives: {y.isnull().sum()}')

# By running the above methods on X and y, we see that there are 0 null entries in each
# column in the X dataframe and 0 null entries in the y dataseries.

# Therefore, there are no missing values and we can proceed to the next step.

Running X.isnull().sum() gives:

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_fr

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [62]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split
X_scrap, X_small, y_scrap, y_small = train_test_split(X, y, test_size=0.05, random_state=0)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [63]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

# Import LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression

# Instantiate model LogisticRegression(max_iter=2000); implement with X and y.
model_all = LogisticRegression(max_iter=2000)
X_train_all, X_val_all, y_train_all, y_val_all = train_test_split(X, y, test_size=0.2, random_state=0)
model_all.fit(X_train_all, y_train_all)

# Instantiate model LogisticRegression(max_iter=2000); implement with first two columns of X and y.
model_2col = LogisticRegression(max_iter=2000)
X_train_2col, X_val_2col, y_train_2col, y_val_2col = train_test_split(X[['word_freq_make', 'word_freq_address']], y, test_size=0.2, random_state=0)
model_2col.fit(X_train_2col, y_train_2col)

# Instantiate model LogisticRegression(max_iter=2000); implement with X_small and y_small.
model_small = LogisticRegression(max_iter=2000)
X_train_small, X_val_small, y_train_small, y_val_small = train_test_split(X_small, y_small, test_size=0.2, random_state=0)
model_small.fit(X_train_small, y_train_small)

# Creating the results DataFrame.
results = pd.DataFrame({'Data Size': [X.shape, X[['word_freq_make', 'word_freq_address']].shape, X_small.shape], 
                        'Training Accuracy': [model_all.score(X_train_all, y_train_all), model_2col.score(X_train_2col, y_train_2col), model_small.score(X_train_small, y_train_small)], 
                       'Validation Accuracy': [model_all.score(X_val_all, y_val_all), model_2col.score(X_val_2col, y_val_2col), model_small.score(X_val_small, y_val_small)]})

# Print results
results

Unnamed: 0,Data Size,Training Accuracy,Validation Accuracy
0,"(4600, 57)",0.927174,0.93587
1,"(4600, 2)",0.614946,0.593478
2,"(230, 57)",0.956522,0.804348


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

#### Question 1 Response:

The best performance is observed when we used the full set of data which resulted in training and accuracy scores of 0.93 and 0.94 respectively.

When we reduced the dataset to the first two columns the only, the the training and validation scores were 0.61 and 0.59 respectively. This model exhibited much higher bias and resulted in worse performance.

When we used only a 5% subset of the dataset, the performance was also worse. The training and validation scores were approximately 0.96 and 0.80 respectively. This model exhibited more high-variance as the validation score was significantly worse than the training score.

#### Question 2 Response:

A false positive in this scenario indicates that a given email is predicted to be spam when it is in fact not spam. A false negative indicates that a given email is predicted to not be spam when it is in fact spam. A false positive is worse as important emails could be missed by the user. False negatives are also bad however if the user is not educated and prone to the scams that are often contained in spam emails.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

#### Question 1 Response:
Code was based on the provided class examples (particularly Linear Example). Aspects of the overall code structure and layout were modified to suit the specific assignment requirements.

#### Question 2 Response:
I completed the steps in the order recommended by the assigment (1->2->3->4->5).

#### Question 3 Response:
I did not use generative AI for the completion of this section.

#### Question 4 Response:
I did experience some challenges throughout the completion of this section:
   - I was a bit confused as to what to do with the spare variables created when the dataset was originally split to create the 5% sections. I thought perhaps the 95% sections should be used later on for a portion of the training data. Reading the discussion board confirmed that these variables were not required.
   - I struggled recalling the syntax to create the DataFrame to store the results. I searched Google which helped me remember how to create this data structure using a dictionary.


## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [64]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
(X, y) = load_concrete()
# TO DO: Print size and type of X and y
print(f'The shape of X is: {X.shape} \n') 
print(f'The type of X is:\n {X.dtypes}\n')

print(f'The shape of y is: {y.shape} \n') 
print(f'The type of y is: {y.dtypes}')

The shape of X is: (1030, 8) 

The type of X is:
 cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object

The shape of y is: (1030,) 

The type of y is: float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [65]:
# TO DO: Check if there are any missing values and fill them in if necessary
print('Running X.isnull().sum() gives:\n')
print(X.isnull().sum())
print(f'Running y.isnull().sum() gives: {y.isnull().sum()}')

# By running the above methods on X and y, we see that there are 0 null entries in each
# column in the X dataframe and 0 null entries in the y dataseries.

# Therefore, there are no missing values and we can proceed to the next step.

Running X.isnull().sum() gives:

cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64
Running y.isnull().sum() gives: 0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [66]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

# Import LogisticRegression from sklearn.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Instantiate model LogisticRegression().
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
model = LinearRegression().fit(X_train, y_train)

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [67]:
# TO DO: ADD YOUR CODE HERE
# Use the model to make predictions using both the training and validation sets.
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)

# Calculating the training and validation accuracy using mean squared error.
from sklearn.metrics import mean_squared_error
linear_mse_train = mean_squared_error(y_train, y_pred_train)
linear_mse_val = mean_squared_error(y_val, y_pred_val)

# Calculating the training and validation accuracy using R2 score.
linear_r2_train = model.score(X_train, y_train)
linear_r2_val = model.score(X_val, y_val)

# Printing the results to the display.
print(f"Training Accuracy (MSE): {linear_mse_train}")
print(f"Validation Accuracy (MSE): {linear_mse_val}")
print(f"Training Accuracy (R^2): {linear_r2_train}")
print(f"Validation Accuracy (R^2): {linear_r2_val}")


Training Accuracy (MSE): 110.34550122934108
Validation Accuracy (MSE): 95.63533482690428
Training Accuracy (R^2): 0.6090710418548884
Validation Accuracy (R^2): 0.6368981103411242


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [68]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame({'Type': ['MSE Score', 'R2 Score'], 
                        'Training Accuracy': [linear_mse_train, linear_r2_train], 
                       'Validation Accuracy': [linear_mse_val, linear_r2_val]})

# Set the index to 'Type' and print the results to the display.
results.set_index('Type')

Unnamed: 0_level_0,Training Accuracy,Validation Accuracy
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
MSE Score,110.345501,95.635335
R2 Score,0.609071,0.636898


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

#### Question 1 Response:
Using a linear model did not produce good results for this dataset. 

The MSE scores for the training and validation sets were large at 110 and 95 respectively. This shows that the predicted values were quite far from the actual values for both sets.

The R^2 training and validation accuracies were determined to be 0.61 and 0.64 respectively. Since the are close to each other and well below the maximum value of 1, it is clear that we are underfitting the data; we have low variance and high bias. This would suggest that the relationship is potentially non-linear; therefore, a linear model is not suitable.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

#### Question 1 Response:
Code was based on the provided class examples (particularly Linear Regression) as well as the Collab notebook for Lab 2. Aspects of the overall code structure and layout were modified to suit the specific assignment requirements.

#### Question 2 Response:
I completed the steps in the order recommended by the assigment (1->2->3->4->5).

#### Question 3 Response:
I did not use generative AI for the completion of this section.

#### Question 4 Response:
I did experience some challenges throughout the completion of this section:
   - At first, I did not recall how to compute the mean squared error. I reference the the Lab 2 collab notebook to determine how to perform these calculations.
   - I struggled recalling the syntax to set the index of the results DataFrame to the 'Type' column. I used Google to search the internet and help me recall to used the .set_index() method.


## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


#### Part 1: Classification Observations

 - For the spam dataset we observed that in general, our model will exhibit better performance when we use a larger subset of data. When we used the full dataset, we achieved training and validation scores of 0.93 and 0.94 respectively. When we used two columns only, we achieved training and validation scores of 0.61 and 0.59 respectively. When we used only a 5% subset of the data, we achieved training and validation scores of 0.96 and 0.80 respectively.
 - Limiting the amount of data columns, resulted in a model of much higher bias (i.e. we were underfitting the data).
 - Limiting the amount of data rows, resulted in a model of much higher variance (i.e. we were overfitting the data).
 
 #### Part 2: Regression Observations
 
 - For the concrete dataset, the linear model we used resulted in R^2 training and validation accuracies of 0.61 and 0.64 respectively. Therefore, this model exhibited high bias (i.e. we were underfitting the data).
 - The mean squared error for the the training and validation sets were relatively large at 110 and 95 respectively.
 - This shows that there are cases where a linear model cannot accurately predict the relationship between variables and other methods need to be considered.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


#### Reflection Response:
I liked that this assignment gave me an opportunity to apply the linear models and validation methods that we have been learning throughout the first few weeks of the course. It gave me a better understanding of the concepts of bias/variance and how they can be impacted by the data and methods of modelling that we employ.

I found it interesting to be able to quantify the impact of reducing the amount of data used from a particular source on the results. This made me realize how important it is to use as large of a dataset as possible when using linear modelling methods.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [82]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Ridge, Lasso

alpha = [0.001, 0.01, 0.1, 1, 10, 100]
ridge_mse_train = []
ridge_mse_val = []
ridge_r2_train = []
ridge_r2_val = []
lasso_mse_train = []
lasso_mse_val = []
lasso_r2_train = []
lasso_r2_val = []

for a in alpha:
    # Initialize and train a Ridge regression model.
    ridge_model = Ridge(alpha=a).fit(X_train, y_train)
    
    # Make predictions.
    y_pred_train_ridge = ridge_model.predict(X_train)
    y_pred_val_ridge = ridge_model.predict(X_val)
    
    # Evaluate the ridge regression model.
    ridge_mse_train.append(mean_squared_error(y_train, y_pred_train_ridge))
    ridge_mse_val.append(mean_squared_error(y_val, y_pred_val_ridge))
    ridge_r2_train.append(ridge_model.score(X_train, y_train))
    ridge_r2_val.append(ridge_model.score(X_val, y_val))
    
    # Initialize and train a lasso regression model.
    lasso_model = Lasso(alpha=a).fit(X_train, y_train)
    
    # Make predictions.
    y_pred_train_lasso = lasso_model.predict(X_train)
    y_pred_val_lasso = lasso_model.predict(X_val)

    # Evaluate the lass regression model.
    lasso_mse_train.append(mean_squared_error(y_train, y_pred_train_lasso))
    lasso_mse_val.append(mean_squared_error(y_val, y_pred_val_lasso))

    lasso_r2_train.append(lasso_model.score(X_train, y_train))
    lasso_r2_val.append(lasso_model.score(X_val, y_val))

results = pd.DataFrame({'Alpha': alpha, 
                       'Ridge MSE Training Accuracy': ridge_mse_train,
                       'Ridge MSE Validation Accuracy': ridge_mse_val,
                       'Ridge R^2 Training Accuracy': ridge_r2_train,
                       'Ridge R^2 Validation Accuracy': ridge_r2_val,
                       'Lasso MSE Training Accuracy': lasso_mse_train,
                       'Lasso MSE Validation Accuracy': lasso_mse_val,
                       'Lasso R^2 Training Accuracy': lasso_r2_train,
                       'Lasso R^2 Validation Accuracy': lasso_r2_val})
results


Unnamed: 0,Alpha,Ridge MSE Training Accuracy,Ridge MSE Validation Accuracy,Ridge R^2 Training Accuracy,Ridge R^2 Validation Accuracy,Lasso MSE Training Accuracy,Lasso MSE Validation Accuracy,Lasso R^2 Training Accuracy,Lasso R^2 Validation Accuracy
0,0.001,110.345501,95.635335,0.609071,0.636898,110.345501,95.634971,0.609071,0.636899
1,0.01,110.345501,95.635334,0.609071,0.636898,110.345507,95.631698,0.609071,0.636912
2,0.1,110.345501,95.635324,0.609071,0.636898,110.34612,95.599545,0.609069,0.637034
3,1.0,110.345501,95.635231,0.609071,0.636899,110.40734,95.33585,0.608852,0.638035
4,10.0,110.345502,95.634301,0.609071,0.636902,112.093055,95.114791,0.60288,0.638874
5,100.0,110.345597,95.625173,0.609071,0.636937,151.368492,126.142568,0.463736,0.52107


*ANSWER HERE*

As per the results table, the alpha values that gave the best results for each method were:
- Alpha = 100, resulting in a R^2 validation accuracy of 0.636937 (for the Ridge Method)
- Alpha = 10, resulting in a R^2 validation accuracy of 0.638874 (for the Lasso Method)

This is still poor performance as it is very far from the maximum R^2 value of 1.0. We are still underfitting the data (we have low variance and high bias). This likely confirms our suspicion that the data is not linearly related and another modelling technique will need to be used.