# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Nur-Alhuda Ali

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [55]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [56]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y

from yellowbrick.datasets import load_spam
X, y = load_spam() # returns features into X and target into y

print("Size of X: ", X.size, "elements")
print("Shape of X: ", X.shape)
print("Type of X:\n", X.dtypes, "\n")

print("Size of y: ", y.size)
print("Type of y: ", y.dtypes, "\n")

Size of X:  262200 elements
Shape of X:  (4600, 57)
Type of X:
 word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp            

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [42]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("Number of nulls in feature matrix:\n", X.isnull().sum().sort_values(ascending=False), "\n")
print("Number of nulls in target vector: ", y.isnull().sum())

Number of nulls in feature matrix:
 word_freq_make                0
word_freq_labs                0
word_freq_857                 0
word_freq_data                0
word_freq_415                 0
word_freq_85                  0
word_freq_technology          0
word_freq_1999                0
word_freq_parts               0
word_freq_pm                  0
word_freq_direct              0
word_freq_cs                  0
word_freq_meeting             0
word_freq_original            0
word_freq_project             0
word_freq_re                  0
word_freq_edu                 0
word_freq_table               0
word_freq_conference          0
char_freq_;                   0
char_freq_(                   0
char_freq_[                   0
char_freq_!                   0
char_freq_$                   0
char_freq_#                   0
capital_run_length_average    0
capital_run_length_longest    0
word_freq_telnet              0
word_freq_lab                 0
word_freq_address             0
word

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [43]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

X_big, X_small, y_big, y_small = train_test_split(X, y, test_size=0.05, random_state=5)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [57]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression
model_logreg = LogisticRegression(max_iter=2000)

# Dataset: X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
model1 = model_logreg.fit(X_train, y_train)

size1 = X.size
training_score1 = model1.score(X_train, y_train)
validation_score1 = model1.score(X_test, y_test)
datasets = {'X & y' : [size1, training_score1, validation_score1]}  # add info to dictonary (used to create results table)

print("FOR DATASET: X and y")
print("Training score: {:.2f}".format(training_score1))
print("Validation score: {:.2f}".format(validation_score1))

# Dataset: First 2 Columns of X, and y
X_twocol = X.iloc[:, :2]
X_train, X_test, y_train, y_test = train_test_split(X_twocol, y, random_state=0)
model2 = model_logreg.fit(X_train, y_train)

size2 = X_twocol.size
training_score2 = model2.score(X_train, y_train)
validation_score2 = model2.score(X_test, y_test)
datasets['First 2 col of X & y'] = [size2, training_score2, validation_score2]  # add info to dictionary

print("\nFOR DATASET: First two columns of X and y")
print("Training score: {:.2f}".format(training_score2))
print("Validation score: {:.2f}".format(validation_score2))

# Dataset: X_small and y_small
X_train, X_test, y_train, y_test = train_test_split(X_small, y_small, random_state=0)
model3 = model_logreg.fit(X_train, y_train)

size3 = X_small.size
training_score3 = model3.score(X_train, y_train)
validation_score3 = model3.score(X_test, y_test)
datasets['X_small & y_small'] = [size3, training_score3, validation_score3] # add info to dictionary

print("\nFOR DATASET: X_small and y_small")
print("Training score: {:.2f}".format(training_score3))
print("Validation score: {:.2f}".format(validation_score3))

# Visualization
results = pd.DataFrame.from_dict(datasets, orient='index', columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])
print("\nRESULTS TABLE")
print("-----------------------------------------------------------------------")
print(results)

FOR DATASET: X and y
Training score: 0.93
Validation score: 0.94

FOR DATASET: First two columns of X and y
Training score: 0.61
Validation score: 0.61

FOR DATASET: X_small and y_small
Training score: 0.97
Validation score: 0.81

RESULTS TABLE
-----------------------------------------------------------------------
                      Data Size  Training Accuracy  Validation Accuracy
X & y                    262200           0.928696             0.937391
First 2 col of X & y       9200           0.608406             0.613043
X_small & y_small         13110           0.970930             0.810345


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. Using only the first 2 columns (features) of X, with a data size of 9200 elements, resulted in almost equal training and validation accuracies that were low, which means the model had high bias and low variance. This is because the model only used 2 features from the 57 available features, so it did not account for the complete behaviour of the data.

    Using 5% of the data, with a data size of 13,110 elements, increased the training and validation accuracies significantly. This is because all the 57 features were used in the model, despite the data used being only 5% of the total data available, allowing the model to account for the complete behaviour of the data (weight of the features).

    Using the entire dataset, with a data size of 262,200 elements, the validation accuracy is the highest of all 3 datasets. The training accuracy is slightly lower than the validation accuracy, likely due to the train-test split. Increasing the size of the test set would fix this issue.

    Based on these observations, we can conclude that increasing the dataset size and the number of features used in the model improves the performance of the model as it better informs the model about the behaviour (both local and general) of the data.

2. A false positive represents an email that is flagged as spam when it actually is not. A false negative is an email that is NOT flagged as spam, when it actually is.

    In this case, a false positive is worse because if good mail is flagged as spam, then the email user could miss important/time-sensitive emails that go to the spam folder.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*\
\
I used the ENSF 611 Jupyter notebooks as a general guideline for how to use SciKit-Learn, but other than that, I created my own code. I completed the steps in order and before every step, I would have a general idea of what libraries and functions I would need to use, so I would study the documentation for SciKit-Learn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to understand what the functions do, what they return, and what their arguments are not only to ensure that I am using the correct syntax, but also getting the results I want.

I also referred to the ENSF 611 Model Selection Jupyter and Linear Classification notebooks to help me with understanding and interpretting my results. I generally don't like to use generative AI to help me code as I prefer to do more manual research and read the documentation for libraries that I am using myself - so I did not use any generative AI for this process.

One thing I found challenging was creating the results dataframe in Step 5 efficiently since it had been a few months since I have used Pandas. I already had the idea of using a dictionary to store the results with the dataset description as the key, but I had forgotton how to turn that into a database. Initially, I tried to use nested for-loops, which led me to think that there must be an easier way to do it. So I read the Pandas documentation and found a quick and easy solution using the function pd.DataFrame.from_dict() .

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [58]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y
from yellowbrick.datasets import load_concrete
X, y = load_concrete()

print("Size of X: ", X.size, "elements")
print("Shape of X: ", X.shape)
print("Type of X:\n", X.dtypes, "\n")

print("Size of y: ", y.size)
print("Type of y: ", y.dtypes, "\n")

Size of X:  8240 elements
Shape of X:  (1030, 8)
Type of X:
 cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object 

Size of y:  1030
Type of y:  float64 



### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [59]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("Number of nulls in X:\n", X.isnull().sum().sort_values(ascending=False), "\n")
print("Number of nulls in y: ", y.isnull().sum())

Number of nulls in X:
 cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64 

Number of nulls in y:  0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [70]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lin_reg = LinearRegression().fit(X_train, y_train)


### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [71]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import r2_score, mean_squared_error

y_pred_train = lin_reg.predict(X_train)
y_pred_test = lin_reg.predict(X_test)

# Using R2 score:
train_score_R2 = r2_score(y_train, y_pred_train)
test_score_R2 = r2_score(y_test, y_pred_test)

print("USING THE R2 SCORE:")
print("Training score: {:.2f}".format(train_score_R2))
print("Validation score: {:.2f}".format(test_score_R2))

scores = {'R2 Score' : [train_score_R2, test_score_R2]} # add results to dictionary (for visualization in next step)

# Using Mean Squared Error:
train_score_MSE = mean_squared_error(y_train, y_pred_train)
test_score_MSE = mean_squared_error(y_test, y_pred_test)

print("\nUSING MEAN-SQUARED ERROR:")
print("Training score: {:.2f}".format(train_score_MSE))
print("Validation score: {:.2f}".format(test_score_MSE))

scores['MSE'] = [train_score_MSE, test_score_MSE] # add results to dictionary (for visualization in next step)

USING THE R2 SCORE:
Training score: 0.61
Validation score: 0.62

USING MEAN-SQUARED ERROR:
Training score: 111.36
Validation score: 95.90


### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [72]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame.from_dict(scores, orient='index', columns=['Training Accuracy', 'Validation Accuracy'])
print(results)

          Training Accuracy  Validation Accuracy
R2 Score           0.610823             0.623414
MSE              111.358439            95.904136


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

    No, using a linear model did not produce good results. For the R2 score metric, values close to 1 indicate a better fitting model, whereas for the mean squared error metric, that is the case for values close to 0 (the opposite).

    Using the R2 score metric, the training and validation accuracies were far from optimal and were also similar in value, 0.61 and 0.62, respectively, which means that the model is high-bias and does not account for the variance in (behaviour of) the data very well. Using the mean squared error metric, the training and validation accuracies were high, 111.36 and 95.90, respectively, indicating that the model does not make high quality predictions (has high prediction error).

    One reason could be that the model is too regularized and the features need to be more represented in it. Another potential reason could be that the data has a lot of noise that has nothing to do with the features. Finally, it is possible that the data is simply better represented using a non-linear model.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*\
\
Like in *Part 1: Classification*, I used the ENSF 611 Jupyter notebooks as a guideline for how to use SciKit-Learn for model selection. Other than that, I relied on the SciKit-Learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) to aid me in writing my code, allowing me to understand that functions that I wanted to use so that I could properly implement them. I completed all the steps in order because every step relies on the one before it. However, after completing all the steps, I did go back to tweak certain parameters to see how that would affect my results in order to better understand what was going on.

I used the same method for creating the results dataframe as in Part 1: Classificaton, using the Pandas documentation as reference. I did not use any generative AI for this assignment.

A challenge I had was implementing the r2_score and mean_squared_error metrics to determine the training and validation accuracies since I was more familiar with using model.score() directly. However, I was able to figure out how to use these metrics by consulting the SciKit-Learn documentation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html).


## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

In *Part 1: Classification*, I found that increasing the data size and maximizing the number of features used to train the model makes for a better model. This is because increasing the data size reduces the effect of noise, which would cause overfitting, and allows the  model to learn the **general** behaviour and patterns of the data. This can be seen in the difference in results between using the full dataset X, which had training and validation scores of 0.93 and 0.94, respectively, versus only 5% of the data in X_small, which had a training score of 0.97 and a lower validation score of 0.81.

Maximizing the number of features used to train the model allows the model to detect patterns and relationships among features. This can be seen in the difference in results between the full dataset X, which, again, had training and validation scores of 0.93 and 0.94, respectively, versus the dataset with only the first 2 features, resulting in training and validation scores of 0.61, indicating high bias and low variance.

In *Part 2: Regression*, I found that if a model has lower training and validation accuracies using the R2-score metric, the same accuracies calculated using the mean squared error (MSE) metric will be high. In other words, the better the model is, the closer the accuracies calculated with the R2-score metric are to 1 (versus 0), and the lower the accuracies calculated with the mean squared error metric are. This can be seen in the results for this section's linear regression model, where the R2-score metric resulted in training and validation accuracies of 0.61 and 0.62, respectively, and the MSE metric resulted in training and validation accuracies of 111.36 and 95.90, respectively, both indicating that the linear regression model for this dataset is high-bias, and therefore requires a different model with less regularization.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*\
\
I liked the coding section of the assignment and applying what I learned in class. I enjoyed interpreting the results in the questions following each modelling section of the assignment. However, I felt like the observations/interpretation part was repetitive, as I felt like I had already explained my results in the questions portion after each coding section and was just re-explaining.

When I was learning these concepts in the lectures and labs, I felt overwhelmed by the loads of information being taught to us all at once. However, being able to apply the concepts we learned allowed me to solidify my understanding of the course material, leaving me feeling very motivated for future material and assignments.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*