# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Jubayer Ahmed

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [58]:
import numpy as np
import pandas as pd

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [59]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam

# TO DO: Print size and type of X and y
X,y = load_spam()
print(X.size, type(X))
print(y.size, type(y))
print(X.dtypes)
print(y.dtype)

262200 <class 'pandas.core.frame.DataFrame'>
4600 <class 'pandas.core.series.Series'>
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
wo

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [60]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum().sum())
print(y.isna().sum())

0
0


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [61]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split
X_small, _, y_small, _ = train_test_split(X, y, train_size = 0.05, stratify=y, random_state=0)

### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [62]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
from sklearn.linear_model import LogisticRegression

#Using X,y
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,random_state=0)
log = LogisticRegression(max_iter=2000).fit(X_train, y_train)
full_t = log.score(X_train, y_train)
full_v = log.score(X_test, y_test)

#Using only first two columns
X_t, X_v, y_t, y_v = train_test_split(X.iloc[:,0:2], y, stratify=y,random_state=0)
l = LogisticRegression(max_iter=2000).fit(X_t, y_t)
two_t = l.score(X_t, y_t)
two_v = l.score(X_v, y_v)

#Using 5% of data

X_small5, X_val, y_small5, y_val = train_test_split(X_small, y_small, stratify=y_small, random_state=0)
lg = LogisticRegression(max_iter=2000).fit(X_small5, y_small5)
small_t = lg.score(X_small5, y_small5)
small_v = lg.score(X_val, y_val)

data = {"Trained Data Size": [X.size, X.iloc[:,0:2].size, X_small.size], "Training Score": [full_t, two_t, small_t], "Validation score": [full_v, two_v, small_v]}
results = pd.DataFrame(data, index=["Using all features", "Using two features", "Using 5% of data"])
print(results)
# Note: for any random state parameters, you can use random_state = 0


# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

                    Trained Data Size  Training Score  Validation score
Using all features             262200        0.933913          0.931304
Using two features               9200        0.619420          0.605217
Using 5% of data                13110        0.941860          0.948276


### Questions (4 marks)
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*
1. We can see that when we are using only the first two features, we get low training and validation scores indicating that our model is not not performing well. This means that our model would benefit from using more features. We have a low variance since the training and validation scores are close but high bias since they are not predicting the unseen data well. This indicates an underfit.

When using all features, we can see that the training and validation are high whether we are using 5% of the data or the entire data. This means that our model is performing well with both low and high sample size when using all the features. Our model is fitting the data well. It is perforiming well on the training set and on unseen data.

2. A false positive would be identifying a non-spam email as spam wheras a false negative would be identifying a spam email as non spam. A false positive is worse than a false negative because the false negative can easily be deleted manually. However in the event of a false positive, you may have an important email that will go to the junk folder and will remain unseen by the person (for example job interview email).

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code? 
2. In what order did you complete the steps?
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not? 
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. From the note/lab material
2. I did in order they are posted
3. I did not use AI
4.  I had difficulty writing the code. I was able to overcome it by looking at the code snipets from class.

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [63]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_concrete

# TO DO: Print size and type of X and y
X, y = load_concrete()
print(X.size, type(X))
print(y.size, type(y))
print(X.dtypes)
print(y.dtype)

8240 <class 'pandas.core.frame.DataFrame'>
1030 <class 'pandas.core.series.Series'>
cement    float64
slag      float64
ash       float64
water     float64
splast    float64
coarse    float64
fine      float64
age         int64
dtype: object
float64


### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [64]:
# TO DO: Check if there are any missing values and fill them in if necessary
print(X.isnull().sum().sum())
print(y.isna().sum().sum())

0
0


### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [65]:
# TO DO: ADD YOUR CODE HERE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
model = LinearRegression().fit(X_train, y_train)
# Note: for any random state parameters, you can use random_state = 0

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [66]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score
y_pred_val = model.predict(X_val)
y_pred_train = model.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
mse_val = mean_squared_error(y_val, y_pred_val)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [67]:
# TO DO: ADD YOUR CODE HERE
results = pd.DataFrame({'Train': [mse_train, r2_train], 'Validation' : [mse_val, r2_val]}, index=["MSE", "R2"])
print(results)

          Train  Validation
MSE  111.358439   95.904136
R2     0.610823    0.623414


### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?
    No because the r2 value is a lot lower than 1 for both the training and validation scores. This indicates high bias and underfit (too simple)

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
2. In what order did you complete the steps?In the steps they are posted
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
4. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*
1. I looked up how to run the mse and r2 commands on chatGPT
2. In the steps they are posted
3. I typed this into chatGPT: sklearn metric r2 score and mean square error 
4. No challenges, it was pretty straight forward

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

We can see that although sample size is an important metric, the quality of the sample is equaly (if not more) important. It is important for our data to be representative (have sufficient features) of the unseen data. This is why the 5% sample size in question 1 resulted in a better model than when using only the first two columns (features) which lead to a high bias (underfitted model). When evaluating a model, it is important to look at many indicators to arrive to a conclusion. For example, looking at the mean squarred error without the r2 score could be missleading because although the MSE may be low, it might be high when compared to the data. 

It is important to note that it is important to train the model well but not to the extent of overtraining it where the validation score starts to drop and cause high variance. It is also important to train the model enough to avoid high bias which would cause the model to perform badly on unseen data. We want to maintain a balance where the training and validation scores are high and the validation score is approaching the training score.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*
I liked that we were asked to discuss our answers which enforced the learning. It made me realize the true value of the class. I realize that it is not about just running comands. This is also motivates me to keep learning now that I understand the value of the class better.

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [68]:
# TO DO: ADD YOUR CODE HERE
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

lasso001 = Lasso(alpha=0.01).fit(X_train, y_train)
lasso01 = Lasso(alpha=0.1).fit(X_train, y_train)
lasso1 = Lasso(alpha=1).fit(X_train, y_train)
lasso10 = Lasso(alpha=10).fit(X_train, y_train)
lasso100 = Lasso(alpha=100).fit(X_train, y_train)

ridge001 = Ridge(alpha=0.01).fit(X_train, y_train)
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
ridge1 = Ridge(alpha=1).fit(X_train, y_train)
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
ridge100 = Ridge(alpha=100).fit(X_train, y_train)

data = {"Training Score": [lasso001.score(X_train, y_train), lasso01.score(X_train, y_train), lasso1.score(X_train, y_train), lasso10.score(X_train, y_train),
                           lasso100.score(X_train, y_train)], "Validation Score": [lasso001.score(X_val, y_val), lasso01.score(X_val, y_val),lasso1.score(X_val, y_val),
                                                                                   lasso10.score(X_val, y_val), lasso100.score(X_val, y_val)]}
results = pd.DataFrame(data, index=["alpha=0.01", "alpha=0.1", "alpha=1", "alpha=10","alpha=100"])

data1 = {"Training Score": [ridge001.score(X_train, y_train), ridge01.score(X_train, y_train), ridge1.score(X_train, y_train), ridge10.score(X_train, y_train),
                           ridge100.score(X_train, y_train)], "Validation Score": [ridge001.score(X_val, y_val), ridge01.score(X_val, y_val),ridge1.score(X_val, y_val),
                                                                                   ridge10.score(X_val, y_val), ridge100.score(X_val, y_val)]}
results1 = pd.DataFrame(data, index=["alpha=0.01", "alpha=0.1", "alpha=1", "alpha=10","alpha=100"])

print("Results for lasso: \n", results)
print("\nResults for ridge: \n", results1)

Results for lasso: 
             Training Score  Validation Score
alpha=0.01        0.610823          0.623429
alpha=0.1         0.610821          0.623562
alpha=1           0.610609          0.624669
alpha=10          0.604314          0.626774
alpha=100         0.467576          0.507413

Results for ridge: 
             Training Score  Validation Score
alpha=0.01        0.610823          0.623429
alpha=0.1         0.610821          0.623562
alpha=1           0.610609          0.624669
alpha=10          0.604314          0.626774
alpha=100         0.467576          0.507413


*ANSWER HERE*

We can see that varying alpha (L1 regularization in lasso and L2 in ridge) does not impact the outcome of the training and validation scores as they remain consistently low. This suggests that there is no one feature that is more important in predicting the outcome. Reducing some coefficient (L2) or eliminating some coefficient (L1) does not help us create a good model. This suggests that our data follows a standard linear regression model more closely.