# Linear Models and Validation Metrics




### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification 

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [1]:
import numpy as np
import pandas as pd

### Step 1: Data Input

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [2]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
X, y = load_spam() # X = feature matrix , y = target 

# TO DO: Print size and type of X and y
print("Size of X is {} and Size of y is {}".format(X.shape, y.shape))
print("\nType of X is {} \nType of y is {}".format(type(X), type(y)))  

Size of X is (4600, 57) and Size of y is (4600,)

Type of X is <class 'pandas.core.frame.DataFrame'> 
Type of y is <class 'pandas.core.series.Series'>


### Step 2: Data Processing 

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [3]:
# TO DO: Check if there are any missing values and fill them in if necessary

#checked data type of each column - no object type
X.dtypes
y.dtypes

#number of null values is 0
X_null = X.isnull().sum().sort_values(ascending=False)
y_null = y.isna().sum()

print("Number of null values in y dataset {} \nNumber of null values in X dataset \n{}".format(y_null, X_null))

Number of null values in y dataset 0 
Number of null values in X dataset 
word_freq_make                0
word_freq_labs                0
word_freq_857                 0
word_freq_data                0
word_freq_415                 0
word_freq_85                  0
word_freq_technology          0
word_freq_1999                0
word_freq_parts               0
word_freq_pm                  0
word_freq_direct              0
word_freq_cs                  0
word_freq_meeting             0
word_freq_original            0
word_freq_project             0
word_freq_re                  0
word_freq_edu                 0
word_freq_table               0
word_freq_conference          0
char_freq_;                   0
char_freq_(                   0
char_freq_[                   0
char_freq_!                   0
char_freq_$                   0
char_freq_#                   0
capital_run_length_average    0
capital_run_length_longest    0
word_freq_telnet              0
word_freq_lab                 

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [4]:
# TO DO: Create X_small and y_small 

from sklearn.model_selection import train_test_split

X_t, X_small, y_t, y_small = train_test_split(X, y, random_state=0, test_size=0.05)


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

### Step 5: Visualize Results 

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [5]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

from sklearn.linear_model import LogisticRegression

results = pd.DataFrame(columns = ['Data size', 'Training Accuracy', 'Validation Accuracy'])
X_two_col = X.iloc[:,:2]

for X,y in zip([X, X_two_col, X_small],[y, y, y_small]):
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)
    logreg = LogisticRegression(max_iter=2000).fit(X_train, y_train)
    training_score = logreg.score(X_train, y_train)
    validation_score = logreg.score(X_val, y_val)
    
    results.loc[len(results)]= [X.shape, training_score, validation_score]

pd.set_option('display.precision', 3)
results.index = ['X', 'X_two_col', 'X_small']
display(results)


Unnamed: 0,Data size,Training Accuracy,Validation Accuracy
X,"(4600, 57)",0.929,0.937
X_two_col,"(4600, 2)",0.608,0.613
X_small,"(230, 57)",0.965,0.793


### Questions 
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

*YOUR ANSWERS HERE*

1.  

For the largest dataset the training and validation scores are 0.93 and 0.94. The two scores are very close to 1 and to each other. Therefore, having a low variance and high bias (underfits).

Smallest dataset(first two columns) the training and validation accuracy are both about 0.61 and the lowest score among the different datasets. The scores are almost the same, there is low variance. The validation score is far from the maximum R2 of 1, we have high bias. Since the model is only trained with a small dataset(only two features) the model doesn't see all the complexity of the data. Meaning the model made mistakes and is underfitting.

For the 5% dataset(inbetween the other two sizes), the training score is 0.97 is very close to 1, a good score.  The validation score is 0.8. Only this dataset, has some difference between the validation and training score. Higher variance and low bias - overfits dataset. To decrease the variance a bit to bring up the validation score, we need more data.

As the data size decreases, the validation score decreases as well (0.94, 0.79 and 0.61). For the largest and smallest dataset the training accuracy is high and around the same. The model is too simple for the smallest dataset. For the middle size dataset the validation score is to low. 

2. 

We are trying to identify spam, so positive class is spam and negative class is not spam. False Positive represents good mail that is marked as spam. False Negative is spam that is marked as not spam. 

False Positive is worse because you wouldn't want an important email to be marked as spam and not be able to recover or find it. Whereas for false negative you would be able to easily identify a spam email and delete it. 

### Process Description 
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I used the jupyter notebooks examples provided in labs/lectures and also referred to the textbooks for this class. I also used this website for importing the data :https://scikit-learn.org/. 
2. I read each step, reviewed the jupyter notebook examples then, I wrote the code. Ran the code to make sure it was working. If I had any problems, I would refer to the notes, textbook and jupyter notebooks. 
3. I used generative AI chatGPT to clarify concepts. Also, to look up functions. For example, I searched "What is the difference between .score() and accuracy_score()?" I didn't use any code from AI. 
4. The main challenge I faced was figuring out if the results of my model was correct and interpreting the results and looking for patterns. 

Sources Used 
- https://scikit-learn.org/
- Chat GPT

## Part 2: Regression 

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [16]:
# TO DO: Import spam dataset from yellowbrick library

from yellowbrick.datasets import load_concrete
X, y  = load_concrete()  # X = feature matrix , y = target 

# TO DO: Print size and type of X and y

print("Size of X is {} and Size of y is {}".format(X.shape, y.shape))
print("\nType of X is {} \nType of y is {}".format(type(X), type(y)))  


Size of X is (1030, 8) and Size of y is (1030,)

Type of X is <class 'pandas.core.frame.DataFrame'> 
Type of y is <class 'pandas.core.series.Series'>


### Step 2: Data Processing 

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [17]:
# TO DO: Check if there are any missing values and fill them in if necessary

y.dtypes
X.dtypes

y_null = y.isnull().sum()
X_null = X.isnull().sum().sort_values(ascending=False)

#no null values found
print("Number of null values in y dataset {} \nNumber of null values in X dataset: \n{}".format(y_null, X_null))


Number of null values in y dataset 0 
Number of null values in X dataset 
cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64


### Step 3: Implement Machine Learning Model 

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`

In [115]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

from sklearn.linear_model import LinearRegression

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0) 

lr = LinearRegression().fit(X_train,y_train)

train_pred = lr.predict(X_train)
val_pred = lr.predict(X_val) 


### Step 4: Validate Model

Calculate the training and validation accuracy using mean squared error and R2 score.

In [88]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

#Mean squared Error score

mse_t = mean_squared_error(y_train, train_pred)
mse_v = mean_squared_error(y_val, val_pred)

#R2 score

r2score_t=r2_score(y_train, train_pred)
r2score_v=r2_score(y_val, val_pred)




### Step 5: Visualize Results 
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [89]:
# TO DO: ADD YOUR CODE HERE
data = {'Training Accuracy': [mse_t,r2score_t ],'Validation Accuracy': [mse_v , r2score_v] }

results = pd.DataFrame(data, index =['MSE', 'R2'], columns=['Training Accuracy', 'Validation Accuracy'])

pd.set_option('display.precision', 2)
display(results)

Unnamed: 0,Training Accuracy,Validation Accuracy
MSE,111.36,95.9
R2,0.61,0.62


### Questions 
1. Did using a linear model produce good results for this dataset? Why or why not?

Based on the R2 score(the goodness of fit), there is low variance and high bias(many mistakes) between the training and validation score (0.61,0.62). This model is underfitting. Also, the training score(0.61) is low which indicates that the linear model did not produce good results and might be too simple for the data. We want a training accuracy that is higher then the validation accuracy and close to 1. 

Looking at the MSE, the MSE is very high for both training and validation, respectively. 111.4 & 95.9. This means the data points are more dispersed from the central mean, higher error and bad estimator. 

This model is a poor predicator for the training set and the unseen data with low R2 score and High MSE score. Therefore, a linear model did not produce good results. 


### Process Description 
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I used the jupyter notebook class examples along with the lab exercises for the code. I also referred to class notes and the two textbooks for this class. 
2. I reviewed all my notes. Then, I attempted the steps in the assignment. To verify my code and results, I referred to the examples in the jupyter notebook specifically Linear Regression.
3. I didn't use generative AI for this section. 
4. No I didn't have any challenges because a lot of the steps in this section were similar to the one above. Also, I reviewed examples in the notes to help me out. 

## Part 3: Observations/Interpretation 

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

For Classification model, the following patterns were seen: 
- For the largest dataset, both training(0.93) and validation(0.94) scores are high(close to 1). This indicates that when the dataset is large enough the scores stabilize and the model has learned the patterns in the data. Adding more data may not impact the accuracies as much. 
- For 5% of the data, the small dataset the training score is very high(0.97) but a low validation score of 0.79. This indicates the model has memorized the data but does not generalize to unseen data to the same degree. The model overfits the data( high variance, low bias).
- For the smallest dataset(two columns), the training and validation scores are very low, same value of 0.61. The model is trained with only two features - less complex, too simple. The model does not learn all the patterns and also cannot predict results for new data. This model underfits the dataset with low variance and high bias. 

Overall, a certain amount of dataset is required while also capturing enough complexity of the model to predict good training and validation accuracy scores, R2 scores close to 1. We want a model that has a good balance between the two scores where it is able to learn the patterns in the training set and also able to generalize to unseen data to prevent underfitting or overfitting the data. 

For Regression model, the following patterns were seen: 
- The mean squared error for both training and validation was high : 111.36, 95.9. The MSE for validation score was slightly less. We want MSE to be as low as possible, since it indicates how much the predicted value deviates from the actual target value. In addition, the R2 score was relatively low for both training(0.61) and validation score(0.62) which is far from the maximum of 1. These scores indicate that the linear model is too simple for this dataset. The dataset needs a more complex model to capture its patterns. This linear model underfit the dataset, low variance and high bias(makes mistakes).




## Part 4: Reflection 
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I liked that the questions were seperated into steps which made it easier to work through and write the code. I also enjoyed using juptyer notebook for the assignment as each step could be tested seperately. 
I found it challenging at first when using the different python functions since there are multiple ways of doing a lot of the same things and certain functions only work with certain data types. I found the data sets interesting and I found it interesting to see how the models performed based on the dataset size. 


## Part 5: Bonus Question 

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [111]:
# TO DO: ADD YOUR CODE HERE


from sklearn.linear_model import Lasso, Ridge
results = pd.DataFrame(columns=['Training Accuracy MSE', 'Validation Accuracy MSE',
                                'Training Accuracy R2', 'Validation Accuracy R2' ])

#Lasso
values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
for alpha in values:
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0) 
    lasso_model = Lasso(alpha= alpha).fit(X_train,y_train)

    train_pred = lasso_model.predict(X_train)
    val_pred = lasso_model.predict(X_val) 

    mse_t = mean_squared_error(y_train, train_pred)
    mse_v = mean_squared_error(y_val, val_pred)

    r2score_t=r2_score(y_train, train_pred)
    r2score_v=r2_score(y_val, val_pred)

    results.loc[len(results)]= [mse_t, mse_v, r2score_t, r2score_v]

results.index =['0.001', '0.01', '0.1', '1.0', '10.0', '100.0'] 
pd.set_option('display.precision', 3)
print('Lasso Linear Model Results')
display(results)

results2 = pd.DataFrame(columns=['Training Accuracy MSE', 'Validation Accuracy MSE',
                                'Training Accuracy R2', 'Validation Accuracy R2' ])
#Ridge
values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
for alpha in values:
    X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0) 
    ridge_model = Ridge(alpha= alpha).fit(X_train,y_train)

    train_pred = ridge_model.predict(X_train)
    val_pred = ridge_model.predict(X_val) 

    mse_t = mean_squared_error(y_train, train_pred)
    mse_v = mean_squared_error(y_val, val_pred)

    r2score_t=r2_score(y_train, train_pred)
    r2score_v=r2_score(y_val, val_pred)

    #r2score_t=r2_score(y_train, train_pred)
    #r2score_v=r2_score(y_val, val_pred)

    results2.loc[len(results2)]= [mse_t, mse_v, r2score_t, r2score_v]

results2.index =['0.001', '0.01', '0.1', '1.0', '10.0', '100.0'] 
pd.set_option('display.precision', 3)
print('Ridge Linear Model Results')
display(results2)



Lasso Linear Model Results


Unnamed: 0,Training Accuracy MSE,Validation Accuracy MSE,Training Accuracy R2,Validation Accuracy R2
0.001,111.358,95.904,0.611,0.623
0.01,111.358,95.9,0.611,0.623
0.1,111.359,95.867,0.611,0.624
1.0,111.42,95.585,0.611,0.625
10.0,113.221,95.049,0.604,0.627
100.0,152.347,125.446,0.468,0.507


Ridge Linear Model Results


Unnamed: 0,Training Accuracy MSE,Validation Accuracy MSE,Training Accuracy R2,Validation Accuracy R2
0.001,111.358,95.904,0.611,0.623
0.01,111.358,95.904,0.611,0.623
0.1,111.358,95.904,0.611,0.623
1.0,111.358,95.904,0.611,0.623
10.0,111.358,95.903,0.611,0.623
100.0,111.359,95.894,0.611,0.623


*ANSWER HERE*

In general, we choose the alpha value that results in the highest validation score. 

Ridge linear model gave the same results as Linear regression with varying alpha values. 
For lasso model, the values were about the same for alpha values from 0.001 to 1. For alpha values of 10 and 100 the model performed worse since most features are removed by bringing it to zero. No alpha value gave the best score and this score is not good enough since training score is less then validation score and both very low. Linear model is not a good fit for this dataset therefore changing the alpha value makes no difference for either model. 