# Linear Models and Validation Metrics


### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.


## Part 1: Classification

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:


### Step 0: Import Libraries


In [17]:
import numpy as np
import pandas as pd
from yellowbrick.datasets import load_spam
from yellowbrick.datasets import load_concrete
from sklearn.model_selection import train_test_split

### Step 1: Data Input

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.


In [18]:
# TO DO: Import spam dataset from yellowbrick library
X, y = load_spam()  # X = data , y = target
# TO DO: Print size and type of X and y
print("The shape of X is: {}, size of X is {}, and X is a {}".format(
    X.shape, X.size, type(X)))
print("The shape of y is: {}, size of y is {}, and y is a {}".format(
    y.shape, y.size, type(y)))

The shape of X is: (4600, 57), size of X is 262200, and X is a <class 'pandas.core.frame.DataFrame'>
The shape of y is: (4600,), size of y is 4600, and y is a <class 'pandas.core.series.Series'>


### Step 2: Data Processing

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.


In [19]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("There are {} missing values in X.".format(X.isna().sum().sum()))
print("There are {} missing values in y.".format(y.isna().sum().sum()))

There are 0 missing values in X.
There are 0 missing values in y.


For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.


In [20]:
# TO DO: Create X_small and y_small
X_small, X_test_sm, y_small, y_test_sm = train_test_split(
    X, y, train_size=0.05, random_state=0)

print("Size of X_small is {}, this represents {} % of the original dataset".format(
    X_small.size, X_small.size/X.size*100))

Size of X_small is 13110, this represents 5.0 % of the original dataset


### Step 3: Implement Machine Learning Model

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets:
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`


### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3


### Step 5: Visualize Results

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`


In [21]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT
from sklearn.linear_model import LogisticRegression

# Visualize results
results = pd.DataFrame(
    columns=['Training Set Data Size', 'Training Accuracy', 'Validation Accuracy'])

X_values = [X, X.iloc[:, [0, 1]], X_small]
y_values = [y, y, y_small]

for X, y in zip(X_values, y_values):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, random_state=0)  # split data into training and testing sets

    # instantiate logistic regression model
    logreg = LogisticRegression(max_iter=2000)
    logreg.fit(X_train, y_train)  # fit the model to the training sets

    # create new row in dataframe and add data size, training accuracy, and validation accuracy
    results.loc[len(results)] = ([X_train.size, logreg.score(
        X_train, y_train), logreg.score(X_test, y_test)])

pd.set_option('display.precision', 2)  # set display precision
results  # print dataframe

Unnamed: 0,Training Set Data Size,Training Accuracy,Validation Accuracy
0,196650.0,0.93,0.94
1,6900.0,0.61,0.61
2,9804.0,0.94,0.93


### Questions

1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

**1.** The training and validation accuracy change depending on the amount of data and the number of features evaluated in the classification model.

The models built using X data and X_small data have about the same training and validation accuracy at (X, y: 0.93 training and 0.94 validation accuracy vs X_small, y_small: 0.94 training and 0.93 validation). This shows:

-   that we used a representative 5% of the records (rows) for the X_small, y_small dataset by how similar the training and validation accuracies are,
-   that increasing the amount of training data can increase the accuracy but may have deminishing returns once a representative sample of datapoints are include in the model, and
-   that changing the amount of data may not impact a high-bias model (changing the model type may have more impact).

The dataset with the first 2 columns of X is less accurate (0.61 for both training and validation) than the full or representatively reduced dataset as we reduced the number of features included in the model and the model is still showing high-bias (equal training and validation accuracy scores). The poor training and validation accuracy of the two column model indicates that an accurate spam model depends on more than just word_freq_make word_freq_address to predict spam.

**2.** In the spam dataset, a false positive represents when the model marks good data as spam. A false negative in the spam dataset represents spam marked as good data. A false negative is worse as it lets spam through the model/filter leading to potential harm.


### Process Description

Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:

1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?


1.  Sourced code from:

-   course notes, course textbook (Introduction to Machine Learning with Python).
-   realpython.com (explaining how to use zip) https://realpython.com/python-zip-function/ and
-   saturncloud.com (how to add a row to a dataframe) https://saturncloud.io/blog/how-to-add-new-rows-to-a-pandas-dataframe/

2. Completed the steps in the order written as instructed. Steps:

-   data input - load spam dataset
-   data processing - check for null values and split the data into three sets: full data, reduced feature data, and reduced size data
-   ML model - applied a LogisticRegression with a maximum of 2000 iterations to the data.
-   validation - used .score() function to produce training and validation accuracy scores for the three datasets.

3. Did not use generative AI.
4. Challenged by how to get the different datasets inside of a loop. Initially didn't understand that we needed to call train_test_split again within the loop. Reviewing lab jupyter notebooks and viewing other course code helped to see how other similar problems were coded.


## Part 2: Regression

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.


### Step 1: Data Input

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.


In [22]:
# TO DO: Import spam dataset from yellowbrick library
X, y = load_concrete()
# TO DO: Print size and type of X and y
print("The shape of X is: {}, size of X is {}, and X is a {}".format(
    X.shape, X.size, type(X)))
print("The shape of y is: {}, size of y is {}, and y is a {}".format(
    y.shape, y.size, type(y)))

The shape of X is: (1030, 8), size of X is 8240, and X is a <class 'pandas.core.frame.DataFrame'>
The shape of y is: (1030,), size of y is 1030, and y is a <class 'pandas.core.series.Series'>


### Step 2: Data Processing

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.


In [23]:
# TO DO: Check if there are any missing values and fill them in if necessary
print("There are {} missing values in X.".format(X.isna().sum().sum()))
print("There are {} missing values in y.".format(y.isna().sum().sum()))

There are 0 missing values in X.
There are 0 missing values in y.


### Step 3: Implement Machine Learning Model

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Implement the machine learning model with `X` and `y`


In [24]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
lr = LinearRegression().fit(X_train, y_train)

### Step 4: Validate Model

Calculate the training and validation accuracy using mean squared error and R2 score.


In [25]:
# TO DO: ADD YOUR CODE HERE
from sklearn.metrics import mean_squared_error, r2_score

# training accuracy
y_pred_train = lr.predict(X_train)
mse_train = mean_squared_error(y_train, y_pred_train)
r2_train = r2_score(y_train, y_pred_train)

# validation accuracy
y_pred = lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

### Step 5: Visualize Results

1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`


In [26]:
# TO DO: ADD YOUR CODE HERE
# create dataframe with columns
results = pd.DataFrame(columns=['Training accuracy', 'Validation accuracy'])
# create index and add accuracy results
results.loc['MSE'] = ([mse_train, mse])
# create index and add accuracy results
results.loc['R2 Score'] = ([r2_train, r2])
pd.set_option('display.precision', 2)  # set display precision
results  # print results

Unnamed: 0,Training accuracy,Validation accuracy
MSE,111.36,95.9
R2 Score,0.61,0.62


### Questions

1. Did using a linear model produce good results for this dataset? Why or why not?

The linear model did not produce good results with low R2 scores and high MSE scores. These indicate that a the linear model is likely not appropriate for the data.

The R2 score should be close to 1.0 and the model scored 0.61 and 0.62 on the training and validation sets indicate that there is a lot of variability between the actual and predicted values. The similarity in value of the R2 score suggests that the model is underfitting.

The MSE is 111.36 (training) and 95.90 (validation). The ideal MSE is close to 0. The high MSE indicates that the average error between actual and predicted values is quite large, again indicating that the model is underfitting the data.


### Process Description

Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:

1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?


_Answers_

**1.** Source code:

-   course notes, course textbook (Introduction to Machine Learning with Python).

-   Assistance with R2 and MSE: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

-   Assistance with understanding what MSE means: https://en.wikipedia.org/wiki/Mean_squared_error

**2.** Completed the steps in the order presented in the jupyter notebook. Steps:

-   data input - load concrete data
-   data processing - check for null values
-   ML model - Applied a LinearRegression to the data.
-   validation - used r_score() function to produce training and validation accuracy scores that shows the goodness of fit of a model. Used the mean_squared_error() function to measure how close the model is to the data points.

**3.** Did not use generative AI

**4.** Challenges understanding the difference in .score(), mean_squared_error(), and r2_score(). Reading the textbook, course notes, and asking google helped to increase my understanding.


## Part 3: Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

-   Found that accuracy scores can be a good indication that the model being used is inappropriate for the data. For example, applying linear regression model on concrete showed that a low R2 score (0.62) and high MSE score (95.9). This is because the compressive strength of concrete is known to be a non-linear model thus showing what would happen if an inappropriate model is applied to a dataset.

-   Found that the more features included in the model increased the chance of having a high accuracy score. The spam dataset reduced by fetures showed that the same data when drastically reduced by features had a much lower accuracy score (0.94 with whole data vs 0.61 with reduced features). This likely has an optimum point and beyond which the addition of more features begins to reduce the accuracy of the model.

-   Found that the number of records/rows doesn't necessarily impact the data accuracy as much as the number of features. Using 5% of the whole dataset then further split into testing and training data revealed very similar accuracy scores (0.94 with whole data vs 0.93 with reduced data).


## Part 4: Bonus Question

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.


In [27]:
# Load data and split into testing and training sets
X, y = load_concrete()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [28]:
# Ridge alpha=1.0
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0, max_iter=2000).fit(X_train, y_train)
print("Ridge alpha=1.0 training set score %.2f" %
      ridge.score(X_train, y_train))
print("Ridge alpha=1.0 validation score %.2f" % ridge.score(X_test, y_test))

Ridge alpha=1.0 training set score 0.61
Ridge alpha=1.0 validation score 0.62


In [29]:
# Ridge alpha=0.01
ridge01 = Ridge(alpha=0.01, max_iter=2000).fit(X_train, y_train)
print("Ridge alpha=0.01 training set score %.2f" %
      ridge01.score(X_train, y_train))
print("Ridge alpha=0.01 validation score %.2f" % ridge01.score(X_test, y_test))

Ridge alpha=0.01 training set score 0.61
Ridge alpha=0.01 validation score 0.62


In [30]:
# Ridge alpha=100
ridge100 = Ridge(alpha=100, max_iter=2000).fit(X_train, y_train)
print("Ridge alpha=100 training set score %.2f" %
      ridge100.score(X_train, y_train))
print("Ridge alpha=100 validation score %.2f" % ridge100.score(X_test, y_test))

Ridge alpha=100 training set score 0.61
Ridge alpha=100 validation score 0.62


In [31]:
# Lasso alpha=1.0
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1.0, max_iter=2000).fit(X_train, y_train)
print("Training set score %.2f" % lasso.score(X_train, y_train))
print("Validation score %.2f" % lasso.score(X_test, y_test))
print("Number of features used in the model: ", np.sum(lasso.coef_ != 0))

Training set score 0.61
Validation score 0.62
Number of features used in the model:  8


In [32]:
# Lasso alpha=100
lasso100 = Lasso(alpha=100, max_iter=2000).fit(X_train, y_train)
print("Training set score %.2f" % lasso100.score(X_train, y_train))  # R2 score
print("Validation score %.2f" % lasso100.score(X_test, y_test))   # R2 score
print("Number of features used in the model: ", np.sum(lasso100.coef_ != 0))

Training set score 0.47
Validation score 0.51
Number of features used in the model:  5


Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

_Answers_

The ordinary least squares method produced the best R2 score.

The model does not have high variance so the model did not respond to regularization.

With ridge regression, the training and validation scores did not change from the ordinary least squares method values.

With lasso regression, the training and validation scores did not respond until the regularization increased and the model became less complex with 5 out of the 8 features being considered in the model. Here, both the training and validation score decreased suggesting the model moved toward underfitting and higher bias.
