<font size="+3"><b>Linear Models and Validation Metrics</b></font>

<font color='Red'>In this project, we will write code that uses linear models to perform classification and regression tasks. We  will also be describing the process of how code was written. More details can be found below. Any websites or AI tools used will be cited, if used.</font>

NOTE: You may use the Table of Content on the left side of this notebook to efficiently navigate within this document/project.

---

|                **Question**                |
|:------------------------------------------:|
|         **Part 1: Classification**         |
|          Step 0: Import Libraries          |
|             Step 1: Data Input             |
|           Step 2: Data Processing          |
| Step 3: Implement Machine Learning Model   |
|           Step 4: Validate Model           |
|          Step 5: Visualize Results         |
|                  Questions                 |
|             Process Description            |
|           **Part 2: Regression**           |
|             Step 1: Data Input             |
|           Step 2: Data Processing          |
| Step 3: Implement Machine Learning Model   |
|            Step 4: Validate Mode           |
|          Step 5: Visualize Results         |
|                  Questions                 |
|             Process Description            |
|  **Part 3:   Observations/Interpretation** |
|  **Part 4: Lasso and Ridge regression**    |

# **Part 1: Classification**

|                **Question**                |
|:------------------------------------------:|
|         **Part 1: Classification**         |
|          Step 0: Import Libraries          |
|             Step 1: Data Input             |
|           Step 2: Data Processing          |
| Step 3: Implement Machine Learning   Model |
|           Step 4: Validate Model           |
|          Step 5: Visualize Results         |
|                  Questions                 |
|             Process Description            |

Developing code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow, the relevant code in each of the steps below is written/coded:


## **Step 0:** Import Libraries

In [207]:
import numpy as np
import pandas as pd

## **Step 1:** Data Input

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Using the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y` and then printing the size and type of `X` and `y`.

In [208]:
# Import the load_spam function from yellowbrick.datasets
from yellowbrick.datasets import load_spam

# Load the spam dataset
X, y = load_spam()

# Print the size and type of X and y
print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y))

Size of X: (4600, 57)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (4600,)
Type of y: <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing

Checking to see if there are any missing values in the dataset. If necessary, selecting an appropriate method to fill-in the missing values.

In [209]:
# Check for missing values
missing_values_X = X.isnull().sum()
missing_values_y = y.isnull().sum()

# Print the results
print("Values in X:\n\n", missing_values_X, "\n")
print("Values in y:", missing_values_y)


Values in X:

 word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet         

We will now test if the linear model would still work if we used less data. By using the `train_test_split` function from sklearn, we will create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [210]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets, with 5% of the data as the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=0)

# Assign the smaller dataset to X_small and y_small
X_small = X_train
y_small = y_train

# Print for testing the shapes of X_small, y_small, X_test, and y_test
print("X_small:", X_small.shape)
print("y_small:", y_small.shape)
print("X_test:", X_test.shape)
print("y_test:", y_test.shape)

X_small: (4370, 57)
y_small: (4370,)
X_test: (230, 57)
y_test: (230,)


## **Step 3:** Implement Machine Learning Model

1. Importing `LogisticRegression` from sklearn
2. Instantiating model `LogisticRegression(max_iter=2000)`.
3. Implementing the machine learning model with three different datasets:
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

## **Step 4:** Validate Model

Calculating the training and validation accuracy for the three different tests implemented in Step 3

## **Step 5:** Visualize Results

1. Creating a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Adding the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Printing `results`

In [211]:
# Importing LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Dataset 1: Complete dataset
model_1 = LogisticRegression(max_iter=2000)
model_1.fit(X, y)

# Dataset 2: Only first two columns of X
X_first_two = X.iloc[:, :2]
model_2 = LogisticRegression(max_iter=2000)
model_2.fit(X_first_two, y)

# Dataset 3: X_small and y_small
model_3 = LogisticRegression(max_iter=2000)
model_3.fit(X_small, y_small)

# Calculate the training accuracy for each model
train_accuracy_1 = model_1.score(X, y)
train_accuracy_2 = model_2.score(X_first_two, y)
train_accuracy_3 = model_3.score(X_small, y_small)

# Calculate the validation accuracy for each model
val_accuracy_1 = model_1.score(X_test, y_test)
val_accuracy_2 = model_2.score(X_test.iloc[:, :2], y_test)
val_accuracy_3 = model_3.score(X_test, y_test)

# Dataframe for storing results
results = pd.DataFrame(columns=['Data size', 'Training accuracy', 'Validation accuracy'])

# Add the data size, training and validation accuracy for each dataset to the 'results' DataFrame
results.loc[0] = ['Complete dataset', train_accuracy_1, val_accuracy_1]
results.loc[1] = ['First two columns', train_accuracy_2, val_accuracy_2]
results.loc[2] = ['Small dataset', train_accuracy_3, val_accuracy_3]

# Print 'results'
print(results)

           Data size  Training accuracy  Validation accuracy
0   Complete dataset           0.931957             0.939130
1  First two columns           0.616304             0.578261
2      Small dataset           0.929748             0.943478


## **Practice descriptive questions & answers for theoratical understanding**
1. How do the training and validation accuracy change depending on the amount of data used? Explain with values.
2. In this case, what do a false positive and a false negative represent? Which one is worse?

<font color='Green'><b>

1. For the first "complete dataset", both training and validation accuracies are high, 0.932 and 0.939, respectively. This shows that the model learnt well and generalizes well to the new data that was provided and could potentially be better at interpretting the complex data.

    However, the dataset with "the first two columns" shows a great decrease in accuracy for both training, 0.616, and validation, 0.578. Here, the validation accuracy is worse and these accuracies imply that decreasing the feature space, reducing the columns, has resulted in a loss of CRITICAL information important for generating correct predictions.

    As for the Accuracy for the "small dataset", it is somewhat lower than the "complete dataset", but is still quite good surprisingly, 0.929 for training and 0.943 for validation. This shows us that even with less data, this particular model is able to perform well because this smaller dataset still contains sufficient information for the model to learn effectively, specially for this type of data.

    It is important to note that the accuracy of the model is not only dependent on the amount of data, but also on the quality of the data and the model itself so it is possible that the model is able to learn well with less data because the data is of high quality and the model is well suited for the data.

2. A false positive in this spam classification represent when an email that is not spam is wrongly labelled as spam. This could be the result in important emails being ignored or overlooked if they are incorrectly screened. Whereas, a false negative indicates that a spam email was wrongly identified as not spam which then allows it to enter the inbox. Therefore, this does expose the users to a variety of factors including potential frauds or phishing attempts etc. So in this dataset, a false negative is worse than a false positive because it exposes the user to potential harms.

</b></font>

## **Process Description/How code was written/Sourcing etc**
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

<font color='Green'><b>

Using the above written practice questions as guidance, the following layout is used to describe the process:
1. The code I used in this Jupyter Notebook was sourced or writen by me and the existing cells information, where necessary.

2. As per the professional steps carried out in real-world, I filled/completed these cells in the following order:
    - Imported necessary libraries and modules for the code
    - Loaded the dataset using the `load_spam()` function from the yellowbrick library
    - Data processing and handling missing values was done using the `train_test_split` function from sklearn
    - Splitted the data into training and testing sets
    - Implemented machine learning models for `Logistic Regression`
    - Validated the models and calculated training and validation accuracies
    - Visualized the results in a pandas DataFrame/table
    - Answered related practice questions to the results and the processes

3. No, I did not use generative AI for this. I manually wrote it based on the requirements.

</b></font>

# **Part 2: Regression**

| **Question**                               |
|--------------------------------------------|
| **Part 2: Regression**                     |
| Step 1: Data Input                         |
| Step 2: Data Processing                    |
| Step 3: Implement Machine Learning   Model |
| Step 4: Validate Mode                      | 
| Step 5: Visualize Results                  | 
| Questions                                  | 
| Process Description                        | 

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. We will be using the `Concrete` dataset from the yellowbrick library.

## **Step 1:** Data Input

The data used for this task can be downloaded using the yellowbrick library:
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Using the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y` and printing the size and type of `X` and `y`.

In [212]:
# Import the dataset from yellowbrick library
from yellowbrick.datasets import load_concrete
X, y = load_concrete()

# Print the size and type of X and y
print("Size of X:", X.shape)
print("Type of X:", type(X))
print("Size of y:", y.shape)
print("Type of y:", type(y))

Size of X: (1030, 8)
Type of X: <class 'pandas.core.frame.DataFrame'>
Size of y: (1030,)
Type of y: <class 'pandas.core.series.Series'>


## **Step 2:** Data Processing

Checking to see if there are any missing values in the dataset. If necessary, selecting an appropriate method to fill-in the missing values.

In [213]:
# Check for missing values
missing_values_X = X.isnull().sum()
missing_values_y = y.isnull().sum()

# Print the results
print("Values in X:\n\n", missing_values_X, "\n")
print("Values in y:", missing_values_y)

Values in X:

 cement    0
slag      0
ash       0
water     0
splast    0
coarse    0
fine      0
age       0
dtype: int64 

Values in y: 0


## **Step 3:** Implement Machine Learning Model

1. Importing `LinearRegression` from sklearn
2. Instantiating model `LinearRegression()`.
3. Implementing the machine learning model with `X` and `y`

In [214]:
# Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # 20% of the data is used for testing here

# Instantiate the LinearRegression model
model = LinearRegression()

# Train the model on the training set
model.fit(X_train, y_train)

## **Step 4:** Validate Model

Calculating the training and validation accuracy using mean squared error and R2 score.

In [215]:
# Import required metrics from sklearn
from sklearn.metrics import mean_squared_error, r2_score

# Predictions for training and testing sets
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Calculate MSE and R2 for both training and testing sets
train_mse = mean_squared_error(y_train, train_predictions)
test_mse = mean_squared_error(y_test, test_predictions)
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)

## **Step 5:** Visualize Results
1. Creating a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Adding the accuracy results to the `results` DataFrame
3. Printing `results`

In [216]:
# Create a pandas DataFrame 'results' with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
results = pd.DataFrame(index=['MSE', 'R2 score'], columns=['Training accuracy', 'Validation accuracy'])

# Add the accuracy results to the 'results' DataFrame
results.loc['MSE', 'Training accuracy'] = train_mse
results.loc['MSE', 'Validation accuracy'] = test_mse
results.loc['R2 score', 'Training accuracy'] = train_r2
results.loc['R2 score', 'Validation accuracy'] = test_r2

# Print 'results'
print(results)

         Training accuracy Validation accuracy
MSE             110.345501           95.635335
R2 score          0.609071            0.636898


## **Practice descriptive questions & answers for theoratical understanding**
1. Did using a linear model produce good results for this dataset? Why or why not?

<font color='Green'><b>

Since it predicts continuous variables and evaluates based on how well model can forecast the actual numeric values, the Mean Squared Error (MSE) and R2 score were used to evaluate the performances. And the MSE values are 110.346 for training and 95.635 for validation, with R2 scores of 0.609 and 0.637, respectively.

While the R2 scores suggest that the model explains approximately 61-64% of the variance in the target variable, there is a room for improvement since a higher R2 score would indicate a better fit. The model appears to be moderately effective but not highly precise. this suggests that while a linear model captures some of the underlying patterns, the complexity of the "concrete" strength dataset might require more complex models or maybe feature engineering to improve accuracy. 

The almost similar performance on both training and validation sets suggests the model is potentially not overfitting and this is a good thing here at least. However, if you observe more then you can see that the MSE values are quite high. Thus, this suggests that the model is not very accurate, and that the predictions are not very close to the actual values and the linear model did not produce good results for this dataset.

</b></font>


## **Process Description/How code was written/Sourcing etc**

1. Where did you source your code?
2. In what order did you complete the steps?
3. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?

<font color='Green'><b>

Using the above written practice questions as guidance, the following layout is used to describe the process:
1. The code I used in this Jupyter Notebook was sourced or writen by me and the existing cells information, where necessary.

2. As per the professional steps carried out in real-world, I filled/completed these cells in the following order:
    - Data was loaded using the `load_concrete()` function from the yellowbrick library
    - Printed the size and type of `X` and `y`
    - Checked and printed for missing values in the dataset, if any, in data processing
    - Implemented machine learning models for `Linear Regression` and split the dataset into training and testing sets
    - Trained the model on the training set
    - Predicted for training and testing sets and calculated MSE and R2 for both training and testing sets
    - Visualized the results in a pandas DataFrame/table
    - Answered related practice questions to the results and the processes

3. No, I did not use generative AI for this. I manually wrote it based on the requirements.

</b></font>

# **Part 3: Observations/Interpretation Practice Question(s)**

Q. Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

<font color='Green'><b>

For the pattern itself, the results show a reasonable level of accuracy, with R2 values ranging from 0.61 to 0.640 approx, which shows that while the linear model can capture some correlations between characteristics and the target variable, it may not catch all of the details or complexities. The lesser percentage of accuracy can also be an indication of this data having non-linear trends. Even the lecture slides stated that linear models can serve as a reasonable baseline but may not be the best in capturing non-linear interactions without extra feature engineering or the usage of more complicated models. Thus, this gives me an understanding that the assumptions of different models is actually critical for choosing the appropriate modeling approach. Even the results show that the model is not overfitting. But the model is also not very accurate which then suggests that the model is actually underfitting. This is a good indication that the model is not capturing all the underlying patterns in the data and again proves that a more complex model may be required to improve accuracy.

</b></font>

# **Part 4: Ridge and Lasso regression**

Repeating Part 2 with Ridge and Lasso regression to see if we can improve the accuracy results.

**NOTE**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [217]:
from sklearn.linear_model import Ridge, Lasso

# Initialize variables to keep track of the best scores and alphas
best_r2_ridge, best_r2_lasso = 0.636898, 0.636898  # Initial values based on previous validation accuracy
best_alpha_ridge, best_alpha_lasso = None, None
alpha_values = np.logspace(-3, 2, num=100)  # Alpha values from 0.001 to 100

# Splitting the dataset outside the loop to avoid redundant splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

for alpha_val in alpha_values:
    # Train Ridge model
    ridge_model = Ridge(alpha=alpha_val)
    ridge_model.fit(X_train, y_train)
    r2_ridge = ridge_model.score(X_test, y_test)

    # Train Lasso model
    lasso_model = Lasso(alpha=alpha_val)
    lasso_model.fit(X_train, y_train)
    r2_lasso = lasso_model.score(X_test, y_test)

    # Update best scores and alphas for Ridge
    if r2_ridge > best_r2_ridge:
        best_r2_ridge = r2_ridge
        best_alpha_ridge = alpha_val

    # Update best scores and alphas for Lasso
    if r2_lasso > best_r2_lasso:
        best_r2_lasso = r2_lasso
        best_alpha_lasso = alpha_val

# Print the best R2 scores and corresponding alpha values
print(f'Best R2 score (Ridge): {best_r2_ridge} with alpha val: {best_alpha_ridge}')
print(f'Best R2 score (Lasso): {best_r2_lasso} with alpha val: {best_alpha_lasso}')

# Determine which model performed better
if best_r2_ridge > best_r2_lasso:
    print(f'Ridge model performed better with an R2 score of {best_r2_ridge}.')
else:
    print(f'Lasso model performed better with an R2 score of {best_r2_lasso}.')

Best R2 score (Ridge): 0.6369366906855765 with alpha val: 100.0
Best R2 score (Lasso): 0.638873238146232 with alpha val: 9.770099572992246
Lasso model performed better with an R2 score of 0.638873238146232.


<font color='Green'><b>

Refer to the code and its output above.

</b></font>