In [1]:
# Assignment 2: Linear Models and Validation Metrics (30 marks total)
### Due: October 10 at 11:59pm

### Name: Nick Nikolov

### In this assignment, you will need to write code that uses linear models to perform classification and regression tasks. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Part 1: Classification (14.5 marks total)

You have been asked to develop code that can help the user determine if the email they have received is spam or not. Following the machine learning workflow described in class, write the relevant code in each of the steps below:

### Step 0: Import Libraries

In [2]:
import numpy as np
import pandas as pd


### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

Use the yellowbrick function `load_spam()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [23]:
# TO DO: Import spam dataset from yellowbrick library
from yellowbrick.datasets import load_spam
# TO DO: Print size and type of X and y
X, y = load_spam()

print(X.shape)
print(X.dtypes)
print(y.shape)
print(y.dtypes)
print(X)
print(y)

(4600, 57)
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp                  float64
word_freq_hpl                 float64
w

### Step 2: Data Processing (1.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [4]:
# TO DO: Check if there are any missing values and fill them in if necessary
nulls = X.isnull().sum().sort_values(ascending=False)
print(nulls)

nulls = y.isnull().sum()
print(nulls)

word_freq_make                0
word_freq_labs                0
word_freq_857                 0
word_freq_data                0
word_freq_415                 0
word_freq_85                  0
word_freq_technology          0
word_freq_1999                0
word_freq_parts               0
word_freq_pm                  0
word_freq_direct              0
word_freq_cs                  0
word_freq_meeting             0
word_freq_original            0
word_freq_project             0
word_freq_re                  0
word_freq_edu                 0
word_freq_table               0
word_freq_conference          0
char_freq_;                   0
char_freq_(                   0
char_freq_[                   0
char_freq_!                   0
char_freq_$                   0
char_freq_#                   0
capital_run_length_average    0
capital_run_length_longest    0
word_freq_telnet              0
word_freq_lab                 0
word_freq_address             0
word_freq_650                 0
word_fre

For this task, we want to test if the linear model would still work if we used less data. Use the `train_test_split` function from sklearn to create a new feature matrix named `X_small` and a new target vector named `y_small` that contain **5%** of the data.

In [41]:
# TO DO: Create X_small and y_small 
from sklearn.model_selection import train_test_split

X_train, X_small, y_train, y_small = train_test_split(X, y, test_size=0.05, random_state=0)
print(X_small.shape)
print(y_small.shape)

(230, 57)
(230,)


### Step 3: Implement Machine Learning Model
1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with three different datasets: 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`

In [42]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

#Using full X and y dataset
full_data = LogisticRegression(max_iter=2000)
full_data.fit(X, y)

#Using only 5% of the data
small_data = LogisticRegression(max_iter=2000)
small_data.fit(X_small, y_small)

#Using the first two columns of X and y
first2 = LogisticRegression(max_iter=2000)
first2.fit(X[['word_freq_make', 'word_freq_address']], y)

### Step 4: Validate Model

Calculate the training and validation accuracy for the three different tests implemented in Step 3

train_score= 0.931
validation_score= 0.912
train_score= 0.942
validation_score= 0.904
train_score= 0.616
validation_score= 0.610
{'fit_time': array([0.00316906, 0.00250173, 0.0022738 , 0.00269485, 0.00256491]), 'score_time': array([0.00074887, 0.00069809, 0.00068712, 0.00077701, 0.00071692]), 'test_score': array([0.60543478, 0.61195652, 0.60217391, 0.61630435, 0.61304348]), 'train_score': array([0.6201087 , 0.61304348, 0.61929348, 0.60842391, 0.62038043])}


### Step 5: Visualize Results (4 marks)

1. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
2. Add the data size, training and validation accuracy for each dataset to the `results` DataFrame
3. Print `results`

In [56]:
# TO DO: ADD YOUR CODE HERE FOR STEPS 3-5
results = pd.DataFrame(columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])

from sklearn.model_selection import cross_validate

scores_full = cross_validate(full_data, X, y, cv=5, scoring='accuracy', return_train_score=True)
row = [X.size]

for label_pair in [ ('train_score', 'train_score'), ('test_score', 'validation_score')]:
    print('{}= {:.3f}'.format(label_pair[1], scores_full[label_pair[0]].mean()))
    row.append(scores_full[label_pair[0]].mean())

results.loc[len(results.index)] = row
    
    
scores_small = cross_validate(small_data, X_small, y_small, cv=5, scoring='accuracy', return_train_score=True)
row = [X_small.size]
    
for label_pair in [ ('train_score', 'train_score'), ('test_score', 'validation_score')]:
    print('{}= {:.3f}'.format(label_pair[1], scores_small[label_pair[0]].mean()))
    row.append(scores_small[label_pair[0]].mean())
results.loc[len(results.index)] = row

    
scores_first2 = cross_validate(first2, X[['word_freq_make', 'word_freq_address']], y, cv=5, scoring='accuracy', return_train_score=True)
row = [X[['word_freq_make', 'word_freq_address']].size]
    
for label_pair in [ ('train_score', 'train_score'), ('test_score', 'validation_score')]:
    print('{}= {:.3f}'.format(label_pair[1], scores_first2[label_pair[0]].mean()))
    row.append(scores_first2[label_pair[0]].mean())
results.loc[len(results.index)] = row

print(results)


# Note: for any random state parameters, you can use random_state = 0
# HINT: USING A LOOP TO STORE THE DATA IN YOUR RESULTS DATAFRAME WILL BE MORE EFFICIENT

train_score= 0.931
validation_score= 0.912
train_score= 0.942
validation_score= 0.904
train_score= 0.616
validation_score= 0.610
   Data Size  Training Accuracy  Validation Accuracy
0   262200.0           0.930870             0.911739
1    13110.0           0.942391             0.904348
2     9200.0           0.616250             0.609783


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I used the sci-kit-learn documentation website to learn about the regression function parameters. I also used lecture & lab slides created by Dr. Dawson.
2. I completed all steps in numerical order.
3. 

## Part 2: Regression (10.5 marks total)

For this section, we will be evaluating concrete compressive strength of different concrete samples, based on age and ingredients. You will need to repeat the steps 1-4 from Part 1 for this analysis.

### Step 1: Data Input (1 mark)

The data used for this task can be downloaded using the yellowbrick library: 
https://www.scikit-yb.org/en/latest/api/datasets/concrete.html

Use the yellowbrick function `load_concrete()` to load the spam dataset into the feature matrix `X` and target vector `y`.

Print the size and type of `X` and `y`.

In [None]:
# TO DO: Import spam dataset from yellowbrick library
# TO DO: Print size and type of X and y

### Step 2: Data Processing (0.5 marks)

Check to see if there are any missing values in the dataset. If necessary, select an appropriate method to fill-in the missing values.

In [None]:
# TO DO: Check if there are any missing values and fill them in if necessary

### Step 3: Implement Machine Learning Model (1 mark)

1. Import `LinearRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Implement the machine learning model with `X` and `y`

In [None]:
# TO DO: ADD YOUR CODE HERE
# Note: for any random state parameters, you can use random_state = 0

### Step 4: Validate Model (1 mark)

Calculate the training and validation accuracy using mean squared error and R2 score.

In [None]:
# TO DO: ADD YOUR CODE HERE

### Step 5: Visualize Results (1 mark)
1. Create a pandas DataFrame `results` with columns: Training accuracy and Validation accuracy, and index: MSE and R2 score
2. Add the accuracy results to the `results` DataFrame
3. Print `results`

In [None]:
# TO DO: ADD YOUR CODE HERE

### Questions (2 marks)
1. Did using a linear model produce good results for this dataset? Why or why not?

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

1. I used the sci-kit-learn documentation website to learn about the regression function parameters. I also used lecture & lab slides created by Dr. Dawson.

2. I completed all steps in numerical order.

3. Any time I was stuck on a compilation error I pasted the error message into ChatGPT. If it provided an easy fix, I modified my code accordingly and retested. My prompt was always identical to the error message I received. 

4. Yes I had some challenges with setting up the training and testing data sets. The assignment indicates we need to split the dataset into a smaller dataset containing 5% of the original data. However, there was no further guidance on what data to use for the training / testing of each model. Should we have used 5% of the dataset to train / test all models or only the model that specified 5%?

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

## Part 5: Bonus Question (4 marks)

Repeat Part 2 with Ridge and Lasso regression to see if you can improve the accuracy results. Which method and what value of alpha gave you the best R^2 score? Is this score "good enough"? Explain why or why not.

**Remember**: Only test values of alpha from 0.001 to 100 along the logorithmic scale.

In [None]:
# TO DO: ADD YOUR CODE HERE

*ANSWER HERE*

In [None]:
Citations
1. Cranor, Lorrie Faith, and Brian A. LaMacchia. “Spam!.” Communications of the ACM 41.8 (1998): 74-83.
2. Dr. Dawson - Lecture slides
3. https://scikit-learn.org/stable/
4. https://chat.openai.com