# Lab1 - Scikit-learn
Author: Michael Le

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yellowbrick

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [19]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''
    
    #TODO: IMPLEMENT FUNCTION BODY
    # Split arrays or matrices into random train and test subsets. Use default 75/25 split.
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # Fit the model according to the given training vector X_train and its relative target vector y_train
    # Predict class labels for samples in X_test - X_test is the data matrix for which we want to get predictions
    # Returns vector containing the class labels for each sample in X_test
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)
    
    # Predict class labels for samples in X_train - X_train is the data matrix for which we want to get predictions
    # Returns vector containing the class labels for each sample in X_train
    y_train_pred = model.predict(X_train)
    
    # Use the true labels and predicted labels(as returned by classifier) for the validation(testing) set and 
    # training set to determine accuracy classifcation score
    validation_accuracy = accuracy_score(y_test, y_test_pred) # Fraction of correctly classified samples for test set
    training_accuracy = accuracy_score(y_train, y_train_pred) # Fraction of correctly classified samples for training set
    
    return((training_accuracy, validation_accuracy))

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [20]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets.loaders import load_spam

# Load spam data set into feature matrix X and target vector y
X, y = load_spam()
# X.head()
# y.head()
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}\n")
print(f"Type of X Columns: \n{X.dtypes}\n")
print(f"Type of X: \n{X.dtypes.dtype}\n")
print(f"Type of y: \n{y.dtypes}\n")

Shape of X: (4600, 57)
Shape of y: (4600,)

Type of X Columns: 
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_hp            

In [24]:
print(X.isnull().sum()) # Check for null values in columns of X
print(y.isnull().sum()) # Check for null values in columns of y

word_freq_make                0
word_freq_address             0
word_freq_all                 0
word_freq_3d                  0
word_freq_our                 0
word_freq_over                0
word_freq_remove              0
word_freq_internet            0
word_freq_order               0
word_freq_mail                0
word_freq_receive             0
word_freq_will                0
word_freq_people              0
word_freq_report              0
word_freq_addresses           0
word_freq_free                0
word_freq_business            0
word_freq_email               0
word_freq_you                 0
word_freq_credit              0
word_freq_your                0
word_freq_font                0
word_freq_000                 0
word_freq_money               0
word_freq_hp                  0
word_freq_hpl                 0
word_freq_george              0
word_freq_650                 0
word_freq_lab                 0
word_freq_labs                0
word_freq_telnet              0
word_fre

Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [25]:
# TODO: ADD YOUR CODE HERE
from sklearn.model_selection import train_test_split

# Use train_test_split to prepare X_small and y_small with onyl 1% of the rows
X_big, X_small, y_big, y_small = train_test_split(X, y, test_size=0.01, random_state=174)

print(f"Shape of X_small: {X_small.shape}")
print(f"Shape of y_small: {y_small.shape}\n")
print(f"Type of X_small Columns: \n{X_small.dtypes}\n")
print(f"Type of X_small: \n{X_small.dtypes.dtype}\n")
print(f"Type of y_small: \n{y_small.dtypes}\n")



Shape of X_small: (46, 57)
Shape of y_small: (46,)

Type of X_small Columns: 
word_freq_make                float64
word_freq_address             float64
word_freq_all                 float64
word_freq_3d                  float64
word_freq_our                 float64
word_freq_over                float64
word_freq_remove              float64
word_freq_internet            float64
word_freq_order               float64
word_freq_mail                float64
word_freq_receive             float64
word_freq_will                float64
word_freq_people              float64
word_freq_report              float64
word_freq_addresses           float64
word_freq_free                float64
word_freq_business            float64
word_freq_email               float64
word_freq_you                 float64
word_freq_credit              float64
word_freq_your                float64
word_freq_font                float64
word_freq_000                 float64
word_freq_money               float64
word_freq_

### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [26]:
# TODO: ADD YOUR CODE HERE
# Import LogisticRegression from sklearn
from sklearn.linear_model import LogisticRegression

# Instantiate model: Set model to a model object before you can perform ML on data 
model = LogisticRegression(max_iter=2000)

# Create results DataFrame
results = pd.DataFrame(columns=['Data Size', 'Training Accuracy', 'Validation Accuracy'])

# Restrict X and y to two columns
X_first_two_cols = X.iloc[:, 0:2]
y_first_two_cols = y.iloc[:]

# Obtain training and validation accuracy scores with all data
first_result = get_classifier_accuracy(model, X, y)
# Obtain training and validation accuracy scores with only the first two features
second_result = get_classifier_accuracy(model, X_first_two_cols, y_first_two_cols)
# Obtain training and validation accuracy scores with only 1% of the rows used
third_result = get_classifier_accuracy(model, X_small, y_small)

# For each result, assign data_size based on the data used for that result
# Then add new row with data_size, training accuracy, and validation accuracy to the results DataFrame 
for result in [first_result, second_result, third_result]:
    data_size = None
    if(result == first_result):
        data_size = X.shape
    elif(result == second_result):
        data_size = X_first_two_cols.shape
    elif(result == third_result):
        data_size = X_small.shape
        
    new_row = {'Data Size': data_size, 'Training Accuracy': result[0], 'Validation Accuracy':result[1]}
    results = results.append(new_row, ignore_index=True)

# Display results DataFrame
results

Unnamed: 0,Data Size,Training Accuracy,Validation Accuracy
0,"(4600, 57)",0.934783,0.918261
1,"(4600, 2)",0.608986,0.613043
2,"(46, 57)",1.0,0.833333


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
1. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

Answer for 1: When using all data samples and all features, the validation accuracy is 0.918261. This is only 0.016522 (or 1.6522%) less than the training accuracy of 0.934783.

Answer for 2: When using all data samples but limiting the features to the first two columns, the validation accuracy is 0.613043. This is only 0.004057 (or 0.4057%) less than the training accuracy of 0.608986, which is slightly less than the difference between the training and validation accuracies when all samples and features are used. Restricting the features to the two first columns only had a minimal effect on the difference between training and validation accuracy, but the performance of the model was signifcantly worse as both the training and validation accuracy scores dropped by over 30%, compared to using all data samples and all features.

Answer for 3: When only using 1% of the data rows with all features included, the validation accuracy is 0.833333. This is 0.166667(16.7%) less than the training accuracy of 1. The larger discrepencies in accuraacy can be attributed to the smaller size of our dataset.

## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [27]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
   
    #TODO: IMPLEMENT FUNCTION BODY
    # Split arrays or matrices into random train and test subsets. Use default 75/25 split?
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # Fit the model according to the given training vector X_train and its relative target vector y_train
    # Predict class labels for samples in X_test - X_test is the data matrix for which we want to get predictions
    # Returns vector containing the class labels for each sample in X_test
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)
    
    # Predict class labels for samples in X_train - X_train is the data matrix for which we want to get predictions
    # Returns vector containing the class labels for each sample in X_train
    y_train_pred = model.predict(X_train)
    
    # Use the true labels and predicted labels(as returned by classifier) for the validation(testing) set and 
    # training set to determine mean squared error
    validation_mse = mean_squared_error(y_test, y_test_pred) # Fraction of correctly classified samples
    training_mse = mean_squared_error(y_train, y_train_pred)
    
    return((training_mse, validation_mse))

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [28]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets.loaders import load_energy

# Load spam data set into feature matrix X and target vector y
X, y = load_energy()
# X.head()
# y.head()
print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}\n")
print(f"Type of X columns: \n{X.dtypes}\n")
print(f"Type of X: \n{X.dtypes.dtype}\n")
print(f"Type of y: \n{y.dtypes}\n")

Shape of X: (768, 8)
Shape of y: (768,)

Type of X columns: 
relative compactness         float64
surface area                 float64
wall area                    float64
roof area                    float64
overall height               float64
orientation                    int64
glazing area                 float64
glazing area distribution      int64
dtype: object

Type of X: 
object

Type of y: 
float64



Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [29]:
# TODO: ADD YOUR CODE HERE
# Split data into training and testing sets before fitting model to data
# Question - Is X_small and y_small supposed to be the test or training set?
from sklearn.model_selection import train_test_split

X_big, X_small, y_big, y_small = train_test_split(X, y, test_size=0.01, random_state=174)

print(f"Shape of X_small: {X_small.shape}")
print(f"Shape of y_small: {y_small.shape}\n")
print(f"Type of X_small columns: \n{X_small.dtypes}\n")
print(f"Type of X_small: \n{X_small.dtypes.dtype}\n")
print(f"Type of y_small: \n{y_small.dtypes}\n")

Shape of X_small: (8, 8)
Shape of y_small: (8,)

Type of X_small columns: 
relative compactness         float64
surface area                 float64
wall area                    float64
roof area                    float64
overall height               float64
orientation                    int64
glazing area                 float64
glazing area distribution      int64
dtype: object

Type of X_small: 
object

Type of y_small: 
float64



### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [17]:
# TODO: ADD YOUR CODE HERE
# Import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression

# Instantiate model: Set model to a model object before you can perform ML on data 
# fit_intercept=False sets the y-intercept to 0. 
# If fit_intercept=True, the y-intercept will be determined by the line of best fit.
model = LinearRegression(fit_intercept=True)

# Create results DataFrame
results = pd.DataFrame(columns=['Data Size', 'Training MSE', 'Validation MSE'])

# Restrict X and y to two columns
X_first_two_cols = X.iloc[:, 0:2]
y_first_two_cols = y.iloc[:]

# Obtain training and validation MSE with all data
first_result = get_regressor_mse(model, X, y)
# Obtain training and validation MSE with only the first two features
second_result = get_regressor_mse(model, X_first_two_cols, y_first_two_cols)
# Obtain training and validation MSE with only 1% of the rows used
third_result = get_regressor_mse(model, X_small, y_small)

# For each result, assign data_size based on the data used for that result
# Then add new row with data_size, training MSE, and validation MSE to the results DataFrame 
for result in [first_result, second_result, third_result]:
    data_size = None
    if(result == first_result):
        data_size = X.shape
    elif(result == second_result):
        data_size = X_first_two_cols.shape
    elif(result == third_result):
        data_size = X_small.shape
        
    new_row = {'Data Size': data_size, 'Training MSE': result[0], 'Validation MSE':result[1]}
    results = results.append(new_row, ignore_index=True)

# Display results DataFrame
results

Unnamed: 0,Data Size,Training MSE,Validation MSE
0,"(768, 8)",7.972066,10.318507
1,"(768, 2)",53.60043,46.410426
2,"(8, 8)",8.126845e-27,489.163464


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

Answer for 1: When using all data samples and all features, the validation mean squared error is 10.318507. This is 2.346441 more than the training mean squared error of 7.972066.

Answer for 2: When using all data samples but limiting the features to the first two columns, the validation mean squared error is 46.410426. This is 7.190004 less than the training mean squared error accuracy of 53.60043. Restricting the features to the two first columns increased the difference between training and validation MSE, and the performance of the model was signifcantly worse as both the training and validation mean squared errors increased compared to using all data samples and all features.

Answer for 3: When only using 1% of the data rows with all features included, the validation mean squared error is 489.163464. This is exponentially larger than the training mean squared error of 8.126845e-27, and results in a massive difference between the two MSE values. The seemingly extreme values and the difference between them can be attributed to the small size of our dataset.

## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


Answer: With the exception of the accuracy score and MSE obtained for the dataset limited to two features, the results came out as expected, with the Validation MSE and accuracy scores being worse than the training MSE and accuracy scores. This is expected because the model will be a better fit to the data it has already has seen, compared to the data it has not yet seen. When using all data samples and all features, the accuracy scores (0.934783 and 0.918261) are at their highest values and the mean squared errors (7.972066 and 10.318507) are at their lowest compared to other data sets. When only using 1% of the data samples or using all of the data samples with only two features, the accuracy scores become lower and the MSEs become higher - this makes sense as the data set is less descriptive and has a smaller sample size.



## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


Answer: I liked that this lab gave us the opportunity to apply the 5 steps of the machine learning workflow that we learned in class. I thought it was interesting how the train_test_split function could be used to split the dataset into different sizes. This assignment provided excellent practice for using the Scikit-learn library to train a model and use it to predict labels for new data. Learning how to properly load data with the Yellowbricks library was probably the most challenging part of this assignment. It was also interesting that cutting the number of features to two in the dataset resulted in a model that yielded better accuracy and a smaller MSE for the validation set (61.3% accuracy, 46.4 MSE) compared to the training set (60.9% accuracy, 53.6 MSE).

