# Lab1 - Scikit-learn
Author: *Jean-Charl Pretorius*

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [26]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''
    # split feature matrix and target vector
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # fit model
    model.fit(X_train, y_train)

    # calculate training accuracy
    y_train_pred = model.predict(X_train)
    training_accuracy = accuracy_score(y_train, y_train_pred)

    # calculate validation accuracy
    y_test_pred = model.predict(X_test)
    validation_accuracy = accuracy_score(y_test, y_test_pred)

    return training_accuracy, validation_accuracy
    

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [27]:
from yellowbrick.datasets import load_spam
X, y = load_spam()
print("X.size: {}".format(X.size))
print("type(X): {}".format(type(X)))

print("y.size: {}".format(y.size))
print("type(y): {}".format(type(y)))



X.size: 262200
type(X): <class 'pandas.core.frame.DataFrame'>
y.size: 4600
type(y): <class 'pandas.core.series.Series'>


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [28]:
X_small, X_test, y_small, y_test = train_test_split(X, y, train_size=0.01, random_state=174)
print("size of X_small:", X_small.size)
print("type of X_small:", type(X_small))
print("size of y_small:", y_small.size)
print("type of y_small:", type(y_small))

size of X_small: 2622
type of X_small: <class 'pandas.core.frame.DataFrame'>
size of y_small: 46
type of y_small: <class 'pandas.core.series.Series'>


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [29]:
# 1)
from sklearn.linear_model import LogisticRegression
# 2) 
lr = LogisticRegression(max_iter=2000)
# 3)
results = pd.DataFrame(columns=["Data size", "training accuracy", "validation accuracy", "sample"])

# 4 & 5)
samples = {"X and y":(X, y), "first 2 columns of X and y":(X.iloc[:,:2], y), "X_small and y_small":(X_small, y_small)}

for  name, sample in samples.items():
  featureMatrix, targetVector = sample
  training_accuracy, validation_accuracy = get_classifier_accuracy(lr, featureMatrix, targetVector)
  results = results.append({"Data size": featureMatrix.size, "training accuracy": training_accuracy, "validation accuracy":  validation_accuracy, "sample":name}, ignore_index=True)


results.set_index("sample", inplace = True)
results



Unnamed: 0_level_0,Data size,training accuracy,validation accuracy
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
X and y,262200,0.934493,0.918261
first 2 columns of X and y,9200,0.608986,0.613043
X_small and y_small,2622,0.941176,0.75


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
1. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*

1) 

- The validation accuracy using all the data is 0.918 and the training accuracy is 0.934.

- There is a small diffirence of 0.016 between the validation and training accuracy scores, with validation score being slightly less than training score. Both values are close to 1, which is a good accuracy.

2) 
- When we only use the first two columns the validation accuracy has dropped by 0.305 down to 0.613 and the training accuracy is 0.609. 

- Both accuracy values have dropped considerably and we also notice that the validation score is actually very slightly higher than the training score by 4E-3. The training and validation accuracy is closer than when using all of the data.

3)
- When we only use 1% of the data the validation accuracy has dropped by 0.168 down to 0.750. The training accuracy is still high at 0.941.

- The difference between training and validation accuracy has increased to 0.191



## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [30]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
   
    # split feature matrix and target vector
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # fit model
    model.fit(X_train, y_train)

    # calculate training mse
    y_train_pred = model.predict(X_train)
    training_mse = mean_squared_error(y_train, y_train_pred)

    # calculate validation mse
    y_test_pred = model.predict(X_test)
    validation_mse = mean_squared_error(y_test, y_test_pred)
    
    return training_mse, validation_mse
    

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [31]:
from yellowbrick.datasets import load_energy
X, y = load_energy()
print("X.size: {}".format(X.size))
print("type(X): {}".format(type(X)))

print("y.size: {}".format(y.size))
print("type(y): {}".format(type(y)))


X.size: 6144
type(X): <class 'pandas.core.frame.DataFrame'>
y.size: 768
type(y): <class 'pandas.core.series.Series'>


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [32]:
X_small, X_test, y_small, y_test = train_test_split(X, y, train_size=0.01, random_state=174)
print("size of X_small:", X_small.size)
print("type of X_small:", type(X_small))
print("size of y_small:", y_small.size)
print("type of y_small:", type(y_small))


size of X_small: 56
type of X_small: <class 'pandas.core.frame.DataFrame'>
size of y_small: 7
type of y_small: <class 'pandas.core.series.Series'>


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [33]:
# 1)
from sklearn.linear_model import LinearRegression
# 2) 
lr = LinearRegression()
# 3)
results = pd.DataFrame(columns=["Data size", "training MSE", "validation MSE", "sample"])

# 4 & 5)
# samples = [(X, y), (X.iloc[:,:2], y), (X_small, y_small)]
samples = {"X and y":(X, y), "first 2 columns of X and y":(X.iloc[:,:2], y), "X_small and y_small":(X_small, y_small)}

for  name, sample in samples.items():
  featureMatrix, targetVector = sample
  training_mse, validation_mse = get_regressor_mse(lr, featureMatrix, targetVector)
  results = results.append({"Data size": featureMatrix.size, "training MSE": training_mse, "validation MSE": validation_mse, "sample":name}, ignore_index=True)

results.set_index("sample", inplace = True)
results

Unnamed: 0_level_0,Data size,training MSE,validation MSE
sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
X and y,6144,8.012691,10.366349
first 2 columns of X and y,1536,53.60043,46.410426
X_small and y_small,56,2.145702e-29,69.977449


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*

1) 
- The validation MSE using all data is 10.36 and training MSE is 8.01

- The difference between training and validation MSE is 2.35

2) 
- The validation MSE using only the first two columns is 46.41. This is an increase of 36.05 compared to using all the data.

- The difference between training and validation MSE is 7.19. This is an increase of 4.84 compared to the training and validation MSE difference of the full dataset


3) 
- The validation MSE using 1% is 69.98. This is an increase of 59.62 compared to using all the data.

- The difference between training and validation MSE is 69.98. This is an increase of 67.63 compared to the training and validation MSE difference of the full dataset.



## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.



  - When using logistic regression with only the first two columns, the validation accuracy (0.613) is actually slightly higher than the training accuracy (0.609). This is similar to the linear regression on the first two rows, where the validation MSE (46.41) is slightly lower than the training MSE (53.60). This is notable, because usually the model predicts the training set better than the validation set. 


  - We observe that the model performs poorly when we only use the first two columns of the data. This is because linear models are known to perform better when there are many features. If you have more features than data points you can perfectly model the training set with a linear model. 
     

  - For both linear regression and logistic regression there is a decrease in the performance of the training set and increase in performance of the validation set when the training set size increases. This is expected, because for a given model complexity the larger the training set size the less likely it is for the model to overfit the data.

    - For linear regression with only 1% of the data, the training MSE is close to zero which is desirable but the validation MSE is very high at 69.98.
    - For Logistic regression with only 1% of the data, the training accuracy is very high at 0.941 but the validation accuracy is low at 0.750

  When increase the size of the data set (using all of the data), we see a substancial improvement:
    - For logistic regression, the validation accuracy increases by 0.168, going from 0.750 to 0.918
    - For linear regression, the validation MSE decreases by 59.62, from 69.98 to 10.36



## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*My Reflection:*
- I found the result of the validation score being better than the training score in the dataset of only the first 2 columns confusing. I couldn't figure out an explaination for that.
- I liked leaning how to actually use sklearn and see how the linear models are affected by changing the dataset.
- I disliked that I had to use google colab to complete this lab, because there is an error when trying to use Jupyter Notebooks and the anaconda environment set up as described in the course material. Something about Yellowbricks doesn't work. 

