# Lab1 - Scikit-learn
Author: Jashraj Dubal

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from yellowbrick.datasets import load_spam
from yellowbrick.datasets import load_energy

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [7]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''
    
    #TODO: IMPLEMENT FUNCTION BODY
    
    # Split data into training and validation sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # Train model
    model.fit(X_train, y_train)
    
    # Get the training and validation accuracy
    training_accuracy = model.score(X_train, y_train)
    validation_accuracy = model.score(X_test, y_test)
    return training_accuracy, validation_accuracy

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [8]:
# TODO: ADD YOUR CODE HERE
df_spam = load_spam(return_dataset=True).to_dataframe()

X_spam, y_spam = df_spam.drop(columns=['is_spam']), df_spam['is_spam']
print(X_spam.shape, y_spam.shape)


(4600, 57) (4600,)


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [9]:
# TODO: ADD YOUR CODE HERE
X_train, X_small, y_train, y_small = train_test_split(X_spam, y_spam, test_size=0.01 , random_state=174)
print(X_small.shape, y_small.shape)

(46, 57) (46,)


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [10]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
logistic_regression = LogisticRegression(max_iter=2000)
results = pd.DataFrame(columns=['data size', 'training accuracy', 'validation accuracy'])

# Get the accuracy of the model on the training dataset (X, y)
results.loc[0] = ['full', *get_classifier_accuracy(logistic_regression, X_spam, y_spam)]

# Get the accuracy of the model using only first two features (X[:, :2], y)
X_two_features = X_train.iloc[:, :2]
results.loc[1] = ['two features', *get_classifier_accuracy(logistic_regression, X_two_features, y_train)]

# Get the accuracy of the model on the validation dataset (X_small, y_small)
results.loc[2] = ['small', *get_classifier_accuracy(logistic_regression, X_small, y_small)]
results

Unnamed: 0,data size,training accuracy,validation accuracy
0,full,0.934783,0.918261
1,two features,0.62694,0.587357
2,small,1.0,0.833333


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
2. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
3. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The validation accuracy using all data is **0.9165**. The difference between training and validation accuracy is **0.016522**.
2. When only two columns are used, the difference in training and validation accuracy is **0.039583**.
3. When only 1% of the rows are used, the difference in training and validation accuracy is **0.1666667**.

## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [11]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
    
    #TODO: IMPLEMENT FUNCTION BODY
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    model.fit(X_train, y_train)
    return mean_squared_error(y_train, model.predict(X_train)), mean_squared_error(y_test, model.predict(X_test))

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [12]:
# TODO: ADD YOUR CODE HERE
df_energy = load_energy(return_dataset=True).to_dataframe()

# Features are the first 8 columns
# Target is the last two columns
X_energy, y_energy = df_energy.drop(columns=['heating load', 'cooling load']), df_energy[['heating load', 'cooling load']]
print(X_energy.shape, y_energy.shape)

(768, 8) (768, 2)


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [13]:
# TODO: ADD YOUR CODE HERE
X_energy_train, X_energy_small, y_energy_train, y_energy_small = train_test_split(X_energy, y_energy, test_size=0.01 , random_state=174)
print(X_energy_small.shape, y_energy_small.shape)

(8, 8) (8, 2)


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [14]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LinearRegression

LinearRegression = LinearRegression()
results = pd.DataFrame(columns=['data size', 'training mse', 'validation mse'])

results.loc[0] = ['full', *get_regressor_mse(LinearRegression, X_energy, y_energy)]
results.loc[1] = ['two features', *get_regressor_mse(LinearRegression, X_energy_train.iloc[:, :2], y_energy_train)]
results.loc[2] = ['small', *get_regressor_mse(LinearRegression, X_energy_small, y_energy_small)]

results

Unnamed: 0,data size,training mse,validation mse
0,full,9.048134,10.372558
1,two features,49.47659,42.228256
2,small,8.616202e-27,436.726767


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*
1. The validation MSE using all data is **10.372558**. The difference between training and validation MSE is **-1.324424**.
2. The difference between training and validation MSE is **-37.280597**.
3. The difference between training and validation MSE is **-436.726767**.


## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

The results from the Classification and Regression sections reveal a clear pattern. When utilizing all of the available data, the model's performance is consistently strong, with a minimal difference between training and validation accuracy/MSE. On the other hand, when only a limited portion of the data (1%) is used, the model struggles to learn the patterns in the data and fails to generalize well, resulting in a substantial discrepancy between the training and validation accuracy/MSE.

The data proves this claim since:
- The Logistic Regression model shows a difference of 0.016522 between training and validation accuracy with all data, and 0.1666667 with 1% of data.
- The Linear Regression model displays a difference of -1.324424 between training and validation MSE with all data, and -436.726767 with 1% of data.

## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I enjoyed the process of creating my own Linear Regression and Logistic Regression models and using them to predict data. It was interesting to see how different data sizes affect the training and validation accuracy of the models.
