**Exercise 4: Gradient Descent for Linear Regression**

*CPSC 381/581: Machine Learning*

*Yale University*

*Instructor: Alex Wong*

*Student: Hailey Robertson*


**Prerequisites**:

1. Enable Google Colaboratory as an app on your Google Drive account

2. Create a new Google Colab notebook, this will also create a "Colab Notebooks" directory under "MyDrive" i.e.
```
/content/drive/MyDrive/Colab Notebooks
```

3. Create the following directory structure in your Google Drive
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises
```

4. Move the 04_exercise_gradient_descent.ipynb into
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises
```
so that its absolute path is
```
/content/drive/MyDrive/Colab Notebooks/CPSC 381-581: Machine Learning/Exercises/04_exercise_gradient_descent.ipynb
```

In this exercise, we will optimize a linear function for the regression task using the gradient descent for mean squared and half mean squared losses. We will test them on several datasets.


**Submission**:

1. Implement all TODOs in the code blocks below.

2. Report your training, validation, and testing scores.

```
Report validation and testing scores here.

For full credit, your mean squared error scores for models optimized using mean_squared and half_mean_squared losses on Diabetes dataset should be no more than 15% worse the mean squared error scores achieved by sci-kit learn's linear regression model across training, validation and testing splits. Your mean squared error scores on California housing price dataset should be no more than 20% worse.

***** Results of scikit-learn linear regression model on Diabetes dataset *****
Training set mean squared error: 25472.6663
Training set r-squared scores: -3.2907
Validation set mean squared error: 30376.7403
Validation set r-squared scores: -5.1176
Testing set mean squared error: 27024.0792
Testing set r-squared scores: -3.2604
***** Results of our linear regression model trained with mean_squared loss, alpha=0.0001 and T=30000 on Diabetes dataset *****
Training set mean squared error: 28034.7946
Training set r-squared scores: -3.7223
Validation set mean squared error: 34538.4871
Validation set r-squared scores: -5.9557
Testing set mean squared error: 29713.4975
Testing set r-squared scores: -3.6844
***** Results of our linear regression model trained with half_mean_squared loss, alpha=0.0002 and T=30000 on Diabetes dataset *****
Training set mean squared error: 28034.7946
Training set r-squared scores: -3.7223
Validation set mean squared error: 34538.4871
Validation set r-squared scores: -5.9557
Testing set mean squared error: 29713.4975
Testing set r-squared scores: -3.6844
***** Results of scikit-learn linear regression model on California housing prices dataset *****
Training set mean squared error: 0.5989
Training set r-squared scores: 0.5463
Validation set mean squared error: 0.5922
Validation set r-squared scores: 0.5721
Testing set mean squared error: 0.6590
Testing set r-squared scores: 0.5185
***** Results of our linear regression model trained with mean_squared loss, alpha=1e-07 and T=2500000 on California housing prices dataset *****
Training set mean squared error: 0.6496
Training set r-squared scores: 0.5080
Validation set mean squared error: 0.6348
Validation set r-squared scores: 0.5413
Testing set mean squared error: 0.7028
Testing set r-squared scores: 0.4865
***** Results of our linear regression model trained with half_mean_squared loss, alpha=2e-07 and T=2500000 on California housing prices dataset *****
Training set mean squared error: 0.6496
Training set r-squared scores: 0.5080
Validation set mean squared error: 0.6348
Validation set r-squared scores: 0.5413
Testing set mean squared error: 0.7028
Testing set r-squared scores: 0.4865
```

3. List any collaborators.

```
Collaborators: Doe, Jane (Please write names in <Last Name, First Name> format)

Collaboration details: Discussed ... implementation details with Jane Doe.
```

Import packages

In [1]:
import numpy as np
import sklearn.datasets as skdata
import sklearn.metrics as skmetrics
from sklearn.linear_model import LinearRegression as LinearRegressionSciKit
import warnings

warnings.filterwarnings(action='ignore')
np.random.seed = 1

Implementation of our Gradient Descent optimizer for mean squared and half mean squared loss

In [2]:
class GradientDescentOptimizer(object):

    def __init__(self):
        pass

    def __compute_gradients(self, w, x, y, loss_func):
        '''
        Returns the gradient of mean squared or half mean squared loss

        Arg(s):
            w : numpy[float32]
                d x 1 weight vector
            x : numpy[float32]
                d x N feature vector
            y : numpy[float32]
                1 x N groundtruth vector
            loss_func : str
                loss type either mean_squared', or 'half_mean_squared'
        Returns:
            numpy[float32] : d x 1 gradients
        '''

        # DONE: Implements the __compute_gradients function
        if loss_func == 'mean_squared':
            # DONE: Implements gradients for mean squared loss

            '''
            Using for-loop

            gradients = np.zeros(x.shape)

            for n in range((x.shape[1])):
                x_n = x[:, n]
                gradients[:, n] = (np.matmul(w.T, x_n) - y[n]) * x_n
            '''

            # Using matrix multiplication
            gradients = (np.matmul(w.T, x) - y) * x

            # Note: Set keepdims=True to keep the dimension of 1 (otherwise it will get squashed by mean operation)
            return 2.0 * np.mean(gradients, axis=1, keepdims=True)
        elif loss_func == 'half_mean_squared':
            # DONE: Implements gradients for half mean squared loss

            gradients = (np.matmul(w.T, x) - y) * x

            return np.mean(gradients, axis=1, keepdims=True)
        else:
            raise ValueError('Unsupported loss function: {}'.format(loss_func))

    def update(self, w, x, y, alpha, loss_func):
        '''
        Updates the weight vector based on mean squared or half mean squared loss

        Arg(s):
            w : numpy[float32]
                d x 1 weight vector
            x : numpy[float32]
                d x N feature vector
            y : numpy[float32]
                1 x N groundtruth vector
            alpha : float
                learning rate
            loss_func : str
                loss type either 'mean_squared', or 'half_mean_squared'
        Returns:
            numpy[float32] : d x 1 weights
        '''

        # DONE: Implement the optimizer update function

        return w - alpha * self.__compute_gradients(w, x, y, loss_func)


Implementation of Linear Regression with Gradient Descent optimizer

In [3]:
class LinearRegressionGradientDescent(object):

    def __init__(self):
        # Define private variables
        self.__weights = None
        self.__optimizer = GradientDescentOptimizer()

    def fit(self, x, y, T, alpha, loss_func='mean_squared'):
        '''
        Fits the model to x and y by updating the weight vector
        using gradient descent

        Arg(s):
            x : numpy[float32]
                d x N feature vector
            y : numpy[float32]
                1 x N groundtruth vector
            T : int
                number of iterations to train
            alpha : float
                learning rate
            loss_func : str
                loss function to use
        '''

        # DONE: Implement the fit function
        self.__weights = np.zeros([x.shape[0], 1])

        for t in range(1, T + 1):

            # DONE: Compute loss function
            loss = self.__compute_loss(
                x=x,
                y=y,
                loss_func=loss_func)

            if (t % 10000) == 0:
                print('Step={}  Loss={:.4f}'.format(t, loss))

            # DONE: Update weights
            self.__weights = self.__optimizer.update(
                w=self.__weights,
                x=x,
                y=y,
                alpha=alpha,
                loss_func=loss_func)

    def predict(self, x):
        '''
        Predicts the label for each feature vector x

        Arg(s):
            x : numpy[float32]
                d x N feature vector
        Returns:
            numpy[float32] : 1 x N vector
        '''

        # DONE: Implements the predict function

        return np.matmul(self.__weights.T, x)

    def __compute_loss(self, x, y, loss_func):
        '''
        Returns the gradient of the mean squared or half mean squared loss

        Arg(s):
            x : numpy[float32]
                d x N feature vector
            y : numpy[float32]
                1 x N groundtruth vector
            loss_func : str
                loss type either 'mean_squared', or 'half_mean_squared'
        Returns:
            float : loss
        '''

        # DONE: Implements the __compute_loss function
        predictions = self.predict(x)

        if loss_func == 'mean_squared':
            # DONE: Implements loss for mean squared loss
            loss = np.mean((predictions - y) ** 2)
        elif loss_func == 'half_mean_squared':
            # DONE: Implements loss for half mean squared loss
            loss = 0.50 * np.mean((predictions - y) ** 2)
        else:
            raise ValueError('Unsupported loss function: {}'.format(loss_func))

        return loss

Implementing training and validation loop for linear regression

In [4]:
# Load Diabetes and California housing prices dataset
datasets = [
    skdata.load_diabetes(),
    skdata.fetch_california_housing()
]
dataset_names = [
    'Diabetes',
    'California housing prices'
]

# Loss functions to minimize
dataset_loss_funcs = [
    ['mean_squared', 'half_mean_squared'],
    ['mean_squared', 'half_mean_squared']
]

# TODO: Select learning rates (alpha) for mean squared and half mean squared loss
dataset_alphas = [
    [1e-4, 2e-4],
    [1e-7, 2e-7]
]

# TODO: Select number of steps (T) to train for mean squared and half mean squared loss
dataset_Ts = [
    [30000, 30000],
    [2500000, 2500000]
]

for dataset_options in zip(datasets, dataset_names, dataset_loss_funcs, dataset_alphas, dataset_Ts):

    dataset, dataset_name, loss_funcs, alphas, Ts = dataset_options

    '''
    Create the training, validation and testing splits
    '''
    x = dataset.data
    y = dataset.target

    # Shuffle the dataset based on sample indices
    shuffled_indices = np.random.permutation(x.shape[0])

    # Choose the first 80% as training set, next 10% as validation and the rest as testing
    train_split_idx = int(0.80 * x.shape[0])
    val_split_idx = int(0.90 * x.shape[0])

    train_indices = shuffled_indices[0:train_split_idx]
    val_indices = shuffled_indices[train_split_idx:val_split_idx]
    test_indices = shuffled_indices[val_split_idx:]

    # Select the examples from x and y to construct our training, validation, testing sets
    x_train, y_train = x[train_indices, :], y[train_indices]
    x_val, y_val = x[val_indices, :], y[val_indices]
    x_test, y_test = x[test_indices, :], y[test_indices]

    '''
    Trains and tests Linear Regression model from scikit-learn
    '''
    # DONE: Initialize scikit-learn linear regression model without bias
    model_scikit = LinearRegressionSciKit(fit_intercept=False)

    # DONE: Trains scikit-learn linear regression model
    model_scikit.fit(x_train, y_train)


    print('***** Results of scikit-learn linear regression model on {} dataset *****'.format(
        dataset_name))

    # DONE: Test model on training set
    predictions_train = model_scikit.predict(x_train)

    score_mse_train = skmetrics.mean_squared_error(y_train, predictions_train)
    print('Training set mean squared error: {:.4f}'.format(score_mse_train))

    score_r2_train = skmetrics.r2_score(y_train, predictions_train)
    print('Training set r-squared scores: {:.4f}'.format(score_r2_train))

    # DONE: Test model on validation set
    predictions_val = model_scikit.predict(x_val)

    score_mse_val = skmetrics.mean_squared_error(y_val, predictions_val)
    print('Validation set mean squared error: {:.4f}'.format(score_mse_val))

    score_r2_val = skmetrics.r2_score(y_val, predictions_val)
    print('Validation set r-squared scores: {:.4f}'.format(score_r2_val))

    # DONE: Test model on testing set
    predictions_test = model_scikit.predict(x_test)

    score_mse_test = skmetrics.mean_squared_error(y_test, predictions_test)
    print('Testing set mean squared error: {:.4f}'.format(score_mse_test))

    score_r2_test = skmetrics.r2_score(y_test, predictions_test)
    print('Testing set r-squared scores: {:.4f}'.format(score_r2_test))

    '''
    Trains and tests our linear regression model using different solvers
    '''

    # Take the transpose of the dataset to match the dimensions discussed in lecture
    # i.e., (N x d) to (d x N)
    x_train = np.transpose(x_train, axes=(1, 0))
    x_val = np.transpose(x_val, axes=(1, 0))
    x_test = np.transpose(x_test, axes=(1, 0))
    y_train = np.expand_dims(y_train, axis=0)
    y_val = np.expand_dims(y_val, axis=0)
    y_test = np.expand_dims(y_test, axis=0)

    for loss_func, alpha, T in zip(loss_funcs, alphas, Ts):

        # DONE: Initialize our linear regression model
        model_ours = LinearRegressionGradientDescent()

        print('***** Results of our linear regression model trained with {} loss, alpha={} and T={} on {} dataset *****'.format(
            loss_func, alpha, T, dataset_name))

        # DONE: Train model on training set
        model_ours.fit(
            x=x_train,
            y=y_train,
            T=T,
            alpha=alpha,
            loss_func=loss_func)

        # DONE: Test model on training set using mean squared error and r-squared
        predictions_train = model_ours.predict(x_train)

        # Squeeze to remove extra dimensions before passing to r2_score
        y_train_squeezed = np.squeeze(y_train)
        predictions_train_squeezed = np.squeeze(predictions_train)

        score_mse_train = skmetrics.mean_squared_error(y_train_squeezed, predictions_train_squeezed)
        print('Training set mean squared error: {:.4f}'.format(score_mse_train))

        score_r2_train = skmetrics.r2_score(y_train_squeezed, predictions_train_squeezed)
        print('Training set r-squared scores: {:.4f}'.format(score_r2_train))

        # DONE: Test model on validation set using mean squared error and r-squared
        predictions_val = model_ours.predict(x_val)

        # Squeeze to remove extra dimensions
        y_val_squeezed = np.squeeze(y_val)
        predictions_val_squeezed = np.squeeze(predictions_val)

        score_mse_val = skmetrics.mean_squared_error(y_val_squeezed, predictions_val_squeezed)
        print('Validation set mean squared error: {:.4f}'.format(score_mse_val))

        score_r2_val = skmetrics.r2_score(y_val_squeezed, predictions_val_squeezed)
        print('Validation set r-squared scores: {:.4f}'.format(score_r2_val))

        # DONE: Test model on testing set using mean squared error and r-squared
        predictions_test = model_ours.predict(x_test)

        # Squeeze to remove extra dimensions
        y_test_squeezed = np.squeeze(y_test)
        predictions_test_squeezed = np.squeeze(predictions_test)

        score_mse_test = skmetrics.mean_squared_error(y_test_squeezed, predictions_test_squeezed)
        print('Testing set mean squared error: {:.4f}'.format(score_mse_test))

        score_r2_test = skmetrics.r2_score(y_test_squeezed, predictions_test_squeezed)
        print('Testing set r-squared scores: {:.4f}'.format(score_r2_test))


***** Results of scikit-learn linear regression model on Diabetes dataset *****
Training set mean squared error: 25472.6663
Training set r-squared scores: -3.2907
Validation set mean squared error: 30376.7403
Validation set r-squared scores: -5.1176
Testing set mean squared error: 27024.0792
Testing set r-squared scores: -3.2604
***** Results of our linear regression model trained with mean_squared loss, alpha=0.0001 and T=30000 on Diabetes dataset *****
Step=10000  Loss=28180.3086
Step=20000  Loss=28106.3236
Step=30000  Loss=28034.8016
Training set mean squared error: 28034.7946
Training set r-squared scores: -3.7223
Validation set mean squared error: 34538.4871
Validation set r-squared scores: -5.9557
Testing set mean squared error: 29713.4975
Testing set r-squared scores: -3.6844
***** Results of our linear regression model trained with half_mean_squared loss, alpha=0.0002 and T=30000 on Diabetes dataset *****
Step=10000  Loss=14090.1543
Step=20000  Loss=14053.1618
Step=30000  Loss=