# Linear regression with a neural network mindset.

This notebook will demonstrate how a linear regression problem can be thought of as an extremely simple neural network.  It is essentially a perceptron with no hidden layers or activation function.

This exercise is a pre-cursor to building a complete neaural network from scratch.  This network would contain a hidden layers, activation functions, and multiple nodes.

## Regression

In mathematical terms, the regression problem is expressed as:

$$z = (\sum_{n=1}^{n} w^{(n)} \cdot x^{(n)}) + b$$

*z* is the weighted sum of the input features, plus the bias term    


$n$ is the number of input features  
$w^{(n)}$ w are the weights associated with each input feature $x_{n}$  
$x^{(n)}$ are the input features  
$b$ is the bias term  

The parameters to be learned are a $w$ for each input feature and $b$  
This will be initialized and then leared through a gradient descent algorithm.

## Data

Use the same housing dataset as in the Housing folder.  
Slightly simplify the features in order to simplify the demonstration.

In [19]:
# Imports for Data section
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [20]:
housing = fetch_california_housing()

# Use the same random state for reproducible results
x_train, x_test, y_train, y_test = train_test_split(housing.data,
                                                    housing.target,
                                                    test_size=0.1,
                                                    random_state=66)

# Remove the latitude and longitude features
x_train = x_train[:,0:6]
x_test = x_test[:,0:6]

# Reshape y to match expected shape
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

In [21]:
# standardize x values to assist in convergence
# Compute mean and standard deviation
mean = np.mean(x_train, axis=0)
std = np.std(x_train, axis=0)

# Perform standardization
# When standardizing the test set it is important to use the training set mean/std  
#  otherwise information about your test data will bleed into your evaluation.
x_train = (x_train - mean) / std
x_test = (x_test - mean) / std

## Initialization

A neural network will learn a weight for each input feature plus a single bias term.  
This approximation of a neural network will learn the same thing.  

First we must initialize the parameters to be learned.  
Initialize from a standard normal distribution  
then multiply the values by a small constant to ensure small initial weights and prevent exploding (very large) gradients.

In [22]:
w = np.random.randn(6, 1) * 0.01 # shape is (n_features,1)
b = np.random.randn() * 0.01 # bias term

## Propagation

forward propagation - forward propagation in this case is simply computing a predicted value for each input X.  In a larger neural network, this computation would typically be followed by an activation function.  At least in the case of a node in a hidden layer or if the final unit of a neural network is a sigmoid or softmax function.

backward propagation - The gradients indicate the direction and magnitude in which the parameter should be adjusted to minimize the loss.  How much will the loss change wrt a small change in the parameter.    
The two learned parameters are:  
dw - gradient of the loss wrt the weights  
db - gradient of the loss wrt the bias   

In [23]:
def propagate(w, b, X, Y):
    """
    Implement the cost function and its gradient for the propagation.

    Arguments:
    w -- weights, a numpy array of size (n_features, 1)
    b -- bias, a scalar
    X -- data of shape (n_samples, n_features)
    Y -- true "label" vector shape: (n_samples, 1)

    Return:
    cost -- negative log-likelihood cost for logistic regression
    dw -- gradient of the loss with respect to w, thus same shape as w
    db -- gradient of the loss with respect to b, thus same shape as b
    """

    m = X.shape[0] # Number of samples

    # FORWARD PROPAGATION
    # compute z.  This will output a predicted value for each sample.
    # z will have shape(n_samples, 1)
    z = np.dot(X, w) + b

    # compute the cost using MSE (Mean Squared Error)
    cost = 1/m * (np.sum(np.square(Y-z))) # y-z or y-z won't matter because it is being squared.

    # BACKWARD PROPAGATION (compute gradients wrt weights and biases)
    dZ = z - Y
    dw = 1/m * np.dot(X.T, dZ)
    db = 1/m * np.sum(dZ)

    gradients = {"dw": dw,
                 "db": db}
        
    return gradients, cost


## Optimization

Now that we have initialized parameter values,  
computed the cost function and its gradient,
now update the parameters using gradient descent.

Learn $w$ and $b$ by minimizing the cost function J.  
For a parameter $\theta$ the update rule is  
$ \theta = \theta - \alpha \text{ } d\theta$  
where $\alpha$ is the learning rate.

In [24]:
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
    """
    This function optimizes w and b by running a gradient descent algorithm
    
    Arguments:
    w -- weights, a numpy array of size (n_features, 1)
    b -- bias, a scalar
    X -- data of shape (n_samples, n_features)
    Y -- true "label" vector shape: (n_samples, 1))
    num_iterations -- number of iterations of the optimization loop
    learning_rate -- learning rate of the gradient descent update rule
    print_cost -- True to print the loss every 100 steps
    
    Returns:
    params -- dictionary containing the weights w and bias b
    grads -- dictionary containing the gradients of the weights and bias with respect to the cost function
    costs -- list of all the costs computed during the optimization, this will be used to plot the learning curve.

    """
    
    for i in range(num_iterations):
        
        grads, cost = propagate(w, b, X, Y)
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule (from markdown above)
        w = w-(dw*learning_rate)
        b = b-(db*learning_rate)

        # Print the cost every 100 training iterations
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
            
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads

## Predict

Once we have a learned array of weights and a bias term the prediction for a given set of inputs relatively simple.

In [25]:
def predict(w, b, X):
    '''
    Predict the values for a given set of data using learned weights and bias
    
    Arguments:
    w -- weights, a numpy array of size (n_features, 1)
    b -- bias, a scalar
    X -- data of size (n_samples, n_features)
    
    Returns:
    prediction -- a numpy array containing predicted values (n_samples, 1)
    '''
    
    # In this neural network prototype, there is no activation function so the predictions are simply
    #  the dot product of the feature matrix and the weight matrix + the bias term
    prediction = np.dot(X, w) + b
    return prediction

## Model

This code will mimic a neural network model.  

This "model" won't save an artifact.  

It will use the optimize function to learn the optimal values for w and b.
It will then make predictions on both the train and test sets.
it will output the r2 scores on the train and test sets.

We can evaluate the r2 score against the test set against the regression baseline.  It should be comporable and it will prove that this extremely simple neural network type of model can learn from the input features.

In [26]:
num_iterations = 1000
learning_rate = 0.01
print_cost = True

parameters, _ = optimize(w, b, x_train, y_train, num_iterations, learning_rate, print_cost)

# Retrieve parameters w and b from dictionary "parameters"
w = parameters["w"]
b = parameters["b"]

# Predict test/train set examples
y_train_predictions = predict(w, b, x_train)
Y_test_predictions = predict(w, b, x_test)

# Print train/test Errors
r2_train = r2_score(y_train, y_train_predictions)
r2_test = r2_score(y_test, Y_test_predictions)

print(f"train score: {r2_train}")
print(f"test score: {r2_test}")


Cost after iteration 0: 5.552101
Cost after iteration 100: 1.320006
Cost after iteration 200: 0.747717
Cost after iteration 300: 0.665072
Cost after iteration 400: 0.649508
Cost after iteration 500: 0.643738
Cost after iteration 600: 0.639803
Cost after iteration 700: 0.636521
Cost after iteration 800: 0.633668
Cost after iteration 900: 0.631169
train score: 0.5246059826612859
test score: 0.5423449598458077


## Intuition/Explainability

Output the weights for each feature and the bias term.  
The math for predicting an unseen value from the features is relatively simple.
Refer to the regresssion calculation above.

In [27]:
features_and_weights = dict(zip(housing.feature_names[0:6], w))
features_and_weights

{'MedInc': array([0.87478891]),
 'HouseAge': array([0.22014495]),
 'AveRooms': array([-0.1811009]),
 'AveBedrms': array([0.15264353]),
 'Population': array([0.03281082]),
 'AveOccup': array([-0.04342863])}

In [28]:
print(f'bias: {b}')

bias: 2.0644495015459063


## Notes

* This "model" was trained for 1000 iterations over the test data.  The cost is decreasing at each iteration and is logged every 100 iterations.
* The learned weights and bias were used to predict values for both the train and test datasets.  The predicted test values have an r2 score on par with the baseline, so the model is learning from the input features.
* This simplistic example had no need to use a batch size for the training data.  The matrix multiplications were so small that each iteration used the entire training set.  As neural networks and the number of inputs grow larger and more complex these batches over each training iteration become important as way to fit the computations into the total available memory.