<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/pytorch/t81_558_class_03_2_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* **What is Linear Regression?**

Linear regression is a supervised learning algorithm that models the relationship between a dependent variable (y) and one or more independent variables (x). The model is a linear equation of the form y = mx + b, where m is the slope of the line and b is the y-intercept.

* **Motivation for Linear Regression**

Linear regression is a simple and powerful model that can be used to solve a wide variety of problems. For example, it can be used to predict house prices, sales, or customer behavior.

* **Designing a Learning Algorithm**

The goal of linear regression is to find the values of m and b that minimize the error between the predicted values (y^) and the actual values (y). This can be done using a variety of optimization algorithms, such as gradient descent.

* **Linear Regression Algorithm**

The linear regression algorithm works by iteratively adjusting the values of m and b to minimize the error between the predicted values and the actual values. The algorithm starts with an initial guess for the values of m and b, and then it iteratively updates the values based on the gradient of the error function.

* **Gradient Descent**

Gradient descent is an optimization algorithm that can be used to find the minimum of a function. The algorithm works by iteratively moving in the direction of the steepest descent, until it reaches a minimum.

* **Batch Gradient Descent**

Batch gradient descent is a simple version of gradient descent that uses all of the training data to update the model parameters.

* **Stochastic Gradient Descent**

Stochastic gradient descent is a more efficient version of gradient descent that uses only a single training example to update the model parameters.

* **Evaluating Linear Regression Models**

The performance of a linear regression model can be evaluated using a variety of metrics, such as the mean squared error (MSE) or the root mean squared error (RMSE).

MSE and RMSE are both metrics used to evaluate the performance of a regression model. They measure the difference between the predicted values and the actual values, and they are both calculated by squaring the errors.

The **mean squared error (MSE)** is the average of the squared errors. It is calculated as follows:

```
MSE = (1/n) * Σ(y_pred - y)^2
```

where:

* n is the number of data points
* y_pred is the predicted value for a data point
* y is the actual value for a data point

The **root mean squared error (RMSE)** is the square root of the MSE. It is calculated as follows:

```
RMSE = sqrt(MSE)
```

The RMSE is often preferred over the MSE because it is in the same units as the dependent variable. This makes it easier to interpret the RMSE and to compare it to other models.

For example, if the dependent variable is house prices, then the RMSE would be in dollars. This means that we can say that the model is off by an average of $10,000.

The MSE and RMSE are both measures of the accuracy of a regression model. However, the RMSE is more sensitive to outliers than the MSE. This means that if there are a few data points with very large errors, the RMSE will be more affected than the MSE.

In general, a lower MSE or RMSE indicates a better fit for the model. However, it is important to consider the units of the MSE or RMSE when interpreting the results.

Here is a table that summarizes the differences between MSE and RMSE:

| Metric | Formula | Units | Interpretation |
|---|---|---|---|
| Mean squared error (MSE) | (1/n) * Σ(y_pred - y)^2 | Same as the dependent variable | Average of the squared errors |
| Root mean squared error (RMSE) | sqrt(MSE) | Same as the dependent variable | Square root of the average of the squared errors |


In [1]:
import torch

# Make use of a GPU or MPS (Apple) if one is available.
device = "mps" if getattr(torch,'has_mps',False) \
    else "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


## Simple PyTorch Regression: MPG

This example shows how to encode the MPG dataset for regression and predict values. We will see if we can predict the miles per gallon (MPG) for a car based on the car's weight, cylinders, engine size, and other features.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.autograd import Variable
from sklearn import preprocessing

# You will create a network class for every PyTorch neural network you create.
class Net(nn.Module):
    def __init__(self, in_count, out_count):
        super(Net, self).__init__()
        # We must define each of the layers.
        self.fc1 = nn.Linear(in_count, 50)
        self.fc2 = nn.Linear(50, 25)
        self.fc3 = nn.Linear(25, 1)

    def forward(self, x):
        # In the forward pass, we must calculate all of the layers we
        # previously defined.
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

You define your neural network in the **Net** class above. The class name does not matter; however, it must subclass **nn.Module** and implement both the **__init__** and **forward** methods. The **__init__** method defines the layers of the neural network. In this case, we have a neural network with an input layer equal to the number of inputs you specify from the MPG dataset. The neural network connects these inputs to 50 neurons in the first hidden layer, which are connected to 25 neurons in the second layer. The output neuron count for a layer must always match the input count of the next layer.

The **forward** method links these layers together and also defines the transfer functions. For this book, we will generally always use the Relu activation function for hidden layers. The output layer will use no transfer function for a regression neural network like this MPG example. For classification, we use the logistic for binary classification (just two classes) or softmax for two or more classes.

For the neural network to perform correctly, everything must align. The **__init__** method must specify all layers with the same number of outputs as inputs for each connection. Finally, the **forward** method must link all the layers together, in the correct order.

We will begin by reading the MPG dataset.

In [3]:
# Read the MPG dataset.
df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv",
    na_values=['NA', '?'])

cars = df['name']

# Handle missing value
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].median())

# Pandas to Numpy
x = df[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin']].values
y = df['mpg'].values # regression

# Numpy to PyTorch
x = torch.tensor(x,device=device,dtype=torch.float32)
y = torch.tensor(y,device=device,dtype=torch.float32)

We use Pandas to load the CSV file, as previously demonstrated. We will save the names of the cars, though the car names do not help predict the MPG. Horsepower does have missing values, so we substitute the median value for any missing values. Next, we convert Pandas to NumPy, and Numpy to PyTorch. We select only the fields that we wish to use to predict. As previously discussed, we designed the Net class to detect the size of this data and add the appropriate count of input neurons.

We are ready to create the neural network, loss function, and optimizer class with the data loaded.

In [5]:
x.shape

torch.Size([398, 7])

In [6]:
# Define the neural network
model = Net(x.shape[1],1).to(device)

# Define the loss function for regression
loss_fn = nn.MSELoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

We create the neural network with one input equal to the number of columns in the x-input data. We specify one output neuron which will predict the MPG. Next, we define MSELoss as the error function, which is a common choice for regression. We will use the Adam optimizer with a learning rate of 0.01 to train the network. Adam is a common choice, and 0.01 is a good start for a learning rate. The learning rate should never be above 1.0. Too large of a learning rate will fail to learn the problem thoroughly, and too low of a learning rate will take a long time to train. We will see more advanced methods for choosing the learning rate, including schedules that change it throughout training.

With the objects created, we can now train the neural network.

In [7]:
# Train for 1000 epochs.
for epoch in range(1000):
    optimizer.zero_grad()
    out = model(x).flatten()
    loss = loss_fn(out, y)
    loss.backward()
    optimizer.step()

    # Display status every 100 epochs.
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, loss: {loss.item()}")

Epoch 0, loss: 23346.607421875
Epoch 100, loss: 105.33187103271484
Epoch 200, loss: 59.72342300415039
Epoch 300, loss: 36.49307632446289
Epoch 400, loss: 27.242572784423828
Epoch 500, loss: 20.207651138305664
Epoch 600, loss: 15.593573570251465
Epoch 700, loss: 13.378717422485352
Epoch 800, loss: 12.471524238586426
Epoch 900, loss: 12.004287719726562


We now loop over 1,000 epochs and train the neural network; we define an epoch as one complete pass over the training set. We zero the gradients, so training from the previous epoch does not influence the current epoch. We present the entire training set to the model as one large batch. Later we will see more advanced ways to segment the data. We apply the loss function and use backpropagation to calculate the gradients to update the neural network weights.

## Regression Prediction

Next, we will perform actual predictions. The program assigns these predictions to the **pred** variable. These are all MPG predictions from the neural network. Notice that this is a 2D array? You can always see the dimensions of what PyTorch returns by printing out **pred.shape**. Neural networks can return multiple values, so the result is always an array. Here the neural network only returns one value per prediction (there are 398 cars, so 398 predictions). However, a 2D range is needed because the neural network has the potential of returning more than one value.

In [8]:
pred = model(x)
print(f"Shape: {pred.shape}")
print(pred[0:10])

Shape: torch.Size([398, 1])
tensor([[16.3173],
        [15.6266],
        [16.9481],
        [17.1265],
        [16.4174],
        [11.4317],
        [11.5971],
        [11.5789],
        [11.4165],
        [14.3042]], grad_fn=<SliceBackward0>)


In [9]:
from sklearn import metrics

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred.cpu().detach(),
  y.cpu().detach()))
print(f"Final score (RMSE): {score}")

Final score (RMSE): 3.4179017543792725


In [10]:
score = torch.sqrt(torch.nn.functional.mse_loss(pred.flatten(),y))
print(f"Final score (RMSE): {score}")

Final score (RMSE): 3.4179017543792725


There are five main assumptions that need to be met for linear regression to be valid:

1. **Linearity:** The relationship between the independent and dependent variables must be linear. This means that the residuals (the difference between the predicted values and the actual values) should be randomly scattered around the line of best fit, and not have any discernible pattern.
2. **Homoscedasticity:** The variance of the residuals should be constant across all values of the independent variable. This means that the residuals should be equally spread out around the line of best fit, and not be clustered together in any particular area.
3. **Normality:** The residuals should be normally distributed. This means that the residuals should be bell-shaped, with most of the values clustered around the mean and fewer values at the extremes.
4. **Independence:** The residuals should be independent of each other. This means that the residuals for one data point should not be correlated with the residuals for any other data point.
5. **Multicollinearity:** The independent variables should not be highly correlated with each other. This means that the independent variables should not be too similar to each other, as this can cause the model to be unstable.

If any of these assumptions are not met, then the results of the linear regression model may be unreliable.

Here are some ways to check for these assumptions:

* **Linearity:** You can plot the residuals against the predicted values to see if there is any discernible pattern. You can also use a statistical test, such as the Durbin-Watson test, to formally test for linearity.
* **Homoscedasticity:** You can plot the residuals against the independent variable to see if the variance of the residuals is constant. You can also use a statistical test, such as the Breusch-Pagan test, to formally test for homoscedasticity.
* **Normality:** You can plot the residuals on a histogram to see if they are normally distributed. You can also use a statistical test, such as the Shapiro-Wilk test, to formally test for normality.
* **Independence:** You can use a statistical test, such as the Durbin-Watson test, to formally test for independence.
* **Multicollinearity:** You can use a statistical measure, such as the variance inflation factor (VIF), to assess the level of multicollinearity in the model.

If any of these assumptions are not met, then you may need to transform the data, remove outliers, or use a different regression model.