# Linear Regression in PyTorch

## What is linear regression?

Linear = our predictions are a **linear combination** of our inputs

Regression = we will learn the relationship that relates features to labels

$X$ is a matrix of training data. Each row represents a different training example (of which there are $m$). Each column represents a different feature (of which there are $n$). Hence $X$ has dimensions $m \times n$, i.e. $X \in  R^{m \times n}$.

$W$ is our matrix of weights, that controls how much each feature contributes to the hypothesis. If one particular weight equals 5, then changing its associated feature by 1 in the input space, will change the output hypothesis by 5. $W \in R^{n \times 1}$

$h$ is our hypothesis - our prediction of the mapping from input to output. In this example, our model will predict a single scalar output for each of out $m$ inputs - so $h \in R^{m\times n}$.

multiply by weight

## $ h = X  W = w_1 x_1 + w_2 x_2 + \dots + w_{n-1} x_{n-1} + w_n x_n$

This linear combination is a **weighted sum of the input features**. As we vary the value of one feature, our hypothesis will change proportionately and linearly.

Imagine that we are trying to predict house price. Consider:
- The weight associated with the feature that is the number of rooms should be large and positive, because the number of rooms contributes lots, and positively to the price of a house. 
- The weight associated with the age of the house may be negative, as older houses might be found to be worth less from the training data.
- The weight associated with a feature that is the age of the person last living there should be zero, because the house price is independent of this feature. It does not contribute at all to the house price.

## Cost functions

For our algorithms to learn, we need a way to evaluate their current performance, so that we can determine how to improve. We can mathematically define when our algorithm is performing well by evaluating an appropriate objective function. We usually try to minimise a function which indicates the error in our hypothesis. In this case, we will use the mean squared error (MSE) between our predictions and labels as our cost function.
cost is also a function of the weights 

## $ MSE\ Loss,\ J = \frac{1}{2m} \sum_{i=1}^{m} (h^{(i)} - y^{(i)})^2$

The cost function has as many dimensions as we have parameters. Changing these parameters moves us around parameter space, in which the cost varies. Varying different parameters will have varying influence on how the cost changes - as such, some are more important to optimise.

(See cost functions notebook for more detail)

## Optimization
We optimize this model using the gradient descent algorithm where we iteratively calculate the derivative of our cost w.r.t our paramameters and use that to update our weights in a direction which reduces the cost.
gradient: move the weight to negative of the gradient (opposite direction of gradient) --> minimise MSE (loss) 
gradient descent: move fct towards negative of derivative (gradient) 
gradient gives the direction (move against gradient), but not the size 
alpha = learning rate (scale) to prevent divergence, so that model converge to optimum 
local minimum: statistical to find best empirical value 


## Implementation
Firstly we will import some functionality

In [0]:
# import functionality from these libraries
%matplotlib notebook
import numpy as np      # for efficient numerical computation
import torch            # for building computational graphs
from torch.autograd import Variable     # for automatically computing gradients of our cost with respect to what we want to optimise
import matplotlib.pyplot as plt     # for plotting absolutely anything
from mpl_toolkits.mplot3d import Axes3D # for plotting 3D graphs
import pandas as pd #allows us to easily import any data

In [0]:
# http://pytorch.org/
from os import path
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())

accelerator = 'cu80' if path.exists('/opt/bin/nvidia-smi') else 'cpu'

!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.3.0.post4-{platform}-linux_x86_64.whl torchvision
import torch

We will import our data into a pandas data frame and shuffle it

In [9]:
df = pd.read_csv('airfoil_self_noise.dat', sep='\t')#import our dataset into a pandas dataframe
df = df.sample(frac=1) #shuffle our dataset
print(df.head())
#dataset from UCL machine learning 
#shuffle to get random mix of data so that its not in order 
#only shuffle row 

        800    0  0.3048  71.3  0.00266337  126.201
135    2500  3.0  0.3048  39.6    0.004957  120.162
1336   2000  3.3  0.1016  31.7    0.002514  131.642
21     3150  0.0  0.3048  55.5    0.002831  123.236
1251  10000  0.0  0.1016  71.3    0.001211  123.305
160     800  4.0  0.3048  71.3    0.004978  131.755


In [4]:
from google.colab import files 
files.upload()

Saving airfoil_self_noise.dat to airfoil_self_noise.dat


{'airfoil_self_noise.dat': b'800\t0\t0.3048\t71.3\t0.00266337\t126.201\r\n1000\t0\t0.3048\t71.3\t0.00266337\t125.201\r\n1250\t0\t0.3048\t71.3\t0.00266337\t125.951\r\n1600\t0\t0.3048\t71.3\t0.00266337\t127.591\r\n2000\t0\t0.3048\t71.3\t0.00266337\t127.461\r\n2500\t0\t0.3048\t71.3\t0.00266337\t125.571\r\n3150\t0\t0.3048\t71.3\t0.00266337\t125.201\r\n4000\t0\t0.3048\t71.3\t0.00266337\t123.061\r\n5000\t0\t0.3048\t71.3\t0.00266337\t121.301\r\n6300\t0\t0.3048\t71.3\t0.00266337\t119.541\r\n8000\t0\t0.3048\t71.3\t0.00266337\t117.151\r\n10000\t0\t0.3048\t71.3\t0.00266337\t115.391\r\n12500\t0\t0.3048\t71.3\t0.00266337\t112.241\r\n16000\t0\t0.3048\t71.3\t0.00266337\t108.721\r\n500\t0\t0.3048\t55.5\t0.00283081\t126.416\r\n630\t0\t0.3048\t55.5\t0.00283081\t127.696\r\n800\t0\t0.3048\t55.5\t0.00283081\t128.086\r\n1000\t0\t0.3048\t55.5\t0.00283081\t126.966\r\n1250\t0\t0.3048\t55.5\t0.00283081\t126.086\r\n1600\t0\t0.3048\t55.5\t0.00283081\t126.986\r\n2000\t0\t0.3048\t55.5\t0.00283081\t126.616\r\n2500\t

Convert the datapoints into torch tensors and split intro training and test sets.

In [0]:
#convert our data into torch tensors
X = torch.Tensor(np.array(df[df.columns[0:-1]])) #pick our features from our dataset, from numpy to torch
Y = torch.Tensor(np.array(df[df.columns[-1]]).squeeze()) #select our label - squeeze() removes redundant dimensions
#only take the final column: df[df.columns[-1]]
#squeeze: makes into vector 

m = 1100 #size of training set --> fraction of 1500 (whole data))

#split our data into training and test set
#training set
x_train = Variable(X[0:m]) #convert into variable containing tensor 
y_train = Variable(Y[0:m]).view(-1, 1)#make into 1 column output feature 

#test set
x_test = Variable(X[m:]) #from m to end --> rest of data 
y_test = Y[m:].view(-1, 1)


In [0]:
#the data set is not normalized (large spread)) 
no_epochs =
#try so that cost goes down 

#accuracy test: 
def test(): 
  h = mymodel.forward(x_test)
  cost = criterion(n, y_test)
  
  return cost.data(0)
test_cost = test()
#etc. 

Define the model class which we will use to instantiate our model.

In [0]:
#define model class - inherit useful functions and attributes from torch.nn.Module
#inheritence: 1 class can inherit from other class --> one function have a lof of similar function and inherit from that 
#function 
class linearmodel(torch.nn.Module):
    def __init__(self):
        super().__init__() #call parent class initializer --> super calls torch.nn.Module
        self.linear = torch.nn.Linear(5, 1) #define linear combination function with 11 inputs and 1 output

    def forward(self, x):
        x = self.linear(x) #linearly combine our inputs to give 1 outputs
        return x

Define the necessary hyper-parameters, instantiate the model from the class, cost function and optimizer.

In [10]:
#how many epoch will train the model for 
no_epochs = 100
lr = 0.03

#create our model from defined class
mymodel = linearmodel()
criterion = torch.nn.MSELoss() #cross entropy cost function as it is a classification problem
#used MSE function from torch 

#Adam input information from past gradients from gradient optimization method 
optimizer = torch.optim.Adam(mymodel.parameters(), lr = lr) #define our optimizer
#function inside torch has .parameters function 

NameError: ignored

Create the axes which we will use to plot our costs each epoch. Define the training loop and train.

In [0]:
#for plotting costs
costs=[]
plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_xlabel('Epoch')
ax.set_ylabel('Cost')
ax.set_xlim(0, no_epochs-1)
plt.show()
#cost decrease wrt to epoch 

#training loop - same as last time
#calculate hypothesis, then cost, then optimize derivatives 

def train(no_epochs):
    for epoch in range(no_epochs):
        h = mymodel.forward(x_train) #forward propagate - calulate our hypothesis

        #calculate, plot and print cost
        cost = criterion(h, y_train)
        costs.append(cost.data[0])
        ax.plot(costs, 'b')
        fig.canvas.draw()
        print('Epoch ', epoch, ' Cost: ', cost.data[0])

        #cost.data[0] for this value, take the 0th value of the data ie tensil 
        
        #calculate gradients + update weights using gradient descent step with our optimizer
        optimizer.zero_grad() #make sure that the gradient from past trainings are cleared to run cost.backward
        cost.backward()#calculate the derivatives (run back through graph in power law)) 
        optimizer.step()

train(no_epochs)