### In this lab, we will implement Linear Regression using Least-square Solution. We will use the same example as we did in the class (Slide 18 from the linear regression slides). There are 5 steps. Let's implement them using only numpy step by step. 

![linreg_steps](linreg_slide_18.png)

In [1]:
import numpy as np 

We are given the dataset: {(0,0), (0,1), (1,0)} and asked to find the \
least-squares solution for the parameters in the regression of \
the function: y = w1 +w2^2

In [2]:
# creating the input and target in numpy arrays 
inputs = np.array([[0], [0], [1]])
targets = np.array([[0], [1], [0]])
print('inputs shape :',np.shape(inputs))
print('targets shape :',np.shape(targets))

inputs shape : (3, 1)
targets shape : (3, 1)


In [3]:
# now let's do the steps to find the solution  
# Step 1: evaluate the basis on the points
inputs = np.concatenate((np.ones((np.shape(inputs)[0],1)),inputs),axis=1)
print('inputs shape :',np.shape(inputs))
print(inputs)

inputs shape : (3, 2)
[[1. 0.]
 [1. 0.]
 [1. 1.]]


In [4]:
# step 2: compute -> transpose(inputs) * inputs 
q_matrix = np.dot(np.transpose(inputs),inputs)
print('q_matrix shape :',np.shape(q_matrix))
print(q_matrix)

q_matrix shape : (2, 2)
[[3. 1.]
 [1. 1.]]


In [5]:
# step 3: invert q_matrix
q_inverse = np.linalg.inv(q_matrix)
print('q_inverse shape :',np.shape(q_inverse))
print(q_inverse)

q_inverse shape : (2, 2)
[[ 0.5 -0.5]
 [-0.5  1.5]]


In [6]:
# step 4: Compute the pseudo-inverse -> q_inverse * transpose(inputs)
q_pseudo = np.dot(q_inverse,np.transpose(inputs))
print('q_pseudo shape :',np.shape(q_pseudo))
print(q_pseudo.astype(np.float16))

q_pseudo shape : (2, 3)
[[ 0.5  0.5  0. ]
 [-0.5 -0.5  1. ]]


In [7]:
# step 5: compute w = q_pseudo * targets
weights = np.dot(q_pseudo,targets)
print('w shape :',np.shape(weights))
print(weights)

w shape : (2, 1)
[[ 0.5]
 [-0.5]]


#### Now, let's implement the steps but on a real dataset. we will work on the auto-mpg dataset. This consists of a collection of a number of datapoints about certain cars (weight, horsepower, etc.), with the aim being to predict the fuel efficiency in miles per gallon (mpg) in for each car.

In [None]:
"""
You are asked to
    - load the dataset text file (auto-mpg.txt) as numpy array 
    - prerocess the dataset (normalise, split it into train and test sets)
    - find the least-squares solution for the parameters (weights vector)
    - test the found parameters on the test set and calculate the error

The following comments and codes are meant to guide you. 
"""

In [None]:
"""
Please note: This dataset has one problem. There are missing values 
in it (labelled with question marks ‘?’). The np.loadtxt() method doesn’t
like these, and we don’t know what to do with them, anyway,manually edit 
the file and delete all lines where there is a ? in that line. The linear
regressor can’t do much with the names of the cars either, but since they 
appear in quotes(") we will tell np.loadtxt that they are comments


Below are the attribute Information for the dataset:

    1. mpg:           continuous 
    2. cylinders:     multi-valued discrete
    3. displacement:  continuous
    4. horsepower:    continuous
    5. weight:        continuous
    6. acceleration:  continuous
    7. model year:    multi-valued discrete
    8. origin:        multi-valued discrete
    9. car name:      string (unique for each instance)

Please note: the first column is our target (mpg)
"""

In [34]:
# TODO: load the dataset file using np.loadtxt()
import pandas as pd

df = pd.read_csv("auto-mpg.txt", delimiter='   ')
df.head()
# data = np.loadtxt("auto-mpg.txt", delimiter=' ', usecols=range(5))

  after removing the cwd from sys.path.


Unnamed: 0,18.0,8,307.0,Unnamed: 3,130.0,Unnamed: 5,3504.,Unnamed: 7,12.0,"70 1\t""chevrolet chevelle malibu"""
0,15.0,8,350.0,,165.0,,3693.0,,11.5,"70 1\t""buick skylark 320"""
1,18.0,8,318.0,,150.0,,3436.0,,11.0,"70 1\t""plymouth satellite"""
2,16.0,8,304.0,,150.0,,3433.0,,12.0,"70 1\t""amc rebel sst"""
3,17.0,8,302.0,,140.0,,3449.0,,10.5,"70 1\t""ford torino"""
4,15.0,8,429.0,,198.0,,4341.0,,10.0,"70 1\t""ford galaxie 500"""


In [25]:
# TODO: Normalise the dataset. You can do this easily in numpy 
# by using np.mean and np.var. The only place where care is needed 
# is along which axis the mean and variance are computed: 
# axis=0 sums down the columns and axis=1 sums across the rows.

normalised_date = None

In [None]:
# TODO: Now separate the data into training and testing sets,

training, testing = None, None 

# And split each set into inputs and targets hint: slicing the array
trainin, traintgt = None, None
testin, testtgt = None, None

In [None]:
# TODO: Use the training set to find the weights vector.
# you need to implement the previous 5 steps on the training set 
# and find the weights vector (this is called training).  
# To make it simple we define a function that takes 
# two args: inputs and targets and return the weights vector

def linreg(inputs,targets):
    # you should implement the 5 steps here
    
    weights = None

    
    return weights



In [None]:
# test your implementation 
weights = linreg(trainin,traintgt) 
weights

In [None]:
# TODO: Testing the found weights on the testing set 
# you can do this by 
#     - testout = (testin*weights)
#     - error = sum((testout - testtgt)**2)

testout = None
error = None 


In [None]:
"""
    You can try to re-train the model without the normalising the data 
    and see if this makes any different on the error value
"""