# Module 2.5 Linear regression

## Setup

### Imports

In [147]:
import numpy as np
import pandas as pd

### Load data

In [148]:
# Load training data
df_train = pd.read_csv('02-regression/notebooks/data_train.csv')
print(df_train.shape)
df_train.head()

(7150, 15)


Unnamed: 0,make,model,year,engine_fuel_type,engine_hp,engine_cylinders,transmission_type,driven_wheels,number_of_doors,market_category,vehicle_size,vehicle_style,highway_mpg,city_mpg,popularity
0,gmc,envoy_xl,2005,regular_unleaded,275.0,6.0,automatic,rear_wheel_drive,4.0,,large,4dr_suv,18,13,549
1,volkswagen,passat,2016,regular_unleaded,170.0,4.0,automatic,front_wheel_drive,4.0,,midsize,sedan,38,25,873
2,honda,odyssey,2016,regular_unleaded,248.0,6.0,automatic,front_wheel_drive,4.0,,large,passenger_minivan,28,19,2202
3,chevrolet,cruze,2015,regular_unleaded,138.0,4.0,manual,front_wheel_drive,4.0,,midsize,sedan,36,25,1385
4,volvo,740,1991,regular_unleaded,162.0,4.0,automatic,rear_wheel_drive,4.0,"luxury,performance",midsize,sedan,20,17,870


## Simple linear regression from scratch

### 2.5 Simple linear regression implementation

#### Notes
From 
https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/02-regression/05-linear-regression-simple.md

Model for solving regression tasks, in which the objective is to adjust a line for the data and make predictions on new values. The input of this model is the **feature matrix** `X` and a `y` **vector of predictions** is obtained, trying to be as close as possible to the **actual** `y` values. The linear regression formula is the sum of the bias term \( $w_0$ \), which refers to the predictions if there is no information, and each of the feature values times their corresponding weights as \( $x_{i1} \cdot w_1 + x_{i2} \cdot w_2 + ... + x_{in} \cdot w_n$ \).

So the simple linear regression formula looks like:

$g(x_i) = w_0 + x_{i1} \cdot w_1 + x_{i2} \cdot w_2 + ... + x_{in} \cdot w_n$.

And that can be further simplified as:

$g(x_i) = w_0 + \displaystyle\sum_{j=1}^{n} w_j \cdot x_{ij}$

Here is a simple implementation of Linear Regression in python:

~~~~python
w0 = 7.1
def linear_regression(xi):
    
    n = len(xi)
    
    pred = w0
    w = [0.01, 0.04, 0.002]
    for j in range(n):
        pred = pred + w[j] * xi[j]
    return pred
~~~~
        

If we look at the $\displaystyle\sum_{j=1}^{n} w_j \cdot x_{ij}$ part in the above equation, we know that this is nothing else but a vector-vector multiplication. Hence, we can rewrite the equation as $g(x_i) = w_0 + x_i^T \cdot w$

We need to assure that the result is shown on the untransformed scale by using the inverse function `exp()`.

#### Implementation

In [149]:
# get a random car
df_train.iloc[10]

make                            toyota
model                            yaris
year                              2015
engine_fuel_type      regular_unleaded
engine_hp                        106.0
engine_cylinders                   4.0
transmission_type               manual
driven_wheels        front_wheel_drive
number_of_doors                    2.0
market_category              hatchback
vehicle_size                   compact
vehicle_style            2dr_hatchback
highway_mpg                         37
city_mpg                            30
popularity                        2031
Name: 10, dtype: object

In [150]:
xi = [106, 30, 2031]  # example input (parameters)

In [151]:
w0 = 7.17  # bias term
w = [0.01, 0.04, 0.002]  # weights

In [152]:
def linear_regression(xi):
    """Predict the target value using linear regression given a single row of features."""
    n = len(xi)

    pred = w0

    for j in range(n):
        pred = pred + w[j] * xi[j]

    return pred

In [153]:
prediction = linear_regression(xi)  # make prediction
print(prediction)

13.492


In [154]:
price_simple = np.expm1(prediction).round(2)  # convert back to original scale
print(price_simple)

723603.32


### 2.6 Linear regression vector form

#### Notes

From https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/02-regression/06-linear-regression-vector.md

The formula of linear regression can be synthesized with the dot product between features and weights. The feature vector includes the *bias* term with an *x* value of one, such as $w_{0}^{x_{i0}},\ where\ x_{i0} = 1\ for\ w_0$.

When all the records are included, the linear regression can be calculated with the dot product between ***feature matrix*** and ***vector of weights***, obtaining the `y` vector of predictions. 

#### Implementation

In [155]:
def dot(xi, w):
    """Compute the dot product between two vectors."""
    n = len(xi)

    res = 0.0

    for j in range(n):
        res = res + xi[j] * w[j]

    return res

In [156]:
def linear_regression_vector_form(xi):
    """Predict the target value using linear regression given a single row of features."""
    return w0 + dot(xi, w)

In [157]:
prediction = linear_regression_vector_form(xi)  # make prediction
print(prediction)

13.492


In [158]:
price_vector = np.expm1(prediction).round(2)  # revert to original scale
print(price_vector)  # convert back to original scale
# verify both methods give the same result
assert np.allclose(price_simple, price_vector)

723603.32


We can simplify further so that the entire calculation is a dot product.

In [159]:
# The following syntax will concatenate lists in Python
w_new = [w0] + w  # add bias term to weights
w_new

[7.17, 0.01, 0.04, 0.002]

In [160]:
def linear_regression_vector_form_dot(xi):
    """Predict the target value using linear regression given a single row of features."""
    xi_new = [1] + xi  # add bias term to features
    return dot(xi_new, w_new)

In [None]:
# Let's create a bunch of fake car features
x1 = [1, 148, 24, 1385]
x2 = [1, 132, 25, 2031]
x10 = [1, 453, 11, 86]

# Create a matrix
X = [x1, x2, x10]  # list of lists
X = np.array(X)  # convert to numpy array
X

array([[   1,  148,   24, 1385],
       [   1,  132,   25, 2031],
       [   1,  453,   11,   86]])

In [162]:
predictions = X.dot(w_new)  # matrix-vector product
print(predictions)

[12.38  13.552 12.312]


In [163]:
prices = np.expm1(predictions).round(2)  # convert back to original scale
print(prices)

[237992.82 768348.51 222347.22]


In [164]:
def linear_regression(X):
    return X.dot(w_new)

In [165]:
# Use the defined function
predictions = linear_regression(X)
print(predictions)
prices = np.expm1(predictions).round(2)  # convert back to original scale
print(prices)

[12.38  13.552 12.312]
[237992.82 768348.51 222347.22]
