In [4]:
import sys
sys.path.append("./git/")

# Multiple regression

Scaling up the complexity from a simple linear regression, now imagine that we have a series of inputs (independent variables) where each $x_i$ is not a single number but a vector of length $k$:

$$x_i,...,x_k$$

The multiple regression linear model assumes that:

$$y_i = \alpha + \beta_1x_i1 + ... + \beta_k x_(ik) + \varepsilon_i$$

In multiple regression the vector of paramters is usually called $\beta$. We'll want thi to include the constant term as well, which we can achieve by adding a column of 1s to our data:

`beta = [alpha,beta_1,...,beta_k]`

and

`x_i = [1,x_i1,...,x_ik]`

Then the model is just:

In [5]:
from scratch.linear_algebra import dot, Vector

def predict(x: Vector, beta: Vector) -> float:
    """assumes that the first element of x is 1"""
    return dot(x,beta)

Each input vector will look like this:

In [6]:
[1, # constant term
 49, # number of friends
 4, # work hours per day
 0] # doesn't have PhD

[1, 49, 4, 0]

# Further assumptions in the least squares model

Extra assumptions make the model work:

- the columns of `x_i` (the input variables) are linearly independent i.e. there is no way to write any one as a weighted sum of some of the others. If this assumption fails, you can't estimate beta.
- all of the columns of `x_i` are uncorrelated with the errors $\varepsilon$. If this isn't true, our $\beta$ estimates will be systematically wrong.

# Fitting the model

In [7]:
from typing import List

def error(x: Vector, 
          y: float,
          beta:Vector) -> float:
    return predict(x,beta) - y

def squared_error(x:Vector,
                  y:float,
                  beta:Vector) -> float:
    return error(x,y,beta) ** 2

x = [1,2,3]
y = 30
beta = [4,4,4]

assert error(x,y,beta) == -6
assert squared_error(x,y,beta) == 36

In [8]:
def sqerror_gradient(x:Vector,
                     y:float,
                     beta:Vector) -> Vector:
    err = error(x,y,beta)
    return [2*err*x_i for x_i in x]

assert sqerror_gradient(x,y,beta) == [-12,-24,-36]

In [9]:
import random
import tqdm
from scratch.linear_algebra import vector_mean
from scratch.gradient_descent import gradient_step

def least_squares_fit(xs:List[Vector],
                      ys:List[float],
                      learning_rate:float = 0.001,
                      num_steps: int = 1000,
                      batch_size: int = 1) -> Vector:
    """
    Find the beta that minimises the sum of squared errors
    assuming the model y = dot(x,beta)
    """
    guess = [random.random() for _ in xs[0]]
    
    for _ in tqdm.trange(num_steps,desc='least squares fit'):
        for start in range(0,len(xs),batch_size):
            batch_xs = xs[start:start+batch_size]
            batch_ys = ys[start:start+batch_size]
            
            gradient = vector_mean([sqerror_gradient(x,y,guess) 
                                    for x,y in zip(batch_xs,batch_ys)])
    return guess

In [10]:
from scratch.statistics import daily_minutes_good
from scratch.gradient_descent import gradient_step

random.seed(0)
learning_rate = 0.001

inputs: List[List[float]] = [[1.,49,4,0],[1,41,9,0],[1,40,8,0],[1,25,6,0],[1,21,1,0],[1,21,0,0],[1,19,3,0],[1,19,0,0],[1,18,9,0],[1,18,8,0],[1,16,4,0],[1,15,3,0],[1,15,0,0],[1,15,2,0],[1,15,7,0],[1,14,0,0],[1,14,1,0],[1,13,1,0],[1,13,7,0],[1,13,4,0],[1,13,2,0],[1,12,5,0],[1,12,0,0],[1,11,9,0],[1,10,9,0],[1,10,1,0],[1,10,1,0],[1,10,7,0],[1,10,9,0],[1,10,1,0],[1,10,6,0],[1,10,6,0],[1,10,8,0],[1,10,10,0],[1,10,6,0],[1,10,0,0],[1,10,5,0],[1,10,3,0],[1,10,4,0],[1,9,9,0],[1,9,9,0],[1,9,0,0],[1,9,0,0],[1,9,6,0],[1,9,10,0],[1,9,8,0],[1,9,5,0],[1,9,2,0],[1,9,9,0],[1,9,10,0],[1,9,7,0],[1,9,2,0],[1,9,0,0],[1,9,4,0],[1,9,6,0],[1,9,4,0],[1,9,7,0],[1,8,3,0],[1,8,2,0],[1,8,4,0],[1,8,9,0],[1,8,2,0],[1,8,3,0],[1,8,5,0],[1,8,8,0],[1,8,0,0],[1,8,9,0],[1,8,10,0],[1,8,5,0],[1,8,5,0],[1,7,5,0],[1,7,5,0],[1,7,0,0],[1,7,2,0],[1,7,8,0],[1,7,10,0],[1,7,5,0],[1,7,3,0],[1,7,3,0],[1,7,6,0],[1,7,7,0],[1,7,7,0],[1,7,9,0],[1,7,3,0],[1,7,8,0],[1,6,4,0],[1,6,6,0],[1,6,4,0],[1,6,9,0],[1,6,0,0],[1,6,1,0],[1,6,4,0],[1,6,1,0],[1,6,0,0],[1,6,7,0],[1,6,0,0],[1,6,8,0],[1,6,4,0],[1,6,2,1],[1,6,1,1],[1,6,3,1],[1,6,6,1],[1,6,4,1],[1,6,4,1],[1,6,1,1],[1,6,3,1],[1,6,4,1],[1,5,1,1],[1,5,9,1],[1,5,4,1],[1,5,6,1],[1,5,4,1],[1,5,4,1],[1,5,10,1],[1,5,5,1],[1,5,2,1],[1,5,4,1],[1,5,4,1],[1,5,9,1],[1,5,3,1],[1,5,10,1],[1,5,2,1],[1,5,2,1],[1,5,9,1],[1,4,8,1],[1,4,6,1],[1,4,0,1],[1,4,10,1],[1,4,5,1],[1,4,10,1],[1,4,9,1],[1,4,1,1],[1,4,4,1],[1,4,4,1],[1,4,0,1],[1,4,3,1],[1,4,1,1],[1,4,3,1],[1,4,2,1],[1,4,4,1],[1,4,4,1],[1,4,8,1],[1,4,2,1],[1,4,4,1],[1,3,2,1],[1,3,6,1],[1,3,4,1],[1,3,7,1],[1,3,4,1],[1,3,1,1],[1,3,10,1],[1,3,3,1],[1,3,4,1],[1,3,7,1],[1,3,5,1],[1,3,6,1],[1,3,1,1],[1,3,6,1],[1,3,10,1],[1,3,2,1],[1,3,4,1],[1,3,2,1],[1,3,1,1],[1,3,5,1],[1,2,4,1],[1,2,2,1],[1,2,8,1],[1,2,3,1],[1,2,1,1],[1,2,9,1],[1,2,10,1],[1,2,9,1],[1,2,4,1],[1,2,5,1],[1,2,0,1],[1,2,9,1],[1,2,9,1],[1,2,0,1],[1,2,1,1],[1,2,1,1],[1,2,4,1],[1,1,0,1],[1,1,2,1],[1,1,2,1],[1,1,5,1],[1,1,3,1],[1,1,10,1],[1,1,6,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,4,1],[1,1,9,1],[1,1,9,1],[1,1,4,1],[1,1,2,1],[1,1,9,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,1,1],[1,1,1,1],[1,1,5,1]]

beta = least_squares_fit(inputs,
                         daily_minutes_good,
                         learning_rate,
                         5000,
                         25)
print(beta)

# assert 30.50 < beta[0] < 30.70  # constant
# assert  0.96 < beta[1] <  1.00  # num friends
# assert -1.89 < beta[2] < -1.85  # work hours per day
# assert  0.91 < beta[3] <  0.94  # has PhD

least squares fit: 100%|█████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2059.43it/s]

[0.8444218515250481, 0.7579544029403025, 0.420571580830845, 0.25891675029296335]





# Interpreting the model

The coefficients of the model (the beta values) can be thought of being estimates of the impact of each factor (each feature, each variable) on the output variable **all else being equal**.

All else being equal, each additional friend corresponds to an extra minute spent on the site each day. All else being equal, each additional hour in a user's workday corresponds to about two fewer minutes spent on the site each day. All else being equal, having a PhD is associated with spending an extra minute on the site each day.

What this **doesn't** tell you anything about is the *interaction* between the variables. The hours someone works may influence people with more friends differently than people with fewer friends. Multiple regressions do not capture this. So how do you handle that?

One way is to define a new variable that is the product of friends and work hours. This allows the work hours coefficient to increase or decrease as the number of friends increases.

Alternatively, it could be that the more friends you have the more time you spend on the site *up to a certain point* at which further friends causes you to spend less time on the site. We could try to capture this by introducing another variable that is the  *square* of the number of friends.

Once we start adding variables, we need to be very careful about whether their coefficients actually matter or mean anything. You have carte blanche to define any variables and as many variables as you want.

# Goodness of fit

Again we look to the r-squared value:

In [12]:
from scratch.simple_linear_regression import total_sum_of_squares

def multiple_r_squared(xs: List[Vector],
                       ys: Vector,
                       beta: Vector) -> float:
    sum_of_squared_errors = sum(error(x,y,beta)**2 
                                for x,y in zip(xs,ys))
    return 1.0 - sum_of_squared_errors / total_sum_of_squares(ys)

multiple_r_squared(inputs,daily_minutes_good,beta)

-4.422023502508536