In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import scipy.stats as st

In this notebook I practice several concepts related to linear regression and least squares. In order to do so I need a function to create data with linear trend. The following function creates a randomized dataset from the parameters of a linear regression model (slope and intercept.)

In [None]:
SIZE = 100
def generate_random_linear_sample(slope, intercept, noise_std, x_init, x_end):
    x = np.random.uniform(x_init, x_end, SIZE)
    y = x*slope + intercept + np.random.normal(0, noise_std, SIZE)
    
    return (x,y)

In [None]:
x1, y1 = generate_random_linear_sample(0.85, 12, 8, 10, 50)
x2, y2 = generate_random_linear_sample(-1.23, 5, 5, 20, 80)
x3, y3 = generate_random_linear_sample(0.02, 10, 2, 100, 120)
fig, ax = plt.subplots(1,3)
ax[0].scatter(x1, y1)
ax[1].scatter(x2, y2)
ax[2].scatter(x3, y3)
fig.set_figwidth(12)

### Example of moderate positive linear relationship

Correlation is a measure of the strength of the linear relationship between two variables. The correlation value is in the range between -1 and 1. If the variables have a strong positive linear relationship the value is close to one. If the variables have a strong negative relationship the value is closer to minus one. If there is no apparent linear relationship between the variables the value will be close to zero. 

Correlation demonstrates a positive linear relationship for the set of observations (x1, y1):

In [None]:
R = 1/(len(x1) - 1)*np.sum((x1 - np.mean(x1))*(y1 - np.mean(y1))/(np.std(x1) * np.std(y1)))
print(R)

Least squares is a method to fit a linar model based on minimising the least squares criterion, that is, the square of the residuals. The residual of an observation is the difference between the observed value of the response variable and the value predicted by the linear model.

Let's apply least squares to the set of observations (x1, y1):

In [None]:
b1 = R * np.std(y1) / np.std(x1)
b0 = np.mean(y1) - b1 * np.mean(x1)

print('y = {0} * x + {1}'.format(b1, b0))

In [None]:
def predict(x, b0, b1):
    return(b0 + b1*x)

fig, ax = plt.subplots()
ax.scatter(x1, y1)
ax.plot([np.min(x1), np.max(x1)], [predict(np.min(x1), b0, b1), predict(np.max(x1), b0, b1)], 'r')

In order to apply least squares, we have to check the following conditions:

- The data has a linear trend
- The distribution of the residuals is normal. This may not be the case if there are outliers
- The variability of the data around the line has to remain constant
- The observations are independent. That is not the case, for instance, when we have time series data.

In order to test these conditions we generate a series of plots:

In [None]:
def test_conditions(x, y, b0, b1):
    fig, ax = plt.subplots(2,2)
    
    predictions = x*b1 + b0
    residuals = y - predictions
    
    # Residuals plot
    ax[0][0].scatter(x, residuals)
    ax[0][0].plot([np.min(x)-1, np.max(x)+1], [0, 0], 'k--')
    ax[0][0].set_xlabel('x')
    ax[0][0].set_ylabel('residuals')
    
    # Distribution of the residuals
    weights = np.ones_like(residuals)/float(len(residuals))
    ax[0][1].hist(residuals, bins=10, weights=weights)
    ax[0][1].set_xlabel('residuals')
    
    # Q-q plot of the residuals
    quantiles = np.arange(0.01,0.99,0.01)
    q_theoretical = [st.norm.ppf(i, loc=np.mean(residuals), scale=np.std(residuals)) for i in quantiles]
    q_residuals = [np.percentile(residuals, i*100) for i in quantiles]
    ax[1][0].scatter(q_residuals, q_theoretical, color='blue')
    min_value = min(np.min(q_theoretical), np.min(q_residuals))
    max_value = max(np.max(q_theoretical), np.max(q_residuals))
    ax[1][0].plot([min_value, max_value], [min_value, max_value], 'k--')
    ax[1][0].set_xlabel('residuals')
    ax[1][0].set_ylabel('theoretical')
    
    # Order of data collection
    ax[1][1].scatter(range(len(x)), residuals)
    ax[1][1].plot([0, len(x)], [0, 0], 'k--')
    ax[1][1].set_xlabel('order of data collection')
    ax[1][1].set_ylabel('residuals')
    
    fig.set_figwidth(12)
    fig.set_figheight(8)

In [None]:
test_conditions(x1, y1, b0, b1)

These plots demonstrate that least squares can be applied to the (x1, y1) dataset

### Example of strong negative linear relationship

Correlation demonstrates a negative linear relationship for the set of observations (x2, y2):

In [None]:
R = 1/(len(x2) - 1)*np.sum((x2 - np.mean(x2))*(y2 - np.mean(y2))/(np.std(x2) * np.std(y2)))
print(R)

Let's apply least squares to the set of observations (x2, y2):

In [None]:
b1 = R * np.std(y2) / np.std(x2)
b0 = np.mean(y2) - b1 * np.mean(x2)

print('y = {0} * x + {1}'.format(b1, b0))

In [None]:
fig, ax = plt.subplots()
ax.scatter(x2, y2)
ax.plot([np.min(x2), np.max(x2)], [predict(np.min(x2), b0, b1), predict(np.max(x2), b0, b1)], 'r')

Let's test the least squares conditions:

In [None]:
test_conditions(x2, y2, b0, b1)

These plots demonstrate that least squares can be applied to the (x2, y2) dataset

### Example of very weak linear relationship

Correlation demonstrates that there is no linear relationship for the ser of observations (x3, y3):

In [None]:
R = 1/(len(x3) - 1)*np.sum((x3 - np.mean(x3))*(y3 - np.mean(y3))/(np.std(x3) * np.std(y3)))
print(R)

Let's apply least squares to the set of observations (x3, y3):

In [None]:
b1 = R * np.std(y3) / np.std(x3)
b0 = np.mean(y3) - b1 * np.mean(x3)

print('y = {0} * x + {1}'.format(b1, b0))

In [None]:
fig, ax = plt.subplots()
ax.scatter(x3, y3)
ax.plot([np.min(x3), np.max(x3)], [predict(np.min(x3), b0, b1), predict(np.max(x3), b0, b1)], 'r')

Let's test the least squares conditions:

In [None]:
test_conditions(x3, y3, b0, b1)

These plots demonstrate that least squares can be applied to the (x2, y2) dataset. The linear trend is practically non-existent, but the rest of the conditions apply. 