# Simple Linear Regression from scratch

Linear regression assumes a linear or straight line relationship between the input variables ($X$) and the single output variable ($y$).

More specifically, that output ($y$) can be calculated from a linear combination of the input variables ($X$). When there is a single input variable, the method is referred to as a simple linear regression.

In simple linear regression we can use statistics on the training data to estimate the coefficients required by the model to make predictions on new data.

The line for a simple linear regression model can be written as:

### $y = b_0 + b_1 * x$

where $b_0$ and $b_1$ are the coefficients we must estimate from the training data.

Once the coefficients are known, we can use this equation to estimate output values for $y$ given new input examples of $X$.

It requires that you calculate statistical properties from the data such as mean, variance and covariance.

All the algebra has been taken care of and we are left with some arithmetic to implement to estimate the simple linear regression coefficients.

For this example, we will be working on a simple insurance data set with given column $X$ and $y$, so let's get started.

In [1]:
import pandas as pd
import numpy as np

In [2]:
filename = 'insurance.csv'
dataset = pd.read_csv(filename, delimiter='|')

In [3]:
dataset

Unnamed: 0,X,Y
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4
...,...,...
58,9,87.4
59,31,209.8
60,14,95.5
61,53,244.6


First step in our model would be to calculate mean and variance, so let's define our functions for that:

## Step 1: Get the mean and the variance

In [4]:
# Calculate the mean value of a list of numbers
def mean(values):
    return sum(values) / float(len(values))

In [5]:
# Calculate the variance of a list of numbers
def variance(values, mean):
    return sum([(x-mean)**2 for x in values])

Now that we have our functions defined, let's calculate the above statistics in our dataset example:

In [6]:
# calculate mean and variance
x = dataset["X"]
y = dataset["Y"]
dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 2 columns):
X    63 non-null int64
Y    63 non-null float64
dtypes: float64(1), int64(1)
memory usage: 1.1 KB


In [7]:
mean_x, mean_y = mean(x), mean(y)
var_x, var_y = variance(x, mean_x), variance(y, mean_y)
print('x stats: mean=%.3f variance=%.3f' % (mean_x, var_x))
print('y stats: mean=%.3f variance=%.3f' % (mean_y, var_y))

x stats: mean=22.905 variance=33809.429
y stats: mean=98.187 variance=472818.290


Now that we have our first step completed, let's work on the second step, which is calculating covariance:

## Step 2: Calculate the covariance

The covariance of two groups of numbers describes how those numbers change together.

Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers.

Additionally, covariance can be normalized to produce a correlation value.

Nevertheless, we can calculate the covariance between two variables as follows:

### $covariance(x,y) = \dfrac{1}{n}\sum_{i=1}^n ((x_i - \bar{x} ) * (y_i - \bar{y}))$

So, let's define a function for that and get the results for our dataset:

In [29]:
# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
    covar = 0.0
    for i in range(len(x)):
        covar += (x[i] - mean_x) * (y[i] - mean_y)
    return covar

covar = covariance(x, mean_x, y, mean_y)
print('Covariance: %.3f' % (covar))

Covariance: 115419.424


We now have all the pieces in place to calculate the coefficients for our model.

## Step 3: Estimate coefficients

The first is B1 which can be estimated as:

### B1 = covariance(x, y) / variance(x)

Next, we need to estimate a value for B0, also called the intercept as it controls the starting point of the line where it intersects the y-axis.

### B0 = mean(y) - B1 * mean(x)

Since we have all of these functions prepared already, we can go ahead and define the function to calculate the coefficients:

In [33]:
# Calculate coefficients
def coefficients(dataset):
    x = dataset["X"]
    y = dataset["Y"]
    x_mean, y_mean = mean(x), mean(y)
    b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
    b0 = y_mean - b1 * x_mean
    return [b0, b1]

In [34]:
b0, b1 = coefficients(dataset)
print('Coefficients: B0=%.3f, B1=%.3f' % (b0, b1))

Coefficients: B0=19.994, B1=3.414


Now that we know how to estimate the coefficients, the next step is to use them in our model.

## Step 4: Make Predictions

The equation to make predictions with a simple linear regression model is as follows:

### y = b0 + b1 * x

In [35]:
def simple_linear_regression(train, test):
    predictions = list()
    b0, b1 = coefficients(train)
    for row in test:
        yhat = b0 + b1 * row[0]
        predictions.append(yhat)
    return predictions

In [None]:
# Evaluate regression algorithm on training dataset
def evaluate_algorithm(dataset, algorithm):
    test_set = list()
    for row in dataset:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse = rmse_metric(actual, predicted)
    return rmse