# Simple Linear Regression for Dummies with Code

### A Scratch Implementation (Without even numpy)

#### [Prashant Brahmbhatt](https://www.github.com/hashbanger)

___

First and foremost what is 'Linear'? Any equation can be considered as linear if all the variables in the equation have atmost exponent of 1.

A Linear Regression is a linear model that assumes a linear relationship between the input variables x and the output variable y

We fit the straight line which is generated by the linear combination of the input variables. We call it Simple Linear Regression in the case where there is only a single input variable for the prediction of the output.

The simple linear model equation is:  
### $$y = b0 + b1 * x$$

This equation on different values of x will output a straight line whose slope and intercept will depend on the coefficients.  

Our job is to find those coefficients on some data that we want to train on. At the end of training we will get some values of the coefficients that will best fit the data based on the technique we use to train our model.

We can use the most common awesome algorithm called **Gradient Descent** or we can use a direct formulas from statistics

### Calculating the mean

To calculate the mean of a list of numbers we can use the following snippet

### $$ mean(x) = sum(x) / count(x)$$

In [199]:
def mean(x):
    return sum(x)/len(x)

Now as for variance which is the squared difference of each value from the mean.

In [200]:
# Function for calculating population variance
def variance(x, mean_of_x):
    return (sum([(x_- mean_of_x)**2 for x_ in x]))

In [201]:
#Lets try it for sample dataset
data = [[1,1],[2,3],[4,3],[3,2],[5,5]]
x = [int(v[0]) for v in data]
y = [int(v[1]) for v in data]
print("The mean and the variance for x and y:\n")
print("Mean of x is {:.3f}\nVariance is {:.3f}\n".format(mean(x), variance(x, mean(x))))
print("Mean of y is {:.3f}\nVariance is {:.3f}".format(mean(y), variance(y, mean(y))))

The mean and the variance for x and y:

Mean of x is 3.000
Variance is 10.000

Mean of y is 2.800
Variance is 8.800


### Calculating Covariance

The covariance is a simple measure of how one variableresponds to change in the other variable

Covariance is a generalization of correlation. Correlation describes the relationship between two groups of numbers, whereas covariance can describe the relationship between two or more groups of numbers.


### $$covariance = sum((x(i) - mean(x)) * (y(i) - mean(y)))$$

In [202]:
def covariance(x, mean_of_x, y , mean_of_y):
    cov = 0.0
    for i in range(len(x)):
        cov += (x[i]-mean_of_x) * (y[i] - mean_of_y)
    return cov

In [203]:
print("Covariance of x and y is {}".format(covariance(x, mean(x), y, mean(y))))

Covariance of x and y is 8.0


### Estimating the Coefficients

We can calculate the B1 coefficient using the formula

### $$B1 = sum((x(i) - mean(x)) * (y(i) - mean(y))) / sum( (x(i) - mean(x))^2 )$$
which is same as 
### $$B1 = covariance / variance$$

For calculation of the B0: 
### $$B0 = mean(y) - B1 * mean(x)$$

We can create a function for all this

In [204]:
def coefficients(data):
    x = [int(v[0]) for v in data]
    y = [int(v[1]) for v in data]
    mean_of_x, mean_of_y = mean(x), mean(y)
    B1 = covariance(x, mean_of_x, y, mean_of_x)/variance(x, mean_of_x)
    B0 = mean_of_y - B1 * mean_of_x
    return B0, B1

In [205]:
B0, B1 = coefficients(data)
print("The calculated coefficients are:\nB0 = {}\nB1 = {}".format(B0,B1))

The calculated coefficients are:
B0 = 0.39999999999999947
B1 = 0.8


### The Prediction Phase

Now since we have trained our model time to predict values for the new values.

In [206]:
def regressionModel(train, test):
    B0, B1 = coefficients(train)
    prediction = []
    # making the predictions
    for row in test:
        # y = b0 + b1 * x
        y_hat = B0 + B1 * row[0]
        prediction.append(y_hat)
    return prediction

We now have the actual results and also our predictions. So why not check how correct we were in our answers.  

We will use the most common RMSE (Root Mean Squared Error) method for that.

In [207]:
from math import sqrt
def rmse(actual, predicted):
    sum_error = 0.0
    for i in range(len(actual)):
        sum_error += ((actual[i] - predicted[i])**2)
    mean_error = sum_error / len(actual)
    return sqrt(mean_error)

In [208]:
def final_algorithm(dataset, algorithm):
    test_set = []
    for row in dataset:      # Here we create a Test dataset like our train data but with Labels as None
        row_copy = list(row) # [[1, None], [2, None], [4, None], [3, None], [5, None]]
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(dataset, test_set)
    print(predicted)
    actual = [row[-1] for row in dataset]
    rmse_error = rmse(actual, predicted)
    return rmse_error

Now testing everything on our toy dataset

In [209]:
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse_error = final_algorithm(dataset, regressionModel)
print('RMSE: %.3f' % (rmse_error))

[1.1999999999999995, 1.9999999999999996, 3.5999999999999996, 2.8, 4.3999999999999995]
RMSE: 0.693


_________________________

## On a real Dataset

First we also need to create a function ot read the csv file.

In [210]:
from random import seed
from random import randrange
from csv import reader
from math import sqrt

In [211]:
def csv_read(filename):
    dataset = []
    with open(filename, 'r') as file:
        csv_readed = reader(file)
        for row in csv_readed:
            if not row:
                continue
            dataset.append(row)
    return dataset

In [212]:
readed_data = csv_read('sweden_insurance_data.csv')
readed_data[:10]

[['108', '392.5'],
 ['19', '46.2'],
 ['13', '15.7'],
 ['124', '422.2'],
 ['40', '119.4'],
 ['57', '170.9'],
 ['23', '56.9'],
 ['14', '77.5'],
 ['45', '214'],
 ['10', '65.3']]

Above we can observe that the data is in string format so we need to convert each values of each column into float

In [213]:
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

We will also implement a function to split the training set and the test set to give us some flexibility in the Test and Train size ratio.

In [214]:
def train_test_split(dataset, split):
    train = list()
    train_size = split * len(dataset)
    # We create a copy and randomly pop observations which will become the test set 
    dataset_copy = list(dataset)
    while len(train) < train_size:
        index = randrange(len(dataset_copy))
        train.append(dataset_copy.pop(index))
    return train, dataset_copy

We also need to modify our previoues **regressionModel()** to implement this splitting functionality.

In [219]:
# Evaluate an algorithm using a train/test split
def final_algorithm(dataset, algorithm, split, *args):
    train, test = train_test_split(dataset, split)
    test_set = list()
    for row in test:
        row_copy = list(row)
        row_copy[-1] = None
        test_set.append(row_copy)
    predicted = algorithm(train, test_set, *args)
    actual = [row[-1] for row in test]
    rmse_value = rmse(actual, predicted)
    return rmse_value

In [216]:
for i in range(2):
    str_column_to_float(readed_data, i)

In [217]:
readed_data[:10]

[[108.0, 392.5],
 [19.0, 46.2],
 [13.0, 15.7],
 [124.0, 422.2],
 [40.0, 119.4],
 [57.0, 170.9],
 [23.0, 56.9],
 [14.0, 77.5],
 [45.0, 214.0],
 [10.0, 65.3]]

Now we can correctly parse our data when needed

#### The final piece of the puzzle.

In [220]:
seed(101)
# load and prepare data
filename = 'sweden_insurance_data.csv'
dataset = csv_read(filename)
for i in range(len(dataset[0])):
    str_column_to_float(dataset, i)
# evaluate algorithm
split = 0.6
rmse = final_algorithm(dataset, regressionModel, split)
print('RMSE: %.3f' % (rmse))

RMSE: 44.716


_____

Still in progress...

### de nada!