# Lab 3 Huy Quang Pham Submission
In this lab, you will complete the full process of describing and then implementing an algorithm. You will do so on a topic relevant to data science: _(simple) linear regression_. Please note that this lab is assessed, i.e., you will need to submit the results of your work on QM+. To submit your work, first download/export your Jupyter notebook as PDF. Then upload the PDF file in the submission area on QM+.

## Task 1: Simple Linear Regression - English Description
a. General description of calculation for the values of the estimates $\alpha$ and $\beta$:
> We can find  the values of $\sqrt{2}$ that corresponds to the minimal total of squared residuals by taking the derivative of $$\sum_{i = 1}^n (y_i - \alpha x_i - \beta)^2$$ with $\beta$ as differentiation variable, and setting the result to equal to 0, which will give us: $$-2\sum_{i = 1}^n (y_i - \alpha x_i - \beta) = 0    \Rightarrow    \sum_{i = 1}^n\beta = \sum_{i = 1}^n (y_i -\alpha x_i) \\\Rightarrow \beta = \frac{\sum_{i = 1}^n y_i}{n} - \alpha \frac{\sum_{i = 1}^n  x_i}{n}    \Rightarrow    \beta = \overline{y} -\alpha\overline{x} $$ ($\overline{y}$ and $\overline{x}$ are mean of y and mean of x respectively)

> We can do the same with $\alpha$ as the differentiation variable, which will give us:   $$-2\sum_{i = 1}^n (y_i - \alpha x_i - \beta)x_i = 0    \Rightarrow   \sum_{i = 1}^n (y_i - \alpha x_i - \overline{y} -\alpha\overline{x})x_i = 0 \\\Rightarrow \sum_{i = 1}^n y_i x_i - \alpha x_i^2 - \overline{y} x_i +\alpha\overline{x} x_i = 0   \Rightarrow   \alpha \sum_{i = 1}^n (x_i - \overline{x}) x_i = \sum_{i = 1}^n (y_i - \overline{y}) x_i \\\Rightarrow \alpha = \frac{\sum_{i = 1}^n (y_i - \overline{y}) x_i}{\sum_{i = 1}^n (x_i - \overline{x}) x_i} $$

b. Algorithm description:
> Assume that the input dataset will be stored in a list with the following format: [[x_1, y_1], [x_2, y_2], ..., [x_n, y_n]]. Steps to calculate the $\alpha$ and $\beta$ estimates are as follows: 

> 1. Store all the x values and y values into separate lists with the same order. 

> 2. Calculate average value of all x and y by taking total sum of each list and dividing by list length.

> 3. Calculate the value of $\alpha$ with the formula stated above by looping through every index of the list and then adding them up

> 4. Calculate the value of $\beta$ with the formula stated above using the mean values of y and x and $\alpha$

## Task 2: Simple Example
Given a dataset [[x_1, y_1], [x_2, y_2], [x_3, y_3], [x_4, y_4]] as follows: [[1,6], [2,10], [3,5], [4,15]]
> 1. Split the dataset into 2 separate lists [1, 2, 3, 4] and [6, 10, 5, 15]

> 2. Calculate the mean value for all x and y values: $\overline{x}$ = 2.5, $\overline{y}$ = 9

> 3. Calculate the value of $\alpha$: $$ \alpha = \frac{\sum_{i = 1}^n (y_i - \overline{y}) x_i}{\sum_{i = 1}^n (x_i - \overline{x}) x_i} \\\Rightarrow \alpha = \frac{(6-9)1+(10-9)2+(5-9)3+(15-9)4}{(1-2.5)1 +(2-2.5)2+(3-2.5)3+(4-2.5)4} \\\Rightarrow \alpha = \frac{11}{5} = 2.2$$

> 4. Calculate the value of $\beta$: $$\beta = \overline{y} -\alpha\overline{x} \\\Rightarrow \beta = 9 - 2.2*2.5 \\\Rightarrow \beta = 3.5$$


## Task 3: Algorithm Implementation
Setting up input values and define calculation function:

In [1]:
dataset = []
# Dataset will be the variable used to input dataset of x and y in the form of [[x_1, y_1], [x_2, y_2], [x_3, y_3]]
def lin_reg(ds):
    # Step 1: get the first element of every sub-list to create a list of x values, 
    # and the second element to create a list of y values
    ds_x = [sds[0] for sds in ds]
    ds_y = [sds[1] for sds in ds]
    
    # Step 2: get mean values of x and y
    mean_x = sum(ds_x)/len(ds_x)
    mean_y = sum(ds_y)/len(ds_y)
    #print for debugging
    print("value of mean x values: "+str(mean_x))
    print("value of mean y values: "+str(mean_y))
    
    # Step 3: Calculate the value of the numerator and denominator of the alpha formula 
    # by using for loops iterate each index of 2 lists:
    alpha_numerator = 0
    alpha_denominator = 0
    for i in range(len(ds)):
        # (y_i - mean_y)*x_i
        alpha_numerator = alpha_numerator + (ds_y[i]-mean_y)*ds_x[i]
        # (x_i - mean_x)*x_i
        alpha_denominator = alpha_denominator + (ds_x[i]-mean_x)*ds_x[i]
    print("value of alpha = "+str(alpha_numerator)+"/"+str(alpha_denominator))
    
    #divide numerator over denominator
    alpha = alpha_numerator / alpha_denominator
    
    #calculate beta based on mean_y, mean_x and alpha
    print("value of beta = "+str(mean_y) + " - " +str(alpha)+"*"+str(mean_x))
    beta = mean_y - alpha*mean_x
    
    #return a dict with alpha and beta
    estimators = {"alpha": alpha, "beta": beta}
    print(estimators)

## Task 4: Test algorithm
Calculate alpha and beta for the dataset on Task 2

In [2]:
dataset = [[1,6], [2,10], [3,5], [4,15]]
lin_reg(dataset)

value of mean x values: 2.5
value of mean y values: 9.0
value of alpha = 11.0/5.0
value of beta = 9.0 - 2.2*2.5
{'alpha': 2.2, 'beta': 3.5}


which gives the same results

Comparing with output of scikit-learn package:

In [4]:
import numpy as np
from sklearn.linear_model import LinearRegression

#create a test model with sklearn and fit the same data
x = np.array([1,2,3,4]).reshape((-1, 1))
y = np.array([6,10,5,15])
test_model = LinearRegression().fit(x, y)

#get estimator values
print('alpha sklearn:', test_model.coef_)
print('beta sklearn:', test_model.intercept_)



alpha sklearn: [2.2]
beta sklearn: 3.500000000000001


Comparing to sklearn with a different dataset, which gives the same results:

In [6]:
dataset2 = [[2,108], [13, 199], [4,50], [5,123], [19,230], [69, 420]]
#test using in-house function
lin_reg(dataset2)

#test using sklearn
dataset2_x = [sds[0] for sds in dataset2]
dataset2_y = [sds[1] for sds in dataset2]
x = np.array(dataset2_x).reshape((-1, 1))
y = np.array(dataset2_y)
test_model2 = LinearRegression().fit(x, y)
#get estimator values
print('alpha sklearn:', test_model2.coef_)
print('beta sklearn:', test_model2.intercept_)

value of mean x values: 18.666666666666668
value of mean y values: 188.33333333333334
value of alpha = 15874.666666666666/3245.333333333333
value of beta = 188.33333333333334 - 4.891536565324569*18.666666666666668
{'alpha': 4.891536565324569, 'beta': 97.02465078060806}
alpha sklearn: [4.89153657]
beta sklearn: 97.02465078060804
