# Learnings for linear regression

In [26]:
import pandas as pd
from sklearn.linear_model import LinearRegression

## Machine Learning with Python - Coursera

In [7]:
example_data = pd.DataFrame({'Enginesize': [2.0,2.4,1.5,3.5,3.5,3.5,3.5,3.7,3.7], 'CO2 Emissions': [196,221,136,255,244,230,232,255,267]})

Mean Squared Error:
Sum of squares of the distance from the predicted value $\widehat{y}$ to the true value $y$.

$$\widehat{y}=\theta_0+\theta_1x_1$$

$$MSE=\frac{1}{n} \sum_{i=1}^n \left( y_i - \widehat{y_i} \right)^2$$ 

How to find best values for $\theta_0$ and $\theta_1$?
Two options available:  
1) Mathematic approach  
2) Optimization approach


### Mathematic approach

Given the condition, that we have a simple linear regression, i.e. only two paramaters.  
Calculations for both parameters:

$$\theta_1=\frac{\sum_{i=1}^s(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^s(x_i-\overline{x})^2} $$

$$ \theta_0=\overline{y}-\theta_1\overline{x}$$

In [20]:
x = example_data['Enginesize']
y = example_data['CO2 Emissions']

# Calculate means of x and y
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)

In [40]:
# Calculate fraction for theta_1
numerator = sum([(x[i]-mean_x)*(y[i]-mean_y) for i in range(len(x))])
denominator = sum([(x[i]-mean_x)**2 for i in range(len(x))])

theta_1 = numerator / denominator
print(f'theta_1 = {theta_1}')

theta_1 = 43.98446833930704


In [41]:
# Input theta_1 to calculate theta_0
theta_0 = mean_y - (theta_1*mean_x)
print(f'theta_0 = {theta_0}')

theta_0 = 92.80266825965754


$$ \widehat{y} = 92.80 + 43.98x_1$$

### Validation of results with `sklearn.linear_model.LinearRegression`

In [33]:
# Validation
lr = LinearRegression()
lr.fit(pd.DataFrame(x),y)
print(f'Coefficient: {lr.coef_[0]} \nIntercept: {lr.intercept_}')


Coefficient: 43.984468339307035 
Intercept: 92.80266825965757


## Multiple Linear Regression

In [None]:
example_mlr = pd.DataFrame({'Enginesize': [2.0,2.4,1.5,3.5,3.5,3.5,3.5,3.7,3.7], 'CO2 Emissions': [196,221,136,255,244,230,232,255,267]})
