# Regressions

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

## Linear vs. non-linear regressions:
 - Inspect visually (input vs. output)
 - Calculate corellation coefficients, if for all parameters > 0.7 linear tendency


## Linear Regressions
### Simple Linear Regression

In [7]:
example_data = pd.DataFrame({'Enginesize': [2.0,2.4,1.5,3.5,3.5,3.5,3.5,3.7,3.7], 'CO2 Emissions': [196,221,136,255,244,230,232,255,267]})

Mean Squared Error:
Sum of squares of the distance from the predicted value $\widehat{y}$ to the true value $y$.

$\widehat{y}=\theta_0+\theta_1x_1$

$MSE=\frac{1}{n} \sum_{i=1}^n \left( y_i - \widehat{y_i} \right)^2$

<br>How to find best values for $\theta_0$ and $\theta_1$?
Two options available:  
1) Mathematic approach  (see simple linear regression) -> Ordinary Least Squares  
2) Optimization approach  (see multiple linear regression) -> Gradient Descent  

Optimization used best for multiple linear regression with large datasets, as calculations required lower than for matrix calculations of mathematical approach


#### Mathematic approach to simple linear regression

Given the condition, that we have a simple linear regression, i.e. only two paramaters.  
Calculations for both parameters:

$\theta_1=\frac{\sum_{i=1}^s(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i=1}^s(x_i-\overline{x})^2} $

$ \theta_0=\overline{y}-\theta_1\overline{x}$

In [45]:
sum(x)

27.299999999999997

In [20]:
x = example_data['Enginesize']
y = example_data['CO2 Emissions']

# Calculate means of x and y
mean_x = sum(x) / len(x)
mean_y = sum(y) / len(y)

In [42]:
# Calculate fraction for theta_1
numerator = sum([(x[i]-mean_x)*(y[i]-mean_y) for i in range(len(x))])
denominator = sum([(x[i]-mean_x)**2 for i in range(len(x))])

theta_1 = numerator / denominator
print(f'Coefficient (theta_1): {theta_1}')

Coefficient (theta_1): 43.98446833930704


In [43]:
# Input theta_1 to calculate theta_0
theta_0 = mean_y - (theta_1*mean_x)
print(f'Intercept (theta_0): {theta_0}')

Intercept (theta_0): 92.80266825965754


$$ \widehat{y} = 92.80 + 43.98x_1$$

## Evaluation Metrics in Regression Models

**Mean Absolute Error:**  
<br>$ MAE=\frac{1}{n}\sum_{j=1}^n|y_j-\widehat{y_j}| $


<br>**Mean Squared Error:**  
MSE is more popular than MAE, as it is more geared towards large errors.  
<br>$ MSE = \frac{1}{n}\sum_{i=1}^n(y_j-\widehat{y_j})^2 $



<br>**Root Mean Squared Error:**  
RMSE is one of the most popular evaluation metric, as it is interpretable in the same units as the response vector or y units.  
<br>$ RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_j-\widehat{y_j})^2} $



<br>**Relative Absolute Error:**  
<br>$ RAE=\frac{\sum_{j=1}^n|y_j-\widehat{y_j}|}{\sum_{j=1}^n|y_j-\overline{y}|} $


<br>**Relative Squared Error:**  
RSE is widely adopted by Data Science community, due to being used for calculating $R^2$.  
<br>$ RSE=\frac{\sum_{j=1}^n(y_j-\widehat{y_j})^2}{\sum_{j=1}^n(y_j-\overline{y})^2} $


<br>**$R^2:$**  
<br>$ R^2=1-RSE $



### Validation of results with `sklearn.linear_model.LinearRegression`

In [33]:
# Validation
lr = LinearRegression()
lr.fit(pd.DataFrame(x),y)
print(f'Coefficient: {lr.coef_[0]} \nIntercept: {lr.intercept_}')

Coefficient: 43.984468339307035 
Intercept: 92.80266825965757


## Multiple Linear Regression

Multiple Linear Regression normall used for two differente purposes:
 - Understand impact of independent variables on prediction
 - Understand and predict impact of change on predictor (dependent variable), if any variable (independent variable) changes

In [None]:
example_mlr = pd.DataFrame({
                                                    'Enginesize': [2.0,2.4,1.5,3.5,3.5,3.5,3.5,3.7,3.7],
                                                    'Cylinders': [4,4,4,6,6,6,6,6,6],
                                                    'Fuelconsumption_Comb': [8.5,9.6,5.9,11.1,10.6,10.0,10.1,11.1,11.6],
                                                    'CO2 Emissions': [196,221,136,255,244,230,232,255,267]
                                                    })


Target function:

$\widehat{y}=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n$

Function can be written as:
$ \widehat{y}=\theta^TX $

$ \theta^T=[\theta_0,\theta_1,\theta_2,...] $ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $ X=\left(\begin{array}{c}1\\x_1\\x_2\\...\\\end{array}\right) $

# Non-Linear Regressions

$ \widehat{y} $ must be a non-linear function of the paramaters $ \theta $, not neccessarily the features x. To cope with non-linear data, use polynomial regression, non-linear regression model or data transformation.

Examples for non-linear functions:

$ \widehat{y}=\theta_0+\theta_2^2x $  

$ \widehat{y}=\theta_0+\theta_1\theta_2^x $  

$ \widehat{y}=\log{(\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3)} $  

$ \widehat{y}=\frac{\theta_0}{1+\theta_1^(x-\theta_2)} $  


## Polynomial Regression

Polynomial Regression comprises all regressions, for which the relationship between the independent variable x and the dependent variable y is modeled as an $n^{th}$ degree polynomial in x. Thus Polynomial Regression fits a curved line to the data, e.g.:

$ \widehat{y}=\theta_0+\theta_1x+\theta_2x^2+\theta_3x^3 $

In [None]:
from sklearn.preprocessing import StandardScaler,PolynomialFeatures

### Non-Linear Regression Models

See notebook for Coursera course (ML0101EN-Reg-NoneLinearRegression-py-v1.ipynb) !!!