<a href="https://colab.research.google.com/github/kevinajordan/DS-Training/blob/master/linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

What is regression? What types of learning problems are good for regression?

Short rule: If the target variable is **continuous** (varying at each observation), regression is a good task for solving it.

Linear regression was developed in the field of statistics and is studied as a model for understanding the relationship between input and output numerical variables. It is a really important part of machine learning. 

Two basic forms of Linear Regression:
* Simple Linear Regression (SLR) which deals with just two variables.
* Multi-linear Regression (MLR) which deals with more than two variables

Dataset: Boston Housing

![alt text](https://image.ibb.co/jRH8E9/Capture.jpg)

$X$ is the set of input values; $h$ is the function that maps the $X$ values to predicted $y$ (called y hat ($\hat y$) or predictor). $h$ is often referred to as the hypothesis function.

The ultimate goal is, given a training set, to learn a function $ h:X→Y$ so that h(x) is a "good" predictor for the corresponding value of y. 

A pair ($x^i, y^i$) is called a training example.

Here's an example dataset to use for the next explanations.

![alt text](https://image.ibb.co/hXTenU/Capture.jpg)

The example dataset contains two features: $x_1^i$ is the living area of the i-th house in the training set, and $x_2^i$ is its number of bedrooms.

To perform regression, you must decide the way you are going to represent h. As an initial choice, let’s say you decide to approximate y as a linear function of x:

$h_θ(x) = θ_0 + θ_1x_1 + θ_2 x_2$

Here, the $θ_i$’s are the parameters (also called weights) parameterizing the space of linear functions mapping from $X$ to $Y$. Essentially, these parameters are used for accurately mapping $X$ to $Y$.

The above formula can be rewritten as 

![alt text](https://image.ibb.co/i94zMp/Capture.jpg)

One prominent method seems to be to make $h(x)$ close to $y$, at least for the training examples you have. To understand this more formally, let's try defining a function that determines, for each value of the $θ$’s, how close the $h(x^i)$’s are to the corresponding $y^i$ ’s.


*Note:The weight parameters are also called model coefficients.* 




## Cost function: 

The main question, is how do you pick or learn the parameters $θ$? You cannot change your input instances as to predict the prices. You have only these $θ$ parameters to tune/adjust.

This brings up one of the most important functions in machine learning and data science. The cost function:

![alt text](https://image.ibb.co/kW8e49/Capture.jpg)


Learning/training a linear regression model essentially means estimating the values of the coefficients/parameters used in the representation with the data you have to minimize the cost function, $J(\theta)$

## Gradient Descent:

![alt text](https://image.ibb.co/jUXWj9/Capture.jpg)

Choose $\theta$ so as to minimize $J(\theta)$. To do so, let’s use a search algorithm that starts with some "initial guess" for $\theta$, and that iteratively changes $\theta$ to make $J(\theta)$ smaller, until hopefully, you converge to a value of $\theta$ that minimizes $J(\theta)$.

Here, $\alpha$ is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of $J$. This term $\alpha$ effectively controls how steep your algorithm would move to the decrease of $J$. It can be pictorially expressed as the following:

![alt text](https://ml-cheatsheet.readthedocs.io/en/latest/_images/gradient_descent_demystified.png)

The term $\alpha$ (learning rate) is very important here since it determines the size of the improvement step to take on each iteration of the procedure.

Now there are commonly two variants of gradient descent:

* The method that looks at every example in the entire training set on every step and is called batch gradient descent.
* The method where you repeatedly run through the training set, and each time you encounter a training example, you update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent).



## Regularization

Generally, regularization methods work by penalizing the coefficients of features having extremely large values and thereby try to reduce the error. It not only results in an enhanced error rate but also, it reduces the model complexity. This is particularly very useful when you are dealing with a dataset that has a large number of features, and your baseline model is not able to distinguish between the importance of the features (not all features in a dataset are equally important, right?).


Two variants of regularization for linear regression:

**Lasso Regression** (aka L1 regularization) - adds a penalty term which is equivalent to the absolute value of the magnitude of the coefficients.

![alt text](https://image.ibb.co/bzZ2cf/Capture.jpg)

explained:
* λ is the constant factor that you add in order to control the speed of the improvement in error (learning rate)
* the dataset has (M+1) features, so it runs from 0 to M. wj is the weight/coefficient.


**Ridge Regression** (aka L2 regularization) - adds a penalty term which is equivalent to the square of the magnitude of coefficients

![alt text](https://image.ibb.co/hY4oiL/Capture.jpg)


HIGHLY RECOMMENDED FURTHER READING:
https://www.datacamp.com/community/tutorials/towards-preventing-overfitting-regularization

In [0]:
# import dependencies and load dataset
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn import datasets

data = datasets.load_boston()

In [2]:
# let's look at a description of our data
print (data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [0]:
# convert our data to a pandas dataframe

#Set the features
df = pd.DataFrame(data.data, columns=data.feature_names)

# set the target
target = pd.DataFrame(data.target, columns=["MEDV"])

# Easy Way - Sklearn  Implementation of Linear Regression

simple example below. Just one feature will be used.

In [0]:
# What sci-kit learn package implements linear regression
from sklearn import _____
x = df["RM"]
y = target["MEDV"]

lm = linear_model._____()
model = lm.fit(x, y)


In [0]:
# What function makes predictions off of your model.
predictions = lm.____(x)
print(predictions[0:5])

In [0]:
# What function gives you an accuracy score?
lm.___(x, y)

In [0]:
# How can you look at your models coefficients?
lm.coef_

## More Verbose with Statsmodels

Single linear regression model

In [0]:
X = df["RM"]
y = target["MEDV"]

# Fit and make the predictions by the model
model = sm.OLS(y, X).fit()
predictions = model.predict(X)

# Print out the statistics
model.summary()

### Breaking down the output:

* The first observation you should make here is you are using OLS method to train your linear regression model.

"The Ordinary Least Squares procedure seeks to minimize the sum of the squared residuals. This means that given a regression line through the data we calculate the distance from each data point to the regression line, square it, and sum all of the squared errors together. This is the quantity that ordinary least squares seeks to minimize." - Jason Brownlee

* There's a value corresponding to R-Squared. R-squared is the “percent of variance explained” by the model. That is, R-squared is the fraction by which the variance of the errors is less than the variance of the dependent variable. R-squared values range from 0 to 1 and are commonly stated as percentages from 0% to 100%. R-squared will give you an estimate of the relationship between movements of a dependent variable based on an independent variable's movements. It doesn't tell you whether your chosen model is good or bad, nor will it tell you whether the data and predictions are biased. A high or low R-square isn't necessarily good or bad, as it doesn't convey the reliability of the model, nor whether you've chosen the right regression. You can get a low R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa. 

* The coefficient (coef) of 3.634 means that if the RM variable increases by 1, the predicted value of MEDV increases by 3.634. 


* There is a 95% confidence intervals for the RM which means that the model predicts at a 95% percent confidence that the value of RM is between 3.548 to 3.759).

Those are the most important points to keep in mind for now.

In [0]:
# Lets add a constant term to our model
x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x)

model.summary()

**Breakdown of output:**
* It can be clearly seen that the addition of the constant term has a direct effect on the coefficient term. Without the constant term, your model was passing through the origin, but now you have a y-intercept at -34.67. 
* Now the slope of the RM predictor is also changed from 3.634 to 9.1021 (coef of RM).

### Multi-Linear Regression Model

In [0]:
# lets add a feature for our model.
x = df[["RM", "LSTAT"]]
y = target["MEDV"]

model = sm.OLS(y, x).fit()
predictions = model.predict(x)

model.summary()

### Breakdown of Output:
* This model has a much higher R-squared value — 0.948, which essentially means that this model captures 94.8% of the variance in the dependent variable. Now, let's try to figure out the relationship between the two variables RM and LSTAT and median house value. 
* As RM increases by 1, MEDV will increase by 4.9069, and when LSTAT increases by 1, MEDV will decrease by 0.6557. This indicates that RM and LSTAT are statistically significant in predicting (or estimating) the median house value.

Interpretation to tell to your boss:
* Houses having a small number of rooms are likely to have low price values.
* In the areas where the status of the population is lower, the house prices are likely to be low.