# Regression


Regression is a technique used to model and analyze the relationships between variables and often times how they contribute and are related to producing a particular outcome together. 

This technique is used for forecasting, time series modelling and finding the causal effect relationship between the variables.

__1. Linear Regression__

the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.

establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).

It is represented by an equation `Y=a+b*X + e`, where a is intercept, b is slope of the line and e is error term. This equation can be used to predict the value of target variable based on given predictor variable(s).

__How to obtain best fit line (Value of a and b)?__

by Least Square Method. It is the most common method used for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line.

We can evaluate the model performance using the metric R-square and Adjusted R-square incase of multiple linear regression

__Important Points:__
- There must be linear relationship between independent and dependent variables
- Multiple regression suffers from __multicollinearity, autocorrelation, heteroskedasticity__.
- Linear Regression is very sensitive to Outliers. It can terribly affect the regression line and eventually the forecasted values.
- Multicollinearity can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable
- In case of multiple independent variables, we can go with forward selection, backward elimination and step wise approach for selection of most significant independent variables.

### When, why, and how you should use linear regression

__What__:  

1. Form of predictive modelling technique.
2. Technique used to model and analyze the relationships between variables 

__Where__: 

1. Forecasting
2. Time series modelling
3. finding the causal effect relationship between the variables

__When__:   

1. The relationship between the variables is linear.
2. The data is homoskedastic, meaning the variance in the residuals (the difference in the real and predicted values) is more or less constant.
3. The residuals are independent, meaning the residuals are distributed randomly and not influenced by the residuals in previous observations. If the residuals are not independent of each other, they’re considered to be autocorrelated.
4. The residuals are normally distributed. This assumption means the probability density function of the residual values is normally distributed at each x value. I leave this assumption for last because I don’t consider it to be a hard requirement for the use of linear regression, although if this isn’t true, some manipulations must be made to the model.

__Why__: when you want to predict a continuous dependent variable from a number of independent variables.  
__How__:  we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized.

### Important Points

1. There should be a __linear and additive relationship__ between dependent (response) variable and independent (predictor) variable(s)


       - A linear relationship suggests that a change in response Y due to one unit change in X¹ is constant, regardless of 
         the value of X¹. 
       - An additive relationship suggests that the effect of X¹ on Y is independent of other variables.
       - The linearity assumption can best be tested with scatter plots.
    
 
2. The linear regression analysis requires all variables to be __multivariate normal__. 

       - One definition is that a random vector is said to be k-variate normally distributed if every linear combination 
         of its k components has a univariate normal distribution.
         
![image.png](./images/MultivariateNormal.png)
         
       - This assumption can best be __checked with a histogram or a Q-Q-Plot.
       - Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnov test.  
       - When the data is not normally distributed a non-linear transformation (e.g., log-transformation) might fix this 
         issue.

3. Linear regression assumes that there is little or __no multicollinearity__ in the data. Multicollinearity occurs when the independent variables are highly correlated with each other.

       - Multicollinearity can increase the variance of the coefficient estimates and make the estimates very sensitive 
         to minor changes in the model. The result is that the coefficient estimates are unstable
 
       - Multicollinearity may be tested with three central criteria:

            1) Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent 
               variables the correlation coefficients need to be smaller than 1.

            2) Tolerance – the tolerance measures the influence of one independent variable on all other independent 
               variables; the tolerance is calculated with an initial linear regression analysis.  Tolerance is defined 
               as T = 1 – R² for these first step regression analysis.  With T < 0.1 there might be multicollinearity 
               in the data and with T < 0.01 there certainly is.

            3) Variance Inflation Factor (VIF) – the variance inflation factor of the linear regression is defined as 
               VIF = 1/T. With VIF > 10 there is an indication that multicollinearity may be present; with VIF > 100 
               there is certainly multicollinearity among the variables.
               
            4) Condition Index – the condition index is calculated using a factor analysis on the independent variables.  
               Values of 10-30 indicate a mediocre multicollinearity in the linear regression variables, values > 30 
               indicate strong multicollinearity.

       - If multicollinearity is found in the data, centering the data (that is deducting the mean of the variable from 
         each score) might help to solve the problem.  However, the simplest way to address the problem is to remove 
         independent variables with high VIF values.

4. Linear regression analysis requires that there is little or __no autocorrelation__ in the data.  

       - Autocorrelation occurs when the residuals are not independent from each other. 
       
       - For instance, this typically occurs in stock prices, where the price is not independent from the previous price.

       - While a scatterplot allows you to check for autocorrelations, you can test the linear regression model for 
         autocorrelation with the Durbin-Watson test.  Durbin-Watson’s d tests the null hypothesis that the residuals are 
         not linearly auto-correlated.  While d can assume values between 0 and 4, values around 2 indicate no 
         autocorrelation.  As a rule of thumb values of 1.5 < d < 2.5 show that there is no auto-correlation in the data. 
         However, the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are 
         first order effects.


5. The last assumption of the linear regression analysis is __homoscedasticity__.  

       - The scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal 
         across the regression line).  The following scatter plots show examples of data that are not homoscedastic.

       - The Goldfeld-Quandt Test can also be used to test for heteroscedasticity. The test splits the data into two 
         groups and tests to see if the variances of the residuals are similar across the groups.  If homoscedasticity is 
         present, a non-linear correction might fix the problem.


___Note: (1):___ 
         - Linear Regression is very sensitive to Outliers. It can terribly affect the regression line and eventually the 
           forecasted values.

___Note: (2):___ 
         - In case of multiple independent variables, we can go with forward selection, backward elimination and step wise 
           approach for selection of most significant independent variables.


### Ordinary Least Squares Method

Aim is to model equation of line : $ y(pred) = b_0 + b_1x$

step 1: calculate mean of independent variable (x) : $ \bar{x} = \frac{\sum_{i=1}^n (x_i)}{n} $  
step 2: calculate mean of dependent variable (y) &nbsp;  : $ \bar{y} = \frac{\sum_{i=1}^n (y_i)}{n} $  
setp 3: calculate slope of line ($b_1$) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; : $ b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} $  
Step 4: calculate intercept of line ($b_0$) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;: $b_0 = \bar{y} - b_1\bar{x}$  

Ordinary Least Square method looks simple and computation is easy. But, this OLS method will only work for a univariate dataset which is single independent variables and single dependent variables. Multi-variate dataset contains a single independent variables set and multiple dependent variables sets, require us to use a machine learning algorithm called &nbsp; “Gradient Descent”.

### Gradient Descent Method

https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a


### Metrics for model evaluation


#### R-Squared value

This value ranges from 0 to 1. Value ‘1’ indicates predictor perfectly accounts for all the variation in Y. Value ‘0’ indicates that predictor ‘x’ accounts for no variation in ‘y’.

__1. Regression sum of squares (SSR)__

This gives information about how far estimated regression line is from the horizontal ‘no relationship’ line (average of actual output).

$$ Error = \sum_{i=1}^n (Predected\:output - Average\:of\:actual\:output)^2$$ 

__2. Sum of Squared error (SSE)__

How much the target value varies around the regression line (predicted value).

$$ Error = \sum_{i=1}^n (Actual\:output - Predected\:output)^2$$ 


__3. Total sum of squares (SSTO)__

This tells how much the data point move around the mean.

$$ Error = \sum_{i=1}^n (Actual\:output - Average\:of\:actual\:output)^2$$ 


$$ R^2 = 1- \frac{SSE}{SSTO} $$


__Is the range of R-Square always between 0 to 1?__

Value of R2 may end up being negative if the regression line is made to pass through a point forcefully. This will lead to forcefully making regression line to pass through the origin (no intercept) giving an error higher than the error produced by the horizontal line. This will happen if the data is far away from the origin.


__Correlation co-efficient (r)__

This is related to value of ‘r-squared’ which can be observed from the notation itself. It ranges from -1 to 1.

$$ r = \pm \sqrt{R^2}$$

If the value of b1 is negative, then ‘r’ is negative whereas if the value of ‘b1’ is positive then, ‘r’ is positive. It is unitless.




__The Coefficient of Determination__   

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.

- The coefficient of determination ranges from 0 to 1.
- An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
- An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
- An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.

The formula for computing the coefficient of determination for a linear regression model with one independent variable is given below.


Coefficient of determination. 

R2 = { ( 1 / N ) * Σ [ (xi - x) * (yi - y) ] / (σx * σy ) }2

where N is the number of observations used to fit the model, Σ is the summation symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, σx is the standard deviation of x, and σy is the standard deviation of y.

https://machinelearningmastery.com/implement-simple-linear-regression-scratch-python/

## Questions

