# What is Regression Analysis? 

**Supervised learning :  Regression problem predict real-valued output**.

**Supervised learning :  Classification problem predict true/false or 0/1 output**.

- Regression analysis is a form of predictive modeling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, **time series modeling** and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.

- Regression analysis is an important tool for modeling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized

# Why do we use Regression Analysis?

As mentioned above, regression analysis estimates the relationship between two or more variables. Let’s understand this with an easy example:

Let’s say, you want to estimate growth in sales of a company based on current economic conditions. You have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information.

There are multiple benefits of using regression analysis. They are as follows:

1. It indicates the **significant relationships** between dependent variable and independent variable.
2. It indicates the **strength of impact** of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.

# Terminologies related to regression analysis

### 1. Outliers
Suppose there is an observation in the dataset which is having a very high or very low value as compared to the other observations in the data, i.e. it does not belong to the population, such an observation is called an outlier. In simple words, it is extreme value. An outlier is a problem because many times it hampers the results we get.

### 2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are said to be multicollinear. Many types of regression techniques assumes multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its importance. Or it makes job difficult in selecting the most important independent variable (factor).

**Why is Multicollinearity a Potential Problem?**

A key goal of regression analysis is to isolate the relationship between each independent variable and the dependent variable. The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant. That last portion is crucial for our discussion about multicollinearity.

The idea is that you can change the value of one independent variable and not the others. However, when independent variables are correlated, it indicates that changes in one variable are associated with shifts in another variable. The stronger the correlation, the more difficult it is to change one variable without changing another. It becomes difficult for the model to estimate the relationship between each independent variable and the dependent variable independently because the independent variables tend to change in unison.

**There are two basic kinds of multicollinearity:**

Structural multicollinearity: This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly there is a correlation between X and X2.
Data multicollinearity: This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.

**Multicollinearity in detail :** https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/

### 3. Heteroscedasticity
When dependent variable’s variability is not equal across values of an independent variable, it is called heteroscedasticity. Example – As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption.

**Heteroscedasticity in detail :** https://statisticsbyjim.com/regression/heteroscedasticity-regression/

### 4. Underfitting and Overfitting
When we use unnecessary explanatory variables it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as problem of **high variance**.

When our algorithm works so poorly that it is unable to fit even training set well then it is said to underfit the data. It is also known as problem of **high bias**.

In the following diagram we can see that fitting a linear regression (straight line in fig 1) would underfit the data i.e. it will lead to large errors even in the training set. Using a polynomial fit in fig 2 is balanced i.e. such a fit can work on the training and test sets well, while in fig 3 the fit will lead to low errors in training set but it will not work well on the test set.

<img src="11.png" alt="Drawing" align="left" style="width: 700px;"/> <img src="12.png" alt="Drawing" align="left" style="width: 400px;"/> 

#  The most commonly used regressions :

### 1. Linear Regression

Ordinary Least Squares regression (OLS) is more commonly named linear regression.

It is one of the most widely known modeling technique. Linear regression is usually among the first few topics which people pick while learning predictive modeling. In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and nature of regression line is linear.

Linear Regression establishes a relationship between **dependent variable (Y)** and **one or more independent variables (X)** using a best fit straight line (also known as regression line).

It is represented by an equation **Y=a+b*X + e**, 
- a is intercept.
- b is slope of the line.
- e is error term. 
This equation can be used to predict the value of target variable based on given predictor variable(s).

The difference between **simple linear regression and multiple linear regression** is that : 
- Multiple linear regression has **(>1)** independent variables.
- Simple linear regression has **only 1** independent variable.

<img src="13.png" alt="Drawing" align="left" style="width: 500px;"/>

### How to obtain best fit line (Value of a and b)?
This task can be easily accomplished by Least Square Method. It is the most common method used for fitting a regression line. It calculates the best-fit line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Because the deviations are first squared, when added, there is no cancelling out between positive and negative values.

Residual = Yactual - Ypredicted , in Least square Method or Logistic Regression we minimise the sqaure of the Residual.

<img src="14.png" alt="Drawing" align="left" style="width: 500px;"/>

#### Cost function = $\frac{1}{2m}\sum_{i=0}^m (Yactual - Ypredict)^2$

**Goal** is to minimise the Cost function in linear regression.



## Gradient Descent 

In [None]:
- To minimise the cost function 

### 2. Logistic Regression

### 3. Polynomial Regression

### 4. Stepwise Regression

### 5. Ridge Regression

### 6. Lasso Regression

### 7. ElasticNet Regression