## Linear Regression
* It is very easy to understand
* It a foundation algorithm, many components of it can be used in other algorithms.


### Regression:
* Regression: Regression in machine learning consists of mathematical methods that allow data scientists to **predict a continuous outcome** (y) based on the value of one or more predictor variables (x). [y is independent variable and X are dependent variables]
* A regression problem is when the output variable is a real or continuous value, such as “salary” or “weight”.
* Uses of regression:
	* Determining relationship between independent variable and dependent variable
	* forecasting an effect, for e.g how much additional sale income will one get for each 1000 dollars spent on marketing.
	* Trend forecasting, for e.g price of bitcoin in last 6 months


## Intro to Linear Regression:

* It is a supervised machine learning algorithm
* Three types of linear regression:
  * **Simple linear regression**: When there is a single input variable (x)
    * It uses traditional slope-intercept form, where m and b are the variables our algorithm will try to “learn” to produce the most accurate predictions. x represents our input data and y represents our prediction.  y=mx+b

  * **Multiple linear regression**: When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression(multivariable regression).
    * A more complex, multi-variable linear equation might look like this, where w represents the regression coefficients, or weights, our model will try to learn. f(x,y,z)=w1x+w2y+w3z
  * **Polynomial linear regression**: When data is not linear
  
**We will study simple linear regression first**

* Linear Regression is a method used to define a relationship between a dependent variable (Y) and independent variable (X). Which is simply written as : y = mx + b ; Where y is the dependent variable, m is the scale factor or coefficient, b being the bias coefficient and X being the independent variable.

* Here, the relationship between independent feature and dependent feature is always linear in nature. 

* A linear relationship is any equation that, when graphed, it gives us a straight line. Linear relationships are beautifully simple in this way; if we don't get a straight line, then either we graphed it wrong or the equation is not a linear relationship. If we get a straight line and we've done everything correctly, we know it is a linear relationship.

* The goal here is to draw the line of best fit between X and Y which estimates the relationship between X and Y.

* In Linear Regression the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). 

### Basic example working:
* Let's understand with help of example:

| CGPA | Package |
|------|---------|
| 7.1  | 3.5     |
| 4.7  | 1.2     |
| 8.9  | 4.2     |
| 8.1  | 3.9     |

* Now we want to build a model which will take cgpa and predicts package.
* First we can plot both the graph, and if there are multiple data points then we don't exactly linear graph, we get sort of linear graph, because there are many real world factor which could impact each package too, like IQ, company doesn't offers good package, etc. These factors are called stochastic errors which cannot be determined by model but impact the results.
* What if we had exactly linear data then we could draw a linear straight line and for each X value we could directly tell value of y corresponding to line.
* But we don't have exact linear data, but still we draw best fit line, like this(big red line, ignore small red one):

![linear_reg_graph](images/linear_reg_graph.png)

### Regression Coefficient:
* Regression coefficients are estimates of the unknown population parameters and describe the relationship between a input variable and the response.
* In linear regression, coefficients are the values that multiply the input values. Suppose you have the following regression equation: y = 4X + 3. In this equation, +4 is the coefficient, X is the input feature, and +3 is the constant.
* The sign of each coefficient indicates the direction of the relationship between a input variable and the response variable.
	- A positive sign indicates that as the input variable increases, the response variable also increases.
	- A negative sign indicates that as the input variable increases, the response variable decreases.

* The coefficient value represents the mean change in the response given a one unit change in the predictor. For example, if a coefficient is +4, the mean response value increases by 4 for every one unit change in the predictor.
* For e.g: If we have a eqn for multivariable regression: y = 0.8 +  1.2X1 + 3X2 + 5X3 + 1X4; so here we can say maximum value among all coefficients is 5, so it means feature X3 is very important for determining y.
* Value of these coefficients are determined when model is trained and we come to know which features are most important.
* Coefficient `m` in eqn y = mx + b; can also be called as slope of the equation.
* Here `m` is actually weightage of cgpa here, if m value is less then it means on cpga our package value won't be that dependent and if m value is high then it means that package value depends a lot on cgpa variable. Mathematically talking as slope is the unit change in y with respect to the unit change in x, so if slope is more it would mean for small unit change in x there will be more change in y.

### Best fit line
* One of our goal is to find the best fit line. But initially we don't know which is the best fit line as there can be multiple lines created at first.
* Now when we have a line we have to try to minimize the distance(between line and points) i.e. minimize the error such that if we do summation of all the errors it should be minimum.
* So as there can be multiple lines that can minimize absolute error, but there will be only one line that will minimize sum of squared errors(SSE). Therefore using SSE also makes implementation easier.
* So we try to find `m` and `c`, and once we find the best `m` and `c` values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.
* Best regression is the one that minimizes sum of squared errors. Σ(actual-predicted)^2
* Some of the algorithms for minimizing sum of squared errors and calculate m and b value:
	* ordinary least squares(OLS)
	* gradient descent
* In higher dimension, using OLS is difficult, so we use gradient descent for higher dimension.
* Even though, there is shortcoming with SSE, as data points increases, SSE increase. Which means if a line fitting less data points is less efficient than a line fitting more data point dataset. Former will have less SSE compared to latter.
* Evaluation metric which does not has this shortcoming is R squared.


### Bias:
* If let's say we were predicting package based on exp and there was no bias, so y = mx, if x is 0 then y = will be also 0, so for 0 experience package will be 0, but that is actually wrong in real world.
* So bias here is offset, if mx term is 0 then y will be b.
* Bias is the difference between the expected value of an estimator and the true value being estimated.
* The bias coefficient gives an extra degree of freedom to this model.
(https://stats.stackexchange.com/questions/13643/what-intuitively-is-bias)
* Bias can also been seen as y-intercept in the eqn y = mx + b (where b is the bias)

![slope and intercept](images/slope_and_intercept.png)


### Linear combination of input
* In linear regression, y can be calculated from a linear combination of the input variables (x).
* Linear combination of input means -> Suppose we have inputs: x1, x2 and x3 then it is: ax1 + bx2 + cx3
* Linear combination of vectors means ->  A linear combination of vectors is a sum of scalar multiples of those vectors. That is, given a set of M vectors xi of the same type, such as R^N (they must have the same number of elements so they can be added), a linear combination is formed by multiplying each vector by a scalar `αi` (alpha) and summing to produce a new vector y of the same type:   y = α1.x1 + α2.x2 +α3.x3 + ....αm.xm 
* Read more here: https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html#introduction
* Read more here: https://towardsdatascience.com/coding-deep-learning-for-beginners-linear-regression-part-1-initialization-and-prediction-7a84070b01c8




### Cost function(J) of linear regression:
*  It is a function that measures the performance of ML model for a given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.
* It is very important to update the θ1 and θ2 values, to reach the best value that minimize the error between predicted y value (pred) and true y value (y)
The purpose of Cost Function is to be either:
    * Minimized - then returned value is usually called cost, loss or error. The goal is to find the values of model parameters for which Cost Function return as small number as possible.
    * Maximized - then the value it yields is named a reward. The goal is to find values of model parameters for which returned number is as large as possible.
* Formula:

	![Cost function](images/cost_func.jpg)


(where n is number of all the data points, predi(also called as ŷ) is given by mx+b)
* Point which we find out or predict in best fit line(which lie on best fit line) are ŷ and y are the original points.
* Now we try to minimize this cost function value and while minimizing whichever lines gives minimum error is selected as best fit line.
* But actually we can have multiple best fit lines(from which best one is to be selected), and from that we have to compute the different-different cost functions(for every line) and find minimum value, but it will take much time and processing power. So selecting n number of lines(n can be millions or billions too) and try to find cost function is not an efficient way. So instead we use some methods like Gradient Descent, etc.

* Read Tailoring Cost functions: https://towardsdatascience.com/coding-deep-learning-for-beginners-linear-regression-part-2-cost-function-49545303d29f
* Cost function(J) of Linear Regression is MSE or the Root Mean Squared Error (RMSE) between predicted y value (pred) and true y value (y). While both MSE and RMSE can effectively guide model optimization, RMSE is generally more useful for result interpretation and communication because it is in the same units as the target variable and slightly less sensitive to extreme outliers compared to MSE.

### Gradient Descent of Linear Regression:
- Gradient descent is an algorithm that is used to minimize a cost function. Gradient descent is used not only in linear regression; it is a more general algorithm.
- While training the model, the model calculates the cost function which measures the Root Mean Squared error between the predicted value (pred) and true value (y). The model targets to minimize the cost function.
-  To minimize the cost function, the model needs to have the best value of θ1 and θ2. Initially model selects θ1 and θ2 values randomly and then itertively update these value in order to minimize the cost function until it reaches the minimum. By the time model achieves the minimum cost function, it will have the best θ1 and θ2 values. Using these finally updated values of θ1 and θ2 in the hypothesis equation of linear equation, model predicts the value of x in the best manner it can.
- We will start off by some initial guesses for the values of θ0 and θ1 and then keep on changing the values according to the formula(convergence theorem):  θj:=θj − α*(∂/∂θj) * f(θ0,θ1) for j=0,1
- α is called the learning rate, and it determines how big a step needs to be taken when updating the parameters. The learning rate is always a positive number.
- We want to simultaneously update θ0 and θ1, that is, calculate the right-hand-side of the above equation for both j=0 as well as j=1 and then update the values of the parameters to the newly calculated ones, which means first calculate for both and then assign(as shown in picture below). This process is repeated till convergence is achieved.

![grad_desc_calc](images/grad_desc_calc.png)

- If α is too small, then gradient descent can be slow; if it is too large gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
- Suppose θ1​ is at a local optimum of J(θ1​)(at a minimum position), what will one step of gradient descent θ1:=θ1 − α*(∂/∂θ1).J(θ1​) do? (It will be unchanged, as slope at that point will be 0, hence derivative term will be 0)
- As we approach local minimum, gradient descent will automatically take smaller steps(as derivative term becomes smaller). So there is no need to decrease α over time.
- Gradient descent is guaranteed to find the global minimum for any function J(θ0,θ1)
- Read this for further info: https://www.hackerearth.com/blog/developers/gradient-descent-algorithm-linear-regression

### Advantages of Linear Regression:
* It performs exceptionally well for linearly separable data.
* Easy to implement and train the model.
* It can handle overfitting using dimensionlity reduction techniques and cross validation and regularization.

### Disadvantages of Linear Regression:
* Sometimes Lot of Feature Engineering Is required
* If the independent features are correlated it may affect performance
* It is often quite prone to noise(outliers) and overfitting.


### Important Notes:
* Whenever we deal with gradient descent or loss function optimizer, then feature scaling is required. Here also it is required. If we don't do feature scaling gradient descent will be bigger and to come to optimal minimum position it will take time.
* It is sensitive to missing values so we have to handle missing values while feature engineering step.
* As we know linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects.

* In below image we can see in first image there are no outiers but in second diagram we have an outlier then we can see that best fit line is getting changed to reduce to mean squared error(performance metrics). But this will impact overall performance of the model. Hence outlier impacts linear regression.

![Impact of outliers](images/impactOfOutliers.png)

* Linear regression can be used in problems like house price predictions and also it can be used in some business to evaluate trends and make estimates or forecasts. For example, if a company's sales have increased steadily every month for the past few years, by conducting a linear analysis on the sales data with monthly sales, the company could forecast sales in future months.