#### Identifying Nonlinearity in Data

The linear regression model assumes that there is a linear relationship between the predictors and the response variable. However, if the true relationship is nonlinear, then virtually all of the conclusions that we draw from the fit do not hold much credibility. In addition, the prediction accuracy of the model can be reduced significantly. In the forthcoming video, Anjali explains how we can identify nonlinearity in data.

1.<b> For Simple Linear Regression: </b>
      Plot the independent variable against the dependent variable to check for nonlinear patterns.

2.<b> For Multiple Linear Regression </b>, since there are multiple predictors, we, instead, plot the residuals versus the predicted values.Ideally, the residual plot will show no observable pattern. In case a pattern is observed, it may indicate a problem with some aspect of the linear model. Apart from that:
    1. Residuals should be randomly scattered around 0.
    2. The spread of the residuals should be constant.
    3. There should be no outliers in the data
    4. Bimodal distribution
    
If nonlinearity is present, then we may need to plot each predictor against the residuals to identify which predictor is nonlinear.

https://www.mathsisfun.com/sets/functions-common.html

There are three methods to handle nonlinear data:
1. Polynomial regression
2. Data transformation
3. Nonlinear regression


1. <b> Polynomial Regression </b>: 
   The kth-order polynomial model in one variable is given by:
   
   $ y = {\beta}_{0}  + {\beta}_{1} x + {\beta}_{2} x^{2}+ ..... + {\beta}_{k} x^{k}  + {\epsilon} $

Depending upon the shape of the residual to pred value distribution. Replace the issue feature with polynomial predictor like if the scatter plot is parabola. Create a new feature with square calculation. <b> In polynomial regression, we need to include the lower degree polynomials in the model as well </b>
Ex: the scatter plot is:

![image.png](attachment:image.png)

predictors would you include $ a, a^2,a^3 $ and linear regression equation will be :

  $ y = {\beta}_{0}  + {\beta}_{1} x + {\beta}_{2} x^{2} + {\beta}_{3} x^{3} + {\epsilon} $
  
2. <b> Data Transformation </b>
If the residual plot indicates the presence of nonlinear relations in the data, then a simple approach is to use nonlinear transformations of the predictors. For instance, for a predictor x, these transformations can be log(x), sqrt(x), exp(x), etc., in the regression model.

We need to remember that although log is the most commonly used function for transformations, it is not the only one that we can use. There are a few other functions that we can use depending upon the shape of the data.

$ {\log}y = {\beta}_{0}  + {\beta}_{1} x + {\beta}_{2} x^{2} + {\beta}_{3} x^{3} + {\epsilon} $

<b> How do we decide when and what to transform? </b>

Essentially, to handle nonlinear data, we may have to try different transformations on the data to determine a model that fits it well. Hence, we may try polynomial models or transformations of the x-variable(s) or the y-variable, or both. These transformations can be square root, logarithmic or reciprocal transformations, although this is not an exhaustive list.

        A. When the predictor variables are non-linear with the response variable then transform the predictor variable
        B. When the problem is non-normality of error terms or residuals and unequal variance , then consider the transformation of the response variable(this can also help with non-linearity).
        C. When the regression function is non-linear and non-normality of error terms or residuals and unequal variance is also there, then consider transformation of both response variable and predictor variable.
        
Polynomial regression and data transformation allow us to stay within the linear regression framework. After transforming the predictors, when we fit the model, we still check whether the model follows the assumptions so that we can trust the results that we get from the model.

3. <b> Non-Linear Regression: </b>
All of the models that we have discussed so far have been linear in terms of the parameters (i.e., linear in terms of the beta's). Nevertheless, for models in which the response variable is related nonlinearly with the parameters or the model coefficients, we use nonlinear regression

$ y = {\beta}_{1}  / 1+ {e}^({\beta}_{2} + {\beta}_{3} x_{i}) + {\epsilon} $


When we fit a linear regression model to a particular data set, many problems may arise. Most common among these are the following:

1. Non-constant variance

Constant variance of error terms is one of the assumptions of linear regression. Unfortunately, many times, we observe non-constant error terms. As discussed earlier, as we move from left to right on the residual plots, the variances of the error terms may show a steady increase or decrease. This is also termed as heteroscedasticity.

When faced with this problem, one possible solution is to transform the response Y using a function such as log or the square root of the response value. Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.

 
2. Autocorrelation

This happens when data is collected over time and the model fails to detect any time trends. Due to this, errors in the model are correlated positively over time, such that each error point is more similar to the previous error. This is known as autocorrelation, and it can sometimes be detected by plotting the model residuals versus time. Such correlations frequently occur in the context of time series data, which consists of observations for which measurements are obtained at discrete points in time.
 
In order to determine whether this is the case for a given data set, we can plot the residuals from our model as a function of time. If the errors are uncorrelated, then there should be no observable pattern. However, on the other hand, if the consecutive values appear to follow each other closely, then we may want to try an autoregression model. 

 

3. Multicollinearity

If two or more of the predictors are linearly related to each other when building a model, then these variables are considered multicollinear. A simple method to detect collinearity is to look at the correlation matrix of the predictors. In this correlation matrix, if we have a high absolute value for any two variables, then they can be considered highly correlated. A better method to detect multicollinearity is to calculate the variance inflation factor (VIF), which you studied in the Linear Regression module.

When faced with the problem of collinearity, we can try a few different approaches. One is to drop one of the problematic variables from the regression model. The other is to combine the collinear variables together into a single predictor. Regularization (which we will discuss in the next session) helps here as well.

4. Overfitting

When a model is too complex, it may lead to overfitting. It means the model may produce good training results but would fail to perform well on the test data. One possible solution for overfitting is to increase the amount and diversity of the training data. Another solution is regularization, which we will cover in the next session. 

5. Extrapolation

Extrapolation occurs when we use a linear regression model to make predictions for predictor values that are not present in the range of data used to build the model. For instance, suppose we have built a model to predict the weight of a child given its height, which ranges from 3 to 5 feet. If we now make predictions for a child with height greater than 5 feet or less than 3 feet, then we may get incorrect predictions. The predictions are valid only within the range of values that are used for building the model. Hence, we should not extrapolate beyond the scope of the model.

#### Regularisation

When a model performs really well on the data that is used to train it, but does not perform well with unseen data, we know we have a problem: overfitting. Such a model will perform very well with training data and, hence, will have very low bias; but since it does not perform well with unseen data, it will show high variance. 

In other words, bias in a model is high when it does not perform well on the training data itself, and variance is high when the model does not perform well on the test data. Please note that a model failing to fit on the test data means that the model results on the test data varies a lot as the training data changes. This may be because the model coefficients do not have high reliability.

There is a trade-off between bias and variance with respect to model complexity. A simple model would usually have high bias and low variance, whereas a complex model would have low bias and high variance. In either case, the total error would be high.

What we need is lowest total error, i.e., low bias and low variance, such that the model identifies all the patterns that it should and is also able to perform well with unseen data.
 

For this, we need to manage model complexity: It should neither be too high, which would lead to overfitting, nor too low, which would lead to a model with high bias (a biased model) that does not even identify necessary patterns in the data.

Regularization helps with managing model complexity by essentially shrinking the model coefficient estimates towards 0. This discourages the model from becoming too complex, thus avoiding the risk of overfitting.

Model Complexity depends on:

1. No of coefficients
2. Magnitude of coefficients

Cost = RSS + Penality

When building an OLS model, we want to estimate the coefficients for which the cost/loss, i.e., RSS, is minimum. Optimising this cost function results in model coefficients with the least possible bias, although the model may have overfitted and hence have high variance. 

In case of overfitting, we know that we need to manage the model’s complexity by primarily taking care of the magnitudes of the coefficients. The more extreme values of the coefficients are (high positive or negative values of the coefficients), the more complex the model is and, hence, the higher are the chances of overfitting.

When we use regularization, we add a penalty term to the model’s cost function

Here, the cost function would be Cost = RSS + Penalty.
 
Adding this penalty term in the cost function helps suppress or shrink the magnitude of the model coefficients towards 0. This discourages the creation of a more complex model, thereby preventing the risk of overfitting.
 
When we add this penalty and try to get the model parameters that optimise this updated cost function (RSS + Penalty), the coefficients that we get given the training data may not be the best (maybe more biased). Although with this minor compromise in terms of bias, the variance of the model may see a marked reduction. Essentially, with regularization, we compromise by allowing a little bias for a significant gain in variance. 

We also need to remember two points about the model coefficients that we obtain from OLS:

1. These coefficients can be highly unstable – this can happen when only a few of the predictors that we have considered to build our model are related significantly to the response variable and the rest are not very helpful, hence random variables.

2. There may be a large variability in the model coefficients due to these unrelated random variables such that even a small change in the training data may lead to a large variance in the model coefficients. Such model coefficients are no longer reliable, since we may get different coefficient values each time we retrain the model.

3. Multicollinearity, i.e., the presence of highly correlated predictors, may be another reason for the variability of model coefficients. Regularization helps here as well.

<b>To Summarize, we use regularization because we want our models to work well with unseen data, without missing out on identifying underlying patterns in the data. For this, we are willing to make a compromise by allowing a little bias for a significant reduction in variance. We also understood that the more extreme the values of the model coefficients are, the higher are the chances of model overfitting. Regularization prevents this by shrinking the coefficients towards 0. In the next two segments, we will discuss the two techniques of regularization: Ridge and Lasso and understand how the penalty term helps with the shrinkage.</b>

<b> Ridge Regression </b>

Cost function for OLS: $ \sum \limits_{i=1} ^{n} (y_{i} - \hat{y}_{i})^2 $

Cost function for Ridge: $ \sum \limits_{i=1} ^{n} (y_{i} - \hat{y}_{i})^2 + {\lambda} \sum \limits_{j=1} ^{p} {\beta}^2_{j} $

In OLS, we get the best coefficients by minimising the residual sum of squares (RSS). Similarly, with Ridge regression also, we estimate the model coefficients, but by minimising a different cost function. This cost function adds a penalty term to the RSS. The penalty term is lambda multiplied by the sum of squared model coefficients. In the cost function, the penalty term, also called the shrinkage penalty, would be small only if the coefficients are small, i.e., close to 0. Hence, while fitting the Ridge regression model, since we need to find out the model coefficients that minimize the entire cost, i.e., RSS and a penalty, it would have the effect of shrinking the model coefficients, i.e., the betas, towards 0.

Now, what is the role of lambda here? It is hypertuning paramter. It regularises the model. If lambda is 0, then the cost function would not contain the penalty term and there will be no shrinkage of the model coefficients. They would be the same as those from OLS. However, since lambda moves towards higher values, the shrinkage penalty increases, pushing the coefficients further towards 0, which may lead to model underfitting. Choosing an appropriate lambda becomes crucial: If it is too small, then we would not be able to solve the problem of overfitting, and with too large a lambda, we may actually end up underfitting.
 

Another point to note is that in OLS, we will get only one set of model coefficients when the RSS is minimised. However, in Ridge regression, for each value of lambda, we will get a different set of model coefficients.
<b>
to summarise:

1. Ridge regression has a particular advantage over OLS when the OLS estimates have high variance, i.e., when they overfit. Regularization can significantly reduce model variance while not increasing bias much. 

2. The tuning parameter lambda helps us determine how much we wish to regularize the model. The higher the value of lambda, the lower the value of the model coefficients, and more is the regularization. 

3. Choosing the right lambda is crucial so as to reduce only the variance in the model, without compromising much on identifying the underlying patterns, i.e., the bias.  A large lambda implies a simpler model. Therefore, a simpler model would have higher bias and lower variance. 

4. It is important to standardise the data when working with Ridge regression.
    
5. The model coefficients of ridge regression can shrink very close to 0 but do not become 0 and hence there is no feature selection with ridge regression.

Ridge regression does have one obvious disadvantage. It would include all the predictors in the final model. This may not affect the accuracy of the predictions but can make model interpretation challenging when the number of predictors is very large
</b>


<b> Lasso Regression </b>

Cost function for Ridge: $ \sum \limits_{i=1} ^{n} (y_{i} - \hat{y}_{i})^2 + {\lambda} \sum \limits_{j=1} ^{p}  |{\beta}_{j} | $

The primary difference between Lasso and Ridge regression is their penalty term. The penalty term here is the sum of the absolute values of all the coefficients present in the model. As with Ridge regression, Lasso regression shrinks the coefficient estimates towards 0. However, there is one difference. With Lasso, the penalty pushes some of the coefficient estimates to be exactly 0, provided the tuning parameter, λ, is large enough.

Hence, Lasso performs feature selection. Choosing an appropriate value of lambda is critical here as well. Because of this, it is easier to interpret models generated by Lasso as compared with those generated by Ridge. Also, just like with Ridge regression, standardisation of variables is necessary for Lasso as well.

<b> 

to summarise:

1. The behaviour of Lasso regression is similar to that of Ridge regression.
2. With an increase in the value of lambda, variance reduces with a slight compromise in terms of bias.
3. Lasso also pushes the model coefficients towards 0 in order to handle high variance, just like Ridge regression. But, in addition to this, Lasso also pushes some coefficients to be exactly 0 and thus performs variable selection.
4. This variable selection results in models that are easier to interpret.
    
</b>

Generally, Lasso should perform better in situations where only a few among all the predictors that are used to build our model have a significant influence on the response variable. So, feature selection, which removes the unrelated variables, should help. But Ridge should do better when all the variables have almost the same influence on the response variable. 

In [None]:
min temp
max temp
median temp
weather feature