# Heteroscedasticity tests

Because our results are dependent on these statistical assumptions, the results are only correct if our assumptions are correct (at least approximately).

- **Homoscedasticity** - If the residuals are symmetrically distributed across the regression line, then the data is said to be homoscedastic.
- **Heteroscedasticity** - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form a funnel shape or any other non-symmetrical shape

The null hypothesis for these tests is that all observations have the same error variance, i.e., the errors are homoscedastic. The tests differ in terms of the type of heteroscedasticity accepted as an alternate hypothesis.

**het_goldfeldquandt**: It is used to test the presence of heteroscedasticity in the given data.

More documentation: [Regression Diagnostics and Specification Tests](https://www.statsmodels.org/stable/diagnostic.html#heteroscedasticity-tests)

Some Heretiscedasticity Tests:
    
- **het_breuschpagan**: Lagrange Multiplier Heteroscedasticity Test by Breusch-Pagan
- **het_white**: Lagrange Multiplier Heteroscedasticity Test by White
- **het_goldfeldquandt**: test whether variance is the same in 2 subsamples

In [1]:
# To import the library
from statsmodels.stats.diagnostic import het_goldfeldquandt

Parameters:

- $ y $: array_like. endogenous variable
- $ x $: array_like. exogenous variable, regressors
- $ idx $: int, default None. column index of variable according to which observations are sorted for the split
- $ split $ {int, float}, default None. If an integer, this is the index at which sample is split. If a float in 0<split<1 then split is interpreted as fraction of the observations in the first sample. If None, uses nobs//2.
- $ drop $ {int, float}, default None. If this is not None, then observation are dropped from the middle part of the sorted series. If $: 0 < split < 1 $ then split is interpreted as fraction of the number of observations to be dropped. 
- $ alternative $: {“increasing”, “decreasing”, “two-sided”}. The default is increasing. This specifies the alternative for the p-value calculation.
- $ store $: bool, default False. Flag indicating to return the regression results

Returns:

- $ fval $: float. value of the F-statistic
- $ pval $: float. p-value of the hypothesis that the variance in one subsample is larger than in the other subsample
- $ ordering $: str. The ordering used in the alternative.
- $ res_store $: ResultsStore, optional. Storage for the intermediate and final results that are calculated

**We will use Goldfeld–Quandt test to check homoscedasticity. Hypothesis defined**
- Null hypothesis : Residuals are homoscedastic
- Alternate hypothesis : Residuals are hetroscedastic

> If $ pval < 0.05 $, Ho is rejected, meaning that residuals are hetroscedastic. <br>
> If $ pval > 0.05 $, Ho is accepted, meaning that residuals are homoscedastic.

----------------

# Evaluations required in Linear Regression Models

* Multicollinearity --> It is required to remove the multicollinearity in the features by removing the columns that not add much value to the model.
    - We can validate the multilinearidad with **variance_inflation_factor** (statsmodel). Values greater near 5 show VIF moderate. Greater than 10 High VIF.
    - Validating the p-value in the summary of the model applyied. **sm.OLS.summary**. Check "P>|t|" column. (statsmodel)
        - **P >|t|**: It is the p-value.
        - Pr(>|t|) : For each independent feature, there is a null hypothesis and alternate hypothesis.
            - Ho : Independent feature is not significant. 
            - Ha : Independent feature is significant. 
        - The p-value of less than 0.05 is considered to be statistically significant with a confidence level of 95%. 
        - If the p-value is less than the significance level of 0.05, then we will reject the null hypothesis in favor of the alternate hypothesis. In other words, we have enough statistical evidence that there is some relationship between the independent variable and the dependent variable.
    
* Assumpsions:
    - (1) Means of residuals should be 0 --> ols_model.resid.mean()
    - (2) Normallity of error terms
    > Error Terms/Residuals should be normally distributed. <br>
    > To chec it we can plot the histogram of residuals. <br>
    > If residuals are no normal, apply transformation as: **log**, **arcsinh**, **exponential**, etc.
    - (3) Linearity of variables
    > Predictor variables must have a linear relation with the dependent variable. <br>
    > To test plot the residuals (**ols_model.resid**) and the fitted values (**ols_model.fittedvalues**) to ensure there is no pattern. We can use **sns.residplot**. <br>
    > If pattern is found, transformation to the target/dependent variable need to be applied. Ex. log.
    - (4) No heteroscedasticity
    > Use Goldfeld–Quandt (**sms.het_goldfeldquandt** form statsmodel)test to check homoscedasticity. <br>
    > Hypothesis defined
    > - Null hypothesis : Residuals are homoscedastic
    > - Alternate hypothesis : Residuals are hetroscedastic
    > Interpretation:
    > - If  𝑝𝑣𝑎𝑙<0.05, Ho is rejected, meaning that residuals are hetroscedastic.
    > - If  𝑝𝑣𝑎𝑙>0.05, Ho is accepted, meaning that residuals are homoscedastic.
    > 
    > To test: <br>
    > $$ name = ["F statistic", "p-value"] $$
    > $$ test = sms.het_goldfeldquandt(target, features) $$
    > $$ lzip(name, test) $$
    > Output Ex:
    > $$ [('F statistic', 0.9395156175145154), ('p-value', 0.9790604597916552)] $$
    > Check p-value if it is <0.05, in this case it is not, so data is hoscedastic.

-----

# Ridge Regression

Because a general linear or polynomial regression will fail if the independent variables are highly collinear, Ridge regression can be utilized to tackle such situations.

When we have more parameters than samples, it is easier to solve problems using Ridge Regression.

-----

# Elastic Net Regression

Elastic Net is a regularized regression model that combines 𝐿1 and 𝐿2 penalties, i.e., lasso and ridge regression. As a result, it performs a more efficient smoothing process.

-----

# Why should we do feature selection?

- Reduces dimensionality
- Discards deceptive features; Deceptive features appear to aid learning on the training set but impair generalization
- Speeds training/testing

-----