## Goodness-of-Fit

### Overview
- The goodness-of-fit measures compare the observed values to the value that the model expects.
- Smaller discrepancies between the observed and the expected values represent a better fit.
- These measures include R-squared, predicted R-squared, adjusted R-squared, the standard error of the regression, and the overall F-test of significance.

### Assessing the Goodness-of-Fit
- Residuals are the distance between the observed value and the fitted value.
- Linear regression identifies the equation that produces the smallest difference between all of the observed values and their fitted values.
- In other words, linear regression finds the smallest sum of squared residuals that is possible for the dataset.
- The regression model fits the data well when the differences between the observations and predicted values are small and unbiased.
- Unbiased in this context means that the fitted values are not systematically too high or too low anywhere in the observation space.
- However, before assessing numeric measures of goodness-of-fit, like R-squared we should evaluate the residual plots. 
- Residual plots can expose a biased model far more effectively than numeric output by displaying problematic patterns in the residuals.
- If the model is biased, we cannot trust the results, but if the residual plot looks good only then assess the R-squared and other statistics. 

### R-squared
- After fitting a linear regression model, we need to determine how well the model fits the data.
- R-squared evaluates the scatter of the data points around the fitted regression line.
- It is also called the coefficient of determination, or the coefficient of multiple determination for multiple regression.
- For the same dataset, higher R-squared values represent smaller differences between the observed data and the fitted values.
- R-squared is the percentage of the dependent variation that a linear model explains.
$$
R^2 = \frac{\text{Variance explained by the model}}{\text{Total Variance}}
$$
- R-squared is always between 0 and 100%
    - 0% represents the model that does not explain any of the variation in the response variable around its mean. The mean of the dependent variable predicts the dependent variable as well as the regression model.
    - 100% represents a model that explains all of the variation in the response variable around its mean.
- Usually, the larger the $R^2$, the better the regression model fits your observations.  

### Visual Representation of R-squared
- To visually demonstrate how R-squared values represent the scatter around the regression line, we can plot the observations with the fitted line that represents the regression equation.
- Like the correlation coefficient, R-squared measures the strength of the relationship between the set of independent variables and the dependent variables.
- Stronger relationships indicate lower scatter.
- Unlike a correlation coefficient R-squared does not indicate the direction of the relationship.
- When a regression model accounts for more of the variance, the data points are closer to the regression line. For a $R^2$ of 100%, the fitted values equal the data values and consequently all of the observations fall on the regression line. 

### Limitations of R-squared
- R-squared cannot be used to determine whether the coefficient estimates and predictions are biased, which is why residual plots should be assessed first.
- R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low $R^2$ value. On the other hand, a biased model can have a high R-squared value.

### Are low R-squared values always a Problem?
- Regression models with low R-squared values can be perfectly good models for the following reasons.
    - Some fields of study like human behavior have an inherently greater amount of unexplainable variation. In such areas the $R^2$ values are bound to be lower (less than 50% in case of human studies). People are just harder to predict than physical processes.
- If you have a low R-squared value but the independent variables are statistically significant, you can still draw important conclusions about relationships between the variables.
- If the goal is to understand the nature of the relationships in the data, a low R-squared is probably not a problem.
- The scenario where low R-squared values can cause problems is when we need to generate predictions that are relatively precise(narrow prediction intervals).
- A high $R^2$ is necessary for precise predictions, but it is not sufficient by itself.

### Are High R-squared values always Great?
- No, A Regression model with a high R-squared value can have a multitude of problems.
- The data in the electron mobility fitted line plot follow a very low noise relationship, and the R-squared is 98.5%, which is fantastic.
- However, the regression line consistently under and over predicts the data along the curve, which is bias.
- The residuals v/s fits plot emphasizes this unwanted pattern. An unbiased model has residuals that are randomly scattered around zero.
- Non-random residual patterns indicate a bad fit despite a high $R^2$. Always check the residual plots.
- This type of specification bias occurs when your linear model is underspecified, which means it is missing significant independent variables, polynomial terms and interaction terms.
- To produce random residuals, try adding terms to the model or fitting a nonlinear model.
- A variety of other circumstances can artificially inflate $R^2$. These reasons include overfitting the model and data mining.
- Either of these look like they provide an excellent fit to the data, but the results can be entirely deceptive.
- An overfit model is one where the model fits the random quirks of the sample. Data mining can take advantage of chance correlations. In either case you can obtain a model with a high $R^2$ even for entirely random data.

### Adjusted R-squared and Predicted R-squared
- R-squared tends to reward you for including too many independent variables in a regression model, and it doesn't provide any incentive to stop adding more.
- Adjusted R-squared and predicted R-squared use different approaches to help you fight the impulse to add too many.
- The protection that adjusted R-squared and predicted R-squared provide is critical because too many terms in a model can produce results that you can't trust.
- These statistics help you include correct number of independent variables in your regression model.
- Does a fitted line graph show an actual relationship or is it just an overfit model. The following points show how R-squared values can mislead us.
    - Multiple regression analysis is an incredibly tempting statistical analysis that seduces you and it practically begs you  to include additional independent variables in your model.
    - Every time you add a variable, the R-squared increases, which tempts you to add more. Some of the independent variables will be statistically significant.
    - This can be an actual relationship or just a chance correlation. So you just add the variables into the model. Higher order polynomials can curve your regression model any way you want.
    - But are these real relationships that you are fitting or just playing connect the dots? Meanwhile R-squared increases, convincing you to add more variables.

#### Problems with R-squared
- **Problem 1**
    - R squared increases every time you add an independent variable to the model.
    - It never decreases, not even when it's just a chance correlation between variables. 
    - A regression model that contains more independent variables than another model can look like it provides a better fit merely because it contains more variables.

- **Problem 2**
    - When a model contains an excessive number of independent variables and polynomial terms, it becomes overly customized to fit the peculiarities and random noise in your sample rather than reflecting the entire population.
    - This is called as overfitting the model, and it produces deceptively high R-squared values and a decreased capability for precise predictions.

- Adjusted R-squared and predicted R-squared address both of these problems.

#### Adjusted R-squared
- Adjusted R-squared is used to compare the goodness-of-fit for regression models that contain differing number of independent variables.
- For ex: If we are comparing a model with five independent variables to a model with one variable and the five variable model has a high R-squared. Is the model with five variables actually better model or does it just have more variables?
- To determine this, just compare the adjusted R-squared values.
- The adjusted R-squared adjusts for the number of terms in the model. 
- Importantly, its value increases only when the new term improves the model fit more than expected by chance alone, the adjusted R-squared value actually decreases when the term doesn't improve the model fit by a sufficient amount.

#### Predicted R-squared
- Predicted R-squared is used to determine how well a regression model makes predictions.
- This statistic helps to identify cases where the model provides a good fit for the existing data but isn't as good at making predictions.
- Predicted R-squared is calculated using following procedure:
    - Remove a data point from the dataset (test dataset).
    - Calculate the regression equation.
    - Evaluate how well the model predicts the missing observation.
    - Repeat this for all points in the test dataset.
    - Predicted R-squared is a summary statistic of how well the model predicted all of the observations when each one was removed from the dataset for an iteration of the above process.

- It helps determine whether you are overfitting a regression model. An overfit model includes an excessive number of terms, and it begins to fit the random noise in your sample.
- By its very definition, its not possible to predict random noise. Consequently, if your model fits a lot of random noise, the predicted R-squared value must fall. 
- A predicted R-squared that is distinctly smaller than the R-squared is a warning sign that you are overfitting the model even if the independent variables are statistically significant. Try reducing the number of terms.

### Caution about chasing a High R-squared
- There is a certain amount of variability that we can't explain. If we chase a high R-squared by including an excessive number of variables, then we force the model to explain the unexplainable, which is not good.
- While this approach can obtain higher R-squared values, it comes at the cost of misleading regression coefficients, p-values, R-squared and imprecise predictions.
- This problem is known as overfitting and it occurs when your model is too complex and begins to model the random noise.
- Adjusted R-squared and predicted R-squared help you resist the urge to add too many independent variables to your model.
    - Adjusted R-squared compares models with different numbers of variables.
    - Predicted R-squared can guard against models that are too complicated.

- Remember, the great power that comes with multiple regression analysis required restraint to use it wisely!.

### Standard Error of Regression vs. the R-squared
- The standard error of regression provides the absolute measure of the typical distance that the data points fall from the regression line. S is in the units of the dependent variable.
- R-squared provides the relative measure of the percentage of the dependent variable variance that the model explains. R-squared can range from 0 to 100%

#### Standard Error of Regression and R-squared in practice
- S or the Standard Error of Regression tells you how precise the models predictions are using the units of dependent variable. This statistic indicates how far the data points are from the regression line on average.
- We want lower values of S, as this signifies that the differences between the data points and the fitted values are smaller. S is also valid for both linear and non-linear regression models.
- This fact is useful when we need to compare the fit between both types of models.
- Higher R-squared values indicate that the data points are closer to the fitted values. While higher R-squared values are good, they don't tell you how far the data points are from the regression line.
- Additionally R-squared is valid only for linear models. We cannot use R-squared to compare a linear model to a non-linear model.
- S measures the precision of the model's predictions. Consequently, we can use S to obtain a rough estimate of the 95% prediction interval.
- This means about 95% of the data points are within a range that extends from $ +/- 2 * \text{standard error of the regression} $ from the fitted line

### The F-test of Overall Significance
- The F-test of overall significance  indicates whether your linear regression model provides a better fit to the data than a model that contains no independent variables.
- An F-test is a type of statistical test that is very flexible and ca nbe used in a wide variety of settings.
- F-tests can evaluate multiple model terms simultaneously, which allows them to compare the fits of the different linear models. In contrast, t-tests can evaluate one term at a time.
- To calculate F-test of overall significance, the method used includes the proper terms in two models that it compares. The overall F-test compares the model you specify to the model with no independent variables, which is also known as intercept only model.
- The F-test has the following two hypothesis:
    - The **null hypothesis** states that the model with no independent variables fits the data as well as your model.
    - The **alternative hypothesis** states that your model fits the data better than the intercept-only model.
- The result for the overall F-test can be found in the ANOVA table.
- Compare the p-value for the F-test to your significance level. If the p-value is less than the significance level, your sample data provides sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables.
- If none of the independent variables are statistically significant, the overall F-test is also not statistically significant.
- Occasionally, the t-tests for coefficients and the overall F-tests can produce conflicting results. This occurs because the F-tests assesses all of the coefficients jointly, whereas the t-test for each coefficient examines them individually.
- The F-test sums the predictive power of all the independent variables and determines that it is unlikely that all of the coefficients equal zero. However, its possible that each variable isn't predictive enough on its own to be statistically significant.
- In other words, your model provides enough evidence to be statistically significant, but not enough to conclude that any individual variable is significant.

#### Additional Ways to Interpret F-test of Overall Significance
- For the model with no independent variables, the intercept-only model, all of the model's predictions equal the mean of the dependent variable. 
- Consequently, if the overall F-test is significant, the model's predictions are an improvement over using the mean.
- If the overall F-test is significant we can conclude that R-squared does not equal zero, and the correlation between the model and the dependent variable is statistically significant.
- On the other hand if the overall F-test is not significant, your sample does not provide strong enough evidence for concluding that R-squared is greater than zero in population.