## Specify Your Model

### Overview
- Model specification is the process of determining which independent variables belong in the model and whether modeling curvature and interaction effects are appropriate.
- This chapter covers statistical methods, difficulties that can arise and practical suggestions for selecting the model.
- During the specification process you have to try different combinations of variables and various forms of the model. For example you can try different terms that explain interactions between variables and curvature in the data.
- We need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.
    - **Too few** : Underspecified models tend to be biased.
    - **Too many**: Overspecified models tend to be less precise.
    - **Just Right**: Models with the correct terms are not biased and are the most precise.
- If a study wants to test a particular relationship, then the regression equation should contain the independent variables that we are explicitly testing, along with other variables that affect the dependent variable..
- This process allows the regression model to assess the study's main questions while controlling for other variables that can influence the dependent variable.

### The Importance of Graphing Your Data
- When you start working with a dataset to build a regression model, the first thing you should do is graph your data.
- This will help you learn a lot about your data and the relationship between variables.
- For regression analysis, where most data are continuous, scatterplots are crucial. Scatterplots will show you whether there are positive or negative relationships and if they are linear or curvilinear.
- When some relationships are curved, the shape of the curve in the scatterplot might provide ideas about how to model it.
- We can also calculate the correlations between the candidate independent variables and the dependent variable. Significant correlations suggest that we should consider including those variables in the model.
- We can include categorical variables in scatterplots to determine whether those variables play a role. Alternatively, we can graph the dependent variable by groups using boxplots or individual value plots.
- If working on multiple regression model, we can use scatterplot matrix to display numerous relationships at the same time.

### Statistical Methods for Model Specification
- We can use statistical assessments during the model specification process.
- Various metrics and algorithms can help you determine which independent variables to include in the regression equation.

#### Adjusted R-squared and Predicted R-squared
- Typically you want to select models that have larger adjusted and predicted R-squared values. These statistics help avoid the fundamental problems with R-squared - it always increases when you add an independent variable.
- This property tempts us into specifying a model that is too complex, and which leads to overfitting, producing misleading results.
- **Adjusted R-squared** increases only when a new variable improves the model by more than chance. Low-quality variables can cause it to decrease.
- **Predicted R-squared** is a cross-validation method that can also decrease. Cross-validation partitions your data to determine whether the model is generalizable outside of your dataset.

#### Mallows' Cp
- Mallows' Cp helps you choose between multiple regression models by striking a balance between precision and bias.
- We want to include a sufficient number of independent variables to eliminate bias but not too many to reduce precision. This balance changes with the number of independent variables in the model.
- Mallows' Cp compares the precision and bias of the full model to models with a subset of predictors.
- Typically, you want Mallows' Cp to be small and close to the number of variables in the models + constant.
- A Mallows' Cp value that meets these criteria suggests that the coefficient estimates are both relatively precise (small variance) and unbiased.
- Biased models have larger values of Mallows' Cp.

#### P-values for the independent variables
- In regression, p-values less than the significance level indicate that the term is statistically significant. When a variable is not significant, consider removing it from the model.
- **"Reducing the model"** is the process of including all candidate variables in the model, and then repeatedly removing the single term with the highest non-significant p-value until your model contains only significant terms.

#### Stepwise regression and Best subsets regression
- These two automated model selection procedures are algorithms that pick the variables to include in your regression equation.
- These automated methods are helpful when you have too many independent variables, and you need some help in the investigative stages of the variable selection process.
- These procedures can provide the Mallows' Cp statistic, which helps you balance the tradeoff between precision and bias.

### Real-World Complications
- There are a few complications that can arise from using statistical methods for model selection. They are as follows.
- Your best model is only as good as the data you collect.
    - Specification of the correct model depends on measuring the proper variables.
    - When you omit important variables from the model, the estimates for the variables that you include can be biased.
    - This condition is known as **omitted variable bias**.
- The sample you collect can be unusual, either by luck or methodology.
    - False discoveries and false negatives are inevitable when you work with samples.
- Multicolinearity occurs when independent variables in a regression equation are correlated.
    - When multicolinearity is present, small changes in the equation can produce dramatic changes in coefficients and p-values.
    - It can also reduce statistical significance in variables that are relevant.
    - Due to these reasons, multicolinearity makes model selection challenging.
- If you fit many models during the selection process.
    - You will find variables appear to be statistically significant, but they are correlated only by chance.
    - This problem occurs because all hypothesis tests have a false discovery rate.
    - This type of data mining can make even random data appear to have significant relationships.
- P-values, adjusted R-squared, predicted R-squared, and Mallows' Cp can point to different regression equations and there is no clear answer.
- Stepwise regression and best subsets regression.
    - These are automated model selection procedures that can help you in the early stages of model specification.
    - These tools can get close to the right answer but they usually don't specify the correct model.

### Practical Recommendations (for above problems)
- Model Specification is as much a science as it is an art.
- Statistical methods can help, but ultimately you'll need to place a high weight on theory and other considerations.

#### Theory
- The best practice is to review the literature to develop a theoretical understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data.
- Building your knowledge helps you collect the correct data in the first place and it helps you specify the best regression equation without resorting to data mining.
- Specification should not be based only on statistical measures. In fact, the foundation of your model selection process should depend largely on theoretical concerns.
- Be sure to determine if the statistical results match theory and, if necessary make adjustments.
- For ex - if theory suggests that an independent variable is important, you might include it in the regression equation even when its p-value is not significant.
- If the coefficient sign is opposite of the theory, investigate and either modify the model or explain the inconsistency.

#### Simplicity
- People often think that complex problems require complicated regression equations. However simplification usually produces more precise models.
- When you have several models with similar predictive power, choose the simplest as it is likely to be the best model.
- Start simple and then add complexity only when it is actually needed.
- As you make a model more complex, it becomes more likely that you are tailoring it to fit the quirks in your particular dataset rather than actual relationships in the population. 
- This causes overfitting which reduces generalizability and can produce results you can't trust.

#### Residual Plots
- During the specification process, check the residual plots. Residual plots are an easy way to avoid biased models and can help you make adjustments.
- For instance, residual plots display patterns when an underspecified regression equation is biased, which can indicate the need to model curvature.
- The simplest model that creates random residuals is a great contender for being reasonably precise and unbiased.

#### In Short
- Ultimately, statistical measures can't tell you which regression equation is best. They don't understand the fundamentals of the subject area.
- Subject area expertise is always a vital part of the model specification process.
- We want simple models that you choose based on theory.
- Its tempting to try many combinations of variables to find the best model but thats not the best approach.

### Omitted Variable Bias
- Omitted variable bias occurs when a regression model leaves out relevant independent variables, which are known as confounding variables.
- This condition forces the model to attribute the effects of omitted variables to the variables that are in the model, which biases the coefficient estimates.
- This problem occurs because your linear regression model is specified incorrectly. This is because either the confounding variables are unknown or because the data does not exist.
- If this bias affects your model, it is a severe condition because you can't trust your results.

#### Effects of Omitted Variable Bias
- Omitting confounding variables from your regression model can bias coefficient estimates.
- When you're assessing the regression coefficients in the statistical output, this bias can produce the following problems:
    - Overestimate the strength of an effect.
    - Underestimate the strength of an effect.
    - Change the sign of an effect.
    - Mask an effect that actually exists.

#### Synonyms for Confounding Variables and Omitted Variable Bias.
- **Omitted variables** that cause bias are also referred to as confounding variables, confounders and lurking variables.
- These are important variables that the statistical model does not include and, therefore cannot control.
- The **omitted variable bias** is referred to as spurious effects and spurious relationships.

#### Conditions that cause Omitted Variable Bias
- For omitted variable bias to occur, the following two conditions must exist:
    1. The omitted variable must correlate with the dependent variable.
    2. The omitted variable must correlate with at least one independent variable that is in the regression model.
- This correlation structure causes confounding variables that are not in the model to bias the estimates that appear in your regression results.
- The amount of bias depends on the strength of these correlations.
- Strong correlations produce greater bias. If the relationships are weak, then the bias might not be severe.
- If the omitted variable is not correlated with another independent variable at all, excluding it does not produce bias.
- Omitted variable bias tends to occur in observational studies. Randomized studies minimize the effects of confounding variables by equally distributing them across the treatment groups, so the bias is less likely to be a problem.

#### Correlations, Residuals and OLS Assumptions
- When you satisfy the ordinary least squares (OLS) assumptions, the Gauss-Markov theorem states that your estimates will be unbiased and have minimum variance.
- Omitted variable bias occurs because the residuals violate one of the assumptions.
- To see how this works we need to follow a chain of events.
- Suppose you have a regression model with two significant independent variables, X1 and X2. These independent variables correlate with each other and the dependent variable - which are the requirements for omitted variable bias.
- Now if we take X2 out of the model, here's what happens:
    1. The model fits the data less well because we have removed a significant explanatory variable. Consequently the gap between the observed values and the fitted values increases. These gaps are the residuals.
    2. The degree to which each residual increases depends on the relationship between X2 and the dependent variable. Consequently, the residuals correlate with X2.
    3. X1 correlates with X2, and X2 correlates with the residuals. Ergo, variable X1 correlates with the residuals.
    4. Hence, this condition violates the ordinary least squares assumption that independent variables in the model do not correlate with the residuals. Violations of this assumption produce biased estimates.

#### Detect Omitted Variable Bias and Identify Confounding Variables
- If you include different combinations of independent variables in the model, and you see the coefficients changing, then it is omitted variable bias in action.
- We know that for omitted variable bias to exist, an independent variable must correlate with the residuals.
- Consequently, we can plot the residuals by the variables in our model. If we see a relationship in the plot, rather than random scatter, it both tells us that there is a problem and also points towards its solution.
- We know which independent variable correlates to the confounding variable (the residuals will have a pattern for that variable). This can help you track the problem.

#### Obstacles to Correcting Omitted Variable Bias
- The best correction for omitted variable bias is including the variable in the model.
- This allows the regression model to control for the missing variable and prevent the spurious effects that the omitted variable might have caused otherwise.
- Theoretically, you should include all independent variables that have a relationship with the dependent variable, but this approach produces real-world problems.
    1. You might need to collect data on many more characteristics than is feasible. Additionally, some of these characteristics might be very difficult or even impossible to measure (for ex- some theoretical concept like ability).
    2. As you include more variables in the model, the number of observations must increase to avoid overfitting the model, which can also produce unreliable results. Measuring  more characteristics and gathering a larger sample size can be very expensive.
    3. Since the bias occurs when the confounding variables correlate with independent variables, including these variables invariably introduces multicolinearity, which causes its own problems including unstable coefficient estimates, lower statistical power and less precise estimates.
- Therefore a tradeoff might occur between precision and bias. As you include the formerly omitted variables, you reduce the bias, but the increased multicolinearity can potentially reduce the precision of the estimates.

#### Recommendations for addressing Confounding Variables and Omitted Variable Bias
- The first thing before you begin your study, arm yourself with all the possible background information you can gather.
- Research the study area, review the literature, and consult with experts. This process enables you to identify and measure the crucial variables that you should include in your model and helps avoid the problem in the first place.
- After collecting all the data if you realize that a critical variable is missing, it can be very expensive.
- After the analysis, this background information can help you identify potential bias, and if necessary track down the solution.
- Check the residual plots, sometimes you might not be sure whether bias exists, but the plots can display the hallmarks of confounding variables.
- Omitted variable bias might not always be a significant problem as it decreases as the degree of correlations decrease. Understanding the relationships between the variables might help you make this decision.
- There is a tradeoff between precision and bias of the estimates that might occur.
    - As you add confounding variables to reduce the bias keep an eye on the precision of the estimates.
    - To track the precision, check the confidence interval of the estimates.
    - If the intervals become wider, the estimates are less precise.
    - In the end, you might want to accept a little bias if it significantly improves precision.

- If you cannot include an important variable and it causes omitted variable bias, consider using a **proxy variable**.
    - **Proxy variables** are easy to measure, and are used instead of variables that are impossible or difficult to measure.
    - The proxy variable variable can be a characteristic that is not of any great importance itself, but has a good correlation with the confounding variable.
    - These variables allow you to include some of the information in the model that would not have been possible otherwise, and thereby reducing omitted variable bias.
- Finally if you can't correct omitted variable bias using any method then you can at least predict the direction of the bias for your estimates.
- After identifying confounding variable candidates, you can estimate their theoretical correlations with the relevant variables and predict the direction of the bias.
- Always remember that its easy to get stuck on determining which set of candidate variables to include, that you forget which variables you might be excluding without even realizing it. 

### Automated Variable Selection Procedures
- Automatic variable selection procedures are algorithms that pick the variables to include in your regression model.
- Stepwise Regression and best subsets regression are two of the more common variable selection methods.
- These automatic procedures can help when you have many independent variables, and you need assistance in the investigative stages of the variable selection process.
- These procedures are especially useful when theory and experience provide only a vague sense of which variables you should include in the model.
- However if theory and expertise are strong guides, its generally better to follow them than to use an automated procedure.
- Additionally, if you use one of these procedures, you should consider it only as the first step of model selection process.

#### Stepwise Regression
- This procedure begins with a set of candidate independent variables and then adds or removes independent variables one at a time using the variable's statistical significance.
- Stepwise either adds the most significant variable or removes the least significant variable.
- It does not consider all possible models, and it produces a single regression model when the algorithm ends.
- Typically, you can control the specifics of the stepwise procedure, like you can specify whether it can only add variables, only remove variables or both.
- You can also set the significance level for including and excluding the independent variables.

#### Best Subsets Regression
- Best subsets regression is also known as "all possible regressions" and "all possible models".
- This method fits all possible models based on the independent variables that you specify.
- The number of models that this procedure fits multiplies quickly. It fits $2^P$ models, where P is the number of predictors in the dataset.
- So if you have 10 independent variables, it will fit $2^{10} = 1024$ models. If you have 20 variables, it fits $2^{20} = 1,048,576$ models.
- After fitting all of the models, best subsets regression then displays the best fitting models with one independent variable, two variables, and so on.
- It uses either adjusted R-squared or Mallows' Cp as the criterion for picking the best fitting models for this process.
- You need to compare the models to determine which one is the best.

#### Stepwise Regression vs Best Subsets Regression
- Both the automatic variable selection procedures assess the full set of candidate independent variables that you specify, the end results can be different.
- Stepwise regression does not fit all the models but instead assesses the statistical significance of the variables and arrives at a single model.
- Best subsets regression fits all possible models and displays some of the best candidates based on adjusted R-squared or Mallows' Cp.
- The single model that stepwise regression produces can be simpler to analyze. However, best subsets regression presents more information that is potentially valuable.

#### Using Stepwise and Best Subsets Regression on the same Dataset
- The following example scenario models a manufacturing process. We will determine whether the production conditions are related to the strength of a product.
- For both variable selection procedures, we'll use the same independent and dependent variables.
- **Dependent Variable** - Strength
- **Independent Variable** - Temperature, Pressure, Rate, Concentration, Time

In [4]:
## import libraries
import pandas as pd
import numpy as np
# import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt

In [5]:
# Read the dataset file
manufacturing_df = pd.read_csv("../../datasets/RegressionAnalysisDatasets/ProductStrength.csv")
manufacturing_df.head()

Unnamed: 0,Strength,Temperature,Pressure,Rate,Concentration,Time
0,271.8,783.35,33.53,40.55,16.66,13.2
1,264.0,748.45,36.5,36.19,16.46,14.11
2,238.8,684.45,34.66,37.31,17.66,15.68
3,230.7,827.8,33.13,32.52,17.5,10.53
4,251.6,860.45,35.75,33.71,16.4,11.0


#### Example of Stepwise Regression
##### Forward Stepwise Regression
- Steps for forward stepwise regression
    1. Start with an empty list of columns
    2. Then add one column and fit the model
    3. Check if the p-value of the new column is less than some threshold
    4. If it is below the threshold then add the column to the list else continue from step 2 - 4
- The final list of columns will all have significant p-values

In [6]:
# Forward Stepwise Regression
def forward_stepwise_regression(regression_df, y, x):
    pvalue_threshold_in = 0.15
    included_columns = []
    while True:
        feature_found = False
        # We remove the column that we have included from the set and repeat the process with remaining columns
        excluded_columns = list(set(x) - set(included_columns)) 
        pvalue_list = pd.Series(index=excluded_columns)
        for column_name in excluded_columns:
            # add the first column
            regression_formula = f"{y} ~  {' + '.join(included_columns + [column_name])}"  
            model = ols(regression_formula, data=regression_df).fit()
            pvalue_list[column_name] = model.pvalues[column_name]
        min_pvalue = pvalue_list.min()
        if min_pvalue < pvalue_threshold_in:
            best_feature = pvalue_list.index[pvalue_list.argmin()]
            feature_found = True
            included_columns.append(best_feature)
            print(f"Add {best_feature} with p-value {min_pvalue}")

        if not feature_found:
            break
    return included_columns

In [7]:
y = "Strength"
x = ['Temperature', 'Pressure', 'Rate', 'Concentration', 'Time']
final_feature_list = forward_stepwise_regression(manufacturing_df, y, x)
print(f"Final Feature List -\n{final_feature_list}")

Add Concentration with p-value 5.935009166852511e-09
Add Rate with p-value 3.001026934440323e-05
Add Pressure with p-value 0.09246550922073958
Add Temperature with p-value 0.06676235862897598
Final Feature List -
['Concentration', 'Rate', 'Pressure', 'Temperature']


  pvalue_list = pd.Series(index=excluded_columns)
  pvalue_list = pd.Series(index=excluded_columns)
  pvalue_list = pd.Series(index=excluded_columns)
  pvalue_list = pd.Series(index=excluded_columns)
  pvalue_list = pd.Series(index=excluded_columns)


In [8]:
# backward stepwise regression
def backward_stepwise_regression(regression_df, y, x):
    pvalue_threshold_out = 0.15
    included_columns = x
    while True:
        feature_eliminated = False
        regression_formula = f"{y} ~  {' + '.join(included_columns)}"
        model = ols(regression_formula, data=regression_df).fit()
        # use all p values except intercept
        pvalue_list = model.pvalues[1:]
        max_pvalue = pvalue_list.max()
        if max_pvalue > pvalue_threshold_out:
            feature_eliminated = True
            worst_feature = pvalue_list.index[pvalue_list.argmax()]
            included_columns.remove(worst_feature)
            print(f"Removing feature - {worst_feature} with pvalue {max_pvalue}")

        if not feature_eliminated:
            break
    return included_columns

In [9]:
y = "Strength"
x = ['Temperature', 'Pressure', 'Rate', 'Concentration', 'Time']
final_feature_list = backward_stepwise_regression(manufacturing_df, y, x)
print(f"Final Feature List -\n{final_feature_list}")

Removing feature - Time with pvalue 0.1943335490326087
Final Feature List -
['Temperature', 'Pressure', 'Rate', 'Concentration']


#### Assess your Candidate Regression Models Thoroughly
- If you use stepwise regression or best subsets regression to help pick your model, you need to investigate the candidate models thoroughly.
- This entails fitting the candidate models the normal way and checking the residual plots to be sure the fit is unbiased.
- You also need to assess the signs and values of the regression coefficients to make sure that they make sense.
- These automatic model selection procedures can find chance correlations in the sample data (overfitting) and produce models that don't make sense in the real world.
- Automatic model selection procedures can be helpful tools, particularly in the exploratory stage. There can be following problems:
    - These procedures can sift through many different models and find correlations that exist by chance in the sample. Assess the results critically and use your expertise to determine whether they make sense.
    - These procedures cannot take real-world knowledge into account. The model might not be correct in a practical sense.
    - Step wise regression does not always choose the model with the largest R-squared value.

### Accuracy of Stepwise Regression
- Below are definitions for terms used in the comparison
    - **Authentic variables** are the independent variables that truly have a relationship with the dependent variable.
    - **Noise variables** are independent variables that do not have an actual relationship with the dependent variable.
    - The **correct model** includes all of the authentic variables and excludes all of the noise variables.
- The best case scenario for stepwise regression is less number of candidate variables with a large sample size. for ex (4 candidate variables with a sample size of 500 observations).
- In this scenario the stepwise regression chooses the correct model 84% of the time, but the scenario is not realistic and accuracy drops from here.
- When there are more variables to evaluate, it is harder for stepwise regression to identify the correct model.
- Multicolinearity also plays a role in the capability of stepwise regression to choose the correct model. 
- When independent variables are correlated, it's harder to isolate the individual effect of each variable. This difficulty occurs regardless whether it is a human or computer algorithm trying to identify the correct model.
- Stepwise Regression and best subsets regression don't usually select the correct model but they can provide value during the very early, investigative stages of fitting the model.