## Interpreting Main Effects and Significance

### Regression Notation (OLS)
- The following notation for Ordinary Least Squares Regression (OLS) applies to a regression models for entire populations with k independent variables.
- These are ideal models that you will obtain if you could measure the entire population.

$Y = \beta_0 + \beta_1 X_1 + .... + \beta_k X_k + \epsilon$

In this notation:
- Y represents the dependent variable
- The beta ($\beta$) represents the true population parameters. $\beta_0$ is the constant while the other betas are for independent variables.
- $X$'s are the independent variables.
- Epsilon ($\epsilon$) represents the error, which is the left-over random portion of variability that the model can't explain.

- We never work with the entire population. Instead we use samples to estimate the population parameters. The notation for Regression model based on the sample is the following:

$\hat{y} = \hat{\beta_0} + \hat{\beta_1} X_1 + .... + \hat{\beta_k} X_k +\hat{\epsilon}$

In this notation - The hats represents the sample estimates of the population values.
- $\hat{y}$ - represents the fittted value for the dependent variable. When we enter the values for the independent variable into the regression equation, we obtain the fitted value of the dependent variable.
- $\hat{\beta}$ - the beta hats represent the estimates of population parameters. These estimates are the `regression coefficients` that appear in your output.
- $\hat{\epsilon}$ - the epsilon hat represents the estimate of the error or what we call `residuals`.

### Three types of Effects in Regression Models.
- **Main Effects** - The relationship between an independent variable and the dependent variable does not depend on the value of other variables in the model.

- **Curvilinear Effects** - The relationship between the dependent and independent variable changes based on the value of that independent variable itself. Instead of following a straight line on a graph , these relationship follow curves.

- **Interaction Effects** - The relationship between an independent variable and the dependent variable depends on the value of atleast one other independent variable in the model.

- For main and interaction effects, the interpretation differs for continuous versus categorical variables. 
- On the other hand, curved relationships can exist only for continuous data.

### Main Effects of Continuous Variables
- Main effects for continuous variables are the most common type of relationship in regression models. 
- For ex we have two independent variables A and B:
    - Both the variables are statistically significant, and the model provides a good fit for the data.
    - In this scenario it can be concluded that the effect of variable A on the dependent variable(output) doesnot change due to change in values of variable B.
    - Also variable A's effect is consistent throughout the range of values for A.
    - The same applies to the effect of variable B - It doesnot depend on A, and it remains consistent.
    

- Coefficients and p values in regression analysis helps to understand which relationships in the model are statistically significant and the nature of those relationships.
- The coefficients represent a variable's effect and describe the magnitude and direction of the relationship between each independent variable.
- Coefficients are numbers in the regression equation that multiply the values of the variables(slope or m).
- The p-values for the coefficients indicate whether these relationships are statistically significant.
- The sign of a regression coefficient tells whether the value of dependent variable increases or decreases with change in each independent variable. In other words +ve coefficient means as the value of independent variable increases, the mean for the dependent variables also increases.
- A -ve coeffcient means a decrease in mean of independent variable with the increase in value of the dependent variable(inverse relationship).
- The coefficent value signifies how much the mean value of dependent variable changes with a unit increase in independet variable, while keeping all the other variables constant.

### Confidence Intervals for Regression Parameters
- If we collect a random sample and calculate the mean, the sample mean is the point estimate for population mean.
- We will never know the exact value if the population parameter because we will only work with samples.(Not possible to collect population data).
- The point estimate doesn't indicate how far from the population parameter it is likely to be.
- For this we can calculate the confidence intervals for population parameters.
- A `confidence interval` is dervided from a sample and provides a range of values that likely contains the unknown value of population parameter.
- Different random samples drawn from the population are likely to produce slightly different intervals.
- If we draw many random samples and calculate confidence interval for each sample, a specific proportion of the ranges contains the population parameter. That percentage is the `confidence level`.
- For ex - A 95% confidence level suggests that if you draw 20 random samples from the sample population, you'd expect 19 of the confidence intervals to include the population value.

- The confidence interval provides meaningful estimates because it produces ranges that usually contains the parameter.
- We can also see how far our point estimate is likely to be from the parameter value.
- In the regression context, we use the sample to calculate the regression coefficients ($\beta$ hats) which are the point estimates of the population parameters($\beta$ s).
- In the example below the sample estimate of the height coefficient is 106.5. But if we collect multiple samples of the same population, each sample will have its own estimate of the height coefficent. 
- We won't know the real value or how close to the actual population is our estimate likely to be.
- For this we calculate the confidence interval for the regression coefficients as is shown in the cells below.


In [2]:
## import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [3]:
## 1. Read the file and create a Data Frame
height_weight_df = pd.read_csv("../../datasets/RegressionAnalysisDatasets/HeightWeight.csv")
height_weight_df.head()

Unnamed: 0,Height M,Weight kg
0,1.6002,49.441572
1,1.651,62.595751
2,1.651,75.749931
3,1.53035,48.987979
4,1.45415,43.091278


In [4]:
## 2. Define predictor and response variables

x = height_weight_df["Height M"] ## Independent or predictor Variable
y = height_weight_df["Weight kg"] ## Dependent or response variable

## 3. Add constant to predictor variables
ols_model = sm.OLS(y, x).fit()
x = sm.add_constant(x)

## 4. Fit Linear Regression Model
ols_model = sm.OLS(y, x).fit()

## 5. View model summary
ols_model.summary()

0,1,2,3
Dep. Variable:,Weight kg,R-squared:,0.497
Model:,OLS,Adj. R-squared:,0.491
Method:,Least Squares,F-statistic:,85.03
Date:,"Sun, 03 Sep 2023",Prob (F-statistic):,1.74e-14
Time:,10:47:17,Log-Likelihood:,-305.7
No. Observations:,88,AIC:,615.4
Df Residuals:,86,BIC:,620.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-114.3259,17.443,-6.554,0.000,-149.001,-79.651
Height M,106.5046,11.550,9.221,0.000,83.544,129.465

0,1,2,3
Omnibus:,7.927,Durbin-Watson:,1.176
Prob(Omnibus):,0.019,Jarque-Bera (JB):,7.925
Skew:,0.733,Prob(JB):,0.019
Kurtosis:,3.119,Cond. No.,45.0


In [8]:
## Get the confidence interval of the fitted parameters
# for reference https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.conf_int.html
ols_model.conf_int(alpha=0.05, cols=None)

# NOTE: The confidence interval is based on Student's t distribution

Unnamed: 0,0,1
const,-149.000505,-79.65136
Height M,83.543896,129.465213


- From the above confidence interval we can be 95% confident that the actual population value for the height coefficient is between 
83.5 and 129.5
- The width of a confidence interval reveals the precision of the estimate.
- Narrower ranges suggest a more precise estimate.
- We can evaluate precision by calculating confidence intervals.
- If we add more variables to the model and these confidence intervals become wider, then there is a problem because the additional variables reduce the model's precision.

### Interpreting p-values for Continuous Independent Variables
- Regression Analysis is a form of inferential statistics, where we use a sample to draw conclusions about an entire population.
- Sample errors can produce effects in the sample that don't exist in the population.
- p-values and significance levels help determine whether the observed relationship in the sample also exist in the larger population.
- The p-value for each independent variable tests the null hypothesis that there is no relationship with the dependent variable.
- If there is no relationship, then there is no association between the changes in the independent variable and the shifts in the dependent variable.  

- The hypothesis for the independent variables are following:  
    - **Null Hypothesis** - The coefficient for the independent variable equals zero (there is no relationship) or (p-value > significance level)  

    - **Alternative Hypothesis**  - The coefficient for the independent variable doesnot equal zero (there is a relationship) or (p-value < significance level)

- If the `p-value` for a variable is less than the significance level, the sample data provides enough evidence to reject the null hypothesis for the entire population.
- This means changes in the independent variable are associated with the changes in response at the population level.
- The variable is statistically significant and can be included in the regression model. Significance level of 0.05 are the most common value.
- If the p-value is greater than significance level, that means there is not enough eveidence in the sample to reject that there is zero correlation and the coeficient doesn't equal to zero.

#### Note:
 - For a refresher on p-values refer to correlation and regression notebook, also go through below videos and exercises.
  https://www.khanacademy.org/math/ap-statistics/xfb5d8e68:inference-categorical-proportions/idea-significance-tests/v/p-values-and-significance-tests

### Recoding Continuous Independent Variables
- Recoding involves taking original values and mathematically converting them to other values.
- While the recoding methods can cause you to interpret some of the results differently, the p-values and goodness-of-fit measures will remmain the same when you fit the same model.
- Two common coding methods are 1. Standardization 2. Centering.

#### Standardizing the Continuous Variables
- To standardize a variable:
    - Take each observed value for a variable.
    - Subtract the variable's mean.
    - Then divide by the variable's standard deviation.
- When we standardize a variable, the coded value denotes where the observation falls in the distribution of values.
- It indicates the number of standard deviation above or below the variable's mean.
- The sign indicates whether the observations are above or below the mean and the number indicates thr number of standard deviations.
- Standardized values of 0 indicates that the value is precisely equal to the mean.

#### Interpreting Standardized Coefficients
- When we fit a model using standardized independent variables, the coefficients are now standardized coefficients.
- Standardized coefficients signify the mean change of the dependent variable given a one standard deviation increase in an independent variable.

#### Why obtain Standardized Coefficients?
- Standardization puts all the variables on the same scale so we can compare the magnitude of the results.
- It is used when independent variables have entirely different units like temp(C) and thickness(cm) or it doesn't have any specified units at all like a scale(from 1 to ..).
- Standardization puts all the variables on a consistant scale, which allows us to compare the standardized coefficients.

### Main Effects of Categorical Variables
- Categorical variables, also known as nominal variables, have values that can be put into a countable number of distinct groups based on a characteristic.
- For categorical variables we have the variable name and then the level of the variables.
- In categorical variables we are dealing with groups of data that we cannot incrementally increase, and therefore it cannot be plotted using a scatterplot.
- The levels of variables represent groups in the data and can be plotted using boxplots or barcharts.
- Regression analysis estimates mean differences between these groups and determines if they are statistically significant.
- Including categorical variables in the regression model allows us to determine, whether the differences in the graph are statistically significant.

#### Coding Categorical Variables
- To analyze categorical variables, they are converted into indicator variables, also known as dummy variables.
- They are columns of 1's and 0's that indicate the presence or absence of a characteristic.
- 1 indicates the presence while a 0 represents its absence. 
- The number of indicator variables depend on the number of categorical levels(or type of categories).

- The below table shows the indicator variables created for each level of Gender categorical variable.

| Gender | Male | Female |
|--------|------|--------| 
| Male   | 1    | 0      |
| Female | 0    | 1      |
| Female | 0    | 1      |
| Male   | 1    | 0      |

- In the table, the Gender column represents the categorical data that we enter into the worksheet.
- The value depends on the gender of the subject for which the row corresponds.
- The ***Male*** and ***Female*** columns are the ***indicator variables*** based on the gender column.
- The Male column contains 1's for observations that correspond to males and 0s for non-males. The same applies to the Female column.
- These two columns supply completely redundant information, because one column can predict the other one perfectly.
- This is referred to as perfect **multicolinearity**, which will create an error if we enter both the indicator variables in a regression model.
- For a categorical variable we must omit one of the underlying indicator variables from the model, which becomes the **reference level**.

- The below table shows the variable College Major which has 3 levels.

| College Major     | Psychology | Political Science | Statistics |
|-------------------|------------|-------------------|------------|
| Statistics        | 0          | 0                 | 1          |
| Psychology        | 1          | 0                 | 0          |
| Statistics        | 0          | 0                 | 1          |
| Political Science | 0          | 1                 | 0          |
| Psychology        | 1          | 0                 | 0          |

- In this table, **College Major** is the categorical variable and other columns are indicator variables.
- For each row there is a single value of 1, others are 0.
- As with the Gender variable if we include all the indicator variables, we are supplying redundant information which will result in error in the model.
- If we take any two columns, we can always figure out the value of the third column.
- To perform regression analysis we will have to remove one indicator variable, that becomes the reference level. 

#### Summary for Coding Categorical Variables
- For all categorical we must remove one level from the analysis and use it as a reference level.
- But when we remove one indicator variable, we are altering the data that is being used for fitting the model.
- This can change the coefficient and p-values, however using a different reference level does not change the overall story and statistical significance.
- If the variable has a natural baseline or a category for comparison, using that level as the reference level will make the interpretation more natural.

#### Interpreting the Results for Categorical Variables
- There are several tests that can be performed on categorical variables.

**F tests**
- Since a categorical variable is represented by multiple indicator variables, we use F tests for that group of indicator variables.
- Unlike t-tests, F tests can evaluate multiple model terms simultaneously, which allows them to compare the fits of different linear models.
- F-tests calculate the variability within a variable to that of between variables. This way we can check if the changes are significant or due to pure chance.(Watch the video below for more details). 
- Here the F-test compares the fit of the model with the set of indicator variables that corresponds to a categorical variable with the model without that set of indicator variables.
- The **Hypothesis for F-Tests** are as follows:
    - ***Null*** : The model with the categorical variable does not improve the fit of the model without the categorical variable.
    - ***Alternative*** : The model with the categorical variable fits the data better than the model without categorical variable.

- If the p-value is less than the significance level, then we can reject the null hypothesis and conclude that including the categorical variable improves the fit of the model.

**t - tests**
- While F tests tell us about the categorical variable as a whole, t-tests allows to explore the differences between the group means and the reference level.
- The coefficients represent the difference between each level(each category) mean and the reference level mean.
- p-value is used to determine whether the difference is statistically significant.
- The **Hypothesis for t-Tests** are as follows:
    - ***Null*** : The difference between level mean and reference level mean equals zero (not significant)
    - ***Alternative*** : The difference between level mean and reference level mean does not equal zero.

- If the p-value is less than the significance level than we can reject the null hypothesis and conclude that the level mean is significantly different from reference level mean.
- As they are main effects, the sizes of the effects do not change based on the values of other variables in the model. 


**Further Reading**
- Hypothesis Test with F-statistic : https://www.khanacademy.org/math/statistics-probability/analysis-of-variance-anova-library/analysis-of-variance-anova/v/anova-1-calculating-sst-total-sum-of-squares

#### Example of a model with Categorical Variable
- In this part we will analyze the data that includes a categorical variable and a continuous variable.
- First we will plot different categories of independent variable and see how they relate to the dependent variable.
- Next we will fit the model using OLS regression method, then do ANOVA (Analysis of Variance) on the model.
- After the analysis we will interpret the results.

**Reference Links for Statsmodel Library**
- https://www.statsmodels.org/dev/example_formulas.html
- https://www.statsmodels.org/stable/generated/statsmodels.stats.anova.anova_lm.html#statsmodels.stats.anova.anova_lm
- https://www.statsmodels.org/stable/anova.html#module-statsmodels.stats.anova
- https://www.statsmodels.org/dev/contrasts.html#user-defined

In [38]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm

In [17]:
# Read the data and display it
major_income_df = pd.read_csv("../../datasets/RegressionAnalysisDatasets/Categorical_Example.csv", index_col=False)
# experience_df = major_income_df["Experience"]
# major_income_df.columns
major_income_df.head(10)

Unnamed: 0,Income,Major,Experience
0,36669,Political Science,4
1,24446,Political Science,1
2,61115,Political Science,5
3,24446,Political Science,1
4,24446,Political Science,2
5,24446,Political Science,2
6,85561,Political Science,7
7,24446,Political Science,4
8,48892,Political Science,5
9,85561,Psychology,5


In [34]:
# ANOVA (Analysis of Variance) for Major and Experience independent variables.
# Step 1 - Group the Data by  Major
major_groups = major_income_df.groupby("Major")
# Step 2 - Extract Individual Groups
major_groups_dict = {}
for group_name, group_df in major_groups:
    major_groups_dict[group_name] = group_df[["Income", "Experience"]]

political_science = major_groups_dict["Political Science"]
psychology = major_groups_dict["Psychology"]
statistics = major_groups_dict["Statistics"]

In [35]:
# Perform the ANOVA using scipy.stats 
stats.f_oneway(political_science, psychology, statistics)

F_onewayResult(statistic=array([2.41589916, 1.97925371]), pvalue=array([0.10833903, 0.15771777]))

In [51]:
## Another way to carry out ANOVA test is to first fit the regression model and then calculate the f and p values.
from statsmodels.formula.api import ols

model = ols("Income ~ Experience + C(Major, Treatment(reference='Statistics'))", data=major_income_df).fit()
model.summary()

0,1,2,3
Dep. Variable:,Income,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.205
Method:,Least Squares,F-statistic:,3.5
Date:,"Thu, 14 Sep 2023",Prob (F-statistic):,0.0295
Time:,07:35:58,Log-Likelihood:,-339.43
No. Observations:,30,AIC:,686.9
Df Residuals:,26,BIC:,692.5
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.906e+04,7469.982,6.568,0.000,3.37e+04,6.44e+04
"C(Major, Treatment(reference='Statistics'))[T.Political Science]",-2.719e+04,9812.758,-2.771,0.010,-4.74e+04,-7024.312
"C(Major, Treatment(reference='Statistics'))[T.Psychology]",-5368.3564,9915.559,-0.541,0.593,-2.58e+04,1.5e+04
Experience,5085.2826,2283.663,2.227,0.035,391.146,9779.419

0,1,2,3
Omnibus:,5.853,Durbin-Watson:,2.559
Prob(Omnibus):,0.054,Jarque-Bera (JB):,4.111
Skew:,0.743,Prob(JB):,0.128
Kurtosis:,4.04,Cond. No.,10.8


**Regression Equation**
- To get the regression equation, for this example - The constant is the mean salary for Statistics(49064) which is constant 
- We can use the coefficients for other categories and find the difference between these groups, which will be unchanged.
- The indicator variables will shift the regression line up and down the y-axis for specific groups by the value of the coefficient for the corresponding indicator variable.
- We can obtain separate equations for each categorical level with different constants as shown below.

| Major             | Equation                                             |
|-------------------|------------------------------------------------------|
| Political Science | Income = (49064 - 27195 = 21869) + 5085 * Experience |
| Psychology        | Income = (49064 - 5368 = 43696) + 5085 * Experience  |
| Statistics        | Income = (49064) + 5085 * Experience                 |

In [60]:
# ANOVA Result
anova_result = sm.stats.anova_lm(model, typ=2)
anova_result

Unnamed: 0,sum_sq,df,F,PR(>F)
"C(Major, Treatment(reference='Statistics'))",3762712000.0,2.0,4.141929,0.027447
Experience,2252343000.0,1.0,4.958681,0.034833
Residual,11809780000.0,26.0,,


### Interpretation of Example Results
- In the above example we want to determine whether college major relates to Income.
- We included College Major as a categorical variable with 3 levels - Political Science, Statistics and Psychology.
- The analysis will determine whether the mean differences between the groups are statistically significant.
- We are also including years of experience as a continuous variable, by adding this variable we can control for differences in the years of experience that might exist between the groups.
- If the subjects of one group have more years of experience by chance, the mean (Income) of that group will appear to be higher, but it would be due to experience than due to the major itself.
- But by including Experience as a variable, the model controls for that possibility, which means we can learn about income differences by major while keeping experience constant.
- Here we assume that we are studying this from the perspective of Statistics department (how other majors compare to statistics), so we use "Statistics" as a reference level for the model. This is displayed in the formula where we define it explicitly.

#### ANOVA Table
- In the ANOVA table in the above cell, we can find the overall significance of the variables.
- The p-value for the Experience variable is 0.035 which is less than the significance level of 0.05, which means that the variable is statistically significant.
- For the categorical variable Major, it uses 2 degrees of freedom as it has 3 levels and we are using Statistics as reference level, so it is excluded from the model.
- The model includes 2 indicator variables to represent the entire categorical variable of Major.
- If a categorical variable has many levels, it will use many degrees of freedom.
- This is a problem if the sample size is small and can lead to overfitting.
- The F-value in the output shows that the variable Major is statistically significant overall. It improves the fit of the model.

#### Regression Coefficients
- In this we explore the differences in mean income by major by assessing the coefficients.
- The model summary table displays the coefficients for Major and Experience.
- In the table it displays only two levels for 'Major' variable as we have used Statistics as a reference level, it is excluded from output.
- We can choose different reference levels based on the problem statement.
- The coefficients of other 2 levels - Political Science and Psychology indicate how mean incomes of these major compare to the mean income of the Statistics major.
- As per the result the coefficients are -ve, which means that these Majors have lower mean income than Statistics
- We learn the following from the coefficients:
    - The mean income for political science majors is $27,195 LESS than the mean income for Statistics majors
    - The mean income for psychology majors is $5368 LESS than the mean income for statistics majors.

- ***Experience coefficient*** 
    - For each one year increase in experience, mean income increases by an average of $5085 while holding major constant.
    - Conversely the Major holds constant, years of experience. This is useful for isolating the effect of each variable.

#### p-values and Significance
- The `P>|t|` column in the above table shows the p-values for t-tests.
- These p-values determine whether the mean differences are statistically significant.
- The Political Science coefficient is statistically significant. So we can reject the null hypothesis that the mean difference is zero.
- This means that we are rejecting the notion that the coefficient can plausibly equal 0, even while accounting for random sampling error.
- We have sufficient evidence to conclude that these two means are different.
- On the other hand the difference between mean incomes for Psychology ans Statistics is not statistically significant(0.593 > 0.05).
- This means that we have insufficient evidence to conclude that there is a difference in both means(Stats and Psych).
- In other words the value -$5368 might represent random error and if we take another sample and perform analysis the difference might not be there.

#### Summary and Overall Significance
- If we fit the model using a different reference level, the overall significance of the variable Major in the ANOVA table will remain the same as will the goodness-of-fit measures like R-squared.
- The comparisons between specific levels will change as we'd be comparing the majors to a different reference level.
- Use the reference level that makes the most sense for the question.

### Interpreting Count and Ordinal Variables
- There are some scenarios where there is some ambiguity as to whether a variable is continuous or categorical.
- In such cases we have the discretion to include a variable as continuous or categorical.
- This tends to occur in 2 broad types of scenarios
    1. **Count or Ordinal Variable**
        - In the first scenario, at least one of your independent variable is a count variable or an ordinal variable.
        - These types of variables are discreet, but they contain information about order, scale or magnitude.
        - They share properties of both continuous and categorical variables, but aren't quite either one.
        - ***Count variables*** are non negative integers. Ex - no of defects, no of days in a hospital and no of treatment sessions.
        - ***Ordinal Variables*** have at least 3 categories and the categories have a natural order. The categories are ranked, but the differences between the categories might not be equal.  
        For ex - first, second and third in a race are ordinal data. The difference in time between first and second place might not be the same as the difference between second and third place.

    2. **Continuous variable with discrete values**
        - In the second scenario we have a continuous variable, but it uses only a limited number of discrete chosen values.
        - for example, baking cakes at different temperatures of 325, 375 and 425 degrees for a study. or in a longitudinal study, observations occur at specific intervals 1 month, 2 months, 3 months etc.

**Note**:  
Determining how to include these variables in the model depends on both the nature of your data and the purpose of your study.

#### The Case for including it as a Continuous Variable
- When a variable has many levels, it might be best to include it as a continuous variable.
- At a bare minimum a variable must have at least 3 values to fit a straight line. However its hard to determine whether there is a linear trend with only 3 values.
- Fitting a curved relationship requires more values.
- If a study wants to determine how changes in the independent variable relate to changes in the dependent variable, including a variable as a continuous variable allows to determine that type of relationship.

#### The Case for including it as a Categorical Variable
- If a variable has only a few levels, we can include it as a categorical variable.
- In this case, the procedure estimates a fitted mean for each group and does not consider the order of values.
- But as the number of values increases, it becomes increasingly unwieldy comparing difference between means.
- Additionally coding a categorical variable requires many degrees of freedom when a variable has many levels. This issue is problematic when we have a small sample size.
- If a study wants to assess group means and differences between means, including the variable in question as a categorical variable allows to answer these questions.

### Constant (Y Intercept)
- The constant term in regression analysis is the value at which the regression line crosses the y-axis. The constant is also known as the y-intercept.
- The constant is often defined as the mean of the dependent variable when you set all of the independent variables in your model to zero.
- In a purely mathematical sense, this definition is correct, but in practice is it impossible to set all the independent variables to zero as this combination can be irrational or doesn't make sense.
- The more independent variables you have, the less likely it is that each and every one of them can equal zero simultaneously.
- If the independent variables can't all equal zero, we get an impossibly negative value of y-intercept.(see Height-Weight example).
- As a general statistical guideline, never make a prediction for a point that is outside the range of observed values that were used to fit the regression model.
- A portion of the estimation process for the y-intercept is based on the exclusion of relevant variables from the Regression model.
- When we leave relevant variables out, this can produce bias in the model.
- ***Bias*** exists if the residuals have an overall positive or negative mean. In other words, then model tends to make predictions that are systematically too high or too low.
- The constant term prevents this overall bias by forcing the residual mean to equal zero.
- Imagine if we can move the regression line up or down to the point where the residual mean equals zero. This process is how the constant in a regression model ensures, that the residual average equals zero.
- However this process does not focus producing a y-intercept that is meaningful for the study area. Thus, the intercept has "no meaning" in a regression equation.
- The constant ensures that the residuals don't have an overall bias, but that might make it meaningless.

#### Including the constant in a Regression Model
- The reason as explained above explains why you should almost always include  the constant in your regression model - it forces the residuals to have zero mean which is important.
- If you don't include the constant in your regression model, you are actually setting the constant to zero.
- This will force the regression line to go through the origin. In other words, a model that doesn't include the constant requires all of the dependent and independent variables to equal zero simultaneously.
- If this isn't correct for the study area, the regression model will exhibit bias without the constant.
- When it comes to using and interpreting the constant in a regression model, you should almost always include the constant in the regression model even though it is almost never worth interpreting.

**Side Note**  
The key benefit of regression analysis is determining how changes in the independent variables are associated with the shifts in the dependent variable. Don't think about the y-intercept too much.