## Interpreting Main Effects and Significance

### Regression Notation (OLS)
- The following notation for Ordinary Least Squares Regression (OLS) applies to a regression models for entire populations with k independent variables.
- These are ideal models that you will obtain if you could measure the entire population.

$Y = \beta_0 + \beta_1 X_1 + .... + \beta_k X_k + \epsilon$

In this notation:
- Y represents the dependent variable
- The beta ($\beta$) represents the true population parameters. $\beta_0$ is the constant while the other betas are for independent variables.
- $X$'s are the independent variables.
- Epsilon ($\epsilon$) represents the error, which is the left-over random portion of variability that the model can't explain.

- We never work with the entire population. Instead we use samples to estimate the population parameters. The notation for Regression model based on the sample is the following:

$\hat{y} = \hat{\beta_0} + \hat{\beta_1} X_1 + .... + \hat{\beta_k} X_k +\hat{\epsilon}$

In this notation - The hats represents the sample estimates of the population values.
- $\hat{y}$ - represents the fittted value for the dependent variable. When we enter the values for the independent variable into the regression equation, we obtain the fitted value of the dependent variable.
- $\hat{\beta}$ - the beta hats represent the estimates of population parameters. These estimates are the `regression coefficients` that appear in your output.
- $\hat{\epsilon}$ - the epsilon hat represents the estimate of the error or what we call `residuals`.

### Three types of Effects in Regression Models.
- **Main Effects** - The relationship between an independent variable and the dependent variable does not depend on the value of other variables in the model.

- **Curvilinear Effects** - The relationship between the dependent and independent variable changes based on the value of that independent variable itself. Instead of following a straight line on a graph , these relationship follow curves.

- **Interaction Effects** - The relationship between an independent variable and the dependent variable depends on the value of atleast one other independent variable in the model.

- For main and interaction effects, the interpretation differs for continuous versus categorical variables. 
- On the other hand, curved relationships can exist only for continuous data.

### Main Effects of Continuous Variables
- Main effects for continuous variables are the most common type of relationship in regression models. 
- For ex we have two independent variables A and B:
    - Both the variables are statistically significant, and the model provides a good fit for the data.
    - In this scenario it can be concluded that the effect of variable A on the dependent variable(output) doesnot change due to change in values of variable B.
    - Also variable A's effect is consistent throughout the range of values for A.
    - The same applies to the effect of variable B - It doesnot depend on A, and it remains consistent.
    

- Coefficients and p values in regression analysis helps to understand which relationships in the model are statistically significant and the nature of those relationships.
- The coefficients represent a variable's effect and describe the magnitude and direction of the relationship between each independent variable.
- Coefficients are numbers in the regression equation that multiply the values of the variables(slope or m).
- The p-values for the coefficients indicate whether these relationships are statistically significant.
- The sign of a regression coefficient tells whether the value of dependent variable increases or decreases with change in each independent variable. In other words +ve coefficient means as the value of independent variable increases, the mean for the dependent variables also increases.
- A -ve coeffcient means a decrease in mean of independent variable with the increase in value of the dependent variable(inverse relationship).
- The coefficent value signifies how much the mean value of dependent variable changes with a unit increase in independet variable, while keeping all the other variables constant.

### Confidence Intervals for Regression Parameters
- If we collect a random sample and calculate the mean, the sample mean is the point estimate for population mean.
- We will never know the exact value if the population parameter because we will only work with samples.(Not possible to collect population data).
- The point estimate doesn't indicate how far from the population parameter it is likely to be.
- For this we can calculate the confidence intervals for population parameters.
- A `confidence interval` is dervided from a sample and provides a range of values that likely contains the unknown value of population parameter.
- Different random samples drawn from the population are likely to produce slightly different intervals.
- If we draw many random samples and calculate confidence interval for each sample, a specific proportion of the ranges contains the population parameter. That percentage is the `confidence level`.
- For ex - A 95% confidence level suggests that if you draw 20 random samples from the sample population, you'd expect 19 of the confidence intervals to include the population value.

- The confidence interval provides meaningful estimates because it produces ranges that usually contains the parameter.
- We can also see how far our point estimate is likely to be from the parameter value.
- In the regression context, we use the sample to calculate the regression coefficients ($\beta$ hats) which are the point estimates of the population parameters($\beta$ s).
- In the example below the sample estimate of the height coefficient is 106.5. But if we collect multiple samples of the same population, each sample will have its own estimate of the height coefficent. 
- We won't know the real value or how close to the actual population is our estimate likely to be.
- For this we calculate the confidence interval for the regression coefficients as is shown in the cells below.


In [2]:
## import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [3]:
## 1. Read the file and create a Data Frame
height_weight_df = pd.read_csv("../../datasets/RegressionAnalysisDatasets/HeightWeight.csv")
height_weight_df.head()

Unnamed: 0,Height M,Weight kg
0,1.6002,49.441572
1,1.651,62.595751
2,1.651,75.749931
3,1.53035,48.987979
4,1.45415,43.091278


In [4]:
## 2. Define predictor and response variables

x = height_weight_df["Height M"] ## Independent or predictor Variable
y = height_weight_df["Weight kg"] ## Dependent or response variable

## 3. Add constant to predictor variables
ols_model = sm.OLS(y, x).fit()
x = sm.add_constant(x)

## 4. Fit Linear Regression Model
ols_model = sm.OLS(y, x).fit()

## 5. View model summary
ols_model.summary()

0,1,2,3
Dep. Variable:,Weight kg,R-squared:,0.497
Model:,OLS,Adj. R-squared:,0.491
Method:,Least Squares,F-statistic:,85.03
Date:,"Sun, 03 Sep 2023",Prob (F-statistic):,1.74e-14
Time:,10:47:17,Log-Likelihood:,-305.7
No. Observations:,88,AIC:,615.4
Df Residuals:,86,BIC:,620.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-114.3259,17.443,-6.554,0.000,-149.001,-79.651
Height M,106.5046,11.550,9.221,0.000,83.544,129.465

0,1,2,3
Omnibus:,7.927,Durbin-Watson:,1.176
Prob(Omnibus):,0.019,Jarque-Bera (JB):,7.925
Skew:,0.733,Prob(JB):,0.019
Kurtosis:,3.119,Cond. No.,45.0


In [8]:
## Get the confidence interval of the fitted parameters
# for reference https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.conf_int.html
ols_model.conf_int(alpha=0.05, cols=None)

# NOTE: The confidence interval is based on Student's t distribution

Unnamed: 0,0,1
const,-149.000505,-79.65136
Height M,83.543896,129.465213


- From the above confidence interval we can be 95% confident that the actual population value for the height coefficient is between 
83.5 and 129.5
- The width of a confidence interval reveals the precision of the estimate.
- Narrower ranges suggest a more precise estimate.
- We can evaluate precision by calculating confidence intervals.
- If we add more variables to the model and these confidence intervals become wider, then there is a problem because the additional variables reduce the model's precision.

### Interpreting p-values for Continuous Independent Variables
- Regression Analysis is a form of inferential statistics, where we use a sample to draw conclusions about an entire population.
- Sample errors can produce effects in the sample that don't exist in the population.
- p-values and significance levels help determine whether the observed relationship in the sample also exist in the larger population.
- The p-value for each independent variable tests the null hypothesis that there is no relationship with the dependent variable.
- If there is no relationship, then there is no association between the changes in the independent variable and the shifts in the dependent variable.  

- The hypothesis for the independent variables are following:  
    - **Null Hypothesis** - The coefficient for the independent variable equals zero (there is no relationship) or (p-value > significance level)  

    - **Alternative Hypothesis**  - The coefficient for the independent variable doesnot equal zero (there is a relationship) or (p-value < significance level)

- If the `p-value` for a variable is less than the significance level, the sample data provides enough evidence to reject the null hypothesis for the entire population.
- This means changes in the independent variable are associated with the changes in response at the population level.
- The variable is statistically significant and can be included in the regression model. Significance level of 0.05 are the most common value.
- If the p-value is greater than significance level, that means there is not enough eveidence in the sample to reject that there is zero correlation and the coeficient doesn't equal to zero.

#### Note:
 - For a refresher on p-values refer to correlation and regression notebook, also go through below videos and exercises.
  https://www.khanacademy.org/math/ap-statistics/xfb5d8e68:inference-categorical-proportions/idea-significance-tests/v/p-values-and-significance-tests

### Recoding Continuous Independent Variables
- Recoding involves taking original values and mathematically converting them to other values.
- While the recoding methods can cause you to interpret some of the results differently, the p-values and goodness-of-fit measures will remmain the same when you fit the same model.
- Two common coding methods are 1. Standardization 2. Centering.

#### Standardizing the Continuous Variables
- To standardize a variable:
    - Take each observed value for a variable.
    - Subtract the variable's mean.
    - Then divide by the variable's standard deviation.
- When we standardize a variable, the coded value denotes where the observation falls in the distribution of values.
- It indicates the number of standard deviation above or below the variable's mean.
- The sign indicates whether the observations are above or below the mean and the number indicates thr number of standard deviations.
- Standardized values of 0 indicates that the value is precisely equal to the mean.

#### Interpreting Standardized Coefficients
- When we fit a model using standardized independent variables, the coefficients are now standardized coefficients.
- Standardized coefficients signify the mean change of the dependent variable given a one standard deviation increase in an independent variable.

#### Why obtain Standardized Coefficients?
- Standardization puts all the variables on the same scale so we can compare the magnitude of the results.
- It is used when independent variables have entirely different units like temp(C) and thickness(cm) or it doesn't have any specified units at all like a scale(from 1 to ..).
- Standardization puts all the variables on a consistant scale, which allows us to compare the standardized coefficients.

### Main Effects of Categorical Variables
- Categorical variables, also known as nominal variables, have values that can be put into a countable number of distinct groups based on a characteristic.
- For categorical variables we have the variable name and then the level of the variables.
- In categorical variables we are dealing with groups of data that we cannot incrementally increase, and therefore it cannot be plotted using a scatterplot.
- The levels of variables represent groups in the data and can be plotted using boxplots or barcharts.
- Regression analysis estimates mean differences between these groups and determines if they are statistically significant.
- Including categorical variables in the regression model allows us to determine, whether the differences in the graph are statistically significant.

#### Coding Categorical Variables
- To analyze categorical variables, they are converted into indicator variables, also known as dummy variables.
- They are columns of 1's and 0's that indicate the presence or absence of a characteristic.
- 1 indicates the presence while a 0 represents its absence. 
- The number of indicator variables depend on the number of categorical levels(or type of categories).

- The below table shows the indicator variables created for each level of Gender categorical variable.

| Gender | Male | Female |
|--------|------|--------| 
| Male   | 1    | 0      |
| Female | 0    | 1      |
| Female | 0    | 1      |
| Male   | 1    | 0      |

- In the table, the Gender column represents the categorical data that we enter into the worksheet.
- The value depends on the gender of the subject for which the row corresponds.
- The ***Male*** and ***Female*** columns are the ***indicator variables*** based on the gender column.
- The Male column contains 1's for observations that correspond to males and 0s for non-males. The same applies to the Female column.
- These two columns supply completely redundant information, because one column can predict the other one perfectly.
- This is referred to as perfect **multicolinearity**, which will create an error if we enter both the indicator variables in a regression model.
- For a categorical variable we must omit one of the underlying indicator variables from the model, which becomes the **reference level**.

- The below table shows the variable College Major which has 3 levels.

| College Major     | Psychology | Political Science | Statistics |
|-------------------|------------|-------------------|------------|
| Statistics        | 0          | 0                 | 1          |
| Psychology        | 1          | 0                 | 0          |
| Statistics        | 0          | 0                 | 1          |
| Political Science | 0          | 1                 | 0          |
| Psychology        | 1          | 0                 | 0          |

- In this table, **College Major** is the categorical variable and other columns are indicator variables.
- For each row there is a single value of 1, others are 0.
- As with the Gender variable if we include all the indicator variables, we are supplying redundant information which will result in error in the model.
- If we take any two columns, we can always figure out the value of the third column.
- To perform regression analysis we will have to remove one indicator variable, that becomes the reference level. 

#### Summary for Coding Categorical Variables
- For all categorical we must remove one level from the analysis and use it as a reference level.
- But when we remove one indicator variable, we are altering the data that is being used for fitting the model.
- This can change the coefficient and p-values, however using a different reference level does not change the overall story and statistical significance.
- If the variable has a natural baseline or a category for comparison, using that level as the reference level will make the interpretation more natural.

#### Interpreting the Results for Categorical Variables
- 