## Hypothesis Testing of Regression Coefficients

*(Coding along with the Udemy Couse [Python for Business and Finance](https://www.udemy.com/course/complete-python-for-business-and-finance-bootcamp/) by Alexander Hagmann, lecture 357-359, __[Hypothesis Testing of Regression Coefficients](https://www.udemy.com/course/complete-python-for-business-and-finance-bootcamp/learn/lecture/17480972#overview)__.)*

Having a reasonable regression model with a reasonable regression coefficients and a high coefficient of determination doesn't meant that there has to be a relationship between independent and dependent variable for the full population. The relationship that we've calculated for our sample could be simply by chance.

The smaller the sample, the higher the probability that we get a well-fitting regression line simply by chance.

__To verify whether the regression model or the independent variable is statistically significant we need to perform a hypothesis test (t-test).__

We want to test whether an independent variable explains the dependent variable or whether the slope coefficient is unequal to zero.

The slope coefficient being unequal to zero is the alternative hypothesis that we cannot directly prove.

We we can do is rejecting the opposite hypothesis (H0, null-hypothesis) that the slope coefficient B is equal to zero (this is the and the null-hypothesis in a slope coefficient t-test).

***Statistical test of significance (t-test):***<br>
***H<sub>0</sub>:b = 0<br>***
***H<sub>a</sub>:b != 0***

We can also test for the __intercept coefficient__, which less important than testing for the slope coefficient, but depending on the specific case of the Intercept can be a meaningful metric too.

The null-hypothesis is that the Intercept A is equal to zero. The alternative hypothesis which we actually want to prove is, that the intercept is unequal to zero.

***Statistical test of significance (t-test):***<br>
***H<sub>0</sub>:a = 0<br>***
***H<sub>a</sub>:a != 0***

<img src="../assets/images/level_of_significance_p-value.png" width="80%" />

*(Screenshot taken from Alexander Hagmann's Udemy Couse [Python for Business and Finance](https://www.udemy.com/course/complete-python-for-business-and-finance-bootcamp/).)*

***The important rule of thumb here is, that whenever the p-value for the slope coefficient < 1% then we should conclude, that the independent variable is significant.***

### Performing a t-test with Python

Performing a t-test with Python to verify whether the independent variable budget significantly explains the dependent variable revenue. We will also test whether the intercept is significantly different from zero.

In [1]:
import pandas as pd
from statsmodels.formula.api import ols
pd.options.display.float_format = '{:.10f}'.format

In [2]:
df = pd.read_csv("../assets/data/bud_vs_rev.csv", parse_dates = ["release_date"], index_col = "release_date")

In [3]:
df = df.loc["2016"]

In [4]:
df

Unnamed: 0_level_0,title,budget,revenue
release_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-01-01,Jane Got a Gun,25.0000000000,1.3972840000
2016-01-07,Friend Request,9.9000000000,2.4000000000
2016-01-07,The Forest,10.0000000000,40.0554390000
2016-01-07,Wazir,5.2000000000,9.2000000000
2016-01-13,13 Hours: The Secret Soldiers of Benghazi,50.0000000000,69.4113700000
...,...,...,...
2016-12-23,Resident Evil: The Final Chapter,40.0000000000,312.2426260000
2016-12-23,Railroad Tigers,50.0000000000,102.2051750000
2016-12-23,Dangal,10.4000000000,310.0000000000
2016-12-25,Live by Night,108.0000000000,22.6785550000


In [5]:
model = ols("revenue ~ budget", data = df) # creating the linear regression model

In [6]:
results = model.fit() # fitting the ols regression model

In [7]:
results.params

Intercept   -9.4492150539
budget       3.3494240988
dtype: float64

In [8]:
results.rsquared # the fit of our model

np.float64(0.6402124115463808)

In [9]:
results.tvalues # performing the t-test

Intercept   -0.8789225048
budget      20.3618351223
dtype: float64

In [10]:
results.pvalues # getting the p-values of the test

Intercept   0.3803488009
budget      0.0000000000
dtype: float64

For the slove coefficient for the variable budget we have a  p-value of very close to zero (budget      0.0000000000) and therefore, we should conclude that we should reject the null hypothesis that the slope coefficient is equal to zero.

It is very likely that the slope coeffient is unequal to zero and whenever the slope coefficient is unequal to zero with statistical significance, we should conclude that the independent variable is significant.

This means the independent variable significantly explains the variation in the dependent variable.

So in our case of the budget of movies significantly explains and influences the revenue.

### Testing with statsmodels – interpreting the Summary Table

In [12]:
results.summary()

0,1,2,3
Dep. Variable:,revenue,R-squared:,0.64
Model:,OLS,Adj. R-squared:,0.639
Method:,Least Squares,F-statistic:,414.6
Date:,"Wed, 11 Dec 2024",Prob (F-statistic):,1.24e-53
Time:,12:50:37,Log-Likelihood:,-1475.3
No. Observations:,235,AIC:,2955.0
Df Residuals:,233,BIC:,2961.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-9.4492,10.751,-0.879,0.380,-30.631,11.732
budget,3.3494,0.164,20.362,0.000,3.025,3.674

0,1,2,3
Omnibus:,95.272,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,472.458
Skew:,1.547,Prob(JB):,2.5500000000000004e-103
Kurtosis:,9.22,Cond. No.,83.2


In [13]:
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                revenue   R-squared:                       0.640
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     414.6
Date:                Wed, 11 Dec 2024   Prob (F-statistic):           1.24e-53
Time:                        12:50:49   Log-Likelihood:                -1475.3
No. Observations:                 235   AIC:                             2955.
Df Residuals:                     233   BIC:                             2961.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -9.4492     10.751     -0.879      0.3