# Diagnostic check of a fitted regression model 

Apart from the $R^2$ statistic, there are other statistics and parameters that you need to look at in order to determine if the model is efficient. We will discuss some commonly used statistics – Residual Standard Errors, $p$-values, and $F$-statistics.

### Residual Standard Errors (RSE)

RSE is a common statistic used to calculate the accuracy of values predicted by a model. It is an estimate of the variance of the error term, `res`. For a simple linear regression model, RSE is defined as: 
$$  RSE^2 = \frac{SSE}{n-2} = \frac1{n-2} \sum_{i=1}^n  \Bigl(\text{yact}_i - \text{ypred}_i \Bigr)^2.
$$

In general, 

$$  RSE^2 = \frac{SSE}{n-p-1} = \frac1{n-p-1} \sum_{i=1}^n  \Bigl(\text{yact}_i - \text{ypred}_i \Bigr)^2.
$$
 
where $p$ is the number of predictor variables in a model where we have more than one predictor variables.

 
A **multiple linear regression** model is a linear regression model with multiple predictors, written as  
$$  Y_e = \alpha +\beta_1 * X_1 +\cdots +\beta_p X_p.
$$

As you see, the parameters and predictors are subscripted from 1 up to the number of predictors $p$. 

In multiple regression, the value of RSE generally decreases as we add variables that are more significant predictors of the output variable.

Using our simulated data from the previous steps, the following code snippet shows how the RSE for a model can be calculated:

In [1]:
#### Import necessary libraries
import pandas as pd
import numpy as np

# Generate same data as in previous step
np.random.seed(0)
X = 2.5 * np.random.randn(100) + 1.5
res = 0.5 * np.random.randn(100)
yact = 2 + 0.3 * X + res
df = pd.DataFrame(
    {'X': X,
     'yact': yact}
)

# Calculate beta and alpha as in previous step
xmean = np.mean(X)
ymean = np.mean(yact)
df['xycov'] = (df['X'] - xmean) * (df['yact'] - ymean)
df['xvar'] = (df['X'] - xmean)**2
beta = df['xycov'].sum() / df['xvar'].sum()
alpha = ymean - (beta * xmean)

# Store predictions as in previous step
df['ypred'] = alpha + beta * df['X']

# Show first five rows of dataframe
df.head()

Unnamed: 0,X,yact,xycov,xvar,ypred
0,5.910131,4.714615,9.282815,18.152805,3.911783
1,2.500393,2.076238,-0.391082,0.723985,2.810643
2,3.946845,2.548811,0.029747,5.277702,3.27776
3,7.102233,4.615368,11.338948,29.732079,4.29676
4,6.168895,3.264107,3.291209,20.42475,3.995348


In [2]:
# Calculate SSE
df['SSE'] = (df['yact'] - df['ypred'])**2
SSE = df['SSE'].sum()

# Calculate RSE
RSE = np.sqrt(SSE / 98)   # n = 100
print(f'RSE = {RSE}')

RSE = 0.5193136792898965


The value of `RSE` comes out to be 0.52. 

As you might have guessed, the smaller the residual standard errors, the better the model is. 

The benchmark to compare this to is the mean of the actual values, `yact`. As shown previously, this value is `ymean = 2.54`. In plain English, this means we observe an error of 0.52 over 2.54 – approximately 20.48%.



In [3]:
error = RSE / ymean
print(f'Mean Y = {ymean}.')
print(f'Error = {error}.')

Mean Y = 2.5358624970247825.
Error = 0.20478779109639608.


### p-values

The calculation of $\alpha$  and $\beta$ are estimates, not exact calculations. Whether their values are significant or not needs to be tested using a **hypothesis test**.

In the equation, $Y = \alpha + \beta X$, if we set $\beta=0$,  there will be no relation between $Y$ and $X$. Therefore, the hypothesis tests whether the value of $\beta$ is non-zero or not.

$$\begin{align*} \text{Null hypothesis}~  H_0~:~  \beta=0, & \quad \text{versus} \\
\text{Alternative hypothesis}~ H_1~:~ \beta\ne 0.&  \end{align*}  $$
 

Whenever a regression task is performed and $\beta$ is calculated, there will be an accompanying **p-value** corresponding to this hypothesis test. We will not go through how this is calculated in this course (you can learn more [here](https://www.dummies.com/education/math/statistics/how-to-determine-a-p-value-when-testing-a-null-hypothesis/)), since it is calculated automatically by ready-made methods in Python. 

If the p-value is less than a chosen **significance level** (e.g. 0.05) then the null hypothesis that $\beta = 0$ is rejected and $\beta$ is said to be  <b>significant and non-zero</b>.

In the case of multiple linear regression, the p-value associated with each $\beta_k$   can be used to weed out insignificant predictors from the model. The higher the p-value for $\beta_k$, the less significant $X_k$  is to the model.

### F-statistics

In a multiple regression model, apart from testing the significance of individual variables by checking the p-values, it is also necessary to check whether, as a group all the predictors are significant. This can be done using the following hypothesis:

$$\begin{align*} \text{Null hypothesis}~  H_0~:~ & \beta_1=\beta_2=\cdots=\beta_p=0, \quad \text{versus} \\
\text{Alternative hypothesis}~ H_1~:~& \text{at least one of the} ~\beta_k's ~ \text{is non zero}. \end{align*}  $$
 

The statistic that is used to test this hypothesis is called the **F-statistic** and is defined as follows:

$$  F\text{-statistic} = \text{Fisher statistic}=  \frac{ (SST-SSE)/p}{ SSE/(n-p-1)}
$$

where $n$ = number of rows (sample points) in the dataset and $p$ = number of predictor variables in the model.

There is a $p$-value that is associated with this $F$-statistic. If the $p$-value is smaller than the chosen significance level, the null hypothesis can be rejected.

It is important to look at the F-statistic because:

- p-values are about individual relationships between predictors and the outcome variable. However, one predictor's relationship with the output might be impacted by the presence of other variables.
- When the number of predictors in the model is very large and all the $\beta_i$ are very close to zero, the individual p-values associated with the predictors might give very small values so we might incorrectly conclude that there is a relationship between the predictors and the outcome.