# Day 04: Multi Linear Regression
Linear regression using multiple variables - as opposed to just one - is **multiple linear regression**.

But before we get into that, let's look at some ways to evaluate a model's performance.

### Evaluation Methods
In addition to MSE, another way to evaluate a model's performance is the model's RSS (Residual Sum of Squares). It is the sum of squared residuals (errors). We must aim to minimize RSS.

The relationship between MSE and RSS is:\
MSE = RSS / # of observances

Let's go back to the previous day's file's [statsmodels output](https://github.com/leenoah390/medical-pred/blob/main/Day03-single_linreg_and_MSE.ipynb).

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

In [2]:
# Load data
insurance = pd.read_csv('insurance.csv')

# Create model using BMI
sm_model = smf.ols('charges~bmi', insurance).fit()
sm_model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1192.9372,1664.802,0.717,0.474,-2072.974,4458.849
bmi,393.8730,53.251,7.397,0.000,289.409,498.337


The above ouput shows the coefficients, standard errors, t-statistics, P-values, and 95% confidence interval (correlating to the 2.5% and 97.5% percentile).\
1. **Coefficients** | coef\
As stated in the previous day, the *coefficients* measures the change (or slope) between the independent and dependent variables.
2. **Standard Error** | std err\
The standard error represents the variance of the coefficients.

Something that we haven't gone over yet is hypothesis testing. In determing the relationship between the independent X and dependent Y variables, we create hypothesis to test if there is truly a relationship between X and Y. We call these:

Null Hypothesis H0\
There is no relationship between X and Y.

Alternative Hypothesis H1\
There is a relationship between X and Y.

3. **t-statistic** | t\
A value to interpret hypothesis testing. I admit this is not a measure that I was ever familiar with, as I instead found more use in the p-value.
4. **P-value** | P>|t|\
The p-value determines if we reject the null hypothesis or not. We usually compare the p-value with a pre-determined point (usually 0.05). If the reported p-value is less than the point, then we reject the null hypothesis (AKA, we have found that there is a possible relationship between X and Y). If the p-value is above the point, then we accept the null hypothesis (AKA there is no indication that there is a relationship between X and Y other than random chance).
5. **95% Confidence Interval** | [0.025, 0.975]\
The 95% confidence interval on the coefficents. Basically, the model is 95% certain that the actual value of the coefficients in within this range. Notice that the coef is the average of the confidence interval values.

In [6]:
sm_model.summary().tables[0]

0,1,2,3
Dep. Variable:,charges,R-squared:,0.039
Model:,OLS,Adj. R-squared:,0.039
Method:,Least Squares,F-statistic:,54.71
Date:,"Sun, 12 Oct 2025",Prob (F-statistic):,2.46e-13
Time:,16:26:55,Log-Likelihood:,-14451.0
No. Observations:,1338,AIC:,28910.0
Df Residuals:,1336,BIC:,28920.0
Df Model:,1,,
Covariance Type:,nonrobust,,


Two more values that help in evaluating a model are the RSE (Residual Standard Error) and R^2 value.

The RSE is the estimate of the standard deviation of the errors. The smaller the RSE, the better fit for a model.\
RSE = Square root(RSS / DF)

where DF (degree of freedom) = n - 2\
and n = number of observations