# Day 04: Multi-Linear Regression
Linear regression using multiple variables - as opposed to just one - is **multiple linear regression**.

But before we get into that, let's look at some ways to evaluate a model's performance.

### Evaluation Methods
In addition to MSE, another way to evaluate a model's performance is the model's RSS (Residual Sum of Squares). It is the sum of squared residuals (errors). We must aim to minimize RSS.

The relationship between MSE and RSS is:\
MSE = RSS / # of observances

Let's go back to the previous day's file's [statsmodels output](https://github.com/leenoah390/medical-pred/blob/main/Day03-single_linreg_and_MSE.ipynb).

In [9]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression

In [2]:
# Load data
insurance = pd.read_csv('insurance.csv')

# Create model using BMI
sm_model = smf.ols('charges~bmi', insurance).fit()
sm_model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,1192.9372,1664.802,0.717,0.474,-2072.974,4458.849
bmi,393.8730,53.251,7.397,0.000,289.409,498.337


The above ouput shows the coefficients, standard errors, t-statistics, P-values, and 95% confidence interval (correlating to the 2.5% and 97.5% percentile).\
1. **Coefficients** | coef\
As stated in the previous day, the *coefficients* measures the change (or slope) between the independent and dependent variables. The larger the coefficient for a variable, the more effect the variable has on the response.\
`bmi's coefficient of 393.873 means that for every increase in a patient's BMI, their charge will increase by $393.873.`
2. **Standard Error** | std err\
The standard error represents the variance of the coefficients.

Something that we haven't gone over yet is hypothesis testing. In determing the relationship between the independent X and dependent Y variables, we create hypothesis to test if there is truly a relationship between X and Y. We call these:

Null Hypothesis H0\
There is no relationship between X and Y.

Alternative Hypothesis H1\
There is a relationship between X and Y.

3. **t-statistic** | t\
A value to interpret hypothesis testing. I admit this is not a measure that I was ever familiar with, as I instead found more use in the p-value.
4. **P-value** | P>|t|\
The p-value determines if we reject the null hypothesis or not. We usually compare the p-value with a pre-determined point (usually 0.05). If the reported p-value is less than the point, then we reject the null hypothesis (AKA, we have found that there is a possible relationship between X and Y). If the p-value is above the point, then we accept the null hypothesis (AKA there is no indication that there is a relationship between X and Y other than random chance).\
`bmi has a p-value of 0, which means that there is definetely a relationship between it and charges.`
5. **95% Confidence Interval** | [0.025, 0.975]\
The 95% confidence interval on the coefficents. Basically, the model is 95% certain that the actual value of the coefficients in within this range. Notice that the value of the coef is the average of the confidence interval values.

In [6]:
sm_model.summary().tables[0]

0,1,2,3
Dep. Variable:,charges,R-squared:,0.039
Model:,OLS,Adj. R-squared:,0.039
Method:,Least Squares,F-statistic:,54.71
Date:,"Sun, 12 Oct 2025",Prob (F-statistic):,2.46e-13
Time:,16:26:55,Log-Likelihood:,-14451.0
No. Observations:,1338,AIC:,28910.0
Df Residuals:,1336,BIC:,28920.0
Df Model:,1,,
Covariance Type:,nonrobust,,


Two more values that help in evaluating a model are the RSE (Residual Standard Error) and R^2 value.

**RSE**\
The RSE is the estimate of the standard deviation of the errors. The smaller the RSE, the better fit for a model.\
RSE = Square root(RSS / DF)

where DF (degree of freedom) = n - 2\
and n = number of observations

**R^2**\
The R^2 measures how strongly the model fits a regression lineby way of explaining variance. It acts like a percentage; a value of 1 means that the model and data fits perfectly. A value of 0 means that the model failed to explain any variance and thus does not fit the data well at all.\
`The R-squared value of 0.039 shows that our current model or using only a patient's BMI does not strongly fit the data.` Perhaps including more variables will help?

## Multi-Linear Regression
Ok, let's look at the data again:

In [7]:
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


For simplicity, I will only use the numerical variables: `age`, `bmi`, and `children`. Like last time, I will demonstrate using both sklearn and statsmodels.

In [22]:
# sklearn
# Recall that sklearn only takes numpy arrays
X = insurance[['age', 'bmi', 'children']].values
y = insurance['charges'].values

multireg_sk = LinearRegression().fit(X,y)
print(multireg_sk.coef_)
print(multireg_sk.intercept_)

[239.99447429 332.0833645  542.86465225]
-6916.243347787033


Sklearn outputs the coefficients in the order of the given variables, so `age` ≈ 239.99, `bmi` ≈ 332.08, `children` ≈ 542.86.

The formula would be:\
f(X) = -6916.243347787033 + 239.99447429(age of patient) + 332.0833645(patient's BMI) + 542.86465225(number of children)

In [25]:
multireg_sm = smf.ols('charges~age + bmi + children', insurance).fit()
multireg_sm.summary()

0,1,2,3
Dep. Variable:,charges,R-squared:,0.12
Model:,OLS,Adj. R-squared:,0.118
Method:,Least Squares,F-statistic:,60.69
Date:,"Sun, 12 Oct 2025",Prob (F-statistic):,8.8e-37
Time:,17:20:45,Log-Likelihood:,-14392.0
No. Observations:,1338,AIC:,28790.0
Df Residuals:,1334,BIC:,28810.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-6916.2433,1757.480,-3.935,0.000,-1.04e+04,-3468.518
age,239.9945,22.289,10.767,0.000,196.269,283.720
bmi,332.0834,51.310,6.472,0.000,231.425,432.741
children,542.8647,258.241,2.102,0.036,36.261,1049.468

0,1,2,3
Omnibus:,325.395,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,603.372
Skew:,1.52,Prob(JB):,9.54e-132
Kurtosis:,4.255,Cond. No.,290.0
