# Day 05: More Linear Regression
Review of last time:

* Linear regression is about creating a linear model to predict Y along variable(s) X
    * Involves getting estimates for coefficients Bn
    * Model is evaluated by residuals e
* Standard Error (SE) is the variance of Bn
* 95% Confidence Interval = Bn ± 2(SE)
* Hypothesis Testing:
    * Prove the null hypothesis (Y can not be explained by X) is wrong
    * Use result of P-value to reject or not, based on value of 0.05
* Smaller p-value means stronger relationship, larger p-value means that random chance is more likeley
* RSE measure lack of fit (smaller is better)
* R Squared measures how much of the data can be explained by regression
    * 1 = all data can be explained
    * 0 = no data can be explained
* R Squared = RSS/TSS
    * RSS (Residual Sum of Squares) = SUM of Residuals (difference between actual and predicted y)
    * TSS (Total Sum of Squares) = SUM of Variation (difference between actual and mean of y)
    * We want to minimize RSS

## Choosing subset
It is possible that not all variables are good for predicting. Maybe choosing particular ones (a subset) is better than including all?

In [1]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

In [2]:
insurance = pd.read_csv('insurance.csv')

# All numerical variables
insurance_all = smf.ols('charges ~ age + bmi + children', insurance).fit()
insurance_all.summary()

0,1,2,3
Dep. Variable:,charges,R-squared:,0.12
Model:,OLS,Adj. R-squared:,0.118
Method:,Least Squares,F-statistic:,60.69
Date:,"Mon, 13 Oct 2025",Prob (F-statistic):,8.8e-37
Time:,22:02:33,Log-Likelihood:,-14392.0
No. Observations:,1338,AIC:,28790.0
Df Residuals:,1334,BIC:,28810.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-6916.2433,1757.480,-3.935,0.000,-1.04e+04,-3468.518
age,239.9945,22.289,10.767,0.000,196.269,283.720
bmi,332.0834,51.310,6.472,0.000,231.425,432.741
children,542.8647,258.241,2.102,0.036,36.261,1049.468

0,1,2,3
Omnibus:,325.395,Durbin-Watson:,2.012
Prob(Omnibus):,0.0,Jarque-Bera (JB):,603.372
Skew:,1.52,Prob(JB):,9.54e-132
Kurtosis:,4.255,Cond. No.,290.0


There are three numerical variables. And so there are 6 other subsets:
* age only
* bmi only
* children only
* age and bmi
* age and children
* bmi and children

We can compare model performance by way of RSS.

In [3]:
def statsmodelRSS(est):
    # Returns the RSS for the statsmodel ols class
    return np.sum(est.resid**2)

In [4]:
ins_age = smf.ols('charges ~ age', insurance).fit()
print(statsmodelRSS(ins_age))
ins_bmi = smf.ols('charges ~ bmi', insurance).fit()
print(statsmodelRSS(ins_bmi))
ins_child = smf.ols('charges ~ children', insurance).fit()
print(statsmodelRSS(ins_child))
ins_age_bmi = smf.ols('charges ~ age + bmi', insurance).fit()
print(statsmodelRSS(ins_age_bmi))
ins_age_child = smf.ols('charges ~ age + children', insurance).fit()
print(statsmodelRSS(ins_age_child))
ins_bmi_child = smf.ols('charges ~ bmi + children', insurance).fit()
print(statsmodelRSS(ins_bmi_child))

178544029385.2155
188360830331.80313
195167621650.2592
173097580364.0642
177943340984.38782
187520317725.7104


The subset with the lowest RSS is `age` and `bmi`, with a RSS of 173097580364.0642.\
Let's compare this subset's perfomance with the earlier example of using all numerical variables.

In [5]:
print(statsmodelRSS(insurance_all))

172526061322.4756


Including the variables `children` will lower the RSS, so in this case, having all numerical variables will help.

**But what about including the other variables? A patient's sex, smoker status, and residential region are also avaible to analyze...**\
These variables are called *qualitative*, or *categorical*. Qualitative variables are non-numerical, such as unordered sets or booleans. This is in contrast to the quantitative (numerical) variables we were working with earlier.

Qualitative variables cannot be directly worked with in regression. They have to be converted into a numerical format. This can be done by *dummy variables* - dividing each categorical answer by introducing a 1 for presence.

A quick demonstration on the `sex` variable:

In [6]:
insurance[['sex']]

Unnamed: 0,sex
0,female
1,male
2,male
3,male
4,male
...,...
1333,male
1334,female
1335,female
1336,female


In [7]:
# True and False
pd.get_dummies(insurance['sex'])

Unnamed: 0,female,male
0,True,False
1,False,True
2,False,True
3,False,True
4,False,True
...,...,...
1333,False,True
1334,True,False
1335,True,False
1336,True,False


In [8]:
# Convert true and falso to numerical
pd.get_dummies(insurance['sex'], dtype=float)

Unnamed: 0,female,male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0
...,...,...
1333,0.0,1.0
1334,1.0,0.0
1335,1.0,0.0
1336,1.0,0.0
