# Linear Regression Assumption Workbook

## Author: James Christensen

## Initial Date: November 6, 2025

In [24]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats

In [20]:
X.dtypes

const             float64
person_id           int64
income            float64
bmi               float64
smoke_former         bool
smoker_current       bool
alcohol_weekly       bool
alcohol_daily        bool
suburban             bool
rural                bool
unemployed           bool
retired              bool
dtype: object

### Fitting the initial model

In [23]:
full_data = pd.read_csv('../data/cleaned_data.csv')
full_data = full_data.astype({col: int for col in full_data.select_dtypes('bool').columns})


X = full_data.drop(columns = ['total_claims_paid', 'person_id'])
X = sm.add_constant(X)
y = full_data['total_claims_paid']

initial_model = sm.OLS(y, X).fit()
print(initial_model.summary())


                            OLS Regression Results                            
Dep. Variable:      total_claims_paid   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     77.51
Date:                Thu, 06 Nov 2025   Prob (F-statistic):          3.70e-159
Time:                        10:48:16   Log-Likelihood:            -6.3978e+05
No. Observations:               69917   AIC:                         1.280e+06
Df Residuals:                   69906   BIC:                         1.280e+06
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const            896.1674     50.012     17.

Looking at the summary for this full model, it looks like a number of our variables aren't relevant to the model. The $R^2$ value being only 0.011 is fairly concerning. We'll see if the model can be better modeled with a transformation on Y. 

#### Transforming Y using box-cox

In [28]:
y = initial_model.model.endog  
y = y + 1e-6

y_boxcox, best_lambda = stats.boxcox(y)

print("Optimal lambda:", best_lambda)

Optimal lambda: 0.06744053363915475


Since the optimal lambda is essentially 0, we will perform a natural log transformation on total_claims_paid

In [31]:
y = full_data['total_claims_paid'] + 1e-6
y = np.log(y)
y_adjust_model = sm.OLS(y, X).fit()
print(y_adjust_model.summary())


                            OLS Regression Results                            
Dep. Variable:      total_claims_paid   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     1.807
Date:                Thu, 06 Nov 2025   Prob (F-statistic):             0.0538
Time:                        11:06:00   Log-Likelihood:            -2.6181e+05
No. Observations:               69917   AIC:                         5.236e+05
Df Residuals:                   69906   BIC:                         5.237e+05
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             -1.3008      0.225     -5.

In [32]:
y

0         7.080935
1         6.909993
2         7.106639
3       -13.815511
4         7.248547
           ...    
69912   -13.815511
69913   -13.815511
69914   -13.815511
69915     7.824238
69916     7.007383
Name: total_claims_paid, Length: 69917, dtype: float64

This model actually did worse than the original one. As a result, we will check the assumptions of the original model and see if they hold.

### Checking the assumptions

#### Independence