# Logistic regression workbook

## Author: James Christensen

## Date: November 8, 2025

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats

In [9]:
full_data = pd.read_csv('../data/cleaned_data.csv')
full_data = full_data.astype({col: int for col in full_data.select_dtypes('bool').columns})

X = full_data.drop(columns = ['total_claims_paid', 'person_id'])
X = sm.add_constant(X)
y = full_data['total_claims_paid'] != 0

y.head()

0     True
1     True
2     True
3    False
4     True
Name: total_claims_paid, dtype: bool

### Fitting the initial model

In [16]:
initial_logistic_model = sm.Logit(y, X).fit()
initial_logistic_model.summary()

Optimization terminated successfully.
         Current function value: 0.661809
         Iterations 4


0,1,2,3
Dep. Variable:,total_claims_paid,No. Observations:,69917.0
Model:,Logit,Df Residuals:,69906.0
Method:,MLE,Df Model:,10.0
Date:,"Sat, 08 Nov 2025",Pseudo R-squ.:,9.319e-05
Time:,20:16:49,Log-Likelihood:,-46272.0
converged:,True,LL-Null:,-46276.0
Covariance Type:,nonrobust,LLR p-value:,0.568

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.4255,0.045,9.395,0.000,0.337,0.514
income,9.757e-08,1.67e-07,0.585,0.558,-2.29e-07,4.24e-07
bmi,0.0019,0.002,1.214,0.225,-0.001,0.005
smoke_former,0.0008,0.021,0.037,0.970,-0.039,0.041
smoker_current,0.0018,0.024,0.076,0.940,-0.046,0.049
alcohol_weekly,0.0260,0.018,1.477,0.140,-0.009,0.061
alcohol_daily,0.0262,0.031,0.852,0.394,-0.034,0.087
suburban,0.0222,0.019,1.198,0.231,-0.014,0.059
rural,0.0268,0.023,1.186,0.236,-0.018,0.071


From this model summary, it appears that it isn't possible for us to reliably predict whether someone will make an insurance claim based off of lifestyle. To take a closer look at this, here are the confidence intervals for each factor.

In [17]:
initial_logistic_model.conf_int()

Unnamed: 0,0,1
const,0.3367201,0.5142436
income,-2.290917e-07,4.24229e-07
bmi,-0.001167306,0.004969655
smoke_former,-0.03949865,0.04101819
smoker_current,-0.04576815,0.04943878
alcohol_weekly,-0.008511199,0.06054345
alcohol_daily,-0.03416405,0.08665594
suburban,-0.01413454,0.05860183
rural,-0.01751598,0.07117012
unemployed,-0.01712053,0.07612172


As can be seen, the constant is the only variable with a clear effect. Every other variable has '0' or no effect on the probability within their respective confidence intervals. As such, it may be appropriate to conclude that lifestyle may not have a statistically signficiant impact on whether a claim is paid or not.