Type of regression: Logistic regression
Purpose of regression: Classification algorithm that predicts a binary outcome (i.e., dependent categorical variable)

Steps in analysis:
1. Chi-square test: Identify individual predictors associated with diabetes and determine whether there are gender differences in frequencies.
2. Overall logistic regression: Evaluate the independent effect of each predictor on diabetes risk while controlling for other factors. This shows which predictors are significant once confounding is accounted for.
3. Logistic regression with interaction terms: Using the identified significant predictors from the overall logistic regression results, test whether the effect of each significant predictor differs by gender. Significant interaction means that the predictor affects men and women differently.
4. Stratified logistic regression by gender: Quantify predictor effects for men and women separately by running two models. 


In [None]:
# Install scipy, if not already installed
pip install scipy

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import all relevant libraries to be used in this analysis
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from scipy.stats import chi2_contingency

In [14]:
# Set random seed

np.random.seed(123)

# Load in dataset
data = pd.read_csv("diabetes_data_encoded.csv")

print(data.head())

   age  gender  polyuria  polydipsia  sudden_weight_loss  weakness  \
0   40       1         0           1                   0         1   
1   58       1         0           0                   0         1   
2   41       1         1           0                   0         1   
3   45       1         0           0                   1         1   
4   60       1         1           1                   1         1   

   polyphagia  genital_thrush  visual_blurring  itching  irritability  \
0           0               0                0        1             0   
1           0               0                1        0             0   
2           1               0                0        1             0   
3           1               1                0        1             0   
4           1               0                1        1             1   

   delayed_healing  partial_paresis  muscle_stiffness  alopecia  obesity  \
0                1                0                 1         1 

In [9]:
# Create contingency table
contingency_table = pd.crosstab(data['gender'], data['class'])
print(contingency_table)

# Run Chi-squared test between gender and diabetes
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)

class     0    1
gender          
0        19  173
1       181  147
Chi-squared Statistic: 103.03685927972559
Degrees of Freedom: 1
P-value: 3.289703730553294e-24


In [11]:
# Creating loop for all variables
categorical_vars = ['polyuria', 'polydipsia', 'sudden_weight_loss', 'weakness', 'polyphagia', 'genital_thrush', 'visual_blurring', 'itching', 'irritability', 'delayed_healing', 'partial_paresis', 'muscle_stiffness', 'alopecia', 'obesity']

results = []

for var in categorical_vars: 
    table = pd.crosstab(data[var], data['class'])
    chi2, p, dof, expected = chi2_contingency(table)
    results.append({'Variable': var, 'Chi2': chi2, 'p-value': p})

results_data = pd.DataFrame(results).sort_values('p-value')
print(results_data)

              Variable        Chi2       p-value
0             polyuria  227.865839  1.740912e-51
1           polydipsia  216.171633  6.187010e-49
2   sudden_weight_loss   97.296303  5.969166e-23
10     partial_paresis   95.387627  1.565289e-22
4           polyphagia   59.595254  1.165158e-14
8         irritability   45.208348  1.771483e-11
12            alopecia   36.064143  1.909279e-09
6      visual_blurring   31.808456  1.701504e-08
3             weakness   29.767918  4.869843e-08
11    muscle_stiffness    7.288667  6.939096e-03
5       genital_thrush    5.792149  1.609790e-02
13             obesity    2.327474  1.271080e-01
9      delayed_healing    0.962094  3.266599e-01
7              itching    0.046235  8.297484e-01


The chi-square test shows that polyuria, polydipsia, sudden weight loss, and partial paresis have the stongest associations with diabetes (individual correlation with diabetes).

In [16]:
# Standardize age for comparability because it is numeric while the rest of the variables are binary

scaler = StandardScaler()
data['age_scaled']=  scaler.fit_transform(data[['age']])

In [17]:
# Define predictors and outcome
predictors = [
    'age_scaled', 
    'gender', 
    'polyuria', 
    'polydipsia', 
    'sudden_weight_loss', 
    'weakness', 
    'polyphagia', 
    'genital_thrush', 
    'visual_blurring', 
    'itching', 
    'irritability', 
    'delayed_healing', 
    'partial_paresis', 
    'muscle_stiffness', 
    'alopecia', 
    'obesity'
]

X = data[predictors]
y = data['diabetes']

In [18]:
# Add constant and fit logistic regression
X = sm.add_constant(X)
model = sm.Logit(y, X)
result = model.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.165053
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:               diabetes   No. Observations:                  520
Model:                          Logit   Df Residuals:                      503
Method:                           MLE   Df Model:                           16
Date:                Wed, 12 Nov 2025   Pseudo R-squ.:                  0.7523
Time:                        18:49:48   Log-Likelihood:                -85.827
converged:                       True   LL-Null:                       -346.46
Covariance Type:            nonrobust   LLR p-value:                1.067e-100
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.2890      0.542      0.533      0.594      -0.774       1.352
age_

In [22]:
# Calculating odds ratios
odds_ratios = pd.DataFrame({
    "Variable": result.params.index,
    "Odds_Ratio": np.exp(result.params),
    "p-value": result.pvalues
})

print(odds_ratios.sort_values("p-value"))

                              Variable  Odds_Ratio       p-value
gender                          gender    0.012892  3.491487e-13
polyuria                      polyuria   84.735923  3.077802e-10
polydipsia                  polydipsia  159.244575  9.522435e-10
itching                        itching    0.060632  3.088698e-05
irritability              irritability   10.388796  7.376964e-05
genital_thrush          genital_thrush    6.447231  7.566817e-04
polyphagia                  polyphagia    3.299485  2.525063e-02
partial_paresis        partial_paresis    3.187711  2.717698e-02
age_scaled                  age_scaled    0.537313  4.365732e-02
weakness                      weakness    2.263847  1.279860e-01
visual_blurring        visual_blurring    2.498960  1.596007e-01
muscle_stiffness      muscle_stiffness    0.482507  2.091021e-01
delayed_healing        delayed_healing    0.675951  4.764573e-01
const                            const    1.335134  5.940674e-01
obesity                  

The logistic regression shows that polydipsia, polyuria, and irritability have strong positive association with diabetes once other symptoms are accounted for.

In [20]:
# Logistic regression with gender x polyuria interaction
model_interaction = smf.logit(    
    formula = 'diabetes ~ gender + polyuria + gender:polyuria',
    data = data
).fit()

# Display regression summary
print(model_interaction.summary())

# Convert coefficients to odds ratio
odds_ratios = np.exp(model_interaction.params)
print("\nOdds Ratios:")
print(odds_ratios)

         Current function value: 0.335250
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:               diabetes   No. Observations:                  520
Model:                          Logit   Df Residuals:                      516
Method:                           MLE   Df Model:                            3
Date:                Wed, 12 Nov 2025   Pseudo R-squ.:                  0.4968
Time:                        18:51:40   Log-Likelihood:                -174.33
converged:                      False   LL-Null:                       -346.46
Covariance Type:            nonrobust   LLR p-value:                 2.597e-74
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           0.8398      0.275      3.059      0.002       0.302       1.378
gender             -2.4552      0.334     -7.347  



The p-value for the gender x polyuria interaction term is not significant. The relationship between polyuria and diabetes doesn't differ by gender meaningfully. The odds ratio (1.363e-05) is very close to 1, indicating small effect difference between genders.

In [21]:
# Logistic regression with gender x polydipsia interaction
model_interaction = smf.logit(    
    formula = 'diabetes ~ gender + polydipsia + gender:polydipsia',
    data = data
).fit()

# Display regression summary
print(model_interaction.summary())

# Convert coefficients to odds ratio
odds_ratios = np.exp(model_interaction.params)
print("\nOdds Ratios:")
print(odds_ratios)

         Current function value: 0.351138
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:               diabetes   No. Observations:                  520
Model:                          Logit   Df Residuals:                      516
Method:                           MLE   Df Model:                            3
Date:                Wed, 12 Nov 2025   Pseudo R-squ.:                  0.4730
Time:                        18:59:25   Log-Likelihood:                -182.59
converged:                      False   LL-Null:                       -346.46
Covariance Type:            nonrobust   LLR p-value:                 9.814e-71
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept             0.9268      0.271      3.419      0.001       0.396       1.458
gender               -2.2299      0.317     



The p-value for the gender x polydipsia interaction term is not significant. The relationship between polydipsia and diabetes doesn't differ by gender meaningfully. The odds ratio (1.073e-08) is very close to 1, indicating small effect difference between genders.

In [31]:
# Logistic regression with gender x irritability interaction
model_interaction = smf.logit(    
    formula = 'diabetes ~ gender + irritability + gender:irritability',
    data = data
).fit()

# Display regression summary
print(model_interaction.summary())

# Convert coefficients to odds ratio
odds_ratios = np.exp(model_interaction.params)
print("\nOdds Ratios:")
print(odds_ratios)

Optimization terminated successfully.
         Current function value: 0.494014
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:               diabetes   No. Observations:                  520
Model:                          Logit   Df Residuals:                      516
Method:                           MLE   Df Model:                            3
Date:                Wed, 12 Nov 2025   Pseudo R-squ.:                  0.2585
Time:                        20:10:24   Log-Likelihood:                -256.89
converged:                       True   LL-Null:                       -346.46
Covariance Type:            nonrobust   LLR p-value:                 1.343e-38
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               1.9459      0.252      7.723      0.000       1.452       2.440
ge

The p-value for the gender x irritability interaction term is not significant. The relationship between irritability and diabetes doesn't differ by gender meaningfully. The odds ratio (1.236) is very close to 1, indicating small effect difference between genders.

In [38]:
# Stratify by gender
males = data[data["gender"] == 1]
females = data[data["gender"] == 0]

# Fit logistic regression for males
model_male = sm.Logit(males["diabetes"], sm.add_constant(males[predictors])).fit()

# Convert to odds ratios
odds_ratios = np.exp(model_male.params)

# 95% confidence intervals
conf = np.exp(model_male.conf_int())
conf.columns = ['2.5%', '97.5%']

# Extract p-values for summary table
p_values = model_male.pvalues

# Summary table
results_male = pd.concat([odds_ratios, conf, p_values], axis=1)
results_male.columns = ['OR', '2.5%', '97.5%', 'p-value']

# Sort by p-value, ascending
results_male_sorted = results_male.sort_values(by = 'p-value', ascending = True)

print(results_male_sorted)

Optimization terminated successfully.
         Current function value: 0.146955
         Iterations 10
                             OR        2.5%          97.5%       p-value
polydipsia          5348.687711  235.516197  121471.307012  7.128050e-08
gender                 0.001310    0.000098       0.017498  5.184581e-07
polyuria             671.906139   52.328354    8627.404237  5.771637e-07
irritability         103.003651   12.240279     866.790079  2.001616e-05
itching                0.017411    0.001922       0.157717  3.149849e-04
genital_thrush        21.747446    3.509482     134.763892  9.363106e-04
alopecia              22.613631    2.660077     192.241159  4.291120e-03
age_scaled             0.113193    0.025043       0.511624  4.644614e-03
partial_paresis       13.888314    1.891028     102.000202  9.703146e-03
obesity                0.163759    0.031516       0.850908  3.139824e-02
polyphagia             6.414089    1.114898      36.900705  3.736118e-02
sudden_weight_loss   

Polydipsia, polyuria, and irritability are variables that are both statistically and clinically meaningful for males in predicting diabetes.

In [41]:
females[predictors].corr()


Unnamed: 0,age_scaled,gender,polyuria,polydipsia,sudden_weight_loss,weakness,polyphagia,genital_thrush,visual_blurring,itching,irritability,delayed_healing,partial_paresis,muscle_stiffness,alopecia,obesity
age_scaled,1.0,,0.061203,0.069686,0.06715,0.02439553,0.051808,0.347696,0.170015,0.238946,0.04443439,0.090216,-0.090519,-0.021066,0.296657,0.235616
gender,,,,,,,,,,,,,,,,
polyuria,0.061203,,1.0,0.675267,0.401382,0.3059035,0.378749,-0.146051,0.039375,0.018041,0.1985345,0.180847,0.410239,0.072423,-0.22785,0.17136
polydipsia,0.069686,,0.675267,1.0,0.381924,0.3631537,0.315038,-0.058534,0.201326,0.194581,0.2208081,0.246081,0.531686,0.219771,-0.332502,0.072862
sudden_weight_loss,0.06715,,0.401382,0.381924,1.0,0.2780305,0.306155,0.048413,-0.088017,0.002216,0.1533858,0.117165,0.216181,0.327915,-0.280376,0.231943
weakness,0.024396,,0.305904,0.363154,0.278031,1.0,0.0526,-0.283197,0.253986,0.184302,-3.50558e-17,0.191768,0.238716,0.170986,-0.15891,0.146427
polyphagia,0.051808,,0.378749,0.315038,0.306155,0.05260037,1.0,-0.019684,-0.174238,0.129849,0.3251779,0.202296,0.32728,0.134853,-0.188669,-0.077901
genital_thrush,0.347696,,-0.146051,-0.058534,0.048413,-0.2831969,-0.019684,1.0,0.059235,0.169108,0.06744189,0.232376,-0.22898,0.064739,0.386275,-0.026954
visual_blurring,0.170015,,0.039375,0.201326,-0.088017,0.2539861,-0.174238,0.059235,1.0,0.22557,-0.0243975,0.128709,0.21353,0.289669,0.007597,0.189015
itching,0.238946,,0.018041,0.194581,0.002216,0.1843024,0.129849,0.169108,0.22557,1.0,0.144463,0.551139,0.248875,0.079358,0.088097,-0.198623


In [43]:
females["diabetes"].value_counts()


diabetes
1    173
0     19
Name: count, dtype: int64

In [None]:
# Upsample females as current dataset has 328 males and 192 females [double-check if this is best way to go about this]
#sample = females.sample(n = 328, replace = True)

# Bootstrap female samples because it is a much smaller pool than males
bootstrap_females = []

for i in range(2000):
    sample = females.sample(n = 328, replace = True)
    sample = sample.assign(replicate = i)
    sample_list.append(sample)



         Current function value: inf
         Iterations: 35


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q * linpred)))


LinAlgError: Singular matrix

In [None]:
# Fit logistic regression for females
model_female = sm.Logit(females["diabetes"], sm.add_constant(females[predictors])).fit()

model_male = sm.Logit(males["diabetes"], sm.add_constant(males[predictors])).fit()

# Convert to odds ratios
odds_ratios = np.exp(model_female.params)

# 95% confidence intervals
conf = np.exp(model_female.conf_int())
conf.columns = ['2.5%', '97.5%']

# Extract p-values for summary table
p_values = model_female.pvalues

# Summary table
results_female = pd.concat([odds_ratios, conf, p_values], axis=1)
results_female.columns = ['OR', '2.5%', '97.5%', 'p-value']

# Sort by p-value, ascending
results_female_sorted = results_female.sort_values(by = 'p-value', ascending = True)

print(results_female_sorted)