Type of regression: Logistic regression

Purpose of regression: Classification algorithm that predicts a binary outcome (i.e., dependent categorical variable)

Steps in analysis:
1. Chi-square test: Identify individual predictors associated with diabetes and determine whether there are gender differences in frequencies.
2. Overall logistic regression: Evaluate the independent effect of each predictor on diabetes risk while controlling for other factors. This shows which predictors are significant once confounding is accounted for.
3. Logistic regression with interaction terms: Using the identified significant predictors from the overall logistic regression results, test whether the effect of each significant predictor differs by gender. Significant interaction means that the predictor affects men and women differently.

Next steps (not included in this analysis notebook):

Stratified logistic regression by gender: Quantify predictor effects for men and women separately by running two models. 


In [73]:
# Import all relevant libraries to be used in this analysis
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from scipy.stats import chi2_contingency

In [74]:
# Set random seed

np.random.seed(123)

# Load in dataset
data = pd.read_csv("diabetes_data_encoded.csv")

print(data.head())

   age  gender  polyuria  polydipsia  sudden_weight_loss  weakness  \
0   40       1         0           1                   0         1   
1   58       1         0           0                   0         1   
2   41       1         1           0                   0         1   
3   45       1         0           0                   1         1   
4   60       1         1           1                   1         1   

   polyphagia  genital_thrush  visual_blurring  itching  irritability  \
0           0               0                0        1             0   
1           0               0                1        0             0   
2           1               0                0        1             0   
3           1               1                0        1             0   
4           1               0                1        1             1   

   delayed_healing  partial_paresis  muscle_stiffness  alopecia  obesity  \
0                1                0                 1         1 

In [None]:
# Create contingency table
contingency_table = pd.crosstab(data['gender'], data['diabetes'])
print(contingency_table)

# Run Chi-squared test between gender and diabetes
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-squared Statistic:", chi2)
print("Degrees of Freedom:", dof)
print("P-value:", p)

diabetes    0    1
gender            
0          19  173
1         181  147
Chi-squared Statistic: 103.03685927972559
Degrees of Freedom: 1
P-value: 3.289703730553294e-24


In [None]:
# Creating loop for all variables
categorical_vars = [
    'polyuria', 
    'polydipsia', 
    'sudden_weight_loss', 
    'weakness', 
    'polyphagia', 
    'genital_thrush', 
    'visual_blurring', 
    'itching', 
    'irritability', 
    'delayed_healing', 
    'partial_paresis', 
    'muscle_stiffness', 
    'alopecia', 
    'obesity'
    ]

results = []

for var in categorical_vars: 
    table = pd.crosstab(data[var], data['diabetes'])
    chi2, p, dof, expected = chi2_contingency(table)
    results.append({'Variable': var, 'Chi2': chi2, 'p-value': p})

results_data = pd.DataFrame(results).sort_values('p-value')
print(results_data)

              Variable        Chi2       p-value
0             polyuria  227.865839  1.740912e-51
1           polydipsia  216.171633  6.187010e-49
2   sudden_weight_loss   97.296303  5.969166e-23
10     partial_paresis   95.387627  1.565289e-22
4           polyphagia   59.595254  1.165158e-14
8         irritability   45.208348  1.771483e-11
12            alopecia   36.064143  1.909279e-09
6      visual_blurring   31.808456  1.701504e-08
3             weakness   29.767918  4.869843e-08
11    muscle_stiffness    7.288667  6.939096e-03
5       genital_thrush    5.792149  1.609790e-02
13             obesity    2.327474  1.271080e-01
9      delayed_healing    0.962094  3.266599e-01
7              itching    0.046235  8.297484e-01


The chi-square test shows that polyuria, polydipsia, sudden weight loss, and partial paresis have the stongest associations with diabetes (individual correlation with diabetes).

In [None]:
# Standardize age for comparability because it is numeric while the rest of the variables are binary
scaler = StandardScaler()
data['age_scaled'] = scaler.fit_transform(data[['age']])



In [87]:
# Define predictors and outcome
predictors = [
    'gender', 
    'polyuria', 
    'polydipsia', 
    'sudden_weight_loss', 
    'weakness', 
    'polyphagia', 
    'genital_thrush', 
    'visual_blurring', 
    'itching', 
    'irritability', 
    'delayed_healing', 
    'partial_paresis', 
    'muscle_stiffness', 
    'alopecia', 
    'obesity'
]

X = data[predictors]
y = data['diabetes']

In [90]:
# Add constant and fit logistic regression
X = sm.add_constant(X)
model = sm.Logit(y, X)
results = model.fit()

print(results.summary())

Optimization terminated successfully.
         Current function value: 0.169179
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:               diabetes   No. Observations:                  520
Model:                          Logit   Df Residuals:                      504
Method:                           MLE   Df Model:                           15
Date:                Thu, 13 Nov 2025   Pseudo R-squ.:                  0.7461
Time:                        18:48:14   Log-Likelihood:                -87.973
converged:                       True   LL-Null:                       -346.46
Covariance Type:            nonrobust   LLR p-value:                1.440e-100
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                  0.8724      0.476      1.834      0.067      -0.060       1.804
gend

In [101]:
# Calculating odds ratios and adding in confidence intervals
OR = np.exp(results.params)
pvals = results.pvalues

CI = np.exp(results.conf_int())
CI.columns = ["CI_lower", "CI_upper"]

odds_ratios = pd.concat([
    OR.rename("Odds_Ratio"),
    CI,
    pvals.rename("p-value")
    
], axis = 1)

print(odds_ratios.sort_values("p-value"))

                    Odds_Ratio   CI_lower    CI_upper       p-value
gender                0.013528   0.004233    0.043235  3.913145e-13
polyuria             79.887434  19.995261  319.175735  5.695990e-10
polydipsia          143.298840  28.616981  717.565482  1.534874e-09
itching               0.062627   0.017487    0.224291  2.076599e-05
irritability          8.332763   2.694383   25.770252  2.326747e-04
genital_thrush        6.228905   2.077599   18.675043  1.093805e-03
partial_paresis       2.692106   1.030371    7.033815  4.327761e-02
polyphagia            2.474431   0.953608    6.420674  6.255778e-02
const                 2.392603   0.942013    6.076930  6.660100e-02
weakness              2.592047   0.923933    7.271851  7.035080e-02
muscle_stiffness      0.392565   0.134560    1.145271  8.695828e-02
visual_blurring       1.983300   0.599477    6.561516  2.619755e-01
delayed_healing       0.593728   0.197424    1.785558  3.534037e-01
alopecia              0.704838   0.238199    2.0

The logistic regression shows that polydipsia, polyuria, and irritability have strong positive association with diabetes once other symptoms are accounted for.

In [110]:
# Logistic regression with gender x polyuria interaction
model_interaction_1 = smf.logit(    
    formula = 'diabetes ~ gender + polyuria + gender:polyuria',
    data = data
).fit()

# Calculate OR, CI, and p-value
OR = np.exp(model_interaction_1.params)

CI = np.exp(model_interaction_1.conf_int())
CI.columns = ["2.5%", "97.5%"]

pvals = model_interaction_1.pvalues

interaction_1_results = pd.concat([
    OR.rename("Odds_Ratio"),
    CI,
    pvals.rename("p-value")
], axis = 1)

# Sort by p-values
interaction_1_results_sorted = interaction_1_results.sort_values("p-value")

print("\nOdds Ratios with 95% CI for gender x polyuria:")
print(interaction_1_results_sorted)

         Current function value: 0.335250
         Iterations: 35

Odds Ratios with 95% CI for gender x polyuria:
                   Odds_Ratio           2.5%          97.5%       p-value
gender           8.584337e-02   4.459025e-02   1.652622e-01  2.031419e-13
Intercept        2.315790e+00   1.352169e+00   3.966133e+00  2.220575e-03
polyuria         2.805841e+06  2.572650e-185  3.060170e+197  9.472549e-01
gender:polyuria  1.362526e-05  1.248678e-196  1.486753e+186  9.601866e-01




The p-value for the gender x polyuria interaction term is not significant. The relationship between polyuria and diabetes doesn't differ by gender meaningfully. The odds ratio (1.363e-05) is very close to 1, indicating small effect difference between genders.

In [111]:
# Logistic regression with gender x polydipsia interaction
model_interaction_2 = smf.logit(    
    formula = 'diabetes ~ gender + polydipsia + gender:polydipsia',
    data = data
).fit()

# Calculate OR, CI, and p-value
OR = np.exp(model_interaction_2.params)

CI = np.exp(model_interaction_2.conf_int())
CI.columns = ["2.5%", "97.5%"]

pvals = model_interaction_2.pvalues

interaction_2_results = pd.concat([
    OR.rename("Odds_Ratio"),
    CI,
    pvals.rename("p-value")
], axis = 1)

# Sort by p-values
interaction_2_results_sorted = interaction_2_results.sort_values("p-value")

print("\nOdds Ratios with 95% CI for gender x polydipsia:")
print(interaction_2_results_sorted)

         Current function value: 0.351138
         Iterations: 35

Odds Ratios with 95% CI for gender x polydipsia:
                     Odds_Ratio      2.5%     97.5%       p-value
gender             1.075385e-01  0.057768  0.200188  2.017713e-12
Intercept          2.526316e+00  1.485164  4.297351  6.279874e-04
polydipsia         4.287872e+09  0.000000       inf  9.980990e-01
gender:polydipsia  1.073041e-08  0.000000       inf  9.984272e-01


  result = func(self.values, **kwargs)


The p-value for the gender x polydipsia interaction term is not significant. The relationship between polydipsia and diabetes doesn't differ by gender meaningfully. The odds ratio (1.073e-08) is very close to 1, indicating small effect difference between genders.

In [112]:
# Logistic regression with gender x irritability interaction
model_interaction_3 = smf.logit(    
    formula = 'diabetes ~ gender + irritability + gender:irritability',
    data = data
).fit()

# Calculate OR, CI, and p-value
OR = np.exp(model_interaction_3.params)

CI = np.exp(model_interaction_3.conf_int())
CI.columns = ["2.5%", "97.5%"]

pvals = model_interaction_3.pvalues

interaction_3_results = pd.concat([
    OR.rename("Odds_Ratio"),
    CI,
    pvals.rename("p-value")
], axis = 1)

# Sort by p-values
interaction_3_results_sorted = interaction_3_results.sort_values("p-value")

print("\nOdds Ratios with 95% CI for gender x irritability:")
print(interaction_3_results_sorted)

Optimization terminated successfully.
         Current function value: 0.494014
         Iterations 8

Odds Ratios with 95% CI for gender x irritability:
                     Odds_Ratio      2.5%      97.5%       p-value
gender                 0.072289  0.041323   0.126461  3.362063e-20
Intercept              7.000000  4.271844  11.470456  1.139880e-14
irritability           6.714286  0.871875  51.706506  6.750133e-02
gender:irritability    1.236170  0.146348  10.441658  8.455912e-01


The p-value for the gender x irritability interaction term is not significant. The relationship between irritability and diabetes doesn't differ by gender meaningfully. The odds ratio (1.236) is very close to 1, indicating small effect difference between genders.

Polydipsia, polyuria, and irritability are variables that are both statistically and clinically meaningful for males in predicting diabetes.