In [44]:
import pandas as pd 
import numpy as np
from itertools import combinations

In [45]:
df_simulated = pd.read_csv("../integration/df_final_with_simulated.csv")
df_simulated.head()


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,...,Masters,Doctorate,Rural,Urban,Total,Urban_Pct,Rural_Pct,State_Type,individual_income,education_score
0,9046,1,67.0,0,1,1,Private,Urban,228.69,36.6,...,345966,49857,2,7,9,77.777778,22.222222,Mixed,93396.14153,9.842136
1,31112,1,80.0,0,1,1,Private,Rural,105.92,32.5,...,391150,71956,51,44,95,46.315789,53.684211,Rural,63871.356988,16.088449
2,60182,0,49.0,0,0,1,Private,Urban,171.23,34.4,...,345957,54737,20,26,46,56.521739,43.478261,Mixed,70591.885381,12.036045
3,1665,0,79.0,1,0,1,Self-employed,Rural,174.12,24.0,...,218915,35332,27,37,64,57.8125,42.1875,Mixed,70646.298564,8.762857
4,56669,1,81.0,0,0,1,Private,Urban,186.21,29.0,...,81198,14956,1,2,3,66.666667,33.333333,Mixed,79832.466253,1.615077


In [46]:
# Stroke Prevalence by Residence Type
stroke_by_residence = df_simulated.groupby("Residence_type")["stroke"].mean()
stroke_by_residence

Residence_type
Rural    0.041339
Urban    0.043775
Name: stroke, dtype: float64

Analysis of Interaction Terms 

We're interested in how other numeric features (e.g., age, income, BMI) interact with each other, not multiplying them with the outcome label (stroke).

In [47]:
# Identify numeric columns (excluding the target variable)
numeric_columns = df_simulated.select_dtypes(include="number").drop(columns=["stroke"]).columns

In [48]:
# Store interaction columns in a dictionary
interaction_data = {}
interaction_results = []

Combinations(numeric_columns, 2) finds all unique 2-feature pairs and for each pair, computes their interaction (multiplication). This stores this new column in interaction_data. Then we calculates the Pearson correlation between the interaction term and stroke, and save that in interaction_results. This helps test which feature interactions might predict stroke better than the original features.

In [49]:
for col1, col2 in combinations(numeric_columns, 2):
    interaction_name = f"{col1}_x_{col2}"
    interaction_values = df_simulated[col1] * df_simulated[col2]
    interaction_data[interaction_name] = interaction_values
    corr = np.corrcoef(interaction_values, df_simulated["stroke"])[0, 1]
    interaction_results.append((interaction_name, corr))

In [50]:
# Concatenate new interaction features in one go (efficient!)
interaction_df = pd.DataFrame(interaction_data)
df_simulated = pd.concat([df_simulated, interaction_df], axis=1)

In [51]:
# Top 5 interaction terms by absolute correlation
interaction_results_df = pd.DataFrame(interaction_results, columns=["Interaction_Term", "Correlation_with_Stroke"])
top5 = interaction_results_df.reindex(
    interaction_results_df["Correlation_with_Stroke"].abs().sort_values(ascending=False).index
).head(5)
top5

Unnamed: 0,Interaction_Term,Correlation_with_Stroke
44,age_x_avg_glucose_level,0.246424
47,age_x_Median_Income,0.216782
58,age_x_individual_income,0.206317
45,age_x_bmi,0.204168
43,age_x_ever_married,0.184954


Explore Correlations

In [52]:
health_indicators = ["hypertension", "heart_disease", "bmi", "avg_glucose_level"]
socioeconomic_vars = ["Median_Income", "individual_income", "education_score"]

In [53]:
# Compute correlations
edu_health_corr = df_simulated[["education_score"] + health_indicators].corr().loc["education_score", health_indicators]
income_stroke_corr = df_simulated[["individual_income", "stroke"]].corr().iloc[0, 1]
socio_corr_stroke = df_simulated[socioeconomic_vars + ["stroke"]].corr()["stroke"].drop("stroke")

In [54]:
# Create new interaction terms
df_simulated["income_x_age"] = df_simulated["individual_income"] * df_simulated["age"]
df_simulated["education_x_bmi"] = df_simulated["education_score"] * df_simulated["bmi"]
df_simulated["income_x_glucose"] = df_simulated["individual_income"] * df_simulated["avg_glucose_level"]


In [55]:
# Correlation of interaction terms with stroke
interaction_corr_stroke = df_simulated[[
    "income_x_age", "education_x_bmi", "income_x_glucose", "stroke"
]].corr()["stroke"].drop("stroke")

In [56]:
# Assemble results into labeled DataFrames
edu_health_df = edu_health_corr.reset_index()
edu_health_df.columns = ["Variable", "Education vs Health"]
income_stroke_df = pd.DataFrame({
    "Variable": ["individual_income"],
    "Individual Income vs Stroke": [income_stroke_corr]
})

In [57]:
socio_corr_df = socio_corr_stroke.reset_index()
socio_corr_df.columns = ["Variable", "Socioeconomic vs Stroke"]

interaction_corr_df = interaction_corr_stroke.reset_index()
interaction_corr_df.columns = ["Variable", "Socioeconomic vs Stroke"]


In [59]:
# Combine all socioeconomic correlations
socio_combined_df = pd.concat([socio_corr_df, interaction_corr_df], ignore_index=True)
socio_combined_df


Unnamed: 0,Variable,Socioeconomic vs Stroke
0,Median_Income,0.011619
1,individual_income,0.006451
2,education_score,-0.017142
3,income_x_age,0.206317
4,education_x_bmi,-0.006232
5,income_x_glucose,0.119227


In [61]:
# Merge everything into final summary
final_summary = pd.merge(edu_health_df, income_stroke_df, on="Variable", how="outer")
final_summary = pd.merge(final_summary, socio_combined_df, on="Variable", how="outer")
final_summary


Unnamed: 0,Variable,Education vs Health,Individual Income vs Stroke,Socioeconomic vs Stroke
0,hypertension,0.004612,,
1,heart_disease,-0.016963,,
2,bmi,0.001717,,
3,avg_glucose_level,-0.012927,,
4,individual_income,,0.006451,0.006451
5,Median_Income,,,0.011619
6,education_score,,,-0.017142
7,income_x_age,,,0.206317
8,education_x_bmi,,,-0.006232
9,income_x_glucose,,,0.119227


The relationship between education and health indicators in this dataset is minimal. Correlation values between the education score and health variables such as hypertension, heart disease, BMI, and glucose level are all close to zero, indicating virtually no linear association. This suggests that while education is often linked to long-term health in the broader population, state-level averages may not accurately reflect individual health outcomes. It's also possible that education influences other unmeasured behaviors like diet or preventative care rather than these specific indicators.

When examining the direct correlation between individual income and stroke, the relationship remains negligible. With a correlation of just +0.006, there’s no strong evidence that individual income—simulated here as a noisy extension of state-level medians—significantly influences stroke likelihood. This result highlights the dominance of physiological and demographic risk factors, such as age and medical history, over socioeconomic factors in isolation.

However, when introducing interaction terms that account for the combined effect of socioeconomic and health variables, stronger relationships emerge. The interaction between individual income and age shows a notably higher correlation with stroke (+0.206), suggesting that lower income may become a more relevant risk factor among older individuals. A similar pattern is observed with income and glucose level (+0.119), hinting that higher glucose levels—often linked to diet and metabolic health—may be more harmful in the context of lower income. In contrast, the interaction between education and BMI shows minimal effect, reinforcing that not all socioeconomic-health pairings produce meaningful patterns.

Overall, while direct correlations between socioeconomic factors and stroke are weak, interaction terms reveal hidden structure. These findings suggest that age, metabolic health, and socioeconomic disadvantage combine to shape stroke risk more meaningfully than any single variable alone.