## Investigating how presence of diabetes impacts CVD risk (Framingham Score)

>The plan here is to run a regression using matching, that will help to determine the impact of having diabetes on continuous CVD risk, as described by the Framingham Risk Score. First, we will run a normal regression to show what the results are without controlling for other variables. We will then look at the impact of how matching with DAME changes these results. We will also look at the impact of age and diabetes on CVD risk.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import statsmodels.formula.api as smf

df = pd.read_csv(
    "/Users/lilahduboff/Documents/Duke_Unifying_DataScience/UDS_Final_Proj/UDS_CVD_Final_Project_Rep/bangladesh_data.csv"
)
df.head()

Unnamed: 0,Sex,Age,Weight (kg),Height (m),BMI,Abdominal Circumference (cm),Blood Pressure (mmHg),Total Cholesterol (mg/dL),HDL (mg/dL),Fasting Blood Sugar (mg/dL),...,Physical Activity Level,Family History of CVD,CVD Risk Level,Height (cm),Waist-to-Height Ratio,Systolic BP,Diastolic BP,Blood Pressure Category,Estimated LDL (mg/dL),CVD Risk Score
0,F,32.0,69.1,1.71,23.6,86.2,125/79,248.0,78.0,111.0,...,Low,N,INTERMEDIARY,171.0,0.504,125.0,79.0,Elevated,140.0,17.93
1,F,55.0,118.7,1.69,41.6,82.5,139/70,162.0,50.0,135.0,...,High,Y,HIGH,169.0,0.488,139.0,70.0,Hypertension Stage 1,82.0,20.51
2,M,,,1.83,26.9,106.7,104/77,103.0,73.0,114.0,...,High,Y,INTERMEDIARY,183.0,0.583,104.0,77.0,Normal,0.0,12.64
3,M,44.0,108.3,1.8,33.4,96.6,140/83,134.0,46.0,91.0,...,High,Y,INTERMEDIARY,,0.537,140.0,83.0,Hypertension Stage 1,58.0,16.36
4,F,32.0,99.5,1.86,28.8,102.7,144/83,146.0,64.0,141.0,...,High,N,INTERMEDIARY,186.0,0.552,144.0,83.0,Hypertension Stage 1,52.0,17.88


> First, we're going to impute the numerical columns with missing data with the mean of the variables. This seemed like the most feasible decision because...

In [2]:
# Impute missing values
df["Age"].fillna(df["Age"].mean(), inplace=True)

df["BMI"].fillna(df["BMI"].mean(), inplace=True)

df["Total Cholesterol (mg/dL)"].fillna(
    df["Total Cholesterol (mg/dL)"].mean(), inplace=True
)

df["HDL (mg/dL)"].fillna(df["HDL (mg/dL)"].mean(), inplace=True)

df["Fasting Blood Sugar (mg/dL)"].fillna(
    df["Fasting Blood Sugar (mg/dL)"].mean(), inplace=True
)

df["Systolic BP"].fillna(df["Systolic BP"].mean(), inplace=True)

df["Diastolic BP"].fillna(df["Diastolic BP"].mean(), inplace=True)

df["Estimated LDL (mg/dL)"].fillna(df["Estimated LDL (mg/dL)"].mean(), inplace=True)

df.isnull().sum()

Sex                              0
Age                              0
Weight (kg)                     81
Height (m)                      67
BMI                              0
Abdominal Circumference (cm)    67
Blood Pressure (mmHg)            0
Total Cholesterol (mg/dL)        0
HDL (mg/dL)                      0
Fasting Blood Sugar (mg/dL)      0
Smoking Status                   0
Diabetes Status                  0
Physical Activity Level          0
Family History of CVD            0
CVD Risk Level                   0
Height (cm)                     74
Waist-to-Height Ratio           79
Systolic BP                      0
Diastolic BP                     0
Blood Pressure Category          0
Estimated LDL (mg/dL)            0
CVD Risk Score                  70
dtype: int64

>Ok there's still quite a few rows with missing data. Let's drop the columns we don't need (given the CDC's list of risk factors) and see if that changes anything. We'll be dropping: CVD risk score because it's a leveled version (low, med, high), and would skew results; Blood Pressure (mmHg) because we already have systolic and diastolic, and there's no simple way to convert blood pressure to a decimal/float; Waist-to-Height Ratio and Abdomical Circumference (cm) because they're not entirely relevant to what we're looking at; height and weight variables because we already have BMI, which takes both of those into account, and provides a more succinct way to measure physical stature. 

In [3]:
# we're going to drop BP (mmHg) as well, because it's a combination of systolic and diastolic
df = df.drop(
    columns=[
        "CVD Risk Score",
        "Blood Pressure (mmHg)",
        "Waist-to-Height Ratio",
        "CVD Risk Level",
        "Abdominal Circumference (cm)",
        "Height (m)",
        "Weight (kg)",
        "Height (cm)",
    ],
    errors="ignore",
)

df.head()

Unnamed: 0,Sex,Age,BMI,Total Cholesterol (mg/dL),HDL (mg/dL),Fasting Blood Sugar (mg/dL),Smoking Status,Diabetes Status,Physical Activity Level,Family History of CVD,Systolic BP,Diastolic BP,Blood Pressure Category,Estimated LDL (mg/dL)
0,F,32.0,23.6,248.0,78.0,111.0,N,Y,Low,N,125.0,79.0,Elevated,140.0
1,F,55.0,41.6,162.0,50.0,135.0,Y,Y,High,Y,139.0,70.0,Hypertension Stage 1,82.0
2,M,47.0255,26.9,103.0,73.0,114.0,N,N,High,Y,104.0,77.0,Normal,0.0
3,M,44.0,33.4,134.0,46.0,91.0,N,N,High,Y,140.0,83.0,Hypertension Stage 1,58.0
4,F,32.0,28.8,146.0,64.0,141.0,Y,Y,High,N,144.0,83.0,Hypertension Stage 1,52.0


### Recoding guide

Now we need to recode the Yes/No variables to binary to make it easier to model. 

- Sex: 1 for female, 0 for male
- Smoking Status: 1 for yes, 0 for no
- Diabetes Status: 1 for yes, 0 for no
- Physical Activity Level: 0 for low, 1 for moderate, 2 for high
- CVD Risk Level: 1 for intermediary or high risk, 0 for low to no risk
- Blood Pressure Category: 0 for normal; 1 for elevated; 2 for hypertension stage one; and 3 for hypertension stage two
- Family history of CVD: 0 for no; 1 for yes

In [5]:
# Recoding Sex
df["Sex"] = df["Sex"].replace({"F": 1, "M": 0})

# Recoding Smoking Status
df["Smoking Status"] = df["Smoking Status"].replace({"Y": 1, "N": 0})

# Recoding Diabetes Status
df["Diabetes Status"] = df["Diabetes Status"].replace({"Y": 1, "N": 0})

# Recoding Physical Activity Level
df["Physical Activity Level"] = df["Physical Activity Level"].replace(
    {"Low": 0, "Moderate": 1, "High": 2}
)

# Recoding Blood Pressure Category
df["Blood Pressure Category"] = df["Blood Pressure Category"].replace(
    {"Normal": 0, "Elevated": 1, "Hypertension Stage 1": 2, "Hypertension Stage 2": 3}
)

# Recoding Family History of CVD
df["Family History of CVD"] = df["Family History of CVD"].replace({"Y": 1, "N": 0})

In [6]:
df.head()

Unnamed: 0,Sex,Age,BMI,Total Cholesterol (mg/dL),HDL (mg/dL),Fasting Blood Sugar (mg/dL),Smoking Status,Diabetes Status,Physical Activity Level,Family History of CVD,Systolic BP,Diastolic BP,Blood Pressure Category,Estimated LDL (mg/dL)
0,1,32.0,23.6,248.0,78.0,111.0,0,1,0,0,125.0,79.0,1,140.0
1,1,55.0,41.6,162.0,50.0,135.0,1,1,2,1,139.0,70.0,2,82.0
2,0,47.0255,26.9,103.0,73.0,114.0,0,0,2,1,104.0,77.0,0,0.0
3,0,44.0,33.4,134.0,46.0,91.0,0,0,2,1,140.0,83.0,2,58.0
4,1,32.0,28.8,146.0,64.0,141.0,1,1,2,0,144.0,83.0,2,52.0


In [13]:
has_diabetes = df[df["Diabetes Status"] == 1]
no_diabetes = df[df["Diabetes Status"] == 0]


def print_means(df, label):
    print(f"\nMeans for {label}:")
    for col in df.columns:
        if pd.api.types.is_numeric_dtype(df[col]):
            print(f"{col}: {df[col].mean():.2f}")


print_means(has_diabetes, "People with Diabetes")
print_means(no_diabetes, "People without Diabetes")


Means for People with Diabetes:
Sex: 0.51
Age: 46.79
BMI: 28.87
Total Cholesterol (mg/dL): 199.35
HDL (mg/dL): 55.61
Fasting Blood Sugar (mg/dL): 117.80
Smoking Status: 0.51
Diabetes Status: 1.00
Physical Activity Level: 1.04
Family History of CVD: 0.50
Systolic BP: 125.29
Diastolic BP: 82.97
Blood Pressure Category: 1.95
Estimated LDL (mg/dL): 113.06

Means for People without Diabetes:
Sex: 0.50
Age: 47.27
BMI: 28.05
Total Cholesterol (mg/dL): 197.71
HDL (mg/dL): 56.81
Fasting Blood Sugar (mg/dL): 117.16
Smoking Status: 0.52
Diabetes Status: 0.00
Physical Activity Level: 0.99
Family History of CVD: 0.48
Systolic BP: 125.98
Diastolic BP: 82.86
Blood Pressure Category: 1.96
Estimated LDL (mg/dL): 109.99


>While looking at the numerical columns, it actually seems like most of the variables are pretty equal for the two groups. The few I may check are Estimated LDL, and Physical Activity Level. Ones that seem ok but could check regardless are Age, and Total Cholesterol.