# MAM02 - Week 2
*Authors*: Robin & Matias

*Status*: Draft

*Date*: 8.10.2024

## Introduction

In this analysis, we aim to investigate the association between 20 risk factors and the occurrence of cardiovascular disease (CVD) events in patients with familial hypercholesterolemia (FH).

In [28]:
import pandas as pd

# Load the given data file
data = pd.read_spss("GIRAFH.SAV")

# Some general data descriptions
print(data.head())
print(data.info())
print(data.describe())

      sex  height  weight        bmi alcoholuse smoking  systbp  diasbp  \
0  female   174.0    77.0  25.432686         no    Ever   140.0    95.0   
1    male   179.0    65.0  20.286508        yes    Ever   140.0    95.0   
2    male   183.0    85.0  25.381469        yes    Ever   130.0    85.0   
3  female   169.0    63.0  22.058051        yes    Ever   130.0    75.0   
4    male   176.0    88.0  28.409091        yes   Never   120.0    80.0   

  hypertension  Glucose  ...  diabetes familiarHC     Tc   HDL    Tg    Lpa  \
0          yes      4.7  ...      ever        yes  11.22  1.00  1.40    NaN   
1           no      4.5  ...      ever         no  11.50  1.04  0.81  740.0   
2           no      4.7  ...      ever         no  11.36  1.64  1.67    NaN   
3           no      4.6  ...      ever         no  10.21  1.07  1.18    NaN   
4           no      5.4  ...      ever         no  10.09  1.69  1.01    NaN   

   homocysteine  creatinine        age  event  
0          12.1        75.

In [27]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

sex             0
height          0
weight          0
bmi             0
alcoholuse      0
smoking         0
systbp          0
diasbp          0
hypertension    0
Glucose         0
Hba1c           0
diabetes        0
familiarHC      0
Tc              0
HDL             0
Tg              0
Lpa             0
homocysteine    0
creatinine      0
age             0
event           0
event_binary    0
dtype: int64


## Handling Missing Data

*Discarding Missing Values* can lead to losing essential information, especially if missingness is not completely random.
*Multiple Imputation* preserves the dataset's size by filling in the missing values based on other available data

*Chosen Method*: Multiple Imputation

*Rationale*: Given the size of the dataset (2,400 patients) and the potential loss of information, we choose multiple imputation to handle missing values. We believe this method allows us to use all available data better and reduce the bias that might result from simply discarding missing values.

In [14]:
from sklearn.impute import SimpleImputer

imputer_mean = SimpleImputer(strategy='mean')
imputer_freq = SimpleImputer(strategy='most_frequent')

continuous_vars = ['height', 'weight', 'bmi', 'systbp', 'diasbp', 'Glucose', 'Hba1c', 'Tc', 'HDL', 'Tg', 'Lpa', 'homocysteine', 'creatinine', 'age']
categorical_vars = ['sex', 'alcoholuse', 'smoking', 'hypertension', 'diabetes', 'familiarHC', 'event']

data[continuous_vars] = imputer_mean.fit_transform(data[continuous_vars])

data[categorical_vars] = imputer_freq.fit_transform(data[categorical_vars])

In [15]:
missing_values = data.isnull().sum()
print(missing_values)

sex             0
height          0
weight          0
bmi             0
alcoholuse      0
smoking         0
systbp          0
diasbp          0
hypertension    0
Glucose         0
Hba1c           0
diabetes        0
familiarHC      0
Tc              0
HDL             0
Tg              0
Lpa             0
homocysteine    0
creatinine      0
age             0
event           0
dtype: int64


## Descriptive Statistics

Compare the means of continuous variables between the event and no-event groups using t-tests to identify significant differences.

In [19]:
from scipy.stats import ttest_ind


event_data = data[data['event'] == 'yes']
no_event_data = data[data['event'] == 'none']

def create_summary_table(var_list):
    # Initialize an empty list to store rows
    rows = []
    for var in var_list:
        event_mean = event_data[var].mean()
        no_event_mean = no_event_data[var].mean()
        #  t-test
        t_stat, p_val = ttest_ind(event_data[var], no_event_data[var], nan_policy='omit')

        row = {'Variable': var,
               'Event Mean': event_mean,
               'No Event Mean': no_event_mean,
               'P-Value': p_val}

        rows.append(row)

    summary_table = pd.DataFrame(rows)
    return summary_table

# Generate summary table for continuous variables
summary_continuous = create_summary_table(continuous_vars)

# Display the table
print(summary_continuous)



        Variable  Event Mean  No Event Mean       P-Value
0         height  172.340862     172.528421  6.191569e-01
1         weight   76.434827      74.207192  6.203391e-05
2            bmi   25.631893      24.859753  7.984809e-08
3         systbp  138.245550     133.436406  9.645333e-09
4         diasbp   83.395754      81.196731  1.241415e-06
5        Glucose    5.321026       4.962398  7.917314e-17
6          Hba1c    5.972410       5.698620  6.852495e-11
7             Tc    9.668917       9.469821  1.589363e-02
8            HDL    1.160368       1.234116  1.141435e-07
9             Tg    2.005692       1.707883  3.584563e-13
10           Lpa  407.589526     304.242450  1.226179e-11
11  homocysteine   13.153659      12.103829  5.508880e-05
12    creatinine   84.471159      79.725495  8.022937e-13
13           age   48.419700      46.413791  1.541438e-04


For categorical variables, we use chi-squared tests to compare the proportions between the two groups.

In [22]:
from scipy.stats import chi2_contingency


def create_categorical_summary(var_list):
    rows = []
    for var in var_list:
        event_counts = event_data[var].value_counts(normalize=True)
        no_event_counts = no_event_data[var].value_counts(normalize=True)

        from scipy.stats import chi2_contingency
        contingency_table = pd.crosstab(data[var], data['event'])
        chi2, p_val, dof, ex = chi2_contingency(contingency_table)

        row = {'Variable': var,
               'Event Proportion': event_counts.to_dict(),
               'No Event Proportion': no_event_counts.to_dict(),
               'P-Value': p_val}
        rows.append(row)

    summary_table = pd.DataFrame(rows)
    return summary_table


summary_categorical = create_categorical_summary(categorical_vars)


print(summary_categorical)


       Variable                                   Event Proportion  \
0           sex  {'male': 0.6227621483375959, 'female': 0.37723...   
1    alcoholuse  {'yes': 0.7864450127877238, 'no': 0.2135549872...   
2       smoking  {'Ever': 0.8414322250639387, 'Never': 0.158567...   
3  hypertension  {'no': 0.8286445012787724, 'yes': 0.1713554987...   
4      diabetes  {'ever': 0.8887468030690537, 'never': 0.111253...   
5    familiarHC  {'no': 0.840153452685422, 'yes': 0.15984654731...   
6         event                                       {'yes': 1.0}   

                                 No Event Proportion       P-Value  
0  {'female': 0.5716934487021014, 'male': 0.42830...  6.255881e-19  
1  {'yes': 0.7966625463535228, 'no': 0.2033374536...  5.991600e-01  
2  {'Ever': 0.7255871446229913, 'Never': 0.274412...  5.406920e-10  
3  {'no': 0.9406674907292955, 'yes': 0.0593325092...  4.554371e-18  
4  {'ever': 0.9684796044499382, 'never': 0.031520...  7.813679e-15  
5  {'no': 0.7663782447466

## Univariate Logistic Regression Analyses

#### Performing Univariate Analyses