# Exercise 10
## Section 1: Logistic Regression
The dataset representats data from the Framingham Heart Study, Levy (1999) National Heart Lung and Blood Institute, Center for Bio-Medical Communication. Researchers are interested in studying risk factors for coronoary heart disease (CHD).

#### 1. Answer the following:  
    a. What is the outcome?
    b. What are the predictors researchers are interested in?
    c. What is the hypothesis?
------
a. The outcome is chdfate, which indicates whether the patient developed coronary heart disease (CHD) during the follow-up period.  

b. The predictors (or independent variables) are the factors that might influence the risk of developing CHD. These are:  
* sex - gender of the patiend
* sbp - systolic blood pressure
* dpb - diastolic blood pressure
* scl - serum cholestorol
* age - age at baseline
* bmi - body mass index
* month - month of baseline exam  

c. The hypothesis is that certain risk factors are significantly associated with the likelihood of developing CHD. The Null hypothesis would be that there is no relationship between the predictors and the development of CHD and the alternative hypothesis would be that there *is* a signficiant relationship between one or more predictors and the likelihood of developing CHD.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import streamlit as st

#### 2. Import the data, print out a few rows, and compute summary statistics. Is there missing data or other concerns?

There are missing values for both serum cholesterol and bmi. The spread of the data across the two outcomes is not even, with less than one third of them being positive outcomes.

In [None]:
chd = pd.read_csv('framingham_dataset_mod.csv')
chd.head()

In [None]:
chd.describe()

In [None]:
# find percent of each column that is missing
chd[['scl','bmi']].isna().mean()*100

In [None]:
chd['scl'] = chd['scl'].fillna(chd['scl'].median())
chd['bmi'] = chd['bmi'].fillna(chd['bmi'].median())

In [None]:
chd['chdfate'].value_counts()

3. Month of the year at baseline is an unwieldy variable meant to adjust for seasonal effects.  Rather than put it in the model as is, create 4 binary variables for each season. This link will give examples of how to do this. The categories should be winter, spring, summer, & fall and should be defined as follows based on the month:

    a. Winter: 12, 1, 2  
    b. Spring: 3, 4, 5  
    c. Summer: 6, 7, 8  
    d. Fall: 9, 10, 11  


In [None]:
chd['winter'] = chd['month'].isin([12,1,2]).astype(int)
chd['spring'] = chd['month'].isin([3,4,5]).astype(int)
chd['summer'] = chd['month'].isin([6,7,8]).astype(int)
chd['fall'] = chd['month'].isin([9,10,11]).astype(int)
chd.describe()

4.	Fit a logistic regression model using all the relevant predictor variables (Note: use season, not month. Also, ID is not a predictor variable. Do not use it).

In [None]:
# separate independent and dependent variables
X = chd[['sex',
        'sbp',
        'dbp',
        'scl',
        'age',
        'bmi',
        'spring',
        'summer',   # winter will be my baseline
        'fall']]
y = chd['chdfate']

# apply a standard scaler to normlize the values of each input variable
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Keep 30% for testing
# Pick any value you want for the random_state, but this keeps the work repeatable
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

# Create dataframe of coefficients
coef_df = pd.DataFrame({
    'Variable': X.columns,
    'Coefficient': model.coef_[0],
    'Odds_Ratio': np.exp(model.coef_[0])
})
print("\nCoefficient Summary:")
print(coef_df.sort_values('Coefficient', ascending=False))

5.	Conduct model diagnostics. This reference may be helpful.  
    a. Look at distributions of the main predictor variables (excluding the new season variables). Do any require transformation?  
    b. Check to see if collinearity is present. Explain what   you find.  
    c. Check linearity for each of the continuous covariates. Do those covariates each have a linear relationship with the outcome?  
d.	Are there outliers?  
e.	Are there at least 5 outcomes per category of sex?


In [None]:
numeric_vars = chd[['sbp', 'dbp', 'scl', 'age', 'bmi', 'chdfate']]

sns.pairplot(numeric_vars, hue='chdfate', diag_kind='kde')
plt.suptitle("Numeric Predictors by CHD Outcome", y=1.02)
plt.show()

Some of the variablies are slightly right-skewed, mostly sbp and scl. These could be transformed to improve symmetry, but the skewness is not severe enough to make it necessary for this model.

Based on the plots, sbp and dbp seem to be very highly correlated, so I will create a correlation matrix and run VIF to confirm.

In [None]:
corr_matrix = numeric_vars.corr()

plt.figure(figsize=(7,5))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Matrix of Numeric Predictors")
plt.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# separate independent and dependent variables
X = chd[['sex',
        'sbp',
        'dbp',
        'scl',
        'age',
        'bmi',
        'spring',
        'summer',   # winter will be my baseline
        'fall']]
X = sm.add_constant(X)

vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

Collinearity was assessed using a correlation matrix and Variance Inflation Factor (VIF). SBP and DBP showed the strongest correlation (r = 0.78), which is expected given both measure blood pressure. However, all VIF values were below 3, indicating no serious multicollinearity among predictors. Therefore, all variables were retained for the logistic regression model.

In [None]:
continuous_vars = ['sbp', 'dbp', 'scl', 'age', 'bmi']

for var in continuous_vars:
    sns.regplot(x=chd[var], y=chd['chdfate'], logistic=True, ci=None)
    plt.title(f"Logistic Relationship Between {var} and CHD Outcome")
    plt.show()

Linearity between each continuous predictor and the log-odds of CHD was evaluated using logistic regression smoothed plots. SBP, DBP, SCL, and age showed clear monotonic and approximately linear relationships with the log-odds of CHD. BMI showed a slightly curved association, indicating weaker adherence to the linearity assumption; however, the deviation was not substantial enough to warrant transformation for the purposes of this analysis.

In [None]:
plt.figure(figsize=(12, 6))
for i, var in enumerate(numeric_vars, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(x=chd[var])
    plt.title(f'Boxplot of {var}')
plt.tight_layout()
plt.show()

There are definitely some outliers in the data, mainly on the high end of SBP, DBP, SCL, and BMI. However, they appear to be realistic values rather than data errors, especially in a medical context. Since the sample size is large and the outliers are not extreme enough to distort the analysis, I chose to keep them in the model.

In [None]:
chd.groupby('sex')['chdfate'].value_counts()

I checked the number of CHD cases separately for males and females, and both groups had well over the minimum of 5 events. This means we have enough outcomes in each sex category to reliably estimate the effect of sex in the logistic regression model. Therefore, sex can be included as a valid predictor.

In [None]:
# Refit using statsmodels (no scaling needed for odds ratios)
X = chd[['sex', 'sbp', 'dbp', 'scl', 'age', 'bmi', 'spring', 'summer', 'fall']]
y = chd['chdfate']

X = sm.add_constant(X)   # add intercept
logit_model = sm.Logit(y, X).fit()
print(logit_model.summary())

# Odds Ratios and 95% CI
params = logit_model.params
conf = logit_model.conf_int()
conf['OR'] = params
conf.columns = ['2.5%', '97.5%', 'coef']
conf['OR'] = np.exp(conf['coef'])
conf['OR_2.5%'] = np.exp(conf['2.5%'])
conf['OR_97.5%'] = np.exp(conf['97.5%'])

print("\nOdds Ratios with 95% CI:")
print(conf[['OR', 'OR_2.5%', 'OR_97.5%']])

Looking at the odds ratios, sex stands out the most. The OR for sex is 0.45, which means females have about 55% lower odds of developing CHD compared to males. So gender has a pretty clear impact in this model.

For the continuous variables, the effects are small but still add up. Systolic blood pressure (SBP) slightly increases CHD risk — about 0.8% higher odds for each 1 mmHg increase. Diastolic blood pressure (DBP) also trends upward, but it’s not statistically significant since its confidence interval includes 1. Cholesterol (SCL) goes up too, with about a 0.7% increase in odds for every 1 mg/dL increase. Age is also important, with roughly a 1.8% increase in odds per year, and BMI has a stronger impact at about a 4.7% increase in odds for each 1 unit increase.

The seasonal variables don’t really show meaningful differences. Spring is slightly higher and summer/fall are slightly lower compared to winter, but none of the seasonal effects were significant.
