A naive Bayes algorithm will be employed here to attempt accurate prediction of ten year CHD based on 15 health attributes. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import f1_score

In [3]:
# A previously-cleaned dataset is used here. 
df = pd.read_csv('framingham_cleaned.csv')
df.head(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,"[32, 42.0]",4.0,0,"[0.0, 0.0]",0.0,0,0,0,"[107.0, 206.0]","[83.5, 117.0]","[48.0, 75.0]","(25.41, 28.04]","(75.0, 83.0]","(72.0, 80.0]",no
1,0,"(42.0, 49.0]",2.0,0,"[0.0, 0.0]",0.0,0,0,0,"(234.0, 262.0]","(117.0, 128.0]","(75.0, 82.0]","(28.04, 56.8]","(83.0, 143.0]","(72.0, 80.0]",no
2,1,"(42.0, 49.0]",1.0,1,"(0.0, 20.0]",0.0,0,0,0,"(234.0, 262.0]","(117.0, 128.0]","(75.0, 82.0]","(23.08, 25.41]","(68.0, 75.0]","[40.0, 72.0]",no
3,0,"(56.0, 70]",3.0,1,"(20.0, 70.0]",0.0,0,1,0,"(206.0, 234.0]","(144.0, 295.0]","(89.88, 142.5]","(28.04, 56.8]","[44.0, 68.0]","(85.0, 394.0]",yes
4,0,"(42.0, 49.0]",3.0,1,"(20.0, 70.0]",0.0,0,0,0,"(262.0, 696.0]","(128.0, 144.0]","(82.0, 89.88]","(23.08, 25.41]","(83.0, 143.0]","(80.0, 85.0]",no
5,0,"(42.0, 49.0]",2.0,0,"[0.0, 0.0]",0.0,0,1,0,"(206.0, 234.0]","(144.0, 295.0]","(89.88, 142.5]","(28.04, 56.8]","(75.0, 83.0]","(85.0, 394.0]",no
6,0,"(56.0, 70]",1.0,0,"[0.0, 0.0]",0.0,0,0,0,"[107.0, 206.0]","(128.0, 144.0]","[48.0, 75.0]","(28.04, 56.8]","[44.0, 68.0]","(80.0, 85.0]",yes
7,0,"(42.0, 49.0]",2.0,1,"(0.0, 20.0]",0.0,0,0,0,"(262.0, 696.0]","[83.5, 117.0]","[48.0, 75.0]","[15.54, 23.08]","(75.0, 83.0]","(72.0, 80.0]",no
8,1,"(49.0, 56.0]",1.0,0,"[0.0, 0.0]",0.0,0,1,0,"(234.0, 262.0]","(128.0, 144.0]","(82.0, 89.88]","(25.41, 28.04]","(75.0, 83.0]","(72.0, 80.0]",no
9,1,"(42.0, 49.0]",1.0,1,"(20.0, 70.0]",0.0,0,1,0,"(206.0, 234.0]","(144.0, 295.0]","(89.88, 142.5]","(23.08, 25.41]","(83.0, 143.0]","(85.0, 394.0]",no


In order to utilize the scikit-learn NB algorithm, the data must be transformed into categorical values. Attribute values have already been binned at this stage, so this is straightforward. 

In [60]:
df_c = df.astype('category')

for col in df_c.columns:
    df_c[col] = df_c[col].cat.codes

df_c.head(10)

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,3,3,0,2,0,0,0,0,3,3,3,1,1,0,0
1,0,0,1,0,2,0,0,0,0,1,0,0,2,2,0,0
2,1,0,0,1,0,0,0,0,0,1,0,0,0,0,3,0
3,0,2,2,1,1,0,0,1,0,0,2,2,2,3,2,1
4,0,0,2,1,1,0,0,0,0,2,1,1,0,2,1,0
5,0,0,1,0,2,0,0,1,0,0,2,2,2,1,2,0
6,0,2,0,0,2,0,0,0,0,3,1,3,2,3,1,1
7,0,0,1,1,0,0,0,0,0,2,3,3,3,1,0,0
8,1,1,0,0,2,0,0,1,0,1,1,1,1,1,0,0
9,1,0,0,1,1,0,0,1,0,0,2,2,0,2,2,0


In [61]:
# Assigning x to the independent attributes and y to the dependent outcome in the final column.
x, y = df_c.iloc[:, :-1].values, df_c.iloc[:, -1].values

In [63]:
# Splitting the data into train and test subsets. The shape verifies the expected outcomes. 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=10)
print(x_train.shape, y_train.shape)

(3390, 15) (3390,)


In [64]:
# Defining and training the NB model. 
model = GaussianNB()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

In [66]:
# Plugging in a couple of sample entries that are known to have CHD-positive outcomes. Both are predicted to be CHD-negative.
predicted1= model.predict([[0,2,2,1,1,0,0,1,0,0,2,2,2,3,2]])
predicted1
predicted2 = model.predict([[1,2,0,0,2,0,0,0,0,3,1,3,2,3,1]])
print(predicted1, predicted2)

[0] [0]


In [70]:
accuracy = metrics.accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("accuracy: {} \nF1 score: {}".format(accuracy, f1))

accuracy: 0.8372641509433962 
F1 score: 0.11538461538461539


This looks like a decent prediction accuracy score until it is considered that the base rate of CHD in the data set is rather low. In fact, a prediction model that simply predicted that every outcome was CHD-negative would attain an accuracy of roughly 85%, since only ~15% of the entries in the dataset are CHD-positive. The poor suitability of this model is indicated by the very low F1 score of ~0.12. 

The dataset will balanced in order to rectify this issue and see if any significant improvement in accuracy can be attained.

In [52]:
# Percentage of positive CHD cases in the dataset.
sum(df_c['TenYearCHD'])/len(df_c)

0.1519584709768759

In [53]:
# Since the positive rate is so low here, consider training and testing on a slice of the dataset that brings this ratio to parity.
num_1 = sum(df_c['TenYearCHD'])
num_0 = len(df_c['TenYearCHD']) - num_1
print(num_1, num_0)

644 3594


In [54]:
# Here, enough negative CHD values are dropped until the rate of CHD is at 50%.

df_cp = df_c.copy()

# The rows are first shuffled to remove any kind of trends in ordering.
df_cp = df_cp.sample(frac = 1, random_state=10)

# Dropping 2950 CHD-positive rows brings both sample sizes to 644.
df_cp2 = df_cp.drop(df_cp[df_cp.TenYearCHD==0].index[:2950])

# Percentage of positive CHD cases in the dataset.
sum(df_cp2['TenYearCHD'])/len(df_cp2)

0.5

In [55]:
# Reattempt training a model.

x2, y2 = df_cp2.iloc[:, :-1].values, df_cp2.iloc[:, -1].values

x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2, test_size=0.2, random_state=10)
print(x2_train.shape, y2_train.shape)

(1030, 15) (1030,)


In [56]:
model2 = GaussianNB()
model2.fit(x2_train, y2_train)
y2_pred = model2.predict(x2_test)

In [69]:
accuracy2 = metrics.accuracy_score(y2_test, y2_pred)
f1_2 = f1_score(y2_test, y2_pred)

print("accuracy: {} \nF1 score: {}".format(accuracy2, f1_2))

accuracy: 0.5697674418604651 
F1 score: 0.3018867924528302


The resulting prediction accuracy of the naive Bayes model trained on a sample with outcome parity is ~57%, a stark decline from the previous ~84% prediction accuracy that was skewed due to lack of parity.

Though the F1 score was improved from ~0.12 to ~0.30, it is still quite low and indicates a poor model suitability. 

This is a rather low prediction accuracy result, but if CHD is a health outcome that involves a large amount of variance, a naive Bayes model may not be able to accurately predict outcome based on sample individuals. Additional tools are explored in other analyses. 