## Random Forest
Random forest is an ensemble learning method that operates by constructing multiple decision trees during training and outputting categories as patterns of classification categories. To increase the probability of detecting patients with the disease, we innovatively customized the voting mechanism of the Random Forest.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() 
from scipy.stats import ttest_ind
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from scipy.stats import mode
from sklearn.preprocessing import LabelEncoder

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
df = pd.read_csv('CVD_cleaned.csv')


print(df.head())

  General_Health                  Checkup Exercise Heart_Disease Skin_Cancer  \
0           Poor  Within the past 2 years       No            No          No   
1      Very Good     Within the past year       No           Yes          No   
2      Very Good     Within the past year      Yes            No          No   
3           Poor     Within the past year      Yes           Yes          No   
4           Good     Within the past year       No            No          No   

  Other_Cancer Depression Diabetes Arthritis     Sex Age_Category  \
0           No         No       No       Yes  Female        70-74   
1           No         No      Yes        No  Female        70-74   
2           No         No      Yes        No  Female        60-64   
3           No         No      Yes        No    Male        75-79   
4           No         No       No        No    Male          80+   

   Height_(cm)  Weight_(kg)    BMI Smoking_History  Alcohol_Consumption  \
0        150.0        32.66  

In [4]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Label catogary data
df['General_Health'] = label_encoder.fit_transform(df['General_Health'])
df['Checkup'] = label_encoder.fit_transform(df['Checkup'])
df['Exercise'] = label_encoder.fit_transform(df['Exercise'])
df['Skin_Cancer'] = label_encoder.fit_transform(df['Skin_Cancer'])
df['Other_Cancer'] = label_encoder.fit_transform(df['Other_Cancer'])
df['Depression'] = label_encoder.fit_transform(df['Depression'])
df['Diabetes'] = label_encoder.fit_transform(df['Diabetes'])
df['Arthritis'] = label_encoder.fit_transform(df['Arthritis'])
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Age_Category'] = label_encoder.fit_transform(df['Age_Category'])
df['Heart_Disease'] = label_encoder.fit_transform(df['Heart_Disease'])
df['Smoking_History'] = label_encoder.fit_transform(df['Smoking_History'])

In [6]:
# Define target variable
X = df.drop('Heart_Disease', axis=1)
y = df['Heart_Disease']

# Divide into training set and predicting set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Training
rf_clf = RandomForestClassifier(n_estimators=20, random_state=42)
rf_clf.fit(X_train, y_train)
#rf_clf.fit(X_train, y_train, feature_names=X_train.columns)

# Pridicting and evaluate the model
y_pred = rf_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

# Print the accuracy
print(classification_report(y_test, y_pred))

# Print the voting result of first 20th samples
print('Voting results for the first 20 samples:')
voting_results = rf_clf.predict(X_test.head(20))
print(voting_results)

Accuracy: 0.9166
              precision    recall  f1-score   support

           0       0.92      0.99      0.96     70930
           1       0.38      0.04      0.07      6284

    accuracy                           0.92     77214
   macro avg       0.65      0.52      0.51     77214
weighted avg       0.88      0.92      0.88     77214

Voting results for the first 20 samples:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


## Inovation
To deal with an imbalanced data set, we can adjust the voting system in a random forest model to make it easier to return the minority case. We can change the threshold by a for loop and determine which value is best for the threshold by analysing the report of each value. This is a new idea our group came up with and it is different from the traditional way.

In [7]:
def custom_voting(votes, threshold):
    # Convert votes to integers
    votes = votes.astype(int)
    # Count the votes for each class
    counts = np.bincount(votes)
    # If the count of class 1 is greater than the threshold, return 1, else return 0
    return 1 if len(counts) > 1 and counts[1] > threshold else 0

# Get the predictions for each sample in the test set
predictions = np.array([tree.predict(X_test) for tree in rf_clf.estimators_]).T

# Use the custom voting function to get the final predictions for thresholds from 0 to 20
for threshold in range(21):
    y_pred_custom = np.array([custom_voting(p, threshold) for p in predictions])
    accuracy_custom = accuracy_score(y_test, y_pred_custom)
    print(f'Threshold: {threshold}, Custom Accuracy: {accuracy_custom:.4f}')
    print(classification_report(y_test, y_pred_custom))






Threshold: 0, Custom Accuracy: 0.5196
              precision    recall  f1-score   support

           0       0.98      0.49      0.65     70930
           1       0.13      0.89      0.23      6284

    accuracy                           0.52     77214
   macro avg       0.56      0.69      0.44     77214
weighted avg       0.91      0.52      0.62     77214

Threshold: 1, Custom Accuracy: 0.6771
              precision    recall  f1-score   support

           0       0.97      0.67      0.79     70930
           1       0.17      0.78      0.28      6284

    accuracy                           0.68     77214
   macro avg       0.57      0.72      0.54     77214
weighted avg       0.91      0.68      0.75     77214

Threshold: 2, Custom Accuracy: 0.7634
              precision    recall  f1-score   support

           0       0.96      0.77      0.86     70930
           1       0.20      0.66      0.31      6284

    accuracy                           0.76     77214
   macro avg  

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Threshold: 19, Custom Accuracy: 0.9186
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     70930
           1       0.00      0.00      0.00      6284

    accuracy                           0.92     77214
   macro avg       0.46      0.50      0.48     77214
weighted avg       0.84      0.92      0.88     77214



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Threshold: 20, Custom Accuracy: 0.9186
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     70930
           1       0.00      0.00      0.00      6284

    accuracy                           0.92     77214
   macro avg       0.46      0.50      0.48     77214
weighted avg       0.84      0.92      0.88     77214



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Evaluation
We used a for loop to gradually change the threshold from 0 to 20 and observe the performance of a random forest. We choose typical values here to illustrate： When the threshold is 0, it do increases the recall rate of 1 significantly but sacrifices the performance on ‘0’s too much which makes the overall accuracy not satisfying while a random forest model with va voting threshold of 3 has good overall accuracy with the ability to identify half of the patient.
Compared to the Naive Bayes, we can see that while the threshold is 4, the Random Forest Model has  approximately the same recall rate of 1 class while the overall accuracy is larger than Naive Bayse's result with a difference of 2.5% which shows than random forest is a better model in this case.
The doctor can set his/her own threshold to meet his/her needs. Refer to the report with different threshold values from 0 to 20 which makes the model more powerful.
