# 9. Ensemble Method

## Ensemble Methods for Diabetes Prediction

Ensemble methods are machine learning techniques that combine multiple individual models to make more accurate and robust predictions. This notebook demonstrates the application of three ensemble methods: Voting Classifier, Bagging Classifier, and Random Forest Classifier, for predicting diabetes using the Pima Indians Diabetes Dataset.

1. **Voting Classifier**: The Voting Classifier combines the predictions of multiple individual classifiers (Logistic Regression, Random Forest, and SVM) to make the final prediction. It works by taking the majority vote of the predictions made by each individual classifier. By combining the strengths of different classifiers, the Voting Classifier aims to improve the overall accuracy and robustness of the predictions.

2. **Bagging Classifier**: The Bagging (Bootstrap Aggregating) Classifier creates multiple instances of a base estimator (in this case, a Decision Tree) on different subsets of the training data, obtained through bootstrap sampling. Each base estimator is trained independently on its respective subset, and their predictions are combined to make the final prediction. Bagging helps to reduce overfitting and improve the stability and accuracy of the predictions.

3. **Random Forest Classifier**: The Random Forest Classifier is an extension of the Bagging method that introduces an additional layer of randomness. It creates an ensemble of Decision Trees, where each tree is trained on a random subset of features and a random subset of the training data. The final prediction is made by aggregating the predictions of all the individual trees. Random Forests handle high-dimensional data well, reduce overfitting, and provide feature importance scores.

The notebook compares the performance of these ensemble methods with individual classifiers and analyzes their classification reports to assess their accuracy, precision, recall, and F1-score. By leveraging the power of ensemble methods, the notebook aims to improve the prediction of diabetes compared to using a single model.

Additionally, the notebook demonstrates how to measure feature importance in a Random Forest Classifier, which helps identify the most influential features for diabetes prediction. This information can be valuable for understanding the underlying factors contributing to diabetes and guiding further analysis or interventions.

Overall, the ensemble methods in this notebook showcase how combining multiple models can lead to more accurate and robust predictions in the context of diabetes prediction using the Pima Indians Diabetes Dataset.

1. We import the library that is needed.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

Then we Load the dataset from the provided URL using pd.read_csv() and Separate the features (X) and the target variable (y) by dropping the 'Outcome' column from the feature set; and get the data ready to be training.

In [20]:
# Load the dataset
file_path = 'https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv'
data = pd.read_csv(file_path)

# Separate features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=9)

## 9.1 Voting Classifier
Next, we are going to create the individual classifier and the voting classifier. We set the voting parameter to "hard", which means the majority vote among the individual classifiers is used for the final prediction.

In [19]:
# Create individual classifiers
log_clf = LogisticRegression(max_iter=1000) # set hemax_iter here to be larger so that it would converge.
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# Create the voting classifier
voting_clf = VotingClassifier(
    estimators=[("lr", log_clf), ("rf", rnd_clf), ("svm", svm_clf)],
    voting="hard"
)

After that we train the individual classifier.

In [21]:
# Train and evaluate individual classifiers
svm_clf.fit(X_train, y_train)
svm_y_predict = svm_clf.predict(X_test)
print(f"SVM accuracy: {accuracy_score(y_test, svm_y_predict):.2f}")

log_clf.fit(X_train, y_train)
log_y_predict = log_clf.predict(X_test)
print(f"Logistic Regression accuracy: {accuracy_score(y_test, log_y_predict):.2f}")

rnd_clf.fit(X_train, y_train)
rnd_y_predict = rnd_clf.predict(X_test)
print(f"Random Forest accuracy: {accuracy_score(y_test, rnd_y_predict):.2f}")

SVM accuracy: 0.75
Logistic Regression accuracy: 0.75
Random Forest accuracy: 0.73


And last we train the voting classifier. 

In [23]:
# Train and evaluate the voting classifier
voting_clf.fit(X_train, y_train)
y_predict = voting_clf.predict(X_test)
print(f"Voting Classifier accuracy: {accuracy_score(y_test, y_predict):.2f}")

Voting Classifier accuracy: 0.76


## 9.2 Bagging Classifier

### What is a Bagging Classifier?

Bagging, short for Bootstrap Aggregating, is an ensemble learning technique used to improve the stability and accuracy of machine learning models. It works by creating multiple instances of a base estimator (e.g., decision trees, neural networks, etc.) on different subsets of the training data, and then combining their predictions to make the final prediction.

The key steps involved in the bagging classifier are:

1. **Bootstrap Sampling**: The original training dataset is sampled multiple times with replacement to create several bootstrap samples. Each bootstrap sample is used to train a separate instance of the base estimator.

2. **Base Estimator Training**: For each bootstrap sample, a base estimator (e.g., a decision tree) is trained. The base estimators are trained independently on different bootstrap samples, which helps to reduce the variance of the individual models.

3. **Aggregation**: After training all the base estimators, their predictions are combined to make the final prediction. In classification problems, the final prediction is typically made by taking the majority vote of the predictions from all the base estimators (this is known as "hard voting").

The main advantage of the bagging classifier is that it helps to reduce the variance and overfitting of the base estimators, especially when the base estimators are unstable or sensitive to small changes in the training data (like decision trees). By combining the predictions from multiple diverse models, the bagging classifier can often achieve better generalization and higher accuracy compared to a single base estimator.

It's important to note that the bagging classifier works best when the base estimators are unstable and have high variance but low bias. If the base estimators have high bias, other ensemble techniques like boosting may be more appropriate.

In the provided code, the bagging classifier is created using the `BaggingClassifier` class from scikit-learn, and a decision tree is used as the base estimator. The number of estimators (decision trees) in the ensemble is set to 100, and the predictions from all these estimators are combined to make the final prediction.

In [25]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report

Here, we create an instance of the DecisionTreeClassifier with max_depth=6 and random_state=42. This decision tree classifier will be used as the base estimator for the bagging classifier.

In [None]:
# Create a decision tree classifier
tree_clf = DecisionTreeClassifier(max_depth=6, random_state=42)

In this part, we create an instance of the BaggingClassifier. We pass the tree_clf (the decision tree classifier) as the base_estimator, set n_estimators=100 (the number of estimators in the ensemble), and set random_state=42 for reproducibility.

In [None]:
# Create a bagging classifier
bag_clf = BaggingClassifier(
    base_estimator=tree_clf,
    n_estimators=100,
    random_state=42
)

Then we train it and make predictions

In [26]:
# Train the bagging classifier
bag_clf.fit(X_train, y_train)

# Make predictions on the test set
bag_y_pred = bag_clf.predict(X_test)

# Print the classification report
print("Bagging Classifier Classification Report")
print(classification_report(y_test, bag_y_pred))

Bagging Classifier Classification Report
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       199
           1       0.63      0.61      0.62       109

    accuracy                           0.74       308
   macro avg       0.71      0.71      0.71       308
weighted avg       0.73      0.74      0.74       308



## Random Forest Classifier
### What is a Random Forest Classifier?

A Random Forest Classifier is another type of ensemble learning technique that combines multiple decision trees to improve the overall performance and robustness of the model. It is an extension of the bagging (Bootstrap Aggregating) method, with an additional layer of randomness introduced during the construction of each decision tree.

The key steps involved in the Random Forest Classifier are:

1. **Bootstrap Sampling**: Similar to bagging, the original training dataset is sampled multiple times with replacement to create several bootstrap samples.

2. **Random Subspace Selection**: For each bootstrap sample, a random subset of features is selected to train the decision tree. This adds an extra layer of randomness and helps to reduce the correlation between the individual decision trees in the ensemble.

3. **Decision Tree Training**: For each bootstrap sample and the randomly selected subset of features, a decision tree is trained. The trees are grown to their maximum depth without pruning.

4. **Aggregation**: After training all the decision trees, their predictions are combined to make the final prediction. In classification problems, the final prediction is typically made by taking the majority vote of the predictions from all the decision trees (this is known as "hard voting").

The Random Forest Classifier has several advantages over a single decision tree or a bagging ensemble:

1. **Reduced Overfitting**: By introducing randomness in the feature selection process, the Random Forest Classifier reduces the risk of overfitting to the training data, leading to better generalization performance.

2. **Improved Accuracy**: The ensemble nature of the Random Forest Classifier often results in higher accuracy compared to a single decision tree, as the combined predictions from multiple trees can correct for individual errors.

3. **Robustness to Noise and Outliers**: The Random Forest Classifier is less sensitive to noise and outliers in the training data, as they are less likely to affect the entire ensemble.

4. **Feature Importance Estimation**: Random Forests can provide an estimate of the importance of each feature in the dataset, which can be useful for feature selection and understanding the underlying data.

Here is the code for random forest:

In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report

We compare a decision tree, bagging classifier and the random forest classifier here

In [28]:
# Create a decision tree classifier
tree_clf = DecisionTreeClassifier(max_depth=15, random_state=42)
tree_clf.fit(X_train, y_train)
tree_y_pred = tree_clf.predict(X_test)
print("Tree Classification Report")
print(classification_report(y_test, tree_y_pred), "\n")

Tree Classification Report
              precision    recall  f1-score   support

           0       0.76      0.72      0.74       199
           1       0.53      0.59      0.56       109

    accuracy                           0.67       308
   macro avg       0.65      0.65      0.65       308
weighted avg       0.68      0.67      0.68       308
 



In [29]:
# Create a bagging classifier
bag_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1, random_state=42),
    n_estimators=500,
    bootstrap=True,
    n_jobs=-1
)
bag_clf.fit(X_train, y_train)
bag_y_pred = bag_clf.predict(X_test)
print("Bagging Classification Report")
print(classification_report(y_test, bag_y_pred), "\n")

Bagging Classification Report
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       199
           1       0.66      0.48      0.55       109

    accuracy                           0.73       308
   macro avg       0.70      0.67      0.68       308
weighted avg       0.72      0.73      0.72       308
 



In [30]:
# Create a random forest classifier
rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rf_clf.fit(X_train, y_train)
rf_y_pred = rf_clf.predict(X_test)
print("Random Forest Classification Report")
print(classification_report(y_test, rf_y_pred))

Random Forest Classification Report
              precision    recall  f1-score   support

           0       0.78      0.81      0.80       199
           1       0.63      0.58      0.60       109

    accuracy                           0.73       308
   macro avg       0.70      0.70      0.70       308
weighted avg       0.73      0.73      0.73       308



### Classification Report Analysis

The classification report provides a detailed breakdown of the performance of a classification model across multiple metrics for each class. Let's analyze the reports for the three models: Decision Tree, Bagging Classifier, and Random Forest Classifier.

**Accuracy**:
- The Bagging Classifier and Random Forest Classifier have the same and higher accuracy (0.73) compared to the Decision Tree (0.67).

**Precision for Positive Class (1)**:
- Bagging Classifier: 0.66 (highest)
- Random Forest Classifier: 0.63
- Decision Tree: 0.53 (lowest)

**Recall (Sensitivity) for Positive Class (1)**:
- Decision Tree: 0.59 (highest)
- Random Forest Classifier: 0.58
- Bagging Classifier: 0.48 (lowest)

**F1-score for Positive Class (1)**:
- Random Forest Classifier: 0.60 (highest)
- Decision Tree: 0.56
- Bagging Classifier: 0.55 (lowest)

Based on these metrics, the Random Forest Classifier appears to be the best-performing model overall, with the highest accuracy, a good balance between precision and recall (indicated by the highest F1-score for the positive class), and comparable performance to other models for the negative class.

However, the choice of the best model depends on your specific requirements and the trade-offs between different metrics. If you prioritize precision over recall for the positive class, the Bagging Classifier might be a better choice. If you want to maximize recall for the positive class, the Decision Tree or Random Forest Classifier might be preferable.

Additionally, you should consider factors such as the class imbalance in your dataset, the cost of misclassification errors, and the interpretability of the models when making the final decision.

---

## Feature Importance 

One cool thing about random forests is that these models make it simple to measure feature importance of each feature. Scikit-Learn does this by measuring a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average across all trees in the forest. For example, consider running the following code cell. 


---

In [32]:
# Get the feature names from the dataset
feature_names = X.columns.tolist()

# Get the feature importance scores from the random forest classifier
feature_importances = rf_clf.feature_importances_

# Create a list of (feature_name, feature_importance) tuples
feature_importance_tuples = [(name, score) for name, score in zip(feature_names, feature_importances)]

# Sort the list of tuples based on the feature importance scores in descending order
sorted_feature_importances = sorted(feature_importance_tuples, key=lambda x: x[1], reverse=True)

# Print the sorted feature names and their importance scores
for name, score in sorted_feature_importances:
    print(f"{name}: {score:.3f}")

Glucose: 0.266
BMI: 0.159
Age: 0.142
DiabetesPedigreeFunction: 0.119
BloodPressure: 0.090
Pregnancies: 0.086
SkinThickness: 0.073
Insulin: 0.065


So we could observe the most important feature to predict the diabetes is the Glucose.