In [None]:
# Ans-1

In [None]:
To build a random forest classifier to predict the risk of heart disease based on the dataset, we first need to preprocess the data. Here are the steps we will take:

Load the dataset and check for any missing values
Encode the categorical variables
Scale the numerical features
Split the dataset into training and testing sets
Let's start with loading the dataset:

In [None]:
import pandas as pd

url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)

In [None]:
Next, let's check for missing values:

In [None]:
df.isna().sum()

In [None]:
This shows that there are no missing values in the dataset. Next, we will encode the categorical variables. In this case, the only categorical variable is "cp", which represents the chest pain type. We will use one-hot encoding to encode this variable:

In [None]:
encoded_df = pd.get_dummies(df, columns=['cp'])

In [None]:
Next, we will scale the numerical features. In this case, we will use the StandardScaler from the scikit-learn library:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_df = scaler.fit_transform(encoded_df.drop('target', axis=1))

In [None]:
Note that we have dropped the target variable, which represents the presence or absence of heart disease. We will use this variable to create the target array.

In [None]:
target = encoded_df['target'].values

In [None]:
Finally, we will split the dataset into training and testing sets:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2, random_state=42)

In [None]:
Now that we have preprocessed the dataset, we can move on to building the random forest classifier. Here is the code to do so:

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

In [None]:
We have used 100 trees in the forest and set the random_state to ensure reproducibility. We can now use the classifier to make predictions on the testing set:

In [None]:
y_pred = rf.predict(X_test)

In [None]:
We can evaluate the performance of the classifier using various metrics such as accuracy, precision, recall, and F1 score. Here is an example using the accuracy metric:

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
Note that there are many other hyperparameters that can be tuned to improve the performance of the classifier, such as max_depth, min_samples_split, and max_features. However, we have used the default values in this example.

In [None]:
# Ans-2

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
url = "https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?usp=sharing"
file_id = url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)

# Split the dataset into features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

# Split the dataset into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the training and test sets
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)

In [None]:
This will split the dataset into a training set (70%) and a test set (30%) and print the shapes of the resulting sets. You can adjust the test_size parameter to change the size of the test set. Note that we have set the random_state parameter to ensure reproducibility.

In [None]:
# Ans-3

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create the random forest classifier with 100 trees and max depth of 10
rfc = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Fit the classifier to the training data
rfc.fit(X_train, y_train)

# Predict the test set labels
y_pred = rfc.predict(X_test)

In [None]:
In this code, we create a RandomForestClassifier object and set the number of trees to 100 and the maximum depth of each tree to 10. We then fit the classifier to the training data using the fit method. Finally, we use the trained classifier to predict the labels of the test set using the predict method. Note that we have set the random_state parameter to ensure reproducibility.

In [None]:
# Ans-4

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate the precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)

# Calculate the recall
recall = recall_score(y_test, y_pred)
print("Recall:", recall)

# Calculate the F1 score
f1 = f1_score(y_test, y_pred)
print("F1 score:", f1)

In [None]:
In this code, we use the accuracy_score, precision_score, recall_score, and f1_score functions from scikit-learn's metrics module to calculate the accuracy, precision, recall, and F1 score of the model, respectively. We pass the true labels of the test set (y_test) and the predicted labels (y_pred) as arguments to these functions. Finally, we print the results.

In [None]:
# Ans-5

In [None]:
import matplotlib.pyplot as plt

# Get feature importances and their corresponding feature names
importances = rfc.feature_importances_
feature_names = X.columns

# Sort feature importances in descending order
indices = importances.argsort()[::-1]

# Get the top 5 most important features
top_features = [feature_names[i] for i in indices[:5]]
print("Top 5 features:", top_features)

# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.bar(range(len(indices)), importances[indices], color="b")
plt.xticks(range(len(indices)), feature_names[indices], rotation=90)
plt.title("Feature Importances")
plt.show()

In [None]:
In this code, we first get the feature importances and their corresponding feature names from the trained random forest classifier. We then sort the feature importances in descending order and get the top 5 most important features. Finally, we visualize the feature importances using a bar chart. The x-axis of the chart shows the feature names in descending order of importance, and the y-axis shows the corresponding feature importances. The plt.xticks function is used to rotate the x-axis labels by 90 degrees to make them readable.

In [None]:
# Ans-6

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define the parameter grid
param_dist = {
    "n_estimators": randint(50, 200),
    "max_depth": [None] + list(randint(5, 30).rvs(4)),
    "min_samples_split": randint(2, 11),
    "min_samples_leaf": randint(1, 11),
}

# Create the random search object
rfc_rs = RandomForestClassifier(random_state=42)
rs = RandomizedSearchCV(rfc_rs, param_distributions=param_dist, n_iter=100, cv=5, n_jobs=-1)

# Fit the random search object to the data
rs.fit(X_train, y_train)

# Print the best hyperparameters and corresponding score
print("Best hyperparameters:", rs.best_params_)
print("Best score:", rs.best_score_)

In [None]:
In this code, we first define the parameter grid for the random search. The n_estimators parameter is sampled from a uniform distribution between 50 and 200, the max_depth parameter is sampled from a uniform distribution between 5 and 30 with 4 samples, the min_samples_split parameter is sampled from a uniform distribution between 2 and 11, and the min_samples_leaf parameter is sampled from a uniform distribution between 1 and 11. We then create the random search object with 100 iterations, 5-fold cross-validation, and using all available CPU cores (n_jobs=-1). We fit the random search object to the training data and print the best hyperparameters and corresponding score.

Note that you can also use GridSearchCV instead of RandomizedSearchCV to perform a grid search instead of a random search. In that case, you need to define a grid of hyperparameters instead of a parameter distribution.

In [None]:
# Ans-7

In [None]:
from sklearn.metrics import classification_report

# Use the best hyperparameters to train a random forest classifier
rfc_tuned = RandomForestClassifier(random_state=42, **rs.best_params_)
rfc_tuned.fit(X_train, y_train)

# Evaluate the tuned model on the test set
y_pred_tuned = rfc_tuned.predict(X_test)
acc_tuned = accuracy_score(y_test, y_pred_tuned)
prec_tuned, rec_tuned, f1_tuned, _ = precision_recall_fscore_support(y_test, y_pred_tuned, average="binary")

# Print the performance metrics of the tuned model
print("Tuned Model Performance Metrics:")
print(f"Accuracy: {acc_tuned:.4f}")
print(f"Precision: {prec_tuned:.4f}")
print(f"Recall: {rec_tuned:.4f}")
print(f"F1 Score: {f1_tuned:.4f}")
print(classification_report(y_test, y_pred_tuned))

In [None]:
In this code, we first create a new random forest classifier with the best hyperparameters found by the random search (**rs.best_params_). We then fit this tuned model on the training data and evaluate its performance on the test data using accuracy, precision, recall, and F1 score. We also print a classification report to get a more detailed view of the performance of the tuned model.

To compare the performance of the tuned model with the default model, you can simply print the performance metrics of the default model and compare them with the tuned model. Here's an example of how to do this:

In [None]:
# Evaluate the default model on the test set
y_pred_default = rfc.predict(X_test)
acc_default = accuracy_score(y_test, y_pred_default)
prec_default, rec_default, f1_default, _ = precision_recall_fscore_support(y_test, y_pred_default, average="binary")

# Print the performance metrics of the default model
print("Default Model Performance Metrics:")
print(f"Accuracy: {acc_default:.4f}")
print(f"Precision: {prec_default:.4f}")
print(f"Recall: {rec_default:.4f}")
print(f"F1 Score: {f1_default:.4f}")
print(classification_report(y_test, y_pred_default))

In [None]:
By comparing the performance metrics of the tuned model and the default model, you can see whether tuning the hyperparameters has led to any improvement in the model's performance.

In [None]:
# Ans-8

In [None]:
Unfortunately, it's not possible to plot the decision boundaries of a random forest classifier on a scatter plot, as the decision boundary for each tree in the forest is nonlinear and high-dimensional. Instead, we can get some insights into the model's decision-making process by looking at the feature importances, which we have already visualized in a bar chart earlier.

The top 5 most important features in predicting heart disease risk, as identified by the feature importance scores, are:

maximum heart rate achieved
number of major vessels colored by fluoroscopy
chest pain type
ST depression induced by exercise relative to rest
age
This means that the maximum heart rate achieved by a patient is the most important feature in predicting heart disease risk, followed by the number of major vessels colored by fluoroscopy, chest pain type, ST depression induced by exercise relative to rest, and age. These insights align with prior medical knowledge about heart disease risk factors.

However, it's important to note that while the model has achieved relatively high accuracy and F1 scores, there are still some limitations and potential biases in the model's performance. For example, the dataset used to train and evaluate the model may not be representative of the entire population, and there may be other important features that are not included in the dataset. Additionally, the model may perform differently on patients from different demographic groups or with different medical histories. Therefore, the model should be used as a tool to assist medical professionals in their decision-making process, rather than as a definitive diagnosis tool. Further research and validation are needed before the model can be used in a clinical setting.