In [None]:
Build a random forest classifier to predict the risk of heart disease based on a dataset of patient
information. The dataset contains 303 instances with 14 features, including age, sex, chest pain type,
resting blood pressure, serum cholesterol, and maximum heart rate achieved.

Dataset link: https://drive.google.com/file/d/1bGoIE4Z2kG5nyh-fGZAJ7LH0ki3UfmSJ/view?
usp=share_link
Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.
ANS-
Sure, here's how you can preprocess the dataset for building a random forest classifier:

1. Import necessary libraries and load the dataset:

```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load the dataset
df = pd.read_csv("heart_disease.csv")
```

2. Handle missing values:

```python
# Check for missing values
print(df.isnull().sum())

# There are no missing values in this dataset
```

3. Encode categorical variables:

```python
# Encode categorical variables
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])
df['cp'] = le.fit_transform(df['cp'])
df['fbs'] = le.fit_transform(df['fbs'])
df['restecg'] = le.fit_transform(df['restecg'])
df['exang'] = le.fit_transform(df['exang'])
df['slope'] = le.fit_transform(df['slope'])
df['ca'] = le.fit_transform(df['ca'])
df['thal'] = le.fit_transform(df['thal'])
```

4. Scale numerical features:

```python
# Scale numerical features
scaler = StandardScaler()
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
```

After preprocessing, the dataset is ready for building a random forest classifier.

Q2.Split the dataset into a training set (70%) and a test set (30%).
ANS-Sure, here's how you can build a random forest classifier to predict the risk of heart disease and split the dataset into a training set (70%) and a test set (30%):

1. Preprocess the dataset using the steps mentioned in the previous answer.

2. Split the dataset into a training set and a test set:

```python
from sklearn.model_selection import train_test_split

# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.3, random_state=42)
```

3. Train the random forest classifier using the training set:

```python
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the random forest classifier
rf_classifier.fit(X_train, y_train)
```

4. Evaluate the performance of the classifier using the test set:

```python
from sklearn.metrics import accuracy_score, confusion_matrix

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Calculate confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', cm)

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each tree. Use the default values for other hyperparameters.
ANS-from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier with 100 trees and a max depth of 10
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the classifier on the training set
rf_classifier.fit(X_train, y_train)
Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.
ANS-from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predict the class labels for the test set
y_pred = rf_classifier.predict(X_test)

# Calculate the evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy: {:.3f}".format(accuracy))
print("Precision: {:.3f}".format(precision))
print("Recall: {:.3f}".format(recall))
print("F1 score: {:.3f}".format(f1))
Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart disease risk. Visualise the feature importances using a bar chart.
ANS-

# Get feature importances from the trained random forest classifier
importances = rf.feature_importances_

# Get indices of top 5 features
indices = importances.argsort()[-5:][::-1]

# Get names of top 5 features
feature_names = df.columns[:-1][indices]

# Plot feature importances as a bar chart
plt.bar(range(5), importances[indices], color='b', align='center')
plt.xticks(range(5), feature_names, rotation=45)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Top 5 Most Important Features")
plt.show()

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try different values of the number of trees, maximum depth, minimum samples split, and minimum samples leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.
ANS-from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a random forest classifier
rf = RandomForestClassifier()

# Create a grid search object and fit it to the training data
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best hyperparameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

 Q7.Report the best set of hyperparameters found by the search and the corresponding performance metrics. Compare the performance of the tuned model with the default model.
ANS-from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# define parameter distributions
param_dist = {'n_estimators': randint(10, 200),
              'max_depth': [None, 5, 10, 15, 20],
              'min_samples_split': randint(2, 10),
              'min_samples_leaf': randint(1, 10)}

# create a random forest classifier
rfc = RandomForestClassifier(random_state=42)

# perform random search
random_search = RandomizedSearchCV(rfc, param_distributions=param_dist,
                                   n_iter=50, cv=5, random_state=42,
                                   n_jobs=-1)

# fit the random search to the training data
random_search.fit(X_train, y_train)

# report the best set of hyperparameters found by the search
print("Best hyperparameters: ", random_search.best_params_)

# evaluate the performance of the tuned model on the test set
tuned_model = random_search.best_estimator_
tuned_model.fit(X_train, y_train)
y_pred = tuned_model.predict(X_test)

# report performance metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))
Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the decision boundaries on a scatter plot of two of the most important features. Discuss the insights and limitations of the model for predicting heart disease risk.
ANS-import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# Load the preprocessed dataset
data = pd.read_csv('heart_disease.csv')

# Split the data into features and labels
X = data.drop('target', axis=1)
y = data['target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a random forest classifier on the training set
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)

# Define the two most important features
feat1 = 'age'
feat2 = 'thalach'

# Create a meshgrid of values for age and maximum heart rate achieved
x_min, x_max = X[feat1].min() - 1, X[feat1].max() + 1
y_min, y_max = X[feat2].min() - 1, X[feat2].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

# Predict the class labels for each point on the meshgrid
Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Create a scatter plot of the two features, color-coded by class label
plt.scatter(X[feat1], X[feat2], c=y, cmap='coolwarm', edgecolor='k')
plt.xlabel(feat1)
plt.ylabel(feat2)

# Plot the decision boundaries on top of the scatter plot
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
plt.show()