Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary.

To build a Random Forest classifier for predicting the risk of heart disease, follow these steps:

### 1. **Preprocess the Dataset**

#### a. **Load the Dataset**
Assume you have the dataset in a CSV file named `heart_disease.csv`.

```python
import pandas as pd

# Load the dataset
df = pd.read_csv('heart_disease.csv')
```

#### b. **Handle Missing Values**
Check for missing values and handle them appropriately.

```python
# Check for missing values
print(df.isnull().sum())

# Impute or drop missing values as needed
# For simplicity, we will drop rows with missing values
df.dropna(inplace=True)
```

#### c. **Encode Categorical Variables**
Convert categorical variables to numeric using one-hot encoding or label encoding.

```python
# Convert categorical variables to dummy variables
df = pd.get_dummies(df, columns=['chest_pain_type'], drop_first=True)
```

#### d. **Scale Numerical Features (if necessary)**
Standardize numerical features if they vary significantly in scale.

```python
from sklearn.preprocessing import StandardScaler

# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Standardize numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
```

### 2. **Build the Random Forest Classifier**

#### a. **Import Libraries**

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
```

#### b. **Split the Dataset**

```python
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
```

#### c. **Train the Random Forest Classifier**

```python
# Initialize and train the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
```

#### d. **Make Predictions and Evaluate the Model**

```python
# Make predictions on the test set
y_pred = rf_clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
```

### Summary
1. **Load and preprocess the dataset** by handling missing values, encoding categorical variables, and scaling numerical features.
2. **Build and train a Random Forest classifier** with the processed data.
3. **Evaluate the model’s performance** using metrics like accuracy, confusion matrix, and classification report.

Q2. Split the dataset into a training set (70%) and a test set (30%).

To split the heart disease dataset into a training set (70%) and a test set (30%), you can use the `train_test_split` function from the `sklearn.model_selection` module. Here’s how to do it step-by-step:

### Code Example

```python
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('heart_disease.csv')

# Preprocess the dataset
# Handle missing values
df.dropna(inplace=True)

# Encode categorical variables
df = pd.get_dummies(df, columns=['chest_pain_type'], drop_first=True)

# Separate features and target variable
X = df.drop('target', axis=1)
y = df['target']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the resulting datasets
print(f'Training set shape: {X_train.shape}')
print(f'Test set shape: {X_test.shape}')
```

### Explanation

1. **Load the Dataset:** Read the CSV file into a DataFrame.
2. **Preprocess the Dataset:**
   - Handle missing values by dropping rows with missing data.
   - Encode categorical variables using one-hot encoding.
3. **Separate Features and Target:** Separate the features (`X`) from the target variable (`y`).
4. **Split the Dataset:**
   - Use `train_test_split` to divide the data into training and test sets.
   - Set `test_size=0.3` to allocate 30% of the data to the test set and the remaining 70% to the training set.
   - Set `random_state=42` for reproducibility.
5. **Print Shapes:** Output the shapes of the training and test sets to confirm the split.

This code will split your dataset appropriately, ready for model training and evaluation.

Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each
tree. Use the default values for other hyperparameters.

To train a Random Forest classifier with 100 trees and a maximum depth of 10 for each tree, follow these steps:

### Code Example

```python
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

# Train the model on the training set
rf_clf.fit(X_train, y_train)

# Print confirmation
print("Random Forest model trained with 100 trees and a maximum depth of 10.")
```

### Explanation

1. **Initialize the Random Forest Classifier:**
   - `n_estimators=100` specifies the number of trees in the forest.
   - `max_depth=10` sets the maximum depth of each tree.
   - `random_state=42` ensures that the results are reproducible.

2. **Train the Model:**
   - Use the `.fit()` method to train the model on the training data (`X_train` and `y_train`).

3. **Print Confirmation:**
   - Output a message to confirm that the model has been trained.

This code will train a Random Forest classifier with the specified parameters, ready for evaluation and further analysis.

Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

To evaluate the performance of the trained Random Forest model on the test set, you can use accuracy, precision, recall, and F1 score. Here’s how to do it:

### Code Example

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Make predictions on the test set
y_pred = rf_clf.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print performance metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

# Print confusion matrix and classification report
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
```

### Explanation

1. **Make Predictions:**
   - Use the `.predict()` method to generate predictions for the test set (`X_test`).

2. **Calculate Performance Metrics:**
   - `accuracy_score(y_test, y_pred)`: Measures the proportion of correctly classified instances.
   - `precision_score(y_test, y_pred)`: Measures the proportion of true positives among the predicted positives.
   - `recall_score(y_test, y_pred)`: Measures the proportion of true positives among the actual positives.
   - `f1_score(y_test, y_pred)`: Provides the harmonic mean of precision and recall, balancing the two metrics.

3. **Print Confusion Matrix and Classification Report:**
   - `confusion_matrix(y_test, y_pred)`: Shows the number of true positives, true negatives, false positives, and false negatives.
   - `classification_report(y_test, y_pred)`: Provides a summary of precision, recall, and F1 score for each class.

This code evaluates the performance of the Random Forest model using various metrics and provides a detailed analysis of how well the model performs on the test set.

Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart.

To identify and visualize the top 5 most important features in predicting heart disease risk using the Random Forest model, follow these steps:

### Code Example

```python
import matplotlib.pyplot as plt
import numpy as np

# Get feature importance scores from the trained model
importances = rf_clf.feature_importances_

# Create a DataFrame with feature names and their importance scores
features = X.columns
feature_importances = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Select the top 5 most important features
top_features = feature_importances.head(5)

# Plot the feature importances
plt.figure(figsize=(10, 6))
plt.barh(top_features['Feature'], top_features['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Top 5 Most Important Features for Predicting Heart Disease Risk')
plt.gca().invert_yaxis()  # Highest importance on top
plt.show()
```

### Explanation

1. **Extract Feature Importances:**
   - Use `rf_clf.feature_importances_` to get the importance scores of each feature from the trained Random Forest model.

2. **Create a DataFrame:**
   - Create a DataFrame with feature names and their corresponding importance scores.
   - Sort the DataFrame by importance scores in descending order.

3. **Select Top 5 Features:**
   - Use `head(5)` to select the top 5 most important features.

4. **Visualize Using a Bar Chart:**
   - Plot the feature importances using a horizontal bar chart with `plt.barh()`.
   - Add labels, a title, and invert the y-axis to display the highest importance on top.

This code will help you identify and visualize the most important features for predicting heart disease risk, allowing you to understand which factors are most influential in your model.

Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

To tune the hyperparameters of the Random Forest classifier using Grid Search or Random Search and evaluate performance with 5-fold cross-validation, you can follow these steps:

### Using Grid Search

1. **Import Necessary Libraries**
2. **Define Parameter Grid**
3. **Perform Grid Search**
4. **Evaluate the Best Model**

### Code Example

```python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Define the parameter grid for Grid Search
param_grid = {
    'n_estimators': [50, 100, 150, 200],         # Number of trees
    'max_depth': [None, 10, 20, 30],             # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],              # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]                 # Minimum samples required to be at a leaf node
}

# Initialize the Random Forest model
rf_clf = RandomForestClassifier(random_state=42)

# Initialize Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit Grid Search to the data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)

# Evaluate the best model on the test set
best_rf_clf = grid_search.best_estimator_
y_pred = best_rf_clf.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
```

### Using Random Search

Alternatively, you can use Random Search for hyperparameter tuning if you want a more randomized approach:

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Define the parameter distributions for Random Search
param_dist = {
    'n_estimators': randint(50, 200),                # Number of trees
    'max_depth': [None, 10, 20, 30, 40],             # Maximum depth of the tree
    'min_samples_split': randint(2, 10),              # Minimum samples required to split an internal node
    'min_samples_leaf': randint(1, 5)                 # Minimum samples required to be at a leaf node
}

# Initialize Random Search with 5-fold cross-validation
random_search = RandomizedSearchCV(estimator=rf_clf, param_distributions=param_dist, n_iter=100, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)

# Fit Random Search to the data
random_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

# Evaluate the best model on the test set
best_rf_clf = random_search.best_estimator_
y_pred = best_rf_clf.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
```

### Explanation

1. **Define Parameter Grid/Distribution:**
   - For Grid Search, specify all possible values for each hyperparameter.
   - For Random Search, define distributions from which to sample values.

2. **Initialize Search:**
   - Use `GridSearchCV` or `RandomizedSearchCV` with the defined parameter grid/distribution and 5-fold cross-validation.

3. **Fit Search to Data:**
   - Perform the search on the training data to find the best hyperparameters.

4. **Evaluate the Best Model:**
   - Retrieve the best model from the search and evaluate its performance on the test set.

This process will help you find the optimal hyperparameters for your Random Forest model and assess its performance with 5-fold cross-validation.

Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model.

To report the best set of hyperparameters and corresponding performance metrics, and compare the performance of the tuned model with the default model, follow these steps:

### Code Example

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Random Forest model with default hyperparameters
default_rf_clf = RandomForestClassifier(random_state=42)
default_rf_clf.fit(X_train, y_train)

# Make predictions with the default model
y_pred_default = default_rf_clf.predict(X_test)

# Print performance metrics for the default model
print("Default Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_default):.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_default))

# Define the parameter grid for Grid Search
param_grid = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1)

# Fit Grid Search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Evaluate the best model on the test set
best_rf_clf = grid_search.best_estimator_
y_pred_best = best_rf_clf.predict(X_test)

# Print the best parameters and performance metrics for the tuned model
print("Tuned Model Performance:")
print(f"Best Parameters: {best_params}")
print(f"Best Cross-Validation Score: {best_score:.2f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_best):.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_best))

# Alternatively, you can use RandomizedSearchCV for hyperparameter tuning
# Define the parameter distributions for Random Search
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5)
}

# Initialize Random Search with 5-fold cross-validation
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(random_state=42),
                                    param_distributions=param_dist,
                                    n_iter=100,
                                    cv=5,
                                    scoring='accuracy',
                                    n_jobs=-1,
                                    random_state=42)

# Fit Random Search to the data
random_search.fit(X_train, y_train)

# Get the best parameters and best score from Random Search
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_

# Evaluate the best model from Random Search on the test set
best_rf_clf_random = random_search.best_estimator_
y_pred_best_random = best_rf_clf_random.predict(X_test)

# Print the best parameters and performance metrics for the Random Search tuned model
print("Random Search Tuned Model Performance:")
print(f"Best Parameters: {best_params_random}")
print(f"Best Cross-Validation Score: {best_score_random:.2f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_best_random):.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_best_random))
```

### Explanation

1. **Default Model:**
   - Train a Random Forest classifier with default hyperparameters.
   - Evaluate and report its accuracy and classification metrics on the test set.

2. **Hyperparameter Tuning with Grid Search:**
   - Perform Grid Search to find the best hyperparameters.
   - Print the best parameters and cross-validation score from Grid Search.
   - Evaluate the performance of the tuned model on the test set.

3. **Comparison:**
   - Compare the performance metrics (accuracy, precision, recall, F1 score) of the tuned model with the default model.
   - Optionally, perform a Random Search to compare results and find the best hyperparameters.

This approach allows you to see how hyperparameter tuning improves model performance compared to using default settings.

Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk.

To analyze the decision boundaries of a Random Forest classifier, you can plot them on a scatter plot using two of the most important features. Here's how to interpret the model, plot the decision boundaries, and discuss the insights and limitations:

### Steps to Plot Decision Boundaries

1. **Identify the Two Most Important Features**
2. **Create a Mesh Grid for Decision Boundary Plotting**
3. **Plot the Decision Boundaries and Scatter Plot**

### Code Example

```python
import numpy as np
import matplotlib.pyplot as plt

# Get the two most important features
top_features = feature_importances.head(2)
feature_names = top_features['Feature'].values
X_top_features = X[feature_names]

# Train a Random Forest model on the reduced feature set
rf_clf_reduced = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf_clf_reduced.fit(X_top_features, y)

# Create a mesh grid for plotting decision boundaries
x_min, x_max = X_top_features[feature_names[0]].min() - 1, X_top_features[feature_names[0]].max() + 1
y_min, y_max = X_top_features[feature_names[1]].min() - 1, X_top_features[feature_names[1]].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

# Predict on the mesh grid
Z = rf_clf_reduced.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundaries
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')

# Plot the scatter plot of the two most important features
plt.scatter(X_top_features[feature_names[0]], X_top_features[feature_names[1]], c=y, edgecolors='k', cmap='coolwarm')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Decision Boundaries and Scatter Plot of Top 2 Features')
plt.show()
```

### Explanation

1. **Identify Features:**
   - Use the feature importance scores to select the top 2 most important features.

2. **Train Reduced Model:**
   - Train a Random Forest classifier using only the two selected features to simplify visualization.

3. **Create Mesh Grid:**
   - Define a mesh grid covering the range of the two features to plot the decision boundaries.

4. **Predict and Plot:**
   - Use the model to predict class labels over the mesh grid and plot the decision boundaries.
   - Overlay the scatter plot of the two features to visualize how the decision boundaries separate the classes.

### Insights and Limitations

**Insights:**
- **Decision Boundaries:**
  - The plot shows how the Random Forest classifier separates different classes based on the two most important features. This visualization helps in understanding the regions where the model predicts different classes.
- **Feature Importance:**
  - By focusing on the top features, you can see which features contribute most to decision-making.

**Limitations:**
- **Limited to Two Features:**
  - This analysis only considers two features, whereas the model uses all features for prediction. Important interactions between other features are not visible in this plot.
- **Overfitting Risk:**
  - Random Forests are generally robust, but with a large number of trees and high depth, there is still a risk of overfitting to the training data.
- **Interpretability:**
  - Decision boundaries provide an intuition but may not fully capture the complexity of interactions in higher-dimensional space.

This approach gives a visual understanding of how the model separates classes based on selected features but should be complemented with other metrics and analysis for a comprehensive evaluation.