## Theoretical

1. Can we use Bagging for regression problems  
    * Yes, Bagging  can be used for regression problems. 

2. What is the difference between multiple model training and single model training  
    * Single model training involves training just one model on the entire dataset, relying on it to make predictions.
    * Multiple model training (ensemble learning) involves training multiple models and combining their predictions to improve accuracy, stability, and  generalization. Examples include Bagging, Boosting, and Stacking.


3. Explain the concept of feature randomness in Random Forest

    - Feature randomness in Random Forest refers to the process where, at each split of a decision tree, only a random subset of the total features is considered. This helps reduce overfitting and increases diversity among trees, improving overall model performance.

4. What is OOB (Out-of-Bag) Score

    - Mean Decrease in Impurity (MDI): Measures how much each feature reduces impurity (e.g., Gini impurity or entropy) across all trees.
    - Permutation Importance: Measures the decrease in model performance (e.g., accuracy or RMSE) when the values of a feature are randomly shuffled.
    
5. How can you measure the importance of features in a Random Forest model

    - Mean Decrease in Impurity (MDI) or Permutation Importance
    - Gini importance or feature importance scores can be calculated using the MDI or permutation importance method.


6. Explain the working principle of a Bagging Classifier

    - Creating multiple bootstrap samples from the original dataset.
    - Training a separate base classifier (e.g., decision tree) on each bootstrap sample.
    - Aggregating the predictions of all classifiers, typically using majority voting.


7. How do you evaluate a Bagging Classifier’s performance

    - Accuracy, Precision, Recall, and F1-score for classification problems.
    - OOB Score for internal validation.
    - Cross-validation to get a reliable estimate of model performance.
    - ROC-AUC score for imbalanced datasets.

8. How does a Bagging Regressor work

    - A Bagging Regressor follows the same principle as a Bagging Classifier but for regression:
        - Creates multiple bootstrap samples.
        - Trains separate regression models (e.g., Decision Trees).
        - Aggregates predictions by averaging the outputs of all models.

9. What is the main advantage of ensemble techniques

    - The main advantage is higher accuracy and better generalization. By combining multiple models, ensemble techniques reduce variance, mitigate overfitting, and improve prediction robustness.

10. What is the main challenge of ensemble methods

    - The main challenge is computational complexity and interpretability. Training multiple models requires more computational resources, and interpreting ensemble models can be more difficult than understanding a single decision tree or regression model.

11. Explain the key idea behind ensemble techniques
    
    - Ensemble techniques combine multiple models to improve prediction accuracy, reduce overfitting, and increase robustness. The idea is that a group of diverse models can make better decisions than a single model.


12. What is a Random Forest Classifier

    - A Random Forest Classifier is an ensemble learning method that builds multiple decision trees and combines their predictions using majority voting. It introduces randomness by selecting different data samples (bootstrapping) and considering only a subset of features at each split.


13. What are the main types of ensemble techniques

    - Bagging (Bootstrap Aggregating): Reduces variance by training multiple models on different subsets of data and averaging predictions.
    - Boosting: Reduces bias by training models sequentially, where each new model corrects the errors of the previous one.

14. What is ensemble learning in machine learning

    - Ensemble learning is a technique where multiple models (often of the same or different types) are combined to make better predictions than a single model

15. When should we avoid using ensemble methods

    - When computational resources are limited, as ensembles require more training time.
    - When interpretability is crucial, since ensembles are harder to explain than single models.
    - When a single model performs well enough, making an ensemble unnecessary.


16. How does Bagging help in reducing overfitting

    - Bagging reduces overfitting by training multiple models on different bootstrap samples and averaging their predictions. This helps smooth out the variance in individual models.

17. Why is Random Forest better than a single Decision Tree

    - More robust: Reduces overfitting by averaging multiple trees.
    - More stable: Less sensitive to noise in data.
    - More accurate: Handles large datasets better and provides feature importance.

18. What is the role of bootstrap sampling in Bagging

    - Bootstrap sampling randomly selects subsets of the dataset (with replacement) to train each base model. This ensures diversity among models, leading to better generalization.

19. What are some real-world applications of ensemble techniques

    - Finance: Fraud detection and credit risk assessment.
    - Healthcare: Disease prediction and medical diagnosis.
    - Image Recognition: Facial recognition and object detection.
    - Natural Language Processing: Sentiment analysis and spam filtering.
    - Recommendation Systems: Movie, product, and content recommendations.

20. What is the difference between Bagging and Boosting?

    - Bagging: Reduces variance by training multiple models independently on different random subsets of data and averaging their predictions (e.g., Random Forest).
    - Boosting: Reduces bias by training models sequentially, where each model learns from the errors of the previous one (e.g., AdaBoost, Gradient Boosting).


## Practical

In [None]:
from sklearn.ensemble import BaggingClassifier , BaggingRegressor , RandomForestClassifier ,RandomForestRegressor , StackingClassifier
from sklearn.tree import DecisionTreeClassifier , DecisionTreeRegressor
from sklearn.datasets import make_classification , make_regression
from sklearn.model_selection import train_test_split ,  GridSearchCV , cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score ,  mean_squared_error ,  roc_auc_score , confusion_matrix ,precision_score, recall_score, f1_score , roc_curve ,  precision_recall_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVC
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# 21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Classifier using Decision Trees
bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=1)
bagging_clf.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = bagging_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Bagging Classifier Accuracy: {accuracy:.2f}")

In [None]:
# 22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Regressor using Decision Trees
bagging_reg = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=50, random_state=1)
bagging_reg.fit(X_train, y_train)

# Predict and evaluate MSE
y_pred = bagging_reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print MSE
print(f"Bagging Regressor Mean Squared Error: {mse:.2f}")

In [None]:
# 23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

from sklearn.datasets import load_breast_cancer


# Load the Breast Cancer dataset
cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Get feature importance scores
feature_importances = rf_clf.feature_importances_

# Print feature importance scores with feature names
print("Feature Importances:")
for feature, importance in zip(cancer_data.feature_names, feature_importances):
    print(f"{feature}: {importance:.4f}")


In [None]:
# 24. Train a Random Forest Regressor and compare its performance with a single Decision Tree

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Decision Tree Regressor
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)

# Predict and evaluate MSE for both models
dt_mse = mean_squared_error(y_test, dt_reg.predict(X_test))
rf_mse = mean_squared_error(y_test, rf_reg.predict(X_test))

# Print the MSE scores
print(f"Decision Tree Regressor MSE: {dt_mse:.2f}")
print(f"Random Forest Regressor MSE: {rf_mse:.2f}")

In [None]:
# 25 Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier

cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier with OOB score enabled
rf_clf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=1)
rf_clf.fit(X_train, y_train)

# Print the Out-of-Bag (OOB) score
print(f"Out-of-Bag (OOB) Score: {rf_clf.oob_score_:.4f}")


In [None]:
# 26 Train a Bagging Classifier using SVM as a base estimator and print accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Classifier using SVM as the base estimator
bagging_svm = BaggingClassifier(estimator=SVC(), n_estimators=50, random_state=1)
bagging_svm.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = bagging_svm.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Bagging Classifier with SVM Accuracy: {accuracy:.4f}")

In [None]:
# 27 Train a Random Forest Classifier with different numbers of trees and compare accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# List of different numbers of trees to test
n_trees_list = [10, 50, 100, 200]

# Train Random Forest classifiers with different numbers of trees and compare accuracy
for n_trees in n_trees_list:
    rf_clf = RandomForestClassifier(n_estimators=n_trees, random_state=1)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Random Forest with {n_trees} trees - Accuracy: {accuracy:.4f}")


In [None]:
# 28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Classifier using Logistic Regression as the base estimator
bagging_lr = BaggingClassifier(estimator=LogisticRegression(), n_estimators=50, random_state=1)
bagging_lr.fit(X_train, y_train)

# Predict probabilities for AUC calculation
y_prob = bagging_lr.predict_proba(X_test)[:, 1]  # Get probability of the positive class

# Compute and print AUC score
auc_score = roc_auc_score(y_test, y_prob)
print(f"Bagging Classifier with Logistic Regression - AUC Score: {auc_score:.4f}")

In [None]:
# 29. Train a Random Forest Regressor and analyze feature importance scores

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=1)
rf_reg.fit(X_train, y_train)

# Get feature importance scores
feature_importances = rf_reg.feature_importances_

# Print feature importance scores
print("Feature Importances:")
for i, importance in enumerate(feature_importances):
    print(f"Feature {i + 1}: {importance:.4f}")

In [None]:
# 30 Train an ensemble model using both Bagging and Random Forest and compare accuracy.

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Classifier using Decision Trees
bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=1)
bagging_clf.fit(X_train, y_train)
bagging_pred = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=50, random_state=1)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

# Print accuracy of both models
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

In [None]:
# 31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV

X, y = make_classification(n_samples=1000, n_features=20, random_state= 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define the Random Forest Classifier
rf_clf = RandomForestClassifier()

# Define hyperparameters for tuning
param_grid = {
    'n_estimators': [50, 100, 200],  
    'max_depth': [5, 10, 20],  
    'min_samples_split': [2, 5, 10], 
    'min_samples_leaf': [1, 2, 4] 
}

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_

# Predict on test data
y_pred = best_rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print best hyperparameters and accuracy
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

In [None]:
# 32. Train a Bagging Regressor with different numbers of base estimators and compare performance

X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# List of different numbers of base estimators to test
n_estimators_list = [10, 50, 100, 200]

# Train Bagging Regressors with different numbers of base estimators and compare performance
for n_estimators in n_estimators_list:
    bagging_reg = BaggingRegressor(estimator=DecisionTreeRegressor(), n_estimators=n_estimators, random_state=1)
    bagging_reg.fit(X_train, y_train)
    y_pred = bagging_reg.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Bagging Regressor with {n_estimators} estimators - MSE: {mse:.4f}")

In [None]:
# 33. Train a Random Forest Classifier and analyze misclassified samples

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Predict on test data
y_pred = rf_clf.predict(X_test)

# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Identify misclassified samples
misclassified_indices = np.where(y_test != y_pred)[0]
misclassified_samples = X_test[misclassified_indices]

# Convert misclassified samples to a DataFrame for better readability
misclassified_df = pd.DataFrame(misclassified_samples, columns=[f"Feature_{i}" for i in range(X.shape[1])])
misclassified_df['Actual Label'] = y_test[misclassified_indices]
misclassified_df['Predicted Label'] = y_pred[misclassified_indices]

# Display the first few misclassified samples
print("\nMisclassified Samples:")
print(misclassified_df.head())


In [None]:
# 34. Train a Bagging Classifier and compare its performance with a single Decision Tree Classifier

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a single Decision Tree Classifier
dt_clf = DecisionTreeClassifier(random_state=1)
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)

# Train a Bagging Classifier using Decision Tree as the base estimator
bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=1)
bagging_clf.fit(X_train, y_train)
bagging_pred = bagging_clf.predict(X_test)
bagging_accuracy = accuracy_score(y_test, bagging_pred)

# Print accuracy of both models
print(f"Decision Tree Classifier Accuracy: {dt_accuracy:.4f}")
print(f"Bagging Classifier Accuracy: {bagging_accuracy:.4f}")

In [None]:
# 35. Train a Random Forest Classifier and visualize the confusion matrix

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Predict on test data
y_pred = rf_clf.predict(X_test)

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Classifier Accuracy: {accuracy:.4f}")

# Visualize confusion matrix using seaborn
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

In [None]:
# 36. Train a Stacking Classifier using Decision Trees, SVM, and Logistic Regression, and compare accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define base estimators
estimators = [
    ('decision_tree', DecisionTreeClassifier(random_state=1)),
    ('svm', SVC(probability=True, random_state=1))
]

# Define the Stacking Classifier with Logistic Regression as the final estimator
stacking_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression(), cv=5)
stacking_clf.fit(X_train, y_train)

# Predict and compute accuracy
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Stacking Classifier Accuracy: {accuracy:.4f}")

In [None]:
# 37. Train a Random Forest Classifier and print the top 5 most important features

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
feature_names = [f"Feature_{i}" for i in range(X.shape[1])]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Get feature importance scores
feature_importances = rf_clf.feature_importances_

# Create a DataFrame for better readability
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Print the top 5 most important features
print("Top 5 Most Important Features:")
print(importance_df.head(5))

In [None]:
# 38. Train a Bagging Classifier and evaluate performance using Precision, Recall, and F1-score

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Bagging Classifier using Decision Tree as the base estimator
bagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=1)
bagging_clf.fit(X_train, y_train)

# Predict on test data
y_pred = bagging_clf.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics
print(f"Bagging Classifier Performance:")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")

In [None]:
# 39. Train a Random Forest Classifier and analyze the effect of max_depth on accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define different values for max_depth
max_depth_values = [2, 5, 10, 20, None]  # None means unlimited depth

# Store accuracies for analysis
accuracies = []

# Train and evaluate Random Forest models with different max_depth values
for max_depth in max_depth_values:
    rf_clf = RandomForestClassifier(n_estimators=100, max_depth=max_depth, random_state=1)
    rf_clf.fit(X_train, y_train)
    y_pred = rf_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"Max Depth: {max_depth}, Accuracy: {accuracy:.4f}")

# Plot the effect of max_depth on accuracy
plt.figure(figsize=(8, 5))
plt.plot([str(md) for md in max_depth_values], accuracies, marker='o', linestyle='-', color='b')
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.title("Effect of max_depth on Random Forest Accuracy")
plt.grid(True)
plt.show()

In [None]:
# 40. Train a Bagging Regressor using different base estimators (DecisionTree and KNeighbors) and compareperformance

X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define base estimators
base_estimators = {
    "Decision Tree": DecisionTreeRegressor(),
    "K-Neighbors": KNeighborsRegressor()
}

# Store results
mse_scores = {}

# Train and evaluate Bagging Regressors with different base estimators
for name, base_estimator in base_estimators.items():
    bagging_regressor = BaggingRegressor()
    bagging_regressor.fit(X_train, y_train)
    y_pred = bagging_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores[name] = mse
    print(f"Bagging Regressor ({name}) - Mean Squared Error: {mse:.4f}")

# Plot comparison of MSE
plt.figure(figsize=(6, 4))
plt.bar(mse_scores.keys(), mse_scores.values(), color=['blue', 'green'])
plt.ylabel("Mean Squared Error (MSE)")
plt.title("Comparison of Bagging Regressors with Different Base Estimators")
plt.show()

In [None]:
# 41. Train a Random Forest Classifier and evaluate its performance using ROC-AUC Score

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_probs = rf_clf.predict_proba(X_test)[:, 1]  # Get probability scores for the positive class

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_probs)

# Print ROC-AUC score
print(f"Random Forest Classifier ROC-AUC Score: {roc_auc:.4f}")

# Plot the ROC Curve
plt.figure(figsize=(6, 4))
plt.plot(fpr, tpr, color='blue', label=f"ROC Curve (AUC = {roc_auc:.4f})")
plt.plot([0, 1], [0, 1], linestyle='--', color='red', label="Random Guess")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# 42. Train a Bagging Classifier and evaluate its performance using cross-validatio

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)

# Define a Bagging Classifier with Decision Tree as the base estimator
bagging_clf = BaggingClassifier()

# Perform cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
cv_scores = cross_val_score(bagging_clf, X, y, cv=cv, scoring='accuracy')

# Print cross-validation results
print(f"Cross-Validation Accuracy Scores: {cv_scores}")
print(f"Mean Accuracy: {cv_scores.mean():.4f}")
print(f"Standard Deviation: {cv_scores.std():.4f}")

In [None]:
# 43. Train a Random Forest Classifier and plot the Precision-Recall curv

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)

# Predict probabilities for Precision-Recall Curve
y_probs = rf_clf.predict_proba(X_test)[:, 1]  # Get probability scores for the positive class

# Compute Precision-Recall curve
precision, recall, _ = precision_recall_curve(y_test, y_probs)

# Compute AUC (Area Under Curve) for Precision-Recall
pr_auc = auc(recall, precision)

# Plot the Precision-Recall Curve
plt.figure(figsize=(6, 4))
plt.plot(recall, precision, color='blue', label=f"PR Curve (AUC = {pr_auc:.4f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve for Random Forest Classifier")
plt.legend()
plt.grid(True)
plt.show()

# Print PR AUC Score
print(f"Precision-Recall AUC Score: {pr_auc:.4f}")

In [None]:
# 44. Train a Stacking Classifier with Random Forest and Logistic Regression and compare accuracy

X, y = make_classification(n_samples=1000, n_features=20, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define base estimators
base_estimators = [
    ('random_forest', RandomForestClassifier(n_estimators=100, random_state=1)),
    ('svm', SVC(probability=True, random_state=1))
]

# Define the Stacking Classifier with Logistic Regression as the meta-learner
stacking_clf = StackingClassifier(estimators=base_estimators, final_estimator=LogisticRegression(), cv=5)

# Train the Stacking Classifier
stacking_clf.fit(X_train, y_train)

# Make predictions
y_pred = stacking_clf.predict(X_test)

# Compute accuracy
stacking_accuracy = accuracy_score(y_test, y_pred)
print(f"Stacking Classifier Accuracy: {stacking_accuracy:.4f}")

# Train and compare individual models
rf_clf = RandomForestClassifier(n_estimators=100, random_state=1)
rf_clf.fit(X_train, y_train)
rf_accuracy = accuracy_score(y_test, rf_clf.predict(X_test))

lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
lr_accuracy = accuracy_score(y_test, lr_clf.predict(X_test))

# Print individual model accuracies
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.4f}")

In [None]:
# 45. Train a Bagging Regressor with different levels of bootstrap samples and compare performance

X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define different bootstrap sample sizes
bootstrap_samples = [0.5, 0.7, 1.0]  # Proportion of samples used for training each base model

# Store results
mse_scores = {}

# Train and evaluate Bagging Regressors with different bootstrap sample sizes
for sample_size in bootstrap_samples:
    bagging_regressor = BaggingRegressor(
        estimator=DecisionTreeRegressor(),
        n_estimators=50,
        max_samples=sample_size,  # Define bootstrap sample size
        random_state=1
    )
    bagging_regressor.fit(X_train, y_train)
    y_pred = bagging_regressor.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores[sample_size] = mse
    print(f"Bagging Regressor (Bootstrap Samples = {sample_size}) - MSE: {mse:.4f}")

# Plot comparison of MSE
plt.figure(figsize=(6, 4))
plt.bar([str(s) for s in bootstrap_samples], mse_scores.values(), color=['blue', 'green', 'red'])
plt.xlabel("Bootstrap Sample Size")
plt.ylabel("Mean Squared Error (MSE)")
plt.title("Comparison of Bagging Regressors with Different Bootstrap Samples")
plt.show()