# Theorical Questions

### 1. What is Unsupervised Learning in the Context of Machine Learning?

Unsupervised learning is a type of machine learning where the algorithm is given data without labeled responses. The goal is to find hidden patterns or intrinsic structures in the input data. It is commonly used for clustering, dimensionality reduction, and association rule learning.

Examples include:
- Clustering (e.g., K-Means, DBSCAN)
- Dimensionality Reduction (e.g., PCA, t-SNE)

### 2. How Does K-Means Clustering Algorithm Work?

K-Means is a partitioning clustering algorithm that divides the dataset into **K clusters**. It works as follows:

1. Initialize K centroids randomly.
2. Assign each data point to the nearest centroid based on Euclidean distance.
3. Recalculate the centroids as the mean of the points in each cluster.
4. Repeat steps 2 and 3 until the centroids no longer change significantly or a maximum number of iterations is reached.

The goal is to minimize the **inertia**, which is the sum of squared distances between data points and their respective cluster centers.

### 3. Explain the Concept of a Dendrogram in Hierarchical Clustering

A dendrogram is a tree-like diagram used to represent the arrangement of the clusters produced by hierarchical clustering. It shows:
- The order in which clusters are merged or split.
- The distance at which each merge occurs.

By cutting the dendrogram at a specific height, we can select the number of clusters desired. It helps in visualizing the hierarchical relationships between clusters.

### 4. What is the Main Difference Between K-Means and Hierarchical Clustering?

| Feature | K-Means | Hierarchical Clustering |
|--------|---------|--------------------------|
| Type | Partitional | Hierarchical |
| Cluster number | Must be specified before clustering | Not required beforehand |
| Output | Flat clusters | Tree-like structure (dendrogram) |
| Scalability | Efficient for large datasets | Not suitable for very large datasets |
| Reproducibility | Depends on random initialization | Deterministic |

The main difference is that K-Means divides the data into K distinct, non-overlapping clusters, while hierarchical clustering builds a hierarchy of clusters without requiring the number of clusters upfront.

### 5. What Are the Advantages of DBSCAN Over K-Means?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) offers several advantages over K-Means:

- **No need to specify number of clusters (K)**.
- Can **detect arbitrary shaped clusters**, not just spherical.
- Can **identify outliers** as noise.
- **Robust to noise** and outliers.
- Works well with clusters of **different sizes and densities**.

However, it may struggle with very high-dimensional data or when the density varies greatly between clusters.

### 6. Explain the working principle of a Bagging Classifier

Bagging (Bootstrap Aggregating) is an ensemble learning method that builds multiple versions of a predictor and uses these to get an aggregated prediction. For classification tasks, it combines the predictions of base estimators (often decision trees) using majority voting. Each model is trained on a random sample of the training data (with replacement), which helps to reduce variance and avoid overfitting.


### 7. How do you evaluate a Bagging Classifier’s performance?

To evaluate a Bagging Classifier, we use common classification metrics such as:

- **Accuracy**: The ratio of correct predictions to total predictions.
- **Precision, Recall, and F1-Score**: Useful for imbalanced datasets.
- **Confusion Matrix**: To understand how predictions are distributed across classes.
- **ROC-AUC Score**: For binary classification, helps to visualize performance at various thresholds.
- **Cross-validation**: Provides a more robust estimate of model performance.


### 8. How does a Bagging Regressor work?

Bagging Regressor works on the same principle as Bagging Classifier but for regression tasks. It:

- Trains multiple regressors (like decision trees) on different bootstrap samples.
- Aggregates the predictions by taking the **average** of individual model outputs.
- Helps reduce variance and improve prediction stability.


### 9. What is the main advantage of ensemble techniques?

The main advantage of ensemble techniques is improved **accuracy** and **robustness**. By combining multiple models:

- It reduces the likelihood of overfitting.
- It compensates for weaknesses in individual models.
- It generally improves generalization performance on unseen data.

Ensembles often outperform individual models in complex real-world tasks.


### 10. What is the main challenge of ensemble methods?

The main challenges of ensemble methods include:

- **Increased complexity**: Difficult to interpret and debug.
- **Computational cost**: Requires more memory and training time.
- **Tuning**: More parameters and models to tune.
- **Reduced interpretability**: It's harder to explain the final prediction made by an ensemble.

Despite these challenges, the performance gains often justify the added complexity.

### 11. Explain the key idea behind ensemble techniques

The key idea behind ensemble techniques is to **combine the predictions of multiple models** to produce a single, stronger model. By leveraging the diversity and strength of individual models, ensemble methods help to improve **accuracy**, **robustness**, and **generalization**. The combined output is often more accurate and reliable than that of any single model, especially in complex or noisy datasets.


### 12. What is a Random Forest Classifier?

A **Random Forest Classifier** is an ensemble learning method that builds a large number of decision trees during training. Each tree is trained on a random subset of the data and features. When making a prediction, the random forest takes a **majority vote** across all the trees.

It combines the **power of bagging** (bootstrapping + aggregation) with **random feature selection**, which helps reduce **overfitting** and improve **performance** on unseen data.


### 13. What are the main types of ensemble techniques?

The main types of ensemble techniques are:

1. **Bagging (Bootstrap Aggregating)**  
   - Builds multiple models independently using bootstrapped datasets.
   - Example: Random Forest.

2. **Boosting**  
   - Builds models sequentially, each trying to correct the errors of the previous one.
   - Examples: AdaBoost, Gradient Boosting, XGBoost.

3. **Stacking**  
   - Combines multiple models (called base learners) and uses another model (called a meta-learner) to make final predictions.

Each type focuses on improving model performance by addressing different kinds of errors (bias or variance).


### 14. What is ensemble learning in machine learning?

**Ensemble learning** is a machine learning paradigm where multiple models (often called "weak learners") are trained to solve the same problem and their predictions are combined. The goal is to achieve better performance than any single model alone.

It helps to:

- Reduce overfitting (variance)
- Improve accuracy
- Make predictions more robust

Ensemble methods are widely used in many real-world applications and are often top performers in machine learning competitions.


### 15. When should we avoid using ensemble methods?

We should avoid using ensemble methods when:

- **Model interpretability is important**: Ensembles are complex and hard to explain.
- **Data is very small**: Ensemble methods can overfit when there’s not enough data.
- **Real-time performance is critical**: They can be computationally expensive in training and prediction.
- **Simplicity is preferred**: In cases where a single simple model performs sufficiently well, ensembles may not be worth the added complexity.

In such situations, simpler models like logistic regression or single decision trees might be more suitable.

### 16. How does Bagging help in reducing overfitting?

Bagging (Bootstrap Aggregating) helps in reducing overfitting by training multiple models on different subsets of the data and averaging their predictions. Each subset is drawn using **bootstrap sampling** (sampling with replacement), which introduces **diversity** among models. Since overfitting typically arises from a model being too closely fitted to a single dataset, combining predictions from many varied models **reduces variance** and increases robustness.


### 17. Why is Random Forest better than a single Decision Tree?

Random Forest is better than a single Decision Tree because:

- It reduces **overfitting** by averaging multiple trees.
- It improves **generalization** to unseen data.
- It uses **random feature selection**, which introduces additional diversity among trees.
- It’s more **stable** and less sensitive to noise in the data.

While a single decision tree may capture noise and make inconsistent predictions, Random Forest mitigates this by combining multiple, less correlated models.


### 18. What is the role of bootstrap sampling in Bagging?

Bootstrap sampling is the process of creating multiple datasets by **randomly sampling with replacement** from the original dataset. Each model in the bagging ensemble is trained on a different bootstrap sample. This helps:

- Introduce **variation** in the training process.
- Ensure that each model sees a slightly different version of the data.
- **Reduce variance** by combining diverse models.

It is the core mechanism that allows Bagging to be effective.


### 19. What are some real-world applications of ensemble techniques?

Ensemble techniques are used in a wide variety of real-world applications, including:

- **Spam detection**: Combining classifiers to filter spam emails.
- **Fraud detection**: Ensemble models like Random Forests are used to identify suspicious transactions.
- **Medical diagnosis**: Used to improve accuracy in predicting diseases from diagnostic data.
- **Recommendation systems**: Ensembles can combine multiple models for better personalization.
- **Finance and credit scoring**: Predicting loan defaults or credit risks.

They are particularly useful in domains where **accuracy is critical**.

### 20. What is the difference between Bagging and Boosting?

| Feature              | Bagging                             | Boosting                              |
|----------------------|--------------------------------------|----------------------------------------|
| Training Strategy    | Models trained **in parallel**       | Models trained **sequentially**        |
| Data Sampling        | Bootstrap sampling (random subsets)  | Focus on errors made by previous models |
| Goal                 | Reduce **variance**                  | Reduce **bias and variance**           |
| Model Combination    | **Averaging** (regression) or majority voting (classification) | **Weighted** voting or sum             |
| Example Algorithms   | Random Forest                        | AdaBoost, Gradient Boosting, XGBoost   |

In short, **Bagging** reduces overfitting by creating independent models, while **Boosting** builds a strong model by correcting the errors of previous models.
                                                                    

# Practical Questions

In [2]:
# 21. Train a Bagging Classifier using Decision Trees on a sample dataset and print model accuracy

from sklearn.datasets import load_iris
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

bagging_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, random_state=0)
bagging_clf.fit(X_train, y_train)
y_pred = bagging_clf.predict(X_test)
print("Bagging Classifier Accuracy:", accuracy_score(y_test, y_pred))

Bagging Classifier Accuracy: 1.0


In [8]:
# 22. Train a Bagging Regressor using Decision Trees and evaluate using Mean Squared Error (MSE)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

bagging_reg = BaggingRegressor(DecisionTreeRegressor(), n_estimators=10, random_state=0)

bagging_reg.fit(X_train, y_train)
y_pred = bagging_reg.predict(X_test)

print("Bagging Regressor MSE:", mean_squared_error(y_test, y_pred))

Bagging Regressor MSE: 0.27826689951710337


In [4]:
# 23. Train a Random Forest Classifier on the Breast Cancer dataset and print feature importance scores

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

# Display feature importances
feature_importances = pd.Series(rf_clf.feature_importances_, index=data.feature_names)
print("Feature Importances:\n", feature_importances.sort_values(ascending=False))


Feature Importances:
 mean concave points        0.141934
worst concave points       0.127136
worst area                 0.118217
mean concavity             0.080557
worst radius               0.077975
worst perimeter            0.074292
mean perimeter             0.060092
mean area                  0.053810
worst concavity            0.041080
mean radius                0.032312
area error                 0.029538
worst texture              0.018786
worst compactness          0.017539
radius error               0.016435
worst symmetry             0.012929
perimeter error            0.011770
worst smoothness           0.011769
mean texture               0.011064
mean compactness           0.009216
fractal dimension error    0.007135
worst fractal dimension    0.006924
mean smoothness            0.006223
smoothness error           0.005881
concavity error            0.005816
compactness error          0.004596
symmetry error             0.004001
concave points error       0.003382
mean s

In [5]:
# 24. Train a Random Forest Regressor and compare its performance with a single Decision Tree

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# Use same regression data from Q22
rf_reg = RandomForestRegressor(n_estimators=100, random_state=0)
tree_reg = DecisionTreeRegressor(random_state=0)

rf_reg.fit(X_train, y_train)
tree_reg.fit(X_train, y_train)

rf_pred = rf_reg.predict(X_test)
tree_pred = tree_reg.predict(X_test)

print("Random Forest Regressor MSE:", mean_squared_error(y_test, rf_pred))
print("Decision Tree Regressor MSE:", mean_squared_error(y_test, tree_pred))

Random Forest Regressor MSE: 0.0345280701754386
Decision Tree Regressor MSE: 0.07602339181286549


In [6]:
# 25. Compute the Out-of-Bag (OOB) Score for a Random Forest Classifier

oob_rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
oob_rf.fit(X_train, y_train)
print("OOB Score:", oob_rf.oob_score_)

OOB Score: 0.9547738693467337


In [1]:
# 26. Train a Bagging Classifier using SVM as a base estimator and print accuracy
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC

bagging_svm = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=42)

In [2]:
# 27. Train a Random Forest Classifier with different numbers of trees and compare accuracy
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

tree_nums = [10, 50, 100, 200]
for n in tree_nums:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    print(f"Random Forest with {n} trees: Accuracy = {accuracy_score(y_test, y_pred):.4f}")

Random Forest with 10 trees: Accuracy = 1.0000
Random Forest with 50 trees: Accuracy = 1.0000
Random Forest with 100 trees: Accuracy = 1.0000
Random Forest with 200 trees: Accuracy = 1.0000


In [None]:
# 28. Train a Bagging Classifier using Logistic Regression as a base estimator and print AUC score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing  # Use California housing dataset

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200], 
    'max_depth': [10, 20, 30],     
    'min_samples_split': [2, 5, 10]  
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')  # Use negative MSE for regression
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Mean Squared Error:", -grid_search.best_score_)

y_pred = grid_search.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.4f}")

In [12]:
# 29. Train a Random Forest Regressor and analyze feature importance scores
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train, y_train)
importances = rf_reg.feature_importances_
features = X_train.columns if hasattr(X_train, 'columns') else [f'Feature {i}' for i in range(X_train.shape[1])]
importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)


     Feature  Importance
0  Feature 0    0.525996
5  Feature 5    0.138238
7  Feature 7    0.086133
6  Feature 6    0.086099
1  Feature 1    0.054663
2  Feature 2    0.047174
4  Feature 4    0.031724
3  Feature 3    0.029973


In [None]:
# 30. Train an ensemble model using both Bagging and Random Forest and compare accuracy
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing  # Alternative dataset

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')  # Use 'neg_mean_squared_error' for regression
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best MSE:", -grid_search.best_score_)

y_pred = grid_search.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.4f}")

In [None]:
# 31. Train a Random Forest Classifier and tune hyperparameters using GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing  # Using California housing dataset for regression

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')  # Use 'neg_mean_squared_error' for regression
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best MSE:", -grid_search.best_score_)

y_pred = grid_search.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.4f}")

In [16]:
# 32. Train a Bagging Regressor with different numbers of base estimators and compare performance

from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error

estimators = [10, 50, 100, 200]
for n in estimators:
    bag_reg = BaggingRegressor(n_estimators=n, random_state=42)
    bag_reg.fit(X_train, y_train)
    y_pred = bag_reg.predict(X_test)
    print(f"Bagging Regressor with {n} estimators: MSE = {mean_squared_error(y_test, y_pred):.4f}")


Bagging Regressor with 10 estimators: MSE = 0.2863
Bagging Regressor with 50 estimators: MSE = 0.2579
Bagging Regressor with 100 estimators: MSE = 0.2569
Bagging Regressor with 200 estimators: MSE = 0.2542


In [None]:
# 33. Train a Random Forest Classifier and analyze misclassified samples
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing 

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200], 
    'max_depth': [10, 20, 30],  
    'min_samples_split': [2, 5, 10] 
}

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')  

print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Mean Squared Error:", -grid_search.best_score_)

y_pred = grid_search.best_estimator_.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test Mean Squared Error: {mse:.4f}")