# **Supervised Classification: Decision Trees, SVM, and Naive Bayes**

## **Question 1 : What is Information Gain, and how is it used in Decision Trees?**

**Answer:**

**Information Gain** measures how much uncertainty (impurity) is reduced in the target variable after splitting a dataset based on a particular feature. It is most often calculated using entropy as the impurity measure.

**Formula:**

Information Gain=Entropy (before split) - Weighted Entropy (after split)

Where:

Entropy(before split) = impurity of the parent node

Weighted Entropy(after split) = sum of impurities of the child nodes, weighted by their size

**How it's used in Decision Trees:**
1. Calculate entropy of the current dataset (the parent node).

2. For each feature:

- Split the data based on that feature.

- Calculate the weighted average entropy of the resulting subsets.

- Compute Information Gain for that feature.

3. Choose the feature with the highest Information Gain — it gives the most reduction in impurity.

4. Repeat the process recursively for child nodes.

## **Question 2: What is the difference between Gini Impurity and Entropy?**
Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases.

**Answer:**

| **Aspect**             | **Gini Impurity**                                                            | **Entropy**                                                         |
| ---------------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| **Formula**            | ( Gini = 1 - ∑ p^2_i )                                                    | ( Entropy = - ∑ p_i \log_2(p_i) )                                |
| **Meaning**            | Measures the probability of misclassifying a randomly chosen element         | Measures the level of disorder or uncertainty in the dataset        |
| **Value Range**        | 0 (pure) → 0.5 (for 2 classes, maximum impurity)                             | 0 (pure) → 1 (for 2 classes, maximum impurity)                      |
| **When = 0**           | Node is pure (only one class present)                                        | Node is pure (only one class present)                               |
| **Computation Speed**  | Faster (no logarithm calculation)                                            | Slightly slower (involves log₂)                                     |
| **Behavior**           | Biased toward the largest class (more sensitive to class imbalance)      | More theoretically sound from an information theory perspective |
| **Preferred By**       | CART algorithm (Classification and Regression Trees)                     | ID3, C4.5, C5.0 algorithms                                      |
| **Splitting Tendency** | Prefers features that separate classes more clearly (larger class dominance) | Prefers features that provide maximum information gain          |


## **Question 3:What is Pre-Pruning in Decision Trees?**

**Answer:**

**Pre-Pruning (Early Stopping)**

- The tree's growth is **stopped early** based on certain conditions **before** it becomes overly complex.

- Common stopping criteria:

  - Maximum depth `(max_depth)`
  - Minimum samples to split `(min_samples_split)`
  - Minimum samples in a leaf `(min_samples_leaf)`
  - Minimum impurity decrease
  
- **Practical advantage:** Saves computation time and prevents overfitting by keeping the model simpler from the start.


## **Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).**

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

**Answer:**

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the dataset (Iris dataset for example)
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)


clf.fit(X_train, y_train)


print("Model Accuracy:", clf.score(X_test, y_test))


print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


## **Question 5: What is a Support Vector Machine (SVM)?**

**Answer:**

**Support Vector Machine (SVM)** is a powerful supervised machine learning algorithm primarily used for **classification**, though it can also handle regression tasks. Its main goal is to find the best boundary (called a **hyperplane**) that separates different classes in your data.

Think of it like drawing a line on a 2D plot to separate red dots from blue dots—but it works in much higher dimensions too.

## **Question 6: What is the Kernel Trick in SVM?**

**Answer:**

The **kernel Trick** is a clever mathematical shortcut that lets SVMs handle non-linearly separable data without explicitly transforming it into higher dimensions.

**Why is this needed?**

Imagine your data points can't be separated by a straight line in 2D, but if you were to “lift” them into 3D space, they would be separable by a flat plane. The problem is, explicitly calculating those new coordinates in high-dimensional space is often computationally expensive or impossible.

The kernel trick **avoids the heavy lifting** by computing the dot product between points as if they were already transformed into that high-dimensional space — **without actually calculating their coordinates in that space.**

**How it works:**
- SVM decision functions rely on dot products between feature vectors.
- A kernel function K(x,y)
 computes this dot product **in some transformed feature space.**
- By plugging in the kernel function, SVM can operate as if the data were mapped into a higher-dimensional space, **enabling non-linear separation**.

## **Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.

**Answer:**

In [None]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create two SVM classifiers: one with Linear kernel, one with RBF kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both models
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Make predictions on test data
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracies
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print results
print("SVM Classifier with Linear Kernel - Accuracy:", round(acc_linear, 4))
print("SVM Classifier with RBF Kernel    - Accuracy:", round(acc_rbf, 4))

# Compare which performed better
if acc_linear > acc_rbf:
    print("\n Linear Kernel performed better.")
elif acc_rbf > acc_linear:
    print("\n RBF Kernel performed better.")
else:
    print("\n Both kernels performed equally well.")


SVM Classifier with Linear Kernel - Accuracy: 1.0
SVM Classifier with RBF Kernel    - Accuracy: 0.8056

 Linear Kernel performed better.


## **Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?**

**Answer:**

Naïve Bayes is a **probabilistic machine learning algorithm** used mainly for classification tasks. It's based on **Bayes' Theorem**, which helps you update the probability estimate for a hypothesis as more evidence comes in.

In simple terms, it predicts the class of a data point by calculating the probability that it belongs to each class and then picking the class with the highest probability.

**How does it work?**

It looks at the features (attributes) of your data.
Uses Bayes' theorem to calculate the probability of the data belonging to each class.
Assigns the class with the **highest posterior probability**.

**Why is it called "Naïve"?**

The "naïve" part comes from a **strong assumption it makes**:

- It assumes all features are independent of each other, given the class label.
- In reality, features often influence each other (they're correlated), but Naïve Bayes ignores this and treats each feature as if it stands alone.

This assumption is what makes it "naïve" — it simplifies the math drastically, making the algorithm fast and efficient, even if the assumption isn't perfectly true.

## **Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes**

**Answer:**

**Gaussian Naïve Bayes**: ⬇

Gaussian Naïve Bayes is used when the features are **continuous numerical values**, such as height, weight, or temperature.
It assumes that the data for each class follows a **normal (Gaussian) distribution**.
For example, in the Iris dataset, where the features are continuous measurements of flower parts, Gaussian NB works very well.
It calculates the probability of a feature value belonging to a class using the mean and variance of that class’s Gaussian distribution.

**Multinomial Naïve Bayes** ⬇

Multinomial Naïve Bayes is designed for **discrete count data** — that is, features that represent how many times something occurs.
It is commonly used in **text classification problems**, such as spam detection or document categorization, where features are **word counts or term frequencies**.
It assumes that the features are counts drawn from a **Multinomial distribution** (non-negative integers).
For example, if a document contains a word five times, that count is used directly in the model.


**Bernoulli Naïve Bayes** ⬇

Bernoulli Naïve Bayes is used when the features are **binary (either 0 or 1)**, indicating the **presence or absence** of something.
In text classification, instead of counting how many times a word appears, we only record whether a word appears at all.
For instance, if a word occurs at least once, its feature value is 1; otherwise, it is 0.
This model assumes a **Bernoulli distribution** and is particularly useful in tasks like sentiment analysis or binary feature-based classification.

## **Question 10: Breast Cancer Dataset**

**Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.**

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.

**Answer:**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test set
y_pred = gnb.predict(X_test)

# Print classification report
report = classification_report(y_test, y_pred, target_names=data.target_names)
print(report)

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

