# Supervised Classification: Decision Trees, SVM, and Naive Bayes


# Question 1: What is Information Gain, and how is it used in Decision Trees?

Information Gain (IG) is a key concept used in Decision Trees to decide which feature should be chosen to split the data at each node. It measures how much "information" a feature provides about the target variable.

 Definition:

**Information Gain** is based on the concept of **Entropy** (a measure of impurity or disorder in data).

Information Gain=Entropy (Parent)− Entropy ( combine Child​)


Example:

Suppose we’re classifying whether to Play Tennis based on Weather.

| Weather  | Play |
| -------- | ---- |
| Sunny    | No   |
| Overcast | Yes  |
| Rainy    | Yes  |

* Entropy before split = 0.918
* After splitting on “Weather”, weighted entropy = 0.5
* Information Gain = 0.918 − 0.5 = 0.418

This means "Weather" reduces uncertainty by 0.418 bits of information.


# Question 2 :  What is the difference between Gini Impurity and Entropy?

Here’s a difference between Gini Impurity and Entropy


| Feature         | **Gini Impurity**                                                            | **Entropy**                                                   |
| --------------- | ---------------------------------------------------------------------------- | ------------------------------------------------------------- |
| **Meaning**     | Measures how often a randomly chosen element would be incorrectly classified | Measures the amount of uncertainty or disorder in the dataset |
| **Formula**     | ( G=1−∑pi2​ )                                                       | ( (E=−∑pi ​log2​(pi​)) )                                 |
| **Range**       | 0 → 0.5 (for 2 classes)                                                      | 0 → 1 (for 2 classes)                                         |
| **When it’s 0** | Perfectly pure node (only one class)                                         | Perfectly pure node (only one class)                          |
| **Computation** | Simpler, faster (no log calculation)                                         | Slightly slower (uses logarithm)                              |
| **Behavior**    | More sensitive to class probabilities                                        | More theoretical and information-based                        |
                                




* Gini → faster and often gives similar results.
* Entropy → gives a more “information theory” view of purity.
* Both are used to find the best feature to split a decision tree.



# Question 3 : What is Pre-Pruning in Decision Trees?


Pre-pruning means stopping the tree from growing too deep while it’s being built — before it starts overfitting the data.

During tree building, the algorithm checks certain conditions.
If any of these are met, it stops splitting that branch.

Common Pre-pruning rules:

* Maximum tree depth reached
* Minimum number of samples in a node
* Information gain (or Gini reduction) is too small
* Minimum number of samples required to split not met


Pre-pruning, Stop growing the tree early to keep it simple and avoid overfitting.



In [6]:
# Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
#Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load sample dataset (Iris dataset)
data = load_iris()
X = data.data
y = data.target

# Create and train the Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)

# Print feature names and their importance scores
print("Feature Importances:")
for name, importance in zip(data.feature_names, clf.feature_importances_):
    print(f"{name}: {importance}")


Feature Importances:
sepal length (cm): 0.013333333333333329
sepal width (cm): 0.0
petal length (cm): 0.5640559581320451
petal width (cm): 0.4226107085346215


# Question 5: What is a Support Vector Machine (SVM)?


A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks — but it’s mostly known for classification.

SVM tries to find the best line (in 2D) or best plane/hyperplane (in higher dimensions)that separates different classes of data as clearly as possible.


SVM looks for the maximum margin — the widest gap between data points of different classes.
The data points closest to the line/plane are called Support Vectors — they “support” or define the boundary.

In short: SVM finds the best boundary that separates different classes with the widest margin possible.


# Question 6: What is the Kernel Trick in SVM?


The Kernel Trick is a method used in SVM to handle non-linear data — that is, when the data cannot be separated by a straight line.

* Some datasets can’t be divided by a straight line (linear boundary).
* The kernel trick transforms the data into a higher-dimensional space,
  where it becomes linearly separable.
* SVM can then find a linear boundary in that new space — which corresponds to a non-linear boundary in the original space.

Example:

Imagine data points shaped like two circles — one inside the other.
You can’t separate them with a straight line in 2D.
But if you map them to 3D (using a kernel), they become separable by a plane.

Common Kernels Used:Linear, Polynomial, RBF (Radial Basis Function) and Sigmoid


In short:The Kernel Trick allows SVM to separate complex, non-linear data by implicitly mapping it to a higher dimension — without actually computing those dimensions directly.

In [5]:
# Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
# Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting on the same dataset.

# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create two SVM classifiers with different kernels
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both models
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Predict on the test set
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate and print accuracies
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Accuracy with Linear Kernel: {acc_linear}")
print(f"Accuracy with RBF Kernel: {acc_rbf}")


Accuracy with Linear Kernel: 1.0
Accuracy with RBF Kernel: 0.8055555555555556


# Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?


Naïve Bayes is a simple and fast probabilistic classifier based on Bayes’ Theorem.
It is widely used for text classification, spam detection, and sentiment analysis.


It predicts the class of a sample using probabilities —
i.e., it calculates the chance that a data point belongs to a particular class.

It’s based on Bayes’ Theorem


P(A|B) = P(B|A) X P(A) / P(B)

Where:

* P(A|B)  → Probability of class A given the data B
* P(B|A)  → Probability of data B given class A
* P(A)  → Prior probability of class A
* P(B) → Probability of the data


It’s called naïve because it assumes that all features are independent of each other —
that is, each feature contributes to the outcome individually, without being related to other features.
 In reality, this assumption is often not true, but surprisingly, the classifier still works very well in many cases.



Example (Spam Detection):

If an email contains the words “win,” “money,” and “prize,”
Naïve Bayes assumes that these words are independent and calculates probabilities for spam vs. non-spam accordingly.






# Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes


Here’s a simple and clear comparison of the three main types of Naïve Bayes classifiers


| Type                        | Used For                  | Type of Features                                     | Key Idea                                                | Example Use Case                                                                        |
| --------------------------- | ------------------------- | ---------------------------------------------------- | ------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Gaussian Naïve Bayes**    | Continuous (numeric) data | Features follow a **normal (Gaussian)** distribution | Calculates probability using the **bell curve** formula | Predicting diseases based on continuous test values (e.g., blood pressure, temperature) |
| **Multinomial Naïve Bayes** | Discrete (count) data     | Features are **counts or frequencies**               | Based on the **multinomial distribution**               | Text classification (e.g., spam detection, word count in emails)                        |
| **Bernoulli Naïve Bayes**   | Binary (yes/no) data      | Features are **0 or 1 (True/False)**                 | Works on **presence or absence** of features            | Sentiment analysis or document classification using word presence (word appears or not) |


In simple words:

* Gaussian NB→ when features are numbers (like height, age, weight).
* Multinomial NB → when features are word counts or frequencies.
* Bernoulli NB → when features are binary (word present = 1, absent = 0).


In [4]:
# Question 10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.
# Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from sklearn.datasets.

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Gaussian Naïve Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Gaussian Naïve Bayes Classifier: {accuracy}")






Accuracy of Gaussian Naïve Bayes Classifier: 0.9736842105263158
