<a href="https://colab.research.google.com/github/hem-githu/python/blob/main/Supervised_Classification_Decision_Trees%2C_SVM%2C_and_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Classification: Decision Trees, SVM, and Naive Bayes|Assignment

Question 1 : What is Information Gain, and how is it used in Decision Trees?

Ans: Information Gain is a measure of how much a feature reduces uncertainty (entropy) in a dataset. In Decision Trees, it is used to decide which feature to split on at each step, ensuring the tree grows in a way that maximizes classification accuracy.


What is Information Gain?
- Definition: Information Gain (IG) quantifies the reduction in entropy (randomness or impurity) when a dataset is split based on a particular feature.
- Entropy: A measure of disorder or impurity in the dataset. For classification, entropy is highest when classes are evenly mixed and lowest when all samples belong to one class.

How It Is Used in Decision Trees
- Step 1: Calculate Entropy of the dataset
Example: If 50% samples are "Yes" and 50% are "No," entropy is maximum (1 bit).
- Step 2: Split on a feature
Each feature divides the dataset into subsets.
- Step 3: Compute Information Gain
The reduction in entropy after the split is the IG.
- Step 4: Choose the feature with highest IG
The tree algorithm (like ID3, C4.5) selects the feature that maximizes IG for the next node.
- Step 5: Repeat recursively
Continue splitting until stopping criteria are met (e.g., pure nodes, max depth).

 Question 2: What is the difference between Gini Impurity and Entropy?

 Ans: Gini Impurity and Entropy are both measures of impurity used in decision trees, but they differ in how they calculate uncertainty.
- Entropy comes from information theory. It measures the average amount of “information” or surprise in the dataset.
- If the classes are evenly split (say 50% Yes, 50% No), entropy is at its maximum because the uncertainty is highest. If all samples belong to one class, entropy is zero because there’s no uncertainty.
- Like entropy, Gini is zero when the node is pure (all samples in one class). It reaches its maximum when classes are evenly distributed.

The practical difference is that entropy uses logarithms, which makes it more theoretically grounded in information theory, while Gini relies on squared probabilities, making it computationally faster. In practice, both often lead to very similar splits in decision trees.
Intuitively:
- Entropy tells you how much “information” is needed to classify an observation.
- Gini tells you how often you’d be wrong if you guessed based on the distribution.
In usage:
- CART (Classification and Regression Trees) typically use Gini.
- ID3 and C4.5 algorithms often use Entropy.


Question 3:What is Pre-Pruning in Decision Trees?

Ans:Pre-pruning in decision trees is a technique used to stop the tree from growing too deep or complex by applying constraints during the construction process, rather than waiting until the tree is fully grown.

 What is Pre-Pruning?
- When building a decision tree, the algorithm keeps splitting nodes to reduce impurity.
- If left unchecked, the tree can grow very large, perfectly fitting the training data but performing poorly on unseen data (overfitting).
- Pre-pruning prevents this by setting rules that stop further splitting early.

 Common Pre-Pruning Strategies
- Maximum Depth: Limit how many levels the tree can grow.
- Minimum Samples per Split: Require a minimum number of samples before a node can be split.
- Minimum Information Gain / Gini Reduction: Only split if the improvement in purity is above a threshold.
- Maximum Number of Leaf Nodes: Restrict the total number of terminal nodes.

 Use of Pre-Pruning:
- Avoid Overfitting: Keeps the tree simpler and more generalizable.
- Reduce Complexity: Smaller trees are easier to interpret.
- Improve Efficiency: Saves computation time by avoiding unnecessary splits




In [1]:
#Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Train a Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X, y)

# Print feature importances
importances = clf.feature_importances_

# Display results in a neat format
importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

print("Feature Importances (using Gini Impurity):")
print(importance_df)


Feature Importances (using Gini Impurity):
             Feature  Importance
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000


Question 5: What is a Support Vector Machine (SVM)?

Ans: A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and sometimes regression tasks. Its main idea is to find the best boundary (called a hyperplane) that separates data points of different classes with the maximum margin.

 Core Concept
- Imagine you have two classes of points on a graph.
- SVM tries to draw a line (in 2D) or a hyperplane (in higher dimensions) that separates the classes.
- The best hyperplane is the one that maximizes the distance (margin) between itself and the nearest points from each class.
- These nearest points are called support vectors, and they are critical because they define the boundary

Working:
- Linear SVM: Finds a straight hyperplane to separate classes if data is linearly separable.
- Non-linear SVM: Uses a technique called the kernel trick to project data into a higher-dimensional space where a linear separation is possible.
- Common kernels: Linear, Polynomial, Radial Basis Function (RBF), Sigmoid.
- Margin Maximization: Ensures the classifier is robust and generalizes well to unseen data.


Question 6: What is the Kernel Trick in SVM?

Ans:The Kernel Trick in Support Vector Machines (SVM) is a clever mathematical technique that allows SVMs to handle data that is not linearly separable by implicitly mapping it into a higher-dimensional space — without ever computing that mapping directly.

The Problem
- A simple SVM works well when data can be separated by a straight line (or hyperplane).
- But many real-world datasets are non-linear — you can’t separate them with a straight boundary.

The Kernel Trick
- Instead of explicitly transforming data into a higher dimension (which could be computationally expensive), SVM uses a kernel function.
- A kernel computes the similarity between two data points in the higher-dimensional space, without actually performing the transformation.
- This makes it possible to find complex, non-linear boundaries efficiently

Common Kernel Functions
- Linear Kernel: Works when data is linearly separable.
- Polynomial Kernel: Captures curved boundaries by considering polynomial combinations of features.
- Radial Basis Function (RBF) Kernel: Popular choice; creates circular or radial decision boundaries.
- Sigmoid Kernel: Similar to neural networks’ activation functions.






In [2]:
#Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train SVM with Linear kernel
svm_linear = SVC(kernel="linear", random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel="rbf", random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# Print accuracies
print("Accuracy with Linear Kernel:", accuracy_linear)
print("Accuracy with RBF Kernel:", accuracy_rbf)

Accuracy with Linear Kernel: 0.9444444444444444
Accuracy with RBF Kernel: 0.6666666666666666


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?


Ans: The Naïve Bayes classifier is a simple yet powerful probabilistic machine learning algorithm based on Bayes’ Theorem. It is widely used for classification tasks such as spam detection, sentiment analysis, and text categorization.

What is Naïve Bayes?
- It applies Bayes’ Theorem to calculate the probability that a given data point belongs to a particular class.

Why  "Naïve"?
- The algorithm makes a naïve assumption: it assumes that all features are independent of each other given the class.
- In reality, features often have correlations (e.g., in text classification, the words “solar” and “panel” are related).
- Despite this unrealistic assumption, Naïve Bayes often performs surprisingly well in practice, especially for high-dimensional data like text.

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes?

Ans: Gaussian Naïve Bayes
- Data type: Continuous (real-valued) features.
- Assumption: Each feature follows a normal (Gaussian) distribution within each class.
- Use case: Works well for datasets like sensor readings, exam scores, or continuous measurements (e.g., height, weight, solar irradiance values).
- Example: Classifying whether a patient has a disease based on continuous lab test results.

Multinomial Naïve Bayes
- Data type: Discrete counts or frequency data.
- Assumption: Features represent counts (non-negative integers), often word frequencies in text.
- Use case: Common in text classification (spam detection, sentiment analysis) where features are word counts or TF-IDF values.
- Example: Classifying emails as spam or not spam based on word occurrence counts.

Bernoulli Naïve Bayes
- Data type: Binary features (0 or 1).
- Assumption: Each feature is a yes/no indicator (present or absent).
- Use case: Useful when features represent presence/absence rather than counts.
- Example: Classifying documents based on whether certain keywords appear at least once (not how many times).




In [None]:
#Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Initialize Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset:", accuracy)