# Q1 What is Information Gain, and how is it used in Decision Trees?

ANS1 Information Gain (IG) is a metric used in Decision Trees to decide which feature to split on at each node. It measures how much uncertainty (entropy) is reduced after splitting the dataset on a particular feature.

1. Entropy (Measure of Impurity)

Entropy quantifies how mixed the classes are in a dataset.

Entropy(S)=‚àíi=1‚àën‚Äãpi‚Äãlog2‚Äã(pi‚Äã)

Where:

ùëù
ùëñ
= proportion of class
ùëñ
i in dataset
ùëÜ
S

Entropy = 0 ‚Üí pure node (only one class)

Entropy = 1 (or higher) ‚Üí more disorder


2. Information Gain (Definition)


Information Gain(S,A)=Entropy(S)‚àív‚ààValues(A)‚àë‚Äã‚à£S‚à£‚à£Sv‚Äã‚à£‚Äã√óEntropy(Sv‚Äã)


Where:


S = original dataset


A = attribute (feature)


S
v
= subset of data where attribute
A has value
v


3. Intuition Behind Information Gain

High Information Gain ‚Üí feature creates pure, well-separated nodes

Low Information Gain ‚Üí feature does not help much in classification

Decision Trees always choose the feature with maximum Information Gain for splitting

4. How It Is Used in Decision Trees

1 Calculate entropy of the full dataset.

2 For each feature:

.Split the dataset by feature values.

.Calculate entropy of each subset.

.Compute Information Gain.

3 Choose the feature with highest Information Gain.

4 Repeat recursively for child nodes until:

.All samples belong to one class, or

.No features remain, or

.Stopping criteria are met.

5. Example (Simple)

If:

Entropy before split = 1.0

Entropy after split on feature Weather = 0.4

IG=1.0‚àí0.4=0.6


6. Important Note

ID3 algorithm uses Information Gain

C4.5 improves it using Gain Ratio (to avoid bias toward attributes with many values)

# Q2 What is the difference between Gini Impurity and Entropy?
Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases.


ANS2 Gini Impurity and Entropy are the two most commonly used measures to quantify node impurity in Decision Trees. Both aim to evaluate how mixed the classes are in a dataset, but they differ in formulation, behavior, and use cases.

1. Definition & Formula

| Measure           | Formula                                    |
| ----------------- | ------------------------------------------ |
| **Gini Impurity** | ( \text{Gini} = 1 - \sum p_i^2 )           |
| **Entropy**       | ( \text{Entropy} = -\sum p_i \log_2(p_i) ) |




Where
ùëù
ùëñ
p
i
	‚Äãis the probability of class
ùëñ
i.


2. Interpretation

| Aspect        | Gini Impurity                           | Entropy                            |
| ------------- | --------------------------------------- | ---------------------------------- |
| Meaning       | Probability of incorrect classification | Measure of uncertainty or disorder |
| Value = 0     | Pure node                               | Pure node                          |
| Maximum value | 0.5 (binary classification)             | 1 (binary classification)          |


3. Computational Efficiency

.Gini Impurity

Faster to compute (no logarithms)

Preferred in large datasets

.Entropy

Computationally expensive (uses log function)

Slightly more precise in measuring uncertainty


4. Sensitivity to Class Distribution

| Feature                    | Gini           | Entropy        |
| -------------------------- | -------------- | -------------- |
| Sensitivity to node purity | Less sensitive | More sensitive |
| Penalizes mixed nodes      | Moderately     | Strongly       |


Entropy reacts more sharply to changes near pure nodes, which can lead to slightly different splits.

5. Split Quality & Accuracy

.In practice, both often produce very similar trees

.Entropy may create more balanced trees

.Gini may isolate the most frequent class faster

6. Algorithm Usage

| Algorithm | Measure Used         |
| --------- | -------------------- |
| CART      | Gini Impurity        |
| ID3       | Entropy              |
| C4.5      | Entropy / Gain Ratio |


7. When to Use Which?

| Situation                 | Recommended Measure |
| ------------------------- | ------------------- |
| Large datasets            | Gini (faster)       |
| Theoretical clarity       | Entropy             |
| Highly imbalanced classes | Entropy             |
| Default in practice       | Gini                |




# Q3 What is Pre-Pruning in Decision Trees?


ANS3 Pre-pruning (also called early stopping) in Decision Trees is a technique used to prevent overfitting by stopping the growth of the tree before it perfectly fits the training data.

Instead of growing a full tree and trimming it later, pre-pruning decides in advance when to stop splitting.

#Why Pre-Pruning Is Needed

Decision Trees can:

.Learn noise and outliers

.Become very deep and complex

.Perform poorly on unseen data (overfitting)

Pre-pruning controls this by limiting tree growth early.

#How Pre-Pruning Works

At each node, the algorithm checks stopping conditions. If any condition is met, the node becomes a leaf, even if further splits are possible.

# Common Pre-Pruning Criteria

1.Maximum Depth

Stop if tree reaches a certain depth
(e.g., max_depth = 5)

2.Minimum Samples per Split

Split only if node has at least a minimum number of samples
(e.g., min_samples_split = 10)

3.Minimum Samples per Leaf

Each leaf must contain a minimum number of samples
(e.g., min_samples_leaf = 5)

4.Minimum Impurity Decrease

Split only if impurity reduction exceeds a threshold
(e.g., min_impurity_decrease = 0.01)

5.Maximum Number of Leaf Nodes

Restrict total leaf nodes in the tree

#Advantages of Pre-Pruning

‚úÖ Reduces overfitting
‚úÖ Faster training
‚úÖ Simpler, more interpretable trees

#Disadvantages of Pre-Pruning

‚ùå Risk of underfitting
‚ùå May stop useful splits too early
‚ùå Requires careful parameter tuning

#Pre-Pruning vs Post-Pruning

| Aspect       | Pre-Pruning             | Post-Pruning           |
| ------------ | ----------------------- | ---------------------- |
| When applied | Before tree fully grows | After full tree growth |
| Risk         | Underfitting            | More computation       |
| Complexity   | Lower                   | Higher                 |
| Accuracy     | May be lower            | Often higher           |




# Q4 Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.
(Include your Python code and output in the code box below.)

In [1]:
#ANS4


# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create Decision Tree model using Gini Impurity
model = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Display feature importances with feature names
feature_importance_df = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': importances
})

print(feature_importance_df)


             Feature  Importance
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611


# Q5 What is a Support Vector Machine (SVM)?

ANS5 A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. Its main goal is to find an optimal decision boundary (hyperplane) that best separates data points of different classes.




#Core Idea of SVM

SVM chooses the hyperplane that:

.Maximizes the margin (distance) between the hyperplane and the nearest data points of each class

.These nearest points are called support vectors

A larger margin usually leads to better generalization on unseen data.

#Key Concepts
1. Hyperplane

In 2D: a line

In 3D: a plane

In higher dimensions: a hyperplane

Mathematically:

w‚ãÖx+b=0


2. Support Vectors

.Data points closest to the hyperplane

.They define the position and orientation of the decision boundary

3. Margin

.The distance between the hyperplane and the closest support vectors

.SVM maximizes this margin

#Types of SVM
1. Linear SVM

.Used when data is linearly separable

.Uses a straight-line hyperplane

2. Non-Linear SVM

.Uses kernel functions to map data into higher dimensions

Common kernels:

.Linear

.Polynomial

.Radial Basis Function (RBF)

.Sigmoid

#Advantages of SVM

‚úÖ Effective in high-dimensional spaces
‚úÖ Works well when number of features > number of samples
‚úÖ Robust to overfitting (with proper kernel and parameters)

#Disadvantages of SVM

‚ùå Computationally expensive for large datasets
‚ùå Harder to interpret than Decision Trees
‚ùå Choice of kernel and parameters is critical

#Common Applications

.Face recognition

.Text classification

.Bioinformatics

.Image classification



# Q6 What is the Kernel Trick in SVM?

ANS6 The Kernel Trick in Support Vector Machines (SVM) is a technique that allows SVMs to handle non-linearly separable data by implicitly mapping data into a higher-dimensional feature space, without explicitly computing that mapping.

#Why the Kernel Trick Is Needed

.Some datasets cannot be separated by a straight line in the original feature space.

.By mapping data to a higher dimension, it may become linearly separable.

.Explicitly computing this mapping can be computationally expensive or even infeasible.

The kernel trick avoids this cost.


#Core Idea

Instead of computing:

œï(x)

(explicit feature transformation)

SVM computes:

K(xi‚Äã,xj‚Äã)=œï(xi‚Äã)‚ãÖœï(xj‚Äã)

Directly using a kernel function that measures similarity between two points.


#Common Kernel Functions
1. Linear Kernel

K(xi‚Äã,xj‚Äã)=xi‚Äã‚ãÖxj‚Äã

.No transformation

.Used for linearly separable data

2. Polynomial Kernel

K(xi‚Äã,xj‚Äã)=(xi‚Äã‚ãÖxj‚Äã+c)d

.Captures polynomial relationships

3. Radial Basis Function (RBF / Gaussian Kernel)

K(xi‚Äã,xj‚Äã)=exp(‚àíŒ≥‚à•xi‚Äã‚àíxj‚Äã‚à•2)

.Most popular

.Handles complex, non-linear boundaries

4. Sigmoid Kernel

K(xi‚Äã,xj‚Äã)=tanh(Œ±xi‚Äã‚ãÖxj‚Äã+c)

.Similar to neural networks

#Advantages of the Kernel Trick

‚úÖ Enables non-linear classification
‚úÖ No need to compute high-dimensional features explicitly
‚úÖ Efficient and flexible

#Limitations

‚ùå Kernel choice is problem-dependent
‚ùå Can be slow for very large datasets
‚ùå Harder to interpret model behavior




#Q7  Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
(Include your Python code and output in the code box below.)

In [2]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train SVM with Linear kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
linear_accuracy = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
rbf_accuracy = accuracy_score(y_test, y_pred_rbf)

# Print accuracies
print("Linear Kernel SVM Accuracy:", linear_accuracy)
print("RBF Kernel SVM Accuracy:", rbf_accuracy)


Linear Kernel SVM Accuracy: 0.9814814814814815
RBF Kernel SVM Accuracy: 0.7592592592592593


# Q8 What is the Na√Øve Bayes classifier, and why is it called "Na√Øve"?


ANS8 Na√Øve Bayes is a probabilistic supervised learning classifier based on Bayes‚Äô Theorem. It is widely used for classification tasks, especially in text and spam classification.

#What is Na√Øve Bayes?

Na√Øve Bayes predicts the class
ùê∂
C of a data point
ùëã
X by computing:

P(C‚à£X)=P(X)P(X‚à£C)P(C)‚Äã

The class with the highest posterior probability is chosen.

# Why is it called ‚ÄúNa√Øve‚Äù?

It is called na√Øve because it makes a strong simplifying assumption:

All features are conditionally independent given the class.

This means each feature contributes independently to the final decision, which is often not true in real-world data.

# Example of the Na√Øve Assumption

If features are:

.Fever

.Cough

Na√Øve Bayes assumes:

.Fever and cough are independent, given the disease

In reality, they are usually correlated, hence the assumption is na√Øve.

# Types of Na√Øve Bayes Classifiers

1.Gaussian Na√Øve Bayes

.For continuous data

.Assumes normal distribution

2.Multinomial Na√Øve Bayes

.For discrete counts (e.g., word frequencies)

3.Bernoulli Na√Øve Bayes

.For binary features (0/1)

#Advantages

‚úÖ Simple and fast
‚úÖ Works well with high-dimensional data
‚úÖ Effective even with small datasets

#Limitations

‚ùå Independence assumption often unrealistic
‚ùå Performs poorly when features are highly correlated



# Q9 Explain the differences between Gaussian Na√Øve Bayes, Multinomial Na√Øve Bayes, and Bernoulli Na√Øve Bayes

ANS9 Na√Øve Bayes classifiers differ mainly in the type of data they assume and the probability distribution used to model features.


#1. Gaussian Na√Øve Bayes (GNB)
# Assumption

Features are continuous and follow a normal (Gaussian) distribution.

# Probability Model

P(x‚à£C)=2œÄœÉC2‚Äã
‚Äã1‚Äãexp(‚àí2œÉC2‚Äã(x‚àíŒºC‚Äã)2‚Äã)

#Used When

.Data contains real-valued measurements.

#Examples

.Height, weight, temperature

.Medical measurements

#Pros / Cons

.‚úÖ Works well for continuous data

.‚ùå Assumption of normality may not always hold

#2. Multinomial Na√Øve Bayes (MNB)
#Assumption

.Features are discrete counts.

.Data follows a multinomial distribution.

#Probability Model

.Uses feature frequencies.

#Used When

.Text classification with word counts or TF-IDF.

#Examples

.Spam detection

.Sentiment analysis

.Document classification

#Pros / Cons

.‚úÖ Excellent for text data

.‚ùå Not suitable for continuous values



#3. Bernoulli Na√Øve Bayes (BNB)
#Assumption

.Features are binary (0 or 1).

.Data follows a Bernoulli distribution.

#Probability Model

.Models presence or absence of features.

#Used When

.Binary feature vectors.

#Examples

.Word present or not present in a document

.Yes/No attributes

#Pros / Cons

.‚úÖ Simple and effective for binary data

.‚ùå Ignores frequency information

#Key Differences at a Glance

| Aspect                | Gaussian NB  | Multinomial NB       | Bernoulli NB    |
| --------------------- | ------------ | -------------------- | --------------- |
| Feature type          | Continuous   | Discrete counts      | Binary          |
| Distribution          | Gaussian     | Multinomial          | Bernoulli       |
| Best for              | Numeric data | Text (counts/TF-IDF) | Binary features |
| Uses frequency        | ‚ùå            | ‚úÖ                    | ‚ùå               |
| Uses presence/absence | ‚ùå            | ‚ùå                    | ‚úÖ               |


#Q10 Breast Cancer Dataset
Write a Python program to train a Gaussian Na√Øve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)

In [3]:
#ANS10

# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Gaussian Na√Øve Bayes model
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Na√Øve Bayes Accuracy:", accuracy)


Gaussian Na√Øve Bayes Accuracy: 0.9415204678362573
