Question 1: What is Information Gain, and how is it used in Decision Trees?

Answer:

Information Gain (IG) measures how much “information” (or reduction in impurity) a feature gives when used to split data in a Decision Tree.
It tells us how well a feature separates the classes.

Formula:
𝐼
𝐺
(
𝐷
,
𝐴
)
=
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝐷
)
−
∑
𝑣
∈
𝑉
𝑎
𝑙
𝑢
𝑒
𝑠
(
𝐴
)
∣
𝐷
𝑣
∣
∣
𝐷
∣
×
𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
(
𝐷
𝑣
)
IG(D,A)=Entropy(D)−
v∈Values(A)
∑
	​

∣D∣
∣D
v
	​

∣
	​

×Entropy(D
v
	​

)

𝐷
D → dataset

𝐴
A → attribute (feature)

𝐷
𝑣
D
v
	​

 → subset of data where feature
𝐴
=
𝑣
A=v

How it works:

Calculate the entropy of the dataset (overall impurity).

For each feature, compute the entropy after splitting on that feature.

Information Gain = reduction in entropy.

Choose the feature with maximum Information Gain for the split.

Example:

If splitting on “Age” reduces entropy from 0.9 → 0.3,
then
𝐼
𝐺
=
0.9
−
0.3
=
0.6
IG=0.9−0.3=0.6.
This means “Age” is a strong predictor.

Question 2: What is the difference between Gini Impurity and Entropy?

Answer:

Feature	Gini Impurity	Entropy
Formula
𝐺
𝑖
𝑛
𝑖
=
1
−
∑
𝑝
𝑖
2
Gini=1−∑p
i
2
	​


𝐸
𝑛
𝑡
𝑟
𝑜
𝑝
𝑦
=
−
∑
𝑝
𝑖
log
⁡
2
(
𝑝
𝑖
)
Entropy=−∑p
i
	​

log
2
	​

(p
i
	​

)
Range	0 (pure) to 0.5 (binary mix)	0 (pure) to 1 (binary mix)
Interpretation	Measures misclassification probability	Measures information (bits) needed to classify
Computation Speed	Faster	Slightly slower (uses log)
Used In	CART (Classification and Regression Trees)	ID3, C4.5 algorithms
When to Use	For faster training, with large datasets	When interpretability or information theory is key

Summary:
Both measure impurity. Gini is simpler and faster; Entropy is more informative but computationally heavier.

Question 3: What is Pre-Pruning in Decision Trees?

Answer:

Pre-Pruning (also called Early Stopping) is the technique of stopping tree growth early—before it becomes too complex.

How it works:

You specify stopping criteria such as:

Maximum depth (max_depth)

Minimum samples to split (min_samples_split)

Minimum information gain threshold

If further splits don’t improve accuracy or gain enough information, tree growth stops.

Advantages:

Prevents overfitting early.

Faster training and simpler model.

Example:

DecisionTreeClassifier(max_depth=4)
→ Stops tree at depth 4 even if more splits possible.

Question 4: Python Program – Decision Tree Classifier (Gini Impurity)

Answer:

# Decision Tree using Gini Impurity
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with Gini
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)

# Display feature importances
feature_importances = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': clf.feature_importances_
})

print("Decision Tree Accuracy:", round(accuracy, 3))
print("\nFeature Importances:\n", feature_importances)


✅ Output (Example):

Decision Tree Accuracy: 0.977

Feature Importances:
           Feature  Importance
0  sepal length (cm)   0.015
1   sepal width (cm)   0.025
2  petal length (cm)   0.555
3   petal width (cm)   0.405

Question 5: What is a Support Vector Machine (SVM)?

Answer:

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression.

It finds the best separating hyperplane that divides data into classes with maximum margin.

Key Concepts:

Support Vectors: Data points closest to the boundary.

Margin: Distance between the hyperplane and nearest data points.

Goal: Maximize margin for better generalization.

Advantages:

Works well for high-dimensional data.

Effective when classes are separable.

Question 6: What is the Kernel Trick in SVM?

Answer:

The Kernel Trick allows SVM to perform classification on non-linear data by mapping it into a higher-dimensional space — without explicitly computing the transformation.

Common Kernels:
Kernel	Function	Use Case
Linear
𝑥
⋅
𝑦
x⋅y	Linearly separable data
Polynomial
(
𝑥
⋅
𝑦
+
𝑐
)
𝑑
(x⋅y+c)
d
	Curved boundaries
RBF (Gaussian)	( \exp(-\gamma

In short:
Kernels help SVM handle complex patterns that are not linearly separable.

Question 7: Python Program – SVM with Linear and RBF Kernels (Wine Dataset)

Answer:

# SVM with Linear and RBF kernels
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
linear_acc = accuracy_score(y_test, svm_linear.predict(X_test))

# RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
rbf_acc = accuracy_score(y_test, svm_rbf.predict(X_test))

print("SVM Linear Kernel Accuracy:", round(linear_acc, 3))
print("SVM RBF Kernel Accuracy:", round(rbf_acc, 3))


✅ Output (Example):

SVM Linear Kernel Accuracy: 0.981
SVM RBF Kernel Accuracy: 0.963


Conclusion:
Linear kernel works slightly better for this dataset (data is nearly linearly separable).

Question 8: What is the Naïve Bayes Classifier, and why is it called “Naïve”?

Answer:

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem, assuming features are independent.

𝑃
(
𝐶
∣
𝑋
)
=
𝑃
(
𝑋
∣
𝐶
)
𝑃
(
𝐶
)
𝑃
(
𝑋
)
P(C∣X)=
P(X)
P(X∣C)P(C)
	​


Where:

𝑃
(
𝐶
∣
𝑋
)
P(C∣X): probability of class
𝐶
C given features
𝑋
X

𝑃
(
𝑋
∣
𝐶
)
P(X∣C): likelihood

𝑃
(
𝐶
)
P(C): prior probability

𝑃
(
𝑋
)
P(X): evidence

Why “Naïve”?

It assumes all features are independent, which is rarely true in real-world data — but it still works surprisingly well!

Advantages:

Very fast, works well with large datasets.

Performs well for text and spam filtering.

Question 9: Difference Between Gaussian, Multinomial, and Bernoulli Naïve Bayes
Type	Used For	Data Type	Example Use Case
GaussianNB	Continuous features	Real-valued (height, weight)	Medical data, sensors
MultinomialNB	Discrete counts	Word frequencies, counts	Text classification
BernoulliNB	Binary features	0/1 features (present/absent)	Spam detection (word present/absent)
Question 10: Python Program – Gaussian Naïve Bayes on Breast Cancer Dataset

Answer:

# Gaussian Naive Bayes on Breast Cancer dataset
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naive Bayes Accuracy:", round(accuracy, 3))


✅ Output (Example):

Gaussian Naive Bayes Accuracy: 0.953