Q.1.

Ans

   - What is Information Gain?

It is a metric used in machine learning to determine which feature is most effective for splitting data at a given node in a decision tree.

It is calculated as the difference between the entropy of the dataset before the split and the weighted average of the entropy after the split.

A higher Information Gain means the feature is more effective at reducing uncertainty and creating more pure, homogeneous subsets of data.

   - How it's used in Decision Trees

1. Root Node Selection: The algorithm calculates the Information Gain for every feature in the dataset and chooses the one with the highest gain to be the root node.

2. Splitting and Branching: The dataset is split into subsets based on the values of the chosen feature. This process creates child nodes.

3. Recursive Selection: The process is repeated for each new child node. The algorithm evaluates the remaining features to find the one with the highest Information Gain for that specific subset of data and splits it again.

4. Stopping Condition: The algorithm continues to split nodes until a stopping condition is met, such as reaching a maximum depth, having nodes with very low entropy (pure nodes), or when no features remain to split on.

5. Classification/Prediction: Once the tree is built, it can be used for classification. To classify a new data point, you traverse the tree from the root down to a leaf node, following the branches based on the features of the data point. The leaf node then provides the predicted class.

Q.2.

Ans

Both Gini Impurity and Entropy are measures of impurity used in decision tree algorithms to evaluate how well a feature separates the classes in a dataset. They help in selecting the best attribute for splitting at each node of the tree.


    1. Gini impurity
Definition:
Gini Impurity measures the degree of impurity or uncertainty in a dataset. It represents the probability that a randomly chosen element would be incorrectly classified if it were randomly labeled according to the class distribution in the node.

   2. 2. Entropy

Definition:
Entropy is a measure of impurity or randomness in the data. It is derived from information theory and indicates the amount of information needed to describe the outcome of a random variable.

- Strengths and weaknesses

1. Gini Impurity

    Strengths:

- Faster and more efficient: Its simpler calculation is ideal for larger datasets or real-time applications where speed is a priority.

- Simplicity: The formula is easier to understand and calculate, as it does not involve logarithms.

- Good for balanced classes: Works well and is the preferred metric when classes in the dataset are relatively balanced.

    Weaknesses:

- Less sensitive to skewed distributions: It is less responsive to subtle changes in probability distribution compared to entropy.
- Less robust: It can be less robust than entropy in some cases.


2. Entropy

    Strengths:

- Theoretically richer: As a concept from information theory, it provides a more robust and nuanced measure of information and uncertainty.

- Good for imbalanced classes: Its sensitivity to class distribution makes it a good choice for datasets where class labels are skewed.

- Produces deeper trees: Can lead to more granular and precise splits, potentially resulting in deeper decision trees.

    Weaknesses:

- Computationally slower: The logarithmic function makes it more intensive to calculate, especially on large datasets.

- Similar performance in practice: For most cases, the difference in accuracy between trees built with Gini Impurity and Entropy is minimal, so the additional computation time may not be justified.

    Appropriate use cases

1. Use Gini Impurity if:

- Computational speed is critical, as in real-time systems or with very large datasets.

- The classes in your dataset are well-balanced.

2. Use Entropy if:

- You are working with a highly imbalanced dataset and want to be more sensitive to finer splits.
- You are prioritizing the theoretical foundation of information gain and have a smaller dataset where computation time is not a major concern.

Q.3.

Ans

Pre-Pruning in Decision Trees

Definition:

Pre-pruning, also known as early stopping, is a technique used to prevent overfitting in decision tree algorithms. It involves stopping the growth of the tree early, before it perfectly classifies all the training examples.

Instead of allowing the tree to grow fully and then pruning it afterward, pre-pruning imposes constraints during the tree-building process to decide when to stop splitting a node.

Common Stopping Criteria for Pre-Pruning:

- Maximum Depth:
Stop splitting when the tree reaches a specified maximum depth.

- Minimum Samples per Node:
Stop splitting if a node contains fewer samples than a predefined threshold.

- Minimum Information Gain / Gini Decrease:
Stop splitting if the reduction in impurity (or gain in information) is below a threshold.

- Maximum Number of Leaf Nodes:
Limit the total number of leaf nodes in the tree.

- Chi-Square Test:
Stop splitting if the split is not statistically significant.

Q.4.

In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load the dataset (Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create a Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train (fit) the model
clf.fit(X_train, y_train)

# Print the feature importances
print("Feature Importances:")
for name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")

# Evaluate the model accuracy
accuracy = clf.score(X_test, y_test)
print(f"\nModel Accuracy on Test Data: {accuracy:.2f}")


Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy on Test Data: 1.00


Q.5.

Ans


    Support Vector Machine (SVM)

Definition:

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks.
It works by finding the best decision boundary (hyperplane) that separates data points of different classes with the maximum margin.

Concept:

- In SVM, each data item is plotted as a point in an n-dimensional space (where n is the number of features).

- The goal is to find a hyperplane that best separates the classes.

For example:

- In 2D, the hyperplane is a line.

- In 3D, it is a plane.

- In higher dimensions, it is called a hyperplane.

Key Concepts:

    Support Vectors:

The data points that lie closest to the hyperplane and influence its position and orientation.

    Margin:
The distance between the hyperplane and the nearest data points (support vectors).
SVM maximizes this margin for better generalization.

    Kernel Trick:
When data is not linearly separable, SVM uses kernel functions to transform it into a higher-dimensional space where it becomes separable.
- Common kernels:

Linear Kernel

Polynomial Kernel

Radial Basis Function (RBF) Kernel

Sigmoid Kernel

- Types of SVM:

1. Linear SVM:
Used when data is linearly separable.

2. Non-Linear SVM:
Used when data cannot be separated by a straight line — uses kernel functions to separate data in higher dimensions.

Q.6.

Ans

Definition:

The Kernel Trick is a mathematical technique used in Support Vector Machines (SVMs) to handle non-linearly separable data by transforming it into a higher-dimensional space where it becomes linearly separable, without explicitly performing the transformation.

Concept:

- In many real-world problems, data points cannot be separated by a straight line (or hyperplane) in their original feature space.

- The idea of the kernel trick is to map the input data from its original space to a higher-dimensional feature space, where a separating hyperplane can be found.

 Advantages of the Kernel Trick:

- Allows SVMs to efficiently solve non-linear classification problems.

- Avoids the computational cost of explicitly transforming data into higher dimensions.

- Provides flexibility to choose different kernels based on the problem.

Disadvantages:

- Choosing the right kernel and tuning its parameters can be difficult.

- For very large datasets, kernel computations can become computationally expensive.

Q.7.

Ans


In [3]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create two SVM classifiers with different kernels
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both models
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracy
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# Print the accuracies
print("Accuracy using Linear Kernel:", round(accuracy_linear * 100, 2), "%")
print("Accuracy using RBF Kernel:", round(accuracy_rbf * 100, 2), "%")

# Compare which kernel performs better
if accuracy_linear > accuracy_rbf:
    print("\nThe Linear kernel performs better on this dataset.")
elif accuracy_rbf > accuracy_linear:
    print("\nThe RBF kernel performs better on this dataset.")
else:
    print("\nBoth kernels perform equally well.")


Accuracy using Linear Kernel: 98.15 %
Accuracy using RBF Kernel: 75.93 %

The Linear kernel performs better on this dataset.


Q.8.

Ans

    Naïve Bayes Classifier
Definition:

The Naïve Bayes classifier is a supervised machine learning algorithm based on Bayes’ Theorem, used for classification tasks.

It is called “naïve” because it assumes that all features (attributes) are independent of each other, which is rarely true in real-world data.

Despite this simplifying assumption, Naïve Bayes often performs very well in practice, especially for text classification, spam filtering, and sentiment analysis.

The Naïve Bayes classifier is a simple probabilistic model based on Bayes’ Theorem.
It is called “Naïve” because it assumes that all features contribute independently to the outcome — an assumption that simplifies computation but is rarely true in real data.

Q.9.

Ans

Naïve Bayes Classifier Variants: Gaussian, Multinomial, and Bernoulli

The Naïve Bayes classifier is a probabilistic supervised learning algorithm based on Bayes’ Theorem. It assumes that all features are conditionally independent, which simplifies probability computations. Depending on the type of feature data, Naïve Bayes has three common variants: Gaussian, Multinomial, and Bernoulli.

    1. Gaussian Naïve Bayes (GNB)

Feature Type: Continuous numeric features.

Assumption: Each feature is normally distributed (Gaussian distribution) within each class.

Probability Calculation:
For a continuous feature
𝑥
𝑖
x
i
	​
 given class
𝐶
C, the likelihood is computed using the Gaussian probability density function:

𝑃
(
𝑥
𝑖
∣
𝐶
)
=
1
2
𝜋
𝜎
𝐶
2
exp
⁡
(
−
(
𝑥
𝑖
−
𝜇
𝐶
)
2
2
𝜎
𝐶
2
)
P(x
i
	​

∣C)=
2πσ
C
2

1
	​

exp(−
2σ
C
2
	​

(x
i
	​

−μ
C
	​

)
2
	​

)

    2. Multinomial Naïve Bayes (MNB)

Feature Type: Discrete count-based features (non-negative integers).

Assumption: Each feature represents the number of times an event occurs, such as word counts in documents.

Probability Calculation:
Multinomial Naïve Bayes uses the multinomial distribution to calculate likelihoods:

𝑃
(
𝑋
∣
𝐶
)
=
(
∑
𝑥
𝑖
)
!
∏
𝑥
𝑖
!
∏
𝑃
(
𝑥
𝑖
∣
𝐶
)
𝑥
𝑖
P(X∣C)=
∏x
i
	​

!
(∑x
i
	​

)!
	​

∏P(x
i
	​

∣C)
x
i
	​

    3. Bernoulli Naïve Bayes (BNB)

Feature Type: Binary features (0 or 1), indicating presence or absence of a characteristic.

Assumption: Each feature is binary, modeled using a Bernoulli distribution.

Probability Calculation:

𝑃
(
𝑥
𝑖
∣
𝐶
)
=
𝑃
(
𝑥
𝑖
∣
𝐶
)
𝑥
𝑖
⋅
(
1
−
𝑃
(
𝑥
𝑖
∣
𝐶
)
)
1
−
𝑥
𝑖
P(x
i
	​

∣C)=P(x
i
	​

∣C)
x
i
	​

⋅(1−P(x
i
	​

∣C))
1−x
i
	​


Q.10.

Ans



In [1]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data      # Features
y = data.target    # Labels

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create a Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the classifier
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Naïve Bayes classifier:", round(accuracy * 100, 2), "%")

# Optional: Detailed evaluation
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy of Gaussian Naïve Bayes classifier: 94.15 %

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171

Confusion Matrix:
 [[ 57   6]
 [  4 104]]
