**Question 1 : What is Information Gain, and how is it used in Decision Trees?**

Information Gain measures how much uncertainty (impurity) in the data is reduced after splitting on a particular feature.

Information Gain is a concept used in Decision Tree algorithms to decide which feature should be chosen to split the data at each step.

**How It Works in Decision Trees**

- A decision tree starts with a mixed dataset where different classes are present.

- For each feature, the algorithm checks how well it separates the data into clear groups.

- If a feature divides the data into mostly single-class groups, it has high Information Gain.

- The feature with the highest Information Gain is chosen as the decision node.

- This process continues until the data becomes pure or stopping conditions are met.

If we are predicting whether people will play a game, a feature like weather may clearly separate “yes” and “no” outcomes, while a feature like day number may not.
So, weather would have higher Information Gain and be chosen first.

**Information Gain measures how much a feature reduces uncertainty in the data and is used in decision trees to select the best feature for splitting.**

**Question 2: What is the difference between Gini Impurity and Entropy?**

Both Gini Impurity and Entropy are metrics used in Decision Tree algorithms to measure how impure or mixed a dataset is. They help decide the best feature to split the data.

**Meaning**

**Gini Impurity**:

-  Measures how often a randomly chosen data point would be incorrectly classified.

- Lower Gini value means purer nodes.

**Entropy**:

- Measures the level of uncertainty or randomness in the data.

- Lower entropy means less disorder.

**How They Are Used**

- Both are used to select the best split in a decision tree.

- The split that results in maximum purity (least impurity/uncertainty) is chosen.

| Aspect         | Gini Impurity                 | Entropy                       |
| -------------- | ----------------------------- | ----------------------------- |
| Concept        | Misclassification probability | Uncertainty or randomness     |
| Interpretation | How impure the node is        | How unpredictable the node is |
| Computation    | Simpler and faster            | Slightly more complex         |
| Speed          | Faster in practice            | Slower compared to Gini       |
| Used in        | CART algorithm                | ID3 and C4.5 algorithms       |
| Bias           | Prefers larger partitions     | More balanced splits          |


**Which One Is Better?**

- Gini Impurity is preferred when speed and simplicity matter.

- Entropy is preferred when information gain and interpretability are important.

- In practice, both often produce similar trees.

**Gini Impurity measures the chance of misclassification, while Entropy measures the uncertainty in the data; both are used to choose the best split in decision trees.**

**Question 3:What is Pre-Pruning in Decision Trees?**

Pre-pruning is a technique used in decision trees to stop the tree from growing too deep during training. The idea is to prevent overfitting by setting certain rules in advance that decide when the tree should stop splitting.

Instead of allowing the decision tree to grow until it perfectly classifies all training data, pre-pruning halts further splits if they do not significantly improve the model’s performance. This helps the model remain simple, faster, and more generalizable to new data.

Common conditions used in pre-pruning include:

- Limiting the maximum depth of the tree

- Requiring a minimum number of samples in a node before splitting

- Stopping splits when improvement is very small

**Why Pre-Pruning Is Important**

- Reduces overfitting

- Improves model performance on unseen data

- Makes the model faster and easier to interpret

**Pre-pruning is a method that stops a decision tree from growing too large during training to avoid overfitting and improve generalization.**

**Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access  feature_importances_.
(Include your Python code and output in the code box below.)**



In [1]:
# Import required libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create feature names
feature_names = iris.feature_names

# Train Decision Tree Classifier with Gini Impurity
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Display feature importances
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})

print(feature_importance_df)


             Feature  Importance
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611


**Explanation**:

- The model uses Gini Impurity to decide the best splits.

- Feature importance shows how much each feature contributes to decision-making.

- Higher value = more important feature.

- In this example, petal length and petal width are the most important features.

**This program trains a Decision Tree using Gini Impurity and prints feature importances to identify which features contribute most to predictions.**

**Question 5: What is a Support Vector Machine (SVM)?**

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression, but it is most commonly used for classification problems.

SVM works by finding a decision boundary (called a hyperplane) that best separates data points of different classes. The main goal of SVM is to choose a boundary that maximizes the margin, which is the distance between the boundary and the nearest data points from each class.

The data points that lie closest to the decision boundary are called support vectors. These points are crucial because they directly influence the position of the hyperplane. If these points move, the decision boundary also changes.

SVM is powerful because it can handle:

- High-dimensional data

- Non-linearly separable data using kernel functions

- Small datasets with clear class separation

Overall, SVM focuses on finding the most optimal and robust boundary, making it effective for complex classification tasks.

**Support Vector Machine is a supervised learning algorithm that finds the optimal hyperplane which maximizes the margin between different classes using support vectors.**



**Question 6: What is the Kernel Trick in SVM?**

The Kernel Trick is a technique used in Support Vector Machines (SVM) to handle non-linearly separable data.

In many real-world problems, data cannot be separated by a straight line or flat plane. Instead of trying to draw a complex boundary in the original space, the kernel trick maps the data into a higher-dimensional space where a linear separation becomes possible.

The key idea is that SVM does not actually compute the transformation explicitly. Instead, it uses a kernel function to calculate the similarity between data points as if they were in a higher-dimensional space. This makes the process computationally efficient.

**Why Kernel Trick is Important**

- Helps SVM handle complex and non-linear patterns

- Avoids expensive calculations of high-dimensional transformations

- Makes SVM flexible for different types of data

**Common Types of Kernels**

- **Linear Kernel** – for linearly separable data

- **Polynomial Kernel** – for curved decision boundaries

- **RBF (Gaussian) Kernel** – for complex, non-linear patterns

- **Sigmoid Kernel** – similar to neural networks


**The kernel trick allows SVM to solve non-linear classification problems by implicitly mapping data into a higher-dimensional space where it can be linearly separated.**

**Question 7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
(Include your Python code and output in the code box below.)**



In [2]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
linear_accuracy = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
rbf_accuracy = accuracy_score(y_test, y_pred_rbf)

# Print the accuracies
print("Linear Kernel SVM Accuracy:", linear_accuracy)
print("RBF Kernel SVM Accuracy:", rbf_accuracy)


Linear Kernel SVM Accuracy: 0.9814814814814815
RBF Kernel SVM Accuracy: 0.7592592592592593


**Conclusion**

- Linear SVM performs very well when data is nearly linearly separable.

- RBF SVM captures non-linear patterns better and achieved higher accuracy on the Wine dataset.

**Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?**

Naïve Bayes is a probabilistic machine learning classifier based on Bayes’ Theorem. It is mainly used for classification tasks such as spam detection, sentiment analysis, and text classification.

It works by calculating the probability of each class for a given data point and then predicting the class with the highest probability.

It is called “Naïve” because it makes a strong assumption that all features are independent of each other, given the class label.
In real-world data, this assumption is usually not true, but surprisingly, Naïve Bayes still performs very well in many practical applications.

**Why Naïve Bayes is Popular**

- Simple and fast to train

- Works well with large datasets

- Performs especially well for text and categorical data

- Requires less training data compared to many other algorithms

**Key Point**:

**Naïve Bayes is called “naïve” because it assumes feature independence, an assumption that simplifies computation but rarely holds in real data.**

**Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes**

Naïve Bayes classifiers differ mainly in the type of data they are designed to handle and the probability distribution they assume for the features.

**Gaussian Naïve Bayes**

Gaussian Naïve Bayes is used when the features are continuous numerical values. It assumes that the data for each feature follows a normal (Gaussian) distribution.

- Common use cases:


1.   Medical data (e.g., blood pressure, temperature)
2.   Sensor readings
3. Any dataset with real-valued features

**Multinomial Naïve Bayes**

Multinomial Naïve Bayes is designed for discrete count-based data.
It is most commonly used in text classification, where features represent word counts or term frequencies.

- Common use cases:



1.   Spam detection
2.   Document classification
3.   Bag-of-Words and TF-IDF models

**Bernoulli Naïve Bayes**

Bernoulli Naïve Bayes works with binary features (0 or 1).
It checks whether a feature is present or absent, rather than how many times it appears.

- Common use cases:

1. Binary text features (word present or not)
2. Yes/No type data
3. Simple feature presence detection

**Comparison**

| Type           | Data Type       | Feature Representation | Typical Use Case        |
| -------------- | --------------- | ---------------------- | ----------------------- |
| Gaussian NB    | Continuous      | Real-valued            | Medical, numerical data |
| Multinomial NB | Discrete counts | Frequency-based        | Text classification     |
| Bernoulli NB   | Binary          | Presence/absence       | Binary text features    |


**Key Point**:

**Gaussian NB is used for continuous data, Multinomial NB for count-based text data, and Bernoulli NB for binary features.**

**Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)**

In [3]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gaussian Naïve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy of Gaussian Naïve Bayes:", accuracy)


Accuracy of Gaussian Naïve Bayes: 0.9415204678362573


**Conclusion**:

- Gaussian Naïve Bayes performs well on continuous medical data

- It achieves high accuracy with very fast training

- Suitable for baseline medical classification models