**SVM, and Naive Bayes Assignment**

**Q1.What is Information Gain, and how is it used in Decision Trees?**

ANS. Information Gain is a measure used in the field of information theory and machine learning, particularly in the construction of decision trees. It quantifies the reduction in entropy (or uncertainty) about an outcome, given some additional information. In simpler terms, it tells us how much 'purer' a set of data becomes after splitting it based on a particular attribute.

How is it used in Decision Trees?

Decision Trees work by recursively splitting the dataset into subsets based on features that provide the most 'information' about the target variable. Information Gain is the criterion used to decide which attribute to split on at each node of the tree.

1. Calculate Entropy: First, the entropy of the entire dataset (or a current node) is calculated. Entropy is a measure of impurity or randomness. A high entropy means the data is mixed (e.g., equal number of 'yes' and 'no' classes), while low entropy means the data is relatively pure (e.g., mostly 'yes' or mostly 'no').

2. Calculate Information Gain for Each Attribute: For each potential splitting attribute, the algorithm calculates the entropy of the subsets created by splitting on that attribute. Then, it calculates the 'weighted average' entropy of these subsets. The Information Gain for an attribute is the difference between the entropy before the split and the weighted average entropy after the split.

. Information Gain (A) = Entropy (Parent) - [Weighted Average Entropy (Children)]

3. Select the Best Attribute: The attribute with the highest Information Gain is chosen as the splitting criterion for that node. This is because it reduces the most uncertainty and creates the purest possible child nodes.

4. Repeat: This process is repeated recursively for each child node until a stopping condition is met (e.g., all nodes are pure, a maximum depth is reached, or the Information Gain falls below a threshold).

In essence, Information Gain guides the Decision Tree algorithm to make the most effective splits, leading to a tree that can accurately classify or predict outcomes.



**2.What is the difference between Gini Impurity and Entropy?**

ANS. Both Gini Impurity and Entropy are measures of impurity or disorder used in decision tree algorithms (like CART, C4.5) to decide the best split at each node. While they both aim to quantify how mixed a set of data is, they do so using slightly different mathematical formulations and have some practical differences.

Here's a breakdown of their differences:

1. Definition and Formula:

**Entropy:**

Entropy, borrowed from information theory, measures the unpredictability or randomness of a dataset. Higher entropy means more uncertainty or a more mixed dataset.

Formula: Entropy(S) = - Σ [p(i) * log2(p(i))]

S is the dataset.

p(i) is the proportion of observations belonging to class i in the dataset.

The summation is over all classes.

**Gini Impurity:**

 Gini Impurity measures the probability of misclassifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the subset. Lower Gini Impurity means less chance of misclassification and a purer subset.

Formula:

Gini(S) = 1 - Σ [p(i)^2]

S is the dataset.

p(i) is the proportion of observations belonging to class i in the dataset.

The summation is over all classes.





**3.:What is Pre-Pruning in Decision Trees?**

ANS. Pre-pruning, also known as early stopping, is a technique used in decision tree algorithms to prevent overfitting. It involves stopping the tree construction early, before it has perfectly classified the training data, by setting certain stopping criteria.

**Q4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).**


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load the Iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree Classifier with Gini Impurity
dtree_gini = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the classifier
dtree_gini.fit(X_train, y_train)

# Get feature importances
importances = dtree_gini.feature_importances_

# Create a pandas Series for better readability
feature_importances = pd.Series(importances, index=feature_names)

print("Feature Importances (Gini Impurity):")
display(feature_importances.sort_values(ascending=False))


Feature Importances (Gini Impurity):


Unnamed: 0,0
petal length (cm),0.893264
petal width (cm),0.087626
sepal width (cm),0.01911
sepal length (cm),0.0


**Q5.What is a Support Vector Machine (SVM)?**

ANS. A Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm used for both classification and regression tasks. However, it's primarily known for its effectiveness in classification, particularly for solving two-class (binary) classification problems.

At its core, an SVM works by finding the optimal hyperplane that best separates data points belonging to different classes in a high-dimensional space.

**Q6.What is the Kernel Trick in SVM?**

ANS. The Kernel Trick is a fundamental concept that allows Support Vector Machines (SVMs) to effectively handle non-linearly separable data. It's a very clever mathematical technique that avoids the explicit transformation of data into higher-dimensional spaces, saving a tremendous amount of computational cost.

**Q7.Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)

# Make predictions and calculate accuracy for Linear SVM
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)
print(f"Accuracy with Linear Kernel: {accuracy_linear:.4f}")

# 2. Train SVM with RBF (Gaussian) Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)

# Make predictions and calculate accuracy for RBF SVM
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"Accuracy with RBF Kernel: {accuracy_rbf:.4f}")

# Compare accuracies
if accuracy_linear > accuracy_rbf:
    print("\nLinear Kernel performed better.")
elif accuracy_rbf > accuracy_linear:
    print("\nRBF Kernel performed better.")
else:
    print("\nBoth kernels performed equally.")


Accuracy with Linear Kernel: 0.9815
Accuracy with RBF Kernel: 0.7593

Linear Kernel performed better.


**Q8 What is the Naïve Bayes classifier, and why is it called "Naïve"?**

ANS. The Naïve Bayes classifier is a family of algorithms that apply Bayes' Theorem with the "naïve" assumption of conditional independence between every pair of features given the value of the class variable. Despite its simplistic assumption, Naïve Bayes often performs surprisingly well in practice, especially for text classification and spam detection.

Bayes' Theorem states:

P(A|B) = [P(B|A) * P(A)] / P(B)



**Why is it called "Naïve"?**


The "Naïve" part comes from its fundamental assumption: that all features in the input vector are conditionally independent of each other, given the class.

In other words, it assumes that the presence or absence of one feature does not affect the presence or absence of any other feature, given that we know the class. Mathematically, this means:

P(x_1, x_2, ..., x_n | y) = P(x_1 | y) * P(x_2 | y) * ... * P(x_n | y)

**Q9.Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes?**


ANS. **1. Gaussian Naïve Bayes**

**Best Suited for:** Continuous data. Features are assumed to follow a Gaussian (normal) distribution.

**Likelihood Calculation:** It assumes that the continuous values associated with each feature are distributed according to a Gaussian distribution. When calculating the likelihood P(x_i | y), it estimates the mean (μ_y) and variance (σ²_y) of feature x_i for each class y from the training data. Then, it uses the probability density function (PDF) of the Gaussian distribution to compute the likelihood for a given x_i.

**Example Use Case:** Classifying based on features like height, weight, temperature, or other real-valued measurements.

**Mathematical Form of**
 P(x_i | y): P(x_i | y) = (1 / sqrt(2 * π * σ²_y)) * exp(-(x_i - μ_y)² / (2 * σ²_y))

**2. Multinomial Naïve Bayes**
**Best Suited for:**
Discrete counts. This classifier is typically used for features that represent the frequencies with which certain events have been generated by a multinomial distribution. It's especially popular for text classification.

**Likelihood Calculation:**

For each feature (e.g., a word in a document) and each class, it calculates the probability of observing that feature count given the class. It uses a multinomial distribution where the likelihood is proportional to the count of feature x_i in documents of class y, normalized by the total count of all features in documents of class y.

**Example Use Case:**
Text classification (e.g., spam detection, sentiment analysis), where features are typically word counts or term frequencies within a document. For instance, if you're classifying emails as spam or not spam, the features might be the counts of words like "money," "free," or "Viagra."
**Mathematical Form of**
P(x_i | y): P(x_i | y) = (count(x_i, y) + α) / (Σ_k count(x_k, y) + α * V) Where count(x_i, y) is the number of times feature x_i appears in samples of class y, α is a smoothing parameter (for Laplace or Lidstone smoothing to handle zero probabilities), and V is the total number of unique features.

**3. Bernoulli Naïve Bayes**
Best Suited for: Binary or boolean features. This classifier assumes that features are independent Bernoulli distributed variables. This means each feature can either be present (1) or absent (0).


Likelihood Calculation: For each feature and each class, it estimates the probability that a feature x_i is present (P(x_i=1 | y)) and the probability that it is absent (P(x_i=0 | y)).

Example Use Case: Document classification where features are not word counts, but rather indicators of whether a specific word is present or absent in a document. For example, a document either contains the word "sale" or it doesn't.


Mathematical Form of P(x_i | y): P(x_i | y) = P(x_i=1 | y) if x_i is 1 (present) P(x_i | y) = 1 - P(x_i=1 | y) if x_i is 0 (absent) Where P(x_i=1 | y) is the probability that feature x_i is present in samples of class y.
Summary Table:

Classifier	Data Type	Feature Modeling Assumption	Common Use Case
Gaussian NB	Continuous	Features follow a normal distribution.	Numerical data (e.g., sensor readings, heights)
Multinomial NB	Discrete Counts	Features represent event counts (multinomial).	Text classification (word counts)
Bernoulli NB	Binary/Boolean	Features are binary (present/absent).	Document classification (word presence)
The choice of which Naïve Bayes variant to use depends entirely on the nature of your features in the dataset.



**Q10  Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.**



In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the classifier
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Gaussian Naïve Bayes on the Breast Cancer dataset: {accuracy:.4f}")


Accuracy of Gaussian Naïve Bayes on the Breast Cancer dataset: 0.9415
