### Question 1 : What is Information Gain, and how is it used in Decision Trees?

=>
What is Information Gain?

Information Gain is a measure used in the field of information theory and machine learning to quantify the reduction in entropy (or uncertainty) after a dataset is split based on an attribute. In simpler terms, it tells us how much 'useful information' a particular feature provides about the target variable. The higher the Information Gain, the more effective the attribute is in classifying the data.

How is it used in Decision Trees?

Decision Trees are built by recursively splitting the data into subsets based on features. At each node of the tree, the algorithm needs to decide which feature to split on. This is where Information Gain comes into play:

1. Splitting Criterion: Information Gain is used as the primary criterion to select the best attribute for splitting a node. The goal is to choose the attribute that yields the highest Information Gain.
2. Reducing Entropy: The core idea is to reduce the impurity or randomness of the data subsets created after a split. An attribute with high Information Gain effectively separates the data such that the resulting subsets are more homogeneous (i.e., contain more instances of a single class).
3. Recursive Process: The process starts at the root node, where the Information Gain is calculated for all available features. The feature with the highest Information Gain is chosen as the splitting criterion for the root. This process is then repeated recursively for each child node until a stopping condition is met (e.g., all data points in a node belong to the same class, no more features to split on, or a maximum depth is reached).

### Question 2: What is the difference between Gini Impurity and Entropy?

=>
Entropy

- Definition: Entropy is a measure of the randomness or unpredictability in a dataset. In the context of classification, it quantifies the impurity of a node based on the distribution of classes within it. Higher entropy means higher impurity (more mixed classes), and lower entropy means lower impurity (more homogeneous classes).

- Formula: For a node with 'c' classes, and 'p_i' being the proportion of samples belonging to class 'i', Entropy is calculated as: Entropy = - Σ (p_i * log2(p_i))

- Characteristics:

1. Logarithmic: The use of log2 makes entropy calculations slightly more computationally intensive than Gini Impurity.
2. Information Gain: When used as a splitting criterion, the goal is to maximize Information Gain, which is the reduction in entropy after a split.
3. Stronger Penalization: Entropy tends to penalize impurity more heavily than Gini Impurity, especially when there are multiple classes with roughly equal proportions.

Gini Impurity

- Definition: Gini Impurity (or Gini Index) measures the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. A Gini Impurity of 0 means all elements belong to a single class (perfect purity), while a Gini Impurity of 1 (for a binary classification) indicates maximal impurity (50/50 split).

- Formula: For a node with 'c' classes, and 'p_i' being the proportion of samples belonging to class 'i', Gini Impurity is calculated as: Gini = 1 - Σ (p_i)^2

- Characteristics:

1. Quadratic: The formula involves squaring probabilities, making it computationally faster than entropy in some cases.
2. Bias: Gini Impurity tends to isolate the most frequent class in its own branch, which can sometimes lead to slightly different tree structures compared to entropy.

Strengths and Weaknesses:

- Entropy Strengths: Can be more sensitive to changes in class distribution, potentially leading to more balanced splits in some scenarios.

- Entropy Weaknesses: Slower to compute than Gini Impurity due to the logarithmic function.
- Gini Impurity Strengths: Faster to compute, often preferred for large datasets. It's the default in popular libraries like scikit-learn for DecisionTreeClassifier.
- Gini Impurity Weaknesses: Can sometimes favor larger partitions, and might not be as sensitive to very slight differences in class distribution as entropy.

Key Differences and Use Cases:

Feature    ||   	Entropy    ||     	Gini Impurity

Computation ||	More computationally intensive (logarithm) ||	Faster (squaring)

Penalization	|| Penalizes impurity more heavily   || Less sensitive to impurity in multi-class scenarios

Bias  ||	No inherent bias towards any class  ||	Tends to isolate the most frequent class

Decision Trees  ||	Used in algorithms like ID3, C4.5, C5.0 ||	Used in algorithms like CART

Splitting	|| Maximizes Information Gain (reduction in entropy) ||	Minimizes Gini Impurity


### Question 3:What is Pre-Pruning in Decision Trees?

=>
Pre-pruning, also known as early stopping, is a technique used in decision tree algorithms to prevent overfitting. Overfitting occurs when a decision tree is grown too deep and learns the training data too well, capturing noise and specific patterns that do not generalize to new, unseen data.

How Pre-Pruning Works:

Instead of growing a full decision tree and then pruning it back (which is post-pruning), pre-pruning stops the tree growth early based on certain criteria. The tree building process is halted when adding new branches or nodes no longer significantly improves the model's performance on the validation set, or when certain predefined thresholds are met.

Common Pre-Pruning Criteria:

Here are some common criteria used to decide when to stop growing a tree:

1. Maximum Depth: The tree stops growing once it reaches a specified maximum depth. For example, if max_depth=5, the tree will not have any path longer than 5 nodes from the root to a leaf.
2. Minimum Samples Split: A node will not be split if the number of samples it contains is less than a predefined minimum. This prevents creating splits on very small groups of data that might be noise.
3. Minimum Samples Leaf: A split is only performed if both child nodes resulting from the split would contain at least a specified minimum number of samples. This ensures that leaf nodes are not too small.
4. Maximum Number of Leaf Nodes: The tree growth is stopped when the number of leaf nodes reaches a specified maximum. This controls the complexity of the tree directly.
5. Impurity Decrease Threshold: A node is split only if the split results in an impurity decrease (e.g., Gini Impurity or Entropy) that is greater than or equal to a certain threshold. If the improvement is too small, the split is considered insignificant and not performed.
6. No Improvement on Validation Set: The tree stops growing if adding new splits does not improve the model's performance (e.g., accuracy or F1-score) on a separate validation set.

### Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Decision Tree Classifier with Gini Impurity
dtc_gini = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
dtc_gini.fit(X_train, y_train)

# Get feature importances
feature_importances = dtc_gini.feature_importances_

# Create a pandas Series for better visualization
importance_df = pd.Series(feature_importances, index=feature_names)

print("Feature Importances (Gini Impurity):")
print(importance_df.sort_values(ascending=False))

Feature Importances (Gini Impurity):
petal length (cm)    0.893264
petal width (cm)     0.087626
sepal width (cm)     0.019110
sepal length (cm)    0.000000
dtype: float64


### Question 5: What is a Support Vector Machine (SVM)?

=>
A Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm used for classification, regression, and outlier detection tasks. It's particularly effective in high-dimensional spaces and cases where the number of dimensions is greater than the number of samples.

### Question 6: What is the Kernel Trick in SVM?

=>
The Kernel Trick is one of the most powerful concepts in Support Vector Machines (SVMs) and allows them to perform non-linear classification and regression tasks. The core idea behind the Kernel Trick is to transform the original input data into a higher-dimensional feature space where it can be linearly separated. The 'trick' is that we don't actually need to compute the coordinates of the data points in this higher-dimensional space. Instead, we only need the dot product of the transformed vectors. A kernel function (or kernel) is a function that calculates this dot product in the higher-dimensional space without explicitly performing the transformation.

### Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
(Include your Python code and output in the code box below.)

In [2]:
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

# Split the dataset into training and testing sets
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(X_wine, y_wine, test_size=0.3, random_state=42)

# 1. Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_wine, y_train_wine)
y_pred_linear = svm_linear.predict(X_test_wine)
accuracy_linear = accuracy_score(y_test_wine, y_pred_linear)

print(f"Accuracy of SVM with Linear Kernel: {accuracy_linear:.4f}")

# 2. Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_wine, y_train_wine)
y_pred_rbf = svm_rbf.predict(X_test_wine)
accuracy_rbf = accuracy_score(y_test_wine, y_pred_rbf)

print(f"Accuracy of SVM with RBF Kernel: {accuracy_rbf:.4f}")

Accuracy of SVM with Linear Kernel: 0.9815
Accuracy of SVM with RBF Kernel: 0.7593


### Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

=>
The Naïve Bayes classifier is a probabilistic machine learning algorithm used for classification tasks. It's based on Bayes' Theorem with an assumption of independence among predictors. This means a Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

What is Naïve Bayes Classifier?

It's a collection of classification algorithms based on Bayes' Theorem. It's not a single algorithm but a family of algorithms where all of them share a common principle: every pair of features being classified is independent of each other.

For a given class variable 'y' and a dependent feature vector x1, x2, ..., xn, Bayes' theorem states the following relationship:

P(y|x1, ..., xn) = P(y) * P(x1, ..., xn | y) / P(x1, ..., xn)

Where:

- P(y|x1, ..., xn) is the posterior probability of class y given predictor x.
- P(y) is the prior probability of class y.
- P(x1, ..., xn | y) is the likelihood, which is the probability of predictor x given class y.
- P(x1, ..., xn) is the prior probability of predictor x.'

Why is it called "Naïve"?

The "Naïve" part of the name comes from the fundamental (and often unrealistic) assumption that the features (predictors) are conditionally independent of each other given the class variable. In simpler terms, it assumes that changing the value of one feature does not directly or indirectly affect the value of any other feature for a given class.

For example, if you're trying to classify whether an email is spam based on words like "money" and "discount", a Naïve Bayes classifier would assume that the presence of "money" is independent of the presence of "discount" given that the email is spam. In reality, these words often appear together in spam emails, meaning they are not truly independent.

Despite this "naïve" assumption, Naïve Bayes classifiers often perform surprisingly well in many real-world applications, especially with text classification (like spam filtering) and medical diagnosis. Their simplicity, speed, and efficiency make them a popular choice when dealing with large datasets and high-dimensional features.

### Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

=>

1. Gaussian Naïve Bayes

- Assumes: Features follow a Gaussian (normal) distribution. This means the likelihood of the features is assumed to be a Gaussian probability density function.

- Data Type: Best suited for continuous numerical data. It calculates the mean and standard deviation of each feature for each class during training.

- Use Cases: Often used when features are continuous and can be assumed to have a normal distribution (e.g., measurements like height, weight, or sensor readings).

2. Multinomial Naïve Bayes

- Assumes: Features represent counts or frequencies of events (e.g., word counts in a document). It's based on a multinomial distribution.
- Data Type: Ideal for discrete counts or frequency-based data, especially in text classification where features are typically word counts or TF-IDF values.
- Use Cases: Widely used in Natural Language Processing (NLP) tasks like spam detection, document classification, and sentiment analysis.

3. Bernoulli Naïve Bayes

- Assumes: Features are binary (Boolean) variables, meaning they can only take two values (e.g., presence or absence of a feature). It models each feature's presence/absence with a Bernoulli distribution.
- Data Type: Suited for binary or boolean features. It evaluates if a specific feature is present or absent, rather than counting how many times it appears.
- Use Cases: Often used in text classification where features are indicators of whether a word appears in a document, rather than how many times it appears. It can also be used for other binary classification problems where features are naturally binary (e.g., customer churn: 'yes' or 'no' for certain behaviors).

### Question 10: Breast Cancer Dataset
###Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)

In [3]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X_bc = breast_cancer.data
y_bc = breast_cancer.target

# Split the dataset into training and testing sets
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(X_bc, y_bc, test_size=0.3, random_state=42)

# Initialize the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train_bc, y_train_bc)

# Make predictions on the test set
y_pred_gnb = gnb.predict(X_test_bc)

# Calculate and print the accuracy
accuracy_gnb = accuracy_score(y_test_bc, y_pred_gnb)
print(f"Accuracy of Gaussian Naïve Bayes classifier on Breast Cancer dataset: {accuracy_gnb:.4f}")

Accuracy of Gaussian Naïve Bayes classifier on Breast Cancer dataset: 0.9415
