                                      Assigment ML

 Question 1: What is Information Gain, and how is it used in Decision Trees?

Ans.
Information Gain measures how much a feature reduces uncertainty (entropy) about the target variable, and decision trees use it to select the best feature to split a node at each step. The feature with the highest Information Gain is chosen because it creates the most "pure" or homogenous child nodes, leading to a more accurate classification. [1, 2, 3]  
What is Information Gain? 

• Definition: Information Gain is a metric used to select the most informative feature for splitting a dataset in a decision tree. [1, 4]  
• How it works: It quantifies the reduction in entropy (a measure of impurity or uncertainty) of the target variable when the data is partitioned by a particular feature. [1, 2, 3]  
• Formula: $Gain(S, A) = Entropy(S) - \sum_{v}^{A} \frac{|S_v|}{|S|} Entropy(S_v)$. [5]  
	• $S$ is the dataset. 
	• $A$ is an attribute (feature). 
	• $v$ is a value of the attribute $A$. 
	• $Entropy(S)$ is the entropy of the original dataset. 
	• $Entropy(S_v)$ is the entropy of the subset of data where attribute $A$ has value $v$. 
	• $\frac{|S_v|}{|S|}$ is the proportion of the data that has value $v$ for attribute $A$. [5]  

How it's used in decision trees 

1. Initial Split: At the root node, Information Gain is calculated for every feature. 
2. Feature Selection: The feature with the highest Information Gain is chosen as the first split, as it provides the most information to separate the data classes. 
3. Recursive Splitting: This process is repeated recursively for each child node. At each new node, the algorithm considers only the remaining features and selects the one with the highest Information Gain to split the data further. 
4. Termination: The process continues until a stopping criterion is met, such as a node containing only data points from a single class, or a maximum tree depth is reached. [1, 3, 4, 6, 7, 8, 9, 10, 11, 12]  




Question 2: What is the difference between Gini Impurity and Entropy?

Ans'
Gini Impurity and Entropy are both metrics used in decision trees to measure the impurity or disorder of a dataset, but they differ in their formulas and ranges. Gini Impurity uses a formula that results in a range of 0 to 0.5 for binary classification, while Entropy uses a logarithmic formula that ranges from 0 to 1. Gini Impurity is computationally faster because it avoids logarithmic calculations, making it a preferred choice for large datasets, though Entropy's results are sometimes considered slightly better. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  

| Feature | Gini Impurity | Entropy  |
| --- | --- | --- |
| Formula | Measures the probability of misclassification when randomly picking an element and assigning it a label according to the distribution of labels in the node. | Measures the disorder or randomness of a system, representing the uncertainty in a statistical sense.  |
| Range | $0$ to $0.5$ for binary classification. | $0$ to $1$.  |
| Computational Cost | Less expensive to compute because it does not involve logarithms. | More expensive to compute due to the use of logarithmic functions.  |
| Interpretation | A Gini impurity of $0$ means a pure node (all data points belong to one class), while a value of $0.5$ indicates the maximum impurity for a binary classification. | An entropy of $0$ means a pure node, while a value of $1$ (or the maximum, $\log_2 C$ for $C$ classes) indicates maximum impurity.  |


Question 3:What is Pre-Pruning in Decision Trees?

Ans.
    Pre-pruning, or early stopping, is a technique in decision trees that halts the growth of a tree during its construction to prevent overfitting. This is achieved by setting conditions that stop the tree-building process before it becomes too complex, such as limiting the maximum depth, requiring a minimum number of samples per node, or stopping when a split no longer provides significant impurity decrease.

    How it works
Stopping the growth: Pre-pruning involves stopping the tree from growing beyond a certain point, rather than waiting for it to be fully built and then trimming it. 
Setting conditions: Conditions are set beforehand to determine when to stop. These can include: 
Maximum depth: Limiting the maximum number of layers in the tree. 
Minimum samples per leaf: Stopping when the number of samples in a node is below a certain threshold. 
Minimum samples per split: Requiring a minimum number of samples to split a node. 
Minimum impurity decrease: Halting the split if it does not meet a minimum decrease in impurity, such as Gini impurity or information gain. 
Preventing overfitting: By stopping the tree early, it prevents the model from becoming too complex and memorizing noise in the training data, which leads to better generalization on unseen data. 
Risk of underfitting: A potential drawback is that stopping the tree too early can lead to underfitting, where the model is too simple to capture the underlying patterns in the data. 

Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load a sample dataset (Iris dataset)
data = load_iris()
X = data.data          # Features
y = data.target        # Target labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the Decision Tree Classifier using Gini impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Print the feature importances
print("Feature Importances:")
for feature, importance in zip(data.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

# Evaluate accuracy on test data
accuracy = clf.score(X_test, y_test)
print(f"\nModel Accuracy: {accuracy:.2f}")


Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy: 1.00


Question 5: What is a Support Vector Machine (SVM)?


Ans.
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks, but it is mainly used for classification.

🔹 Concept:

SVM works by finding the best decision boundary (called a hyperplane) that separates data points of different classes with the maximum margin.

The margin is the distance between the hyperplane and the nearest data points from each class.

These nearest points are called Support Vectors — they are the most important data points that define the boundary.

🔹 Key Idea:

SVM aims to maximize the margin between classes while minimizing classification error.

This makes the classifier more robust and generalizable to new data.

🔹 Mathematical Form:

For a binary classification problem, the hyperplane can be written as:

𝑤
𝑇
𝑥
+
𝑏
=
0
w
T
x+b=0

Where:

𝑤
w → weight vector (defines orientation of the hyperplane)

𝑏
b → bias term (defines offset from origin)

𝑥
x → input features

🔹 Types of SVM:

Linear SVM:
Used when data is linearly separable.
→ Example: separating two classes with a straight line.

Non-Linear SVM:
Used when data is not linearly separable.
→ Applies the Kernel Trick to map data into higher dimensions.

🔹 Advantages:

Works well on high-dimensional data.

Effective even with small datasets.

Can handle non-linear relationships using kernel functions.

🔹 Disadvantages:

Not ideal for large datasets (training is slow).

Performance depends on choice of kernel and parameter tuning.

Hard to interpret compared to simpler models like Logistic Regression


Question 6: What is the Kernel Trick in SVM?

Ans.
The Kernel Trick in Support Vector Machines (SVM) is a technique that allows the algorithm to classify data that is not linearly separable by transforming it into a higher-dimensional space — without explicitly computing the transformation.

🔹 Explanation:

Some datasets cannot be separated by a straight line (in 2D) or a hyperplane (in higher dimensions).

The kernel function implicitly maps the input data into a higher-dimensional feature space where it becomes linearly separable.

Instead of computing this transformation directly (which is computationally expensive), the kernel trick computes the dot product of the transformed features using a kernel function
| Kernel Type                                       | Mathematical Form                     | Description                           |       |   |       |                                                                           |
| ------------------------------------------------- | ------------------------------------- | ------------------------------------- | ----- | - | ----- | ------------------------------------------------------------------------- |
| **Linear Kernel**                                 | ( K(x, y) = x^T y )                   | Used when data is linearly separable. |       |   |       |                                                                           |
| **Polynomial Kernel**                             | ( K(x, y) = (x^T y + c)^d )           | Allows curved decision boundaries.    |       |   |       |                                                                           |
| **RBF (Radial Basis Function) / Gaussian Kernel** | ( K(x, y) = e^{-\gamma                |                                       | x - y |   | ^2} ) | Maps data to infinite dimensions — very powerful for non-linear problems. |
| **Sigmoid Kernel**                                | ( K(x, y) = \tanh(\alpha x^T y + c) ) | Similar to neural network activation. |       |   |       |                                                                           |


Question 7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.


In [4]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data          # Features
y = data.target        # Labels

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create two SVM classifiers: one with Linear kernel and one with RBF kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both models
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracy
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print comparison results
print("SVM Classifier Accuracies on Wine Dataset:\n")
print(f"Linear Kernel Accuracy: {acc_linear:.4f}")
print(f"RBF Kernel Accuracy:    {acc_rbf:.4f}")

# Determine which performed better
if acc_linear > acc_rbf:
    print("\n➡ Linear Kernel performed better.")
elif acc_linear < acc_rbf:
    print("\n➡ RBF Kernel performed better.")
else:
    print("\n➡ Both kernels performed equally well.")


SVM Classifier Accuracies on Wine Dataset:

Linear Kernel Accuracy: 0.9815
RBF Kernel Accuracy:    0.7593

➡ Linear Kernel performed better.


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?


A Naïve Bayes Classifier is a supervised machine learning algorithm based on Bayes’ Theorem, used mainly for classification problems (especially text classification, like spam detection, sentiment analysis, etc.).

🔹 Bayes’ Theorem:
P(A∣B)=P(B)P(B∣A)⋅P(A)
Where:
P(A∣B): Probability of class 
A given data B (posterior probability)
P(B∣A): Probability of data 
B given class 
A (likelihood)
P(A): Prior probability of class A
P(B): Probability of data B
🔹 How it works:

The algorithm calculates the probability of each class given the input features and then predicts the class with the highest probability.

It assumes that all features are independent of each other given the class label — this is where the term "Naïve" comes from.

🔹 Why it is called “Naïve”:

It’s called Naïve because it assumes independence between all input features, meaning that the presence of one feature does not affect the presence of another.

In reality, this assumption is often false — but surprisingly, the algorithm still performs very well in many real-world tasks.
| Type                        | Description                                                             |
| --------------------------- | ----------------------------------------------------------------------- |
| **Gaussian Naïve Bayes**    | Assumes features follow a normal (Gaussian) distribution.               |
| **Multinomial Naïve Bayes** | Used for discrete features like word counts in text classification.     |
| **Bernoulli Naïve Bayes**   | Used for binary/boolean features (e.g., presence or absence of a word). |


Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes

Ans.Naïve Bayes algorithms come in several variants, depending on the type of data and the assumptions about how the features are distributed.
The three main types are Gaussian, Multinomial, and Bernoulli Naïve Bayes.

🔹 1. Gaussian Naïve Bayes

Used for: Continuous (numeric) data

Assumption:
Each feature follows a Normal (Gaussian) distribution.

Example:
Predicting whether a person has a disease based on continuous features like age, blood pressure, or cholesterol level.

Formula:
P(xi​∣y)=2πσy2​
​1​e−2σy2​(xi​−μy​)2​
where 

 μy and σy​ are the mean and standard deviation of feature xi and y


Library Implementation:
     from sklearn.naive_bayes import GaussianNB
 🔹 2. Multinomial Naïve Bayes

Used for: Discrete (count-based) data, especially text classification

Assumption:
Features represent counts or frequencies of events (e.g., number of times a word appears in a document).

Example:
Email spam detection, document classification, sentiment analysis.
  formula :
  
                              P(xi​∣y)=(Ny​+αn)(Nyi​+α)​
🔹 3. Bernoulli Naïve Bayes

Used for: Binary/boolean features (presence or absence)

Assumption:
Each feature is binary (1 = feature present, 0 = absent).

Example:
Whether a document contains certain keywords or not.

Use Case:
Good for datasets where only the presence of a feature matters, not its frequency.

Library Implementation:
from sklearn.naive_bayes import BernoulliNB


Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.


In [6]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data          # Features
y = data.target        # Labels

# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train (fit) the model
gnb.fit(X_train, y_train)

# Predict on the test data
y_pred = gnb.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Display results
print("✅ Gaussian Naïve Bayes Classifier on Breast Cancer Dataset\n")
print(f"Model Accuracy: {accuracy:.4f}\n")

# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


✅ Gaussian Naïve Bayes Classifier on Breast Cancer Dataset

Model Accuracy: 0.9415

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171

