<a href="https://colab.research.google.com/github/mohit27-maker/pwAssigment/blob/main/Supervised_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Question 1 : What is Information Gain, and how is it used in Decision Trees?

= Information Gain measures how much uncertainty in the target variable is reduced after splitting the data based on a feature. It is used in Decision Trees to select the best attribute for splitting at each node. The formula is:

 IG(S,A)=Entropy(S)−∑∣Sv∣/∣S​∣​×Entropy(Sv​)

> Where:
- S : the current dataset (before the split)
- A : the attribute we are testing
- Sv : subset of S where attribute A has value
- ∣𝑆𝑣∣/∣𝑆∣ : proportion of samples in subset
- Entropy(S) : impurity (or disorder) of the dataset 𝑆


Question 2: What is the difference between Gini Impurity and Entropy?

= Gini Impurity and Entropy both measure the impurity of a dataset in Decision Trees.
- Gini Impurity measures how often a randomly chosen element would be incorrectly labeled.
- Entropy measures the level of uncertainty or randomness.
- Gini is faster to compute and often used in CART (Classification and Regression Trees).
- Entropy gives more weight to less probable classes and is used in ID3/C4.5 algorithms.
- In practice, both give similar results, but Gini is preferred for speed.


Question 3:What is Pre-Pruning in Decision Trees?

= Pre-pruning (also called early stopping) is a technique used to stop a Decision Tree from growing too deep and overfitting the data. It involves setting conditions like maximum depth, minimum samples per node, or minimum information gain before splitting. If these conditions are not met, the split is stopped. This helps create a simpler, faster, and more generalizable model.


In [1]:
#Question 4:Write a Python program to train a Decision Tree Classifier using Gini
#Impurity as the criterion and print the feature importances (practical).


from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

data = load_iris()
X = data.data
y = data.target

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)

feature_importances = pd.DataFrame({
    'Feature': data.feature_names,
    'Importance': clf.feature_importances_
})

print(feature_importances.sort_values(by='Importance', ascending=False))



             Feature  Importance
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000


Question 5: What is a Support Vector Machine (SVM)?

= A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the best boundary (hyperplane) that separates different classes with the maximum margin. The data points closest to this boundary are called support vectors. SVM can also handle non-linear data using kernel functions (like RBF or polynomial). It is powerful for high-dimensional datasets and provides good accuracy.


Question 6: What is the Kernel Trick in SVM?

 = The Kernel Trick in SVM allows it to handle non-linear data by transforming input data into a higher-dimensional space without explicitly computing the transformation. This helps the SVM find a linear separator in that higher space, which corresponds to a non-linear boundary in the original space. Common kernels include linear, polynomial, and RBF (Radial Basis Function). It makes SVMs powerful for complex, non-linear classification problems.



In [2]:
#Question 7: Write a Python program to train two SVM classifiers with Linear and RBF
#kernels on the Wine dataset, then compare their accuracies.


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

data = load_wine()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
linear_pred = svm_linear.predict(X_test)

svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
rbf_pred = svm_rbf.predict(X_test)

linear_acc = accuracy_score(y_test, linear_pred)
rbf_acc = accuracy_score(y_test, rbf_pred)

print("Accuracy with Linear Kernel:", linear_acc)
print("Accuracy with RBF Kernel:", rbf_acc)


Accuracy with Linear Kernel: 0.9814814814814815
Accuracy with RBF Kernel: 0.7592592592592593


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

= The Naïve Bayes classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, used mainly for classification tasks. It calculates the probability of each class given the input features and chooses the class with the highest probability. It is called “Naïve” because it assumes that all features are independent of each other — an assumption that is rarely true in real-world data but still works well in practice, especially for text classification and spam detection.


Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

- Gaussian Naïve Bayes: Used for continuous data that follows a normal (Gaussian) distribution. Example – predicting based on height, weight, or age.
- Multinomial Naïve Bayes: Used for discrete count data, like word frequencies in text classification (e.g., spam detection).
- Bernoulli Naïve Bayes: Used for binary/boolean features (0 or 1), indicating the presence or absence of a feature.
- Key Difference: The type of data each model handles — continuous (Gaussian), counts (Multinomial), or binary (Bernoulli).
- Common Use Case: Gaussian for numeric data, Multinomial/Bernoulli for text data.

In [3]:
#Question 10: Breast Cancer Dataset
#Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
#dataset and evaluate accuracy

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
data = load_breast_cancer()

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Naïve Bayes:", accuracy)

Accuracy of Gaussian Naïve Bayes: 0.9415204678362573
