#Supervised Classification: Decision  Trees, SVM, and Naive Bayes

# 1. What is Information Gain, and how is it used in Decision Trees?
- Information Gain is a measure used in Decision Trees to decide which feature should be selected to split the data at each node. It is based on the concept of **entropy**, which measures the randomness or impurity in a dataset. Information Gain calculates how much the uncertainty in the target variable is reduced after splitting the data using a particular feature—the feature with the **highest Information Gain** is chosen for the split. For example, suppose we are building a decision tree to predict whether a student will pass an exam based on features like *study hours* and *attendance*. If splitting the data using *study hours* results in groups where most students clearly pass or fail (low impurity), while *attendance* gives mixed results, then *study hours* will have higher Information Gain and will be selected as the root node. This helps the tree make more accurate decisions step by step.


# 2.  What is the difference between Gini Impurity and Entropy?
- Gini Impurity and Entropy are both measures used in Decision Trees to determine how “pure” or “impure” a dataset is after a split, but they differ in calculation, performance, and use cases. **Entropy** is based on information theory and measures the amount of uncertainty or disorder in the data; it is more mathematically intensive and is mainly used in the ID3 and C4.5 algorithms. It gives more precise results but is slower to compute. **Gini Impurity**, used by the CART algorithm, measures the probability of misclassifying a randomly chosen data point and is computationally faster and simpler than entropy. In practice, both often give very similar splits, but **Gini is preferred for large datasets due to speed**, while **Entropy is chosen when higher accuracy and better theoretical interpretation are needed**.


# 3. What is Pre-Pruning in Decision Trees?
 - Pre-pruning in Decision Trees is a technique used to **control the growth of the tree during the training process itself** in order to prevent **overfitting**, reduce model complexity, and improve generalization on unseen data. Instead of allowing the tree to grow fully and then trimming it later, pre-pruning **stops the splitting process early based on certain predefined conditions**. These conditions may include setting a **maximum depth of the tree**, defining a **minimum number of samples required to split a node**, specifying the **minimum number of samples required in a leaf node**, or stopping further splits when the **information gain or impurity reduction becomes very small**. For example, if the maximum depth is set to 4, the decision tree will stop growing after four levels even if additional splits could improve training accuracy. While pre-pruning helps in making the model **simpler, faster, and less prone to overfitting**, a drawback is that it may sometimes **stop the tree too early**, leading to **underfitting** where the model fails to capture important patterns in the data. Therefore, pre-pruning is especially useful when working with **large datasets**, noisy data, or when computational efficiency is important.


# 4. :Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).
- A Decision Tree Classifier using **Gini Impurity** works by selecting the feature at each node that best reduces the impurity of the dataset after splitting. Gini Impurity measures the probability that a randomly selected data point would be incorrectly classified, and its value ranges from 0 (perfectly pure) to 0.5 (maximum impurity for binary classification). During training, the algorithm evaluates every feature and all possible split points, calculates the Gini Impurity for each split, and chooses the split that results in the **lowest weighted Gini value**. After the tree is trained, **feature importance** is calculated based on how much each feature contributes to reducing Gini Impurity across all splits in the tree. Features that are used more frequently and cause larger impurity reductions receive higher importance scores. These feature importance values help us understand which input variables are most influential in predicting the output, making the model more interpretable and useful for feature selection.


In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load dataset
data = load_iris()
X = data.data        # features
y = data.target      # labels

# Split data into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Decision Tree using Gini Impurity
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(data.feature_names, model.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


# 5. What is a Support Vector Machine (SVM)?
 - A **Support Vector Machine (SVM)** is a powerful supervised machine learning algorithm used for both **classification and regression**, but it is mainly known for its effectiveness in classification problems. The core idea of SVM is to find the **optimal hyperplane** that best separates the data points of different classes in a high-dimensional feature space, such that the **margin between the closest data points of each class (called support vectors) is maximized**. These support vectors are the most important points because they directly influence the position and orientation of the hyperplane. SVM can handle both **linearly separable and non-linearly separable data**; for non-linear problems, it uses a powerful technique called the **kernel trick** (such as linear, polynomial, and radial basis function kernels) to transform the data into a higher-dimensional space where a clear separation becomes possible. One of the major strengths of SVM is its **ability to work well with high-dimensional data and small datasets**, and its strong resistance to overfitting due to margin maximization. However, SVM can be computationally expensive for very large datasets and requires careful selection of kernel and hyperparameters. Overall, SVM is widely used in real-world applications such as **image classification, text categorization, bioinformatics, and face recognition** because of its high accuracy and robustness.


# 6.  What is the Kernel Trick in SVM?
 - The **Kernel Trick in Support Vector Machines (SVM)** is a powerful mathematical technique that allows SVM to efficiently solve **non-linearly separable classification and regression problems** without explicitly transforming the data into a higher-dimensional space. In many real-world datasets, the classes cannot be separated by a straight line (or flat hyperplane), so SVM uses a kernel function to **implicitly map the original input features into a higher-dimensional feature space** where the data becomes linearly separable. The key advantage of the kernel trick is that this complex transformation is done **without actually computing the new coordinates**, which saves a huge amount of computation and memory. Common kernel functions include the **Linear Kernel** (used when data is already linearly separable), the **Polynomial Kernel** (for curved decision boundaries), the **Radial Basis Function (RBF) Kernel** (highly flexible and widely used), and the **Sigmoid Kernel** (similar to neural networks). By choosing an appropriate kernel, SVM can model very complex decision boundaries while still keeping the optimization problem mathematically efficient. This makes the kernel trick one of the most important reasons why SVM performs extremely well in areas such as **image processing, text classification, handwriting recognition, and bioinformatics**, where data is often non-linear and high-dimensional.


# 7. Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
 - In this experiment, two **Support Vector Machine (SVM)** classifiers are trained on the **Wine dataset** using two different kernels: **Linear** and **Radial Basis Function (RBF)**, in order to compare their classification performance. The Wine dataset contains chemical features of different wine classes, and the goal is to correctly classify the wine type based on these features. The **Linear kernel** is used when the data is assumed to be linearly separable, meaning the classes can be separated using a straight line or flat hyperplane. On the other hand, the **RBF kernel** is a non-linear kernel that can create complex, curved decision boundaries by mapping the input data into a higher-dimensional space using the kernel trick. After training both models on the same training data and testing them on the same test data, their **accuracy scores** are calculated and compared. Generally, the RBF kernel often gives **higher accuracy** because it can handle complex relationships between features more effectively than the linear kernel. This comparison helps in understanding how kernel choice affects model performance and shows why **non-linear kernels are preferred for complex real-world datasets**.


In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracy
linear_accuracy = accuracy_score(y_test, y_pred_linear)
rbf_accuracy = accuracy_score(y_test, y_pred_rbf)

# Print results
print("Linear Kernel Accuracy:", linear_accuracy)
print("RBF Kernel Accuracy:", rbf_accuracy)


Linear Kernel Accuracy: 1.0
RBF Kernel Accuracy: 0.8055555555555556


# 8. What is the Naïve Bayes classifier, and why is it called "Naïve"?
 - The **Naïve Bayes classifier** is a simple yet powerful **probabilistic machine learning algorithm** based on **Bayes’ Theorem**, and it is mainly used for **classification tasks** such as text classification, spam detection, sentiment analysis, and medical diagnosis. It works by calculating the **posterior probability of a class given the input features** and then selecting the class with the highest probability as the prediction. The key assumption behind Naïve Bayes is that **all features are conditionally independent of each other given the class label**, which means the presence or absence of one feature does not affect the presence of another. This assumption is called “naïve” because in real-world data, features are usually correlated and not truly independent. Despite this unrealistic assumption, Naïve Bayes performs **surprisingly well in many applications**, especially when working with **high-dimensional data such as text**, because it requires very little training data, is computationally fast, and is resistant to overfitting. It is called “Naïve” not because it is weak, but because of its **simplifying independence assumption**, which makes the algorithm easy to implement, efficient, and widely used in practical machine learning systems.


# 9. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes
 - The three main variants of the **Naïve Bayes classifier—Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes—differ primarily in the type of features they support and the probability distributions they assume for modeling the data**. **Gaussian Naïve Bayes** is used when the input features are **continuous numerical values** and are assumed to follow a **normal (Gaussian) distribution**; for each feature and each class, the model estimates the mean and standard deviation, and this makes it suitable for problems such as medical diagnosis using patient measurements, sensor-based data, and other real-valued datasets. **Multinomial Naïve Bayes** is designed for **discrete, count-based features**, most commonly used in **text classification tasks** like spam detection, sentiment analysis, and document categorization, where features represent the **frequency of words** in documents; it works especially well with large vocabularies and sparse data. **Bernoulli Naïve Bayes**, in contrast, is used for **binary-valued features**, where each feature indicates the **presence or absence of an attribute or word** rather than its frequency; this makes it particularly useful in tasks such as email spam filtering when only binary term occurrence is considered. In summary, Gaussian NB is best suited for **continuous numerical data**, Multinomial NB is ideal for **word-frequency and count-based text data**, and Bernoulli NB is appropriate for **binary feature representations**, with the choice of model depending entirely on the nature of the dataset and the problem being solved.


# 10.  Breast Cancer Dataset
# Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer  dataset and evaluate accuracy.
 - In this experiment, a **Gaussian Naïve Bayes classifier** is trained on the **Breast Cancer dataset** to predict whether a tumor is **malignant or benign** based on various medical features such as mean radius, texture, perimeter, and smoothness. Gaussian Naïve Bayes is suitable for this dataset because all input features are **continuous numerical values** and are assumed to follow a **normal (Gaussian) distribution**. The model works by applying **Bayes’ Theorem** and calculating the probability of each class for a given input, then assigning the class with the highest probability as the final prediction. The dataset is first divided into **training and testing sets**, where the model learns from the training data and its performance is evaluated on unseen test data. After prediction, the **accuracy score** is used as the evaluation metric to measure how many samples are classified correctly. A high accuracy value shows that the Gaussian Naïve Bayes model is **effective, fast, and reliable for medical diagnosis problems**, especially when the features are continuous and normally distributed.


In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create Gaussian Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Accuracy of Gaussian Naive Bayes Classifier:", accuracy)


Accuracy of Gaussian Naive Bayes Classifier: 0.9736842105263158
