# 1. What is Information Gain, and how is it used in Decision Trees?

-> Information Gain(IG) is a metric used in decision tree algorithms to select the best feature to split the data at a node. It measures how much "information" a feature provides by calculating the reduction in entropy (or uncertainty) after the dataset is split on that feature.

It is calculated as:
Information Gain = Entropy(Parent) - [Weighted Average] * Entropy(Children)

Where:

· Entropy(Parent) is the impurity of the node before the split.
· [Weighted Average] * Entropy(Children) is the sum of the entropy of each child node, weighted by the proportion of samples that reach that child.

In decision trees, the algorithm evaluates all possible features and splits at a node and chooses the feature with the highest Information Gain. This process is repeated recursively to build the tree, as it effectively finds the feature that most cleanly separates the classes of the target variable.

# 2.  What is the difference between Gini Impurity and Entropy?
-> Gini Impurity and Entropy are both functions used to measure the impurity or disorder of a node in a decision tree.The goal of a split is to reduce this impurity.

Feature Gini Impurity Entropy
Concept Measures the probability of a random sample being misclassified if it was randomly labeled according to the class distribution in the node. Measures the average amount of "information" or "surprise" inherent in the node's possible outcomes. A pure node has zero entropy.
Calculation  Gini = 1 - \sum_{i=1}^{C} (p_i)^2   Entropy = - \sum_{i=1}^{C} p_i * \log_2(p_i) 
Range 0 to 0.5 (for binary classification). 0 indicates a perfectly pure node. 0 to 1 (for binary classification). 0 indicates a perfectly pure node.
Performance Generally faster to compute as it doesn't involve logarithmic calculations. Slightly slower due to the log computation, but the difference is often negligible.
Resulting Tree Tends to create slightly more unbalanced trees by isolating the most frequent class in a branch. Tends to create more balanced trees by producing splits that are more even.
Use Cases The default in many libraries (like scikit-learn) and is a good general-purpose choice. Also a very common and effective choice. In practice, the choice between Gini and Entropy often leads to very similar trees and performance.

Strengths & Weaknesses: There is no universally superior option. Gini is computationally more efficient, while Entropy might be more sensitive to changes in node probabilities. For most practical applications, the difference in the final model's performance is minimal.

# 3. What is Pre-Pruning in Decision Trees?
-> Pre-pruning,also known as early stopping, is a technique used to prevent overfitting in decision trees by restricting the growth of the tree during the building phase. Instead of allowing the tree to grow until all leaves are pure (or all data is classified), pre-pruning sets constraints that halt splitting once certain conditions are met.

Common pre-pruning parameters include:

· max_depth: The maximum allowed depth of the tree.
· min_samples_split: The minimum number of samples required to split an internal node.
· min_samples_leaf: The minimum number of samples required to be at a leaf node.
· max_leaf_nodes: The maximum number of leaf nodes the tree can have.
· min_impurity_decrease: A split will only occur if it reduces the impurity by at least this value.

The main advantage of pre-pruning is that it is computationally cheaper than post-pruning (which builds the full tree first and then prunes it back), as the tree is never fully expanded. However, a weakness is that it can lead to underfitting if the stopping conditions are too strict, potentially missing important patterns in the data because a "good" split might be followed by a "great" one that is never discovered.

# 4. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Feature matrix
y = iris.target  # Target vector

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Decision Tree Classifier with Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model on the training data
clf.fit(X_train, y_train)

# Print the feature importances
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"  {feature_name}: {importance:.4f}")

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0191
  petal length (cm): 0.8933
  petal width (cm): 0.0876


# 5. What is Support Vector Machine (SVM)?
-> A Support Vector Machine(SVM) is a powerful and versatile supervised machine learning algorithm used for both classification and regression tasks, though it is most commonly associated with classification.

The core idea of an SVM classifier is to find the optimal hyperplane in an N-dimensional space (where N is the number of features) that best separates the data points of different classes. This "optimal hyperplane" is chosen as the one with the maximum margin, which is the maximum distance between the hyperplane and the nearest data points from any class. These closest data points are called support vectors.

Key characteristics:

· Maximum Margin: Focuses on the points that are hardest to classify (the support vectors), which often leads to good generalization on unseen data.
· Kernel Trick: Can efficiently perform non-linear classification by mapping inputs into high-dimensional feature spaces using kernel functions.
· Effectiveness in High Dimensions: Works well in high-dimensional spaces, even when the number of dimensions exceeds the number of samples.

# 6. What is the Kernel Trick in SVM?
-> The Kernel Trick is a mathematical technique that allows Support Vector Machines(SVMs) to create a non-linear decision boundary without explicitly transforming the input data into a higher-dimensional space.

In many cases, data is not linearly separable in its original feature space. The theoretical solution is to map the data to a much higher-dimensional space where a linear separation becomes possible. However, this mapping can be computationally very expensive.

The Kernel Trick solves this by using special functions called kernel functions. These functions compute the dot product of the transformed vectors in the high-dimensional space, without ever having to compute the coordinates of the data in that space. They operate directly on the original input vectors.

Common kernel functions include:

· Linear:  K(x, x') = x \cdot x'  (for linear separation)
· Polynomial:  K(x, x') = (x \cdot x' + r)^d 
· Radial Basis Function (RBF):  K(x, x') = \exp(-\gamma ||x - x'||^2)  (the most popular for non-linear problems)

By using the Kernel Trick, SVMs can efficiently learn complex, non-linear models, making them highly effective for a wide range of problems.

# 7. Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

In [3]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features (Important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create SVM classifiers with Linear and RBF kernels
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train the models
svm_linear.fit(X_train_scaled, y_train)
svm_rbf.fit(X_train_scaled, y_train)

# Make predictions
y_pred_linear = svm_linear.predict(X_test_scaled)
y_pred_rbf = svm_rbf.predict(X_test_scaled)

# Calculate and print accuracies
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Linear SVM Accuracy: {accuracy_linear:.4f}")
print(f"RBF SVM Accuracy: {accuracy_rbf:.4f}")

Linear SVM Accuracy: 0.9815
RBF SVM Accuracy: 0.9815


# 8. What is the Naïve Bayes classifier, and why is it called "Naïve"?
-> The Naïve Bayes classifier is a family of simple probabilistic classifiers based on applying Bayes'Theorem with a strong (naïve) independence assumption between the features.

Bayes' Theorem is stated as:
 P(Y|X) = \frac{P(X|Y) P(Y)}{P(X)} 
where:

·  P(Y|X)  is the posterior probability of class  Y  given the features  X .
·  P(X|Y)  is the likelihood, the probability of the features given class  Y .
·  P(Y)  is the prior probability of class  Y .
·  P(X)  is the evidence, the probability of the features.

The classifier is called "Naïve" because it makes a fundamental assumption: it assumes that every feature is conditionally independent of every other feature, given the class label. This means that the presence or absence of one feature does not affect the presence or absence of any other feature.

For example, if we are classifying a fruit based on its color, shape, and diameter, a Naïve Bayes classifier would assume that the color being "red" does not influence the shape being "round," if the fruit is known to be an apple.

Despite this assumption being rarely true in real-world data, Naïve Bayes classifiers often perform surprisingly well. They are very fast, work well with high-dimensional data, and can be effective even with small training datasets.

# 9. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes
-> The different variants of Naïve Bayes classifiers are primarily distinguished by the type of data they are designed to handle and the distribution they assume for the features.

Variant Data Type Distribution Assumption Use Case
Gaussian Naïve Bayes Continuous, real-valued data. Assumes that the continuous values associated with each class are distributed according to a Gaussian (Normal) distribution. Classification with features like measurements (e.g., height, weight, pixel intensity).
Multinomial Naïve Bayes Discrete data, especially counts. Assumes features are generated from a multinomial distribution. This is the standard version for text classification. Text classification (e.g., spam detection, sentiment analysis) where features are word counts or term frequencies (TF-IDF).
Bernoulli Naïve Bayes Binary / Boolean features. Assumes that all features are independent binary (Bernoulli) variables. It penalizes the non-occurrence of a feature that is indicative of a class. Text classification with binary term occurrence (e.g., 1 if a word is present in a document, 0 otherwise). Also useful for other datasets with binary features.

# 10. 