## Theoretical

1. What is a Decision Tree, and how does it work  

    * A Decision Tree is a supervised machine learning algorithm used for classification and regression. It is a tree-like structure where internal nodes represent features, branches represent decisions, and leaf nodes represent outcomes. The tree is built by recursively splitting the dataset based on the feature that results in the best split according to a chosen criterion
    

2. What are impurity measures in Decision Trees	
     * Impurity measures quantify the disorder in a dataset. The lower the impurity, the more homogeneous the data. Common impurity measures include:  
        i. Gini Impurity: Measures how often a randomly chosen element would be incorrectly classified.   
        ii. Entropy: Measures the uncertainty or randomness in the dataset.

3. What is the mathematical formula for Gini Impurity	
    * The Gini Impurity for a node is given by:   
                Gini=1− ∑C pi2   
      where 𝑝𝑖 is the proportion of samples belonging to class i, and C is the number of classes.  

4. What is the mathematical formula for Entropy	

Entropy measures the randomness or impurity in a dataset. It is calculated using the formula:

\[
Entropy = -\sum_{i=1}^{C} p_i \log_2 p_i
\]

where:  
- \( C \) is the total number of classes,  
- \( p_i \) is the probability of an instance belonging to class \( i \).  

A lower entropy value indicates a purer node, while a higher entropy value suggests greater disorder.

5. What is Information Gain, and how is it used in Decision Trees	

nformation Gain (IG) measures how much a split reduces impurity in a Decision Tree. It is calculated as:

\[
IG = Entropy_{parent} - \sum_{j=1}^{k} \frac{N_j}{N} Entropy_{child_j}
\]

where:  
- \( Entropy_{parent} \) is the entropy before splitting,  
- \( Entropy_{child_j} \) is the entropy of child node \( j \),  
- \( N_j \) is the number of samples in child node \( j \),  
- \( N \) is the total number of samples in the parent node,  
- \( k \) is the number of child nodes.  

A higher Information Gain means a feature provides a better split, leading to a more effective Decision Tree. The feature with the highest Information Gain is chosen at each step during tree construction.

6. What is the difference between Gini Impurity and Entropy	

    * Gini Impurity is computationally simpler and tends to favor larger class splits.
    * Entropy involves logarithms, making it more computationally expensive but often leading to better decision boundaries.

7. What is the mathematical explanation behind Decision Trees	
    * Select the best feature to split using an impurity measure (e.g., Gini or Entropy).
    * Split the dataset at the chosen feature value.
    * Recursively apply the process to child nodes until a stopping criterion is met (e.g., minimum samples per leaf).
    * Assign the most frequent class (classification) or mean value (regression) to leaf nodes.

8. What is Pre-Pruning in Decision Trees
    * Pre-pruning stops the tree from growing beyond a certain depth or condition (e.g., minimum samples per split). This help prevent overfitting by limiting complexity.

9. What is the process of Post-Pruning in Decision Trees

Post-pruning is a technique used to reduce overfitting in Decision Trees by trimming unnecessary branches after the tree has been fully grown. The goal is to improve generalization by simplifying the model.

1. Grow the Full Tree 
   - The Decision Tree is trained to its maximum depth, capturing all possible patterns in the data.

2. Evaluate Performance on Validation Data
   - A separate validation set (or cross-validation) is used to assess the performance of the tree.

3. Prune the Least Important Nodes
   - Nodes or branches that do not significantly improve the model's performance are removed.  
   - This is done by comparing the accuracy before and after pruning.  
   - Two common pruning strategies:
     - Cost Complexity Pruning (CCP): Introduces a penalty for tree complexity.
     - Reduced Error Pruning: Directly removes nodes and checks if accuracy improves.

4. Stop When Performance Decreases  
   - Pruning continues until removing further branches reduces validation accuracy.   

    By eliminating redundant splits, post-pruning leads to a more interpretable and efficient Decision Tree that generalizes better to unseen data.


10. What is the difference between Pre-Pruning and Post-Pruning	

    * Pre-Pruning: Stops the tree from growing too deep based on predefined constraints.
    * Post-Pruning: Allows the tree to grow fully and then removes branches that do not contribute significantly to accuracy.

11. What is a Decision Tree Regressor	

    * A Decision Tree Regressor is a type of Decision Tree used for regression tasks. Instead of predicting categorical classes, it predicts continuous values by splitting the dataset based on minimizing variance at each node.

12. What are the advantages and disadvantages of Decision Trees	
    * Advantages:
    
        * Easy to interpret and visualize.
        * Handles both categorical and numerical data.
        * No need for feature scaling.
        * Works well with small to medium-sized datasets.

    * Disadvantages:

        * Prone to overfitting.
        * Unstable (small changes in data can lead to different trees).
        * Greedy splitting can lead to suboptimal trees.

14. How does a Decision Tree handle categorical features	

    * Categorical features are usually converted into numerical values using one-hot encoding or label encoding.
    * The tree can also split directly on categorical values by grouping categories together during the split.

15. What are some real-world applications of Decision Trees?

    * Healthcare: Diagnosing diseases based on symptoms.
    * Finance: Credit risk assessment.
    * E-commerce: Customer segmentation and recommendation systems.
    * Manufacturing: Quality control and defect detection.
    * Marketing: Predicting customer churn.

## Practical 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import graphviz
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier , DecisionTreeRegressor , export_graphviz
from sklearn.metrics import accuracy_score , mean_squared_error
from sklearn.datasets import fetch_california_housing

import warnings
warnings.filterwarnings('ignore')

In [None]:
# data 0f iris
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# 16. Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")


In [None]:
# 17. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances

feature_names = iris.feature_names

clf = DecisionTreeClassifier(criterion="gini", random_state=1)
clf.fit(X_train, y_train)

print("Feature Importances:")
for feature, importance in zip(feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


In [None]:
# 18. Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy

clf = DecisionTreeClassifier(criterion="entropy", random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

In [None]:
# 19. Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)

housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train the Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions
y_pred = regressor.predict(X_test)

# Compute and print Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

In [None]:
# 20. Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

feature_names = iris.feature_names
class_names = iris.target_names
dot_data = export_graphviz(
    clf, out_file=None, feature_names=feature_names, class_names=class_names,
    filled=True, rounded=True, special_characters=True
)

graph = graphviz.Source(dot_data)
graph.render("decision_tree") 

graph.view()  

In [None]:
# 21. Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree

clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=1)
clf_limited.fit(X_train, y_train)

y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)
print(f'Model Accuracy (max_depth=3): {accuracy_limited:.2f}')

clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)

y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print(f'Model Accuracy (fully grown): {accuracy_full:.2f}')

In [None]:
# 22. Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree

clf_min_samples = DecisionTreeClassifier(criterion='gini', min_samples_split=5, random_state=42)
clf_min_samples.fit(X_train, y_train)

y_pred_min_samples = clf_min_samples.predict(X_test)
accuracy_min_samples = accuracy_score(y_test, y_pred_min_samples)
print(f'Model Accuracy (min_samples_split=5): {accuracy_min_samples:.2f}')

In [None]:
# 23. Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data


from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_scaled = DecisionTreeClassifier(criterion='gini', random_state=1)
clf_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = clf_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f'Model Accuracy (with feature scaling): {accuracy_scaled:.2f}')

feature_names = iris.feature_names
class_names = iris.target_names
dot_data = export_graphviz(
    clf_limited, out_file=None, feature_names=feature_names, class_names=class_names,
    filled=True, rounded=True, special_characters=True
)

graph = graphviz.Source(dot_data)
graph.render("decision_tree")  

graph.view()  


In [None]:
# 24. Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification

from sklearn.multiclass import OneVsRestClassifier

ovr_classifier = OneVsRestClassifier(DecisionTreeClassifier(criterion='gini', random_state=1))
ovr_classifier.fit(X_train, y_train)

# Make predictions
y_pred = ovr_classifier.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy (One-vs-Rest Decision Tree): {accuracy:.2f}')

In [None]:
# 25. Write a Python program to train a Decision Tree Classifier and display the feature importance scores

clf = DecisionTreeClassifier(criterion='gini', random_state=1)
clf.fit(X_train, y_train)

# Get feature importance scores
feature_importances = clf.feature_importances_

# Display feature importance scores
print("Feature Importance Scores:")
for feature, importance in zip(feature_names, feature_importances):
    print(f"{feature}: {importance:.4f}")

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, feature_importances, color='skyblue')
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Feature Importance in Decision Tree Classifier")
plt.show()

In [None]:
# 26. Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree

X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

regressor_limited = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor_limited.fit(X_train, y_train)

regressor_full = DecisionTreeRegressor(random_state=42)
regressor_full.fit(X_train, y_train)

y_pred_limited = regressor_limited.predict(X_test)
y_pred_full = regressor_full.predict(X_test)

mse_limited = mean_squared_error(y_test, y_pred_limited)
mse_full = mean_squared_error(y_test, y_pred_full)

# Display results
print(f'Mean Squared Error (max_depth=5): {mse_limited:.4f}')
print(f'Mean Squared Error (fully grown tree): {mse_full:.4f}')

# Plot actual vs predicted values
plt.figure(figsize=(10, 5))
plt.scatter(y_test, y_pred_limited, color='blue', label='max_depth=5', alpha=0.6)
plt.scatter(y_test, y_pred_full, color='red', label='Fully grown tree', alpha=0.4)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle="--", color="black")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Decision Tree Regression: Actual vs Predicted")
plt.legend()
plt.show()

In [None]:
# 27. Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy

iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Train a fully grown Decision Tree to determine ccp_alphas
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

# Get the Cost Complexity Pruning Path
path = clf_full.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # Exclude the maximum value (prunes all leaves)

# Train Decision Trees using different ccp_alpha values
train_accuracies = []
test_accuracies = []

for alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    clf.fit(X_train, y_train)

    # Compute accuracies
    train_acc = accuracy_score(y_train, clf.predict(X_train))
    test_acc = accuracy_score(y_test, clf.predict(X_test))

    train_accuracies.append(train_acc)
    test_accuracies.append(test_acc)

# Plot Accuracy vs CCP Alpha
plt.figure(figsize=(8, 5))
plt.plot(ccp_alphas, train_accuracies, marker='o', label="Train Accuracy", color='blue')
plt.plot(ccp_alphas, test_accuracies, marker='o', label="Test Accuracy", color='red')
plt.xlabel("CCP Alpha")
plt.ylabel("Accuracy")
plt.title("Effect of Cost Complexity Pruning on Accuracy")
plt.legend()
plt.grid()
plt.show()

In [None]:
# 28. Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision, Recall, and F1-Score

from sklearn.metrics import classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model using Precision, Recall, and F1-Score
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n", report)


In [None]:
# 29. Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn

from sklearn.metrics import confusion_matrix

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# 30. Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split.


from sklearn.model_selection import  GridSearchCV

clf = DecisionTreeClassifier(random_state=42)

# Define the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model and parameters
best_clf = grid_search.best_estimator_
best_params = grid_search.best_params_
print(f"Best Parameters: {best_params}")

# Evaluate on the test set
y_pred = best_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")
