1. What is a Decision Tree, and how does it work	?
- A Decision Tree is a flowchart-like model used for decision-making and predicting outcomes. It splits data into branches based on feature values, ending in leaves that represent results (like a class or value).

- It works:

1. Starts at the root (top).
2. Chooses the best feature to split data (using Gini, Entropy, etc.).
3. Splits data into branches.
4. Repeats this for each branch until a stopping condition is met (like pure nodes or max depth).

- It's used in classification and regression tasks.


2. What are impurity measures in Decision Trees	?
- Impurity measures in Decision Trees show how mixed the data is at a node.

- Common types:

1. Gini Impurity - Measures how often a randomly chosen element would be incorrectly labeled.
2. Entropy - Measures the level of disorder or uncertainty.
3. Classification Error - Measures the fraction of misclassified examples.

- Lower impurity = better split.


3. What is the mathematical formula for Gini Impurity	?
- The mathematical formula for Gini Impurity is:
- Gini = 1-∑pi2
- Where:
- pi = proportion of class i in the mode
- n =  NUmber in class
- It measures how often a randomly chosen element would be misclassified.


4. What is the mathematical formula for Entropy	?
- The mathematical formula for Entropy is:
- Entropy = - ∑pilog2(pi)
- Where
- pi = Proportion of class i in the mode
- n = number in class
- It measures the amount of uncertainty or disorder in the data.


5. What is Information Gain, and how is it used in Decision Trees	?
- Information Gain measures the reduction in entropy after a dataset is split on a feature.
- Use in Decision Trees:
It helps select the best feature to split—the one with the highest Information Gain.

6. What is the difference between Gini Impurity and Entropy	?
-  The difference between Gini Impurity and Entropy in a simpler, more direct way:
- Gini is simpler and quicker; Entropy is more precise but slower.
Both help pick the best feature to split a decision tree.

7. What is the mathematical explanation behind Decision Trees?
- Decision Tree math in short:

1. Calculate impurity (Gini or Entropy) at the root.
2. Try all splits on each feature.
3. Compute gain = impurity before split - impurity after split.
4. Pick the split with highest gain.
5. Repeat this process for each branch.

- Goal: keep splitting to reduce impurity and build the best tree.


8. What is Pre-Pruning in Decision Trees?
- Pre-Pruning is stopping the tree growth early during training to prevent overfitting.

- It sets conditions like:

1. Maximum tree depth
2. Minimum samples per node
3. Minimum impurity decrease

- If these conditions are met, splitting stops, making the tree simpler and more general.


9. What is Post-Pruning in Decision Trees?
- Post-Pruning means growing a full tree first, then **cutting back** (removing) branches that don't improve performance.

- It helps reduce overfitting by simplifying the tree after it's fully built.


10. What is the difference between Pre-Pruning and Post-Pruning	?
- The difference without a table :

- Pre-Pruning stops the tree from growing too much during training by setting limits like maximum depth or minimum samples per node. It prevents complexity early but might stop too soon and underfit.

- Post-Pruning lets the tree grow fully first, then cuts back branches that don't improve performance. It usually gives better accuracy but takes more time since pruning happens after training.

- So, pre-pruning controls tree size before it grows, while post-pruning simplifies the tree after it's fully built.


11. What is a Decision Tree Regressor ?
- A Decision Tree Regressor is a decision tree used for predicting continuous values (like prices or temperatures) instead of categories.

- It works by splitting the data into branches based on feature values, and at each leaf, it predicts the average value of the target in that subset.


12. What are the advantages and disadvantages of Decision Trees	?
- Advantages of Decision Trees:

1. Easy to understand and interpret
2. Handles both numerical and categorical data
3. Requires little data preprocessing
4. Can model nonlinear relationships
5. Useful for both classification and regression

- Disadvantages of Decision Trees:

1. Prone to overfitting, especially with deep trees
2. Can be unstable (small changes in data can change the tree)
3. May be biased towards features with more levels
4.  Less accurate compared to some other models like Random Forests or Boosting methods


13.  How does a Decision Tree handle missing values	?
- A Decision Tree can handle missing values in a few ways:

1. Ignore missing values during splitting by using only available data.
2. Use surrogate splits: find another feature that closely mimics the primary split to decide where to send missing data.
3. Assign missing values to the most common branch or distribute them proportionally based on training data.

- These methods help the tree still make good decisions despite missing data.


14. How does a Decision Tree handle categorical features?
- A Decision Tree handles categorical features by splitting the data based on the categories:

1. 7It can split by checking if the feature equals a specific category (e.g., “Color = Red”).
2.  Or it can group categories into subsets (e.g., “Color in {Red, Blue} vs. others”).
3.  The tree finds the best way to split categories to reduce impurity.
- No special encoding is needed; the tree naturally works with categorical data.


15. What are some real-world applications of Decision Trees?
- Some real-world applications of Decision Trees include:

1. Customer churn prediction: Identifying which customers might leave a service.
2. Medical diagnosis: Classifying diseases based on symptoms.
3. Credit scoring: Assessing loan risk based on applicant data.
4. Fraud detection: Spotting suspicious transactions.
5. Marketing: Targeting customers for campaigns based on behavior.
6. Manufacturing: Predicting machine failures or quality issues.
- They're popular because they're easy to interpret and implement.


In [1]:
#16.  Write a Python program to train a Decision Tree Classifier on the Iris dataset and print the model accuracy
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")


Model Accuracy: 1.00


In [None]:
#17. Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Initialize Decision Tree with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X, y)

# Print feature importances
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")



In [None]:
#18. Write a Python program to train a Decision Tree Classifier using Entropy as the splitting criterion and print the model accuracy?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree with Entropy criterion
clf = DecisionTreeClassifier(criterion='entropy', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.2f}")



In [3]:
#19.Write a Python program to train a Decision Tree Regressor on a housing dataset and evaluate using Mean Squared Error (MSE)?
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)

# Train the model
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")



Mean Squared Error: 0.4952


In [None]:
#20. Write a Python program to train a Decision Tree Classifier and visualize the tree using graphviz?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X, y)

# Export the tree to DOT format
dot_data = export_graphviz(
    clf,
    out_file=None,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    special_characters=True
)

# Visualize the tree using graphviz
graph = graphviz.Source(dot_data)
graph.render("decision_tree_iris")  # Saves the tree as PDF file
graph.view()  # Opens the visualization


In [4]:
#21.  Write a Python program to train a Decision Tree Classifier with a maximum depth of 3 and compare its accuracy with a fully grown tree?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with max depth = 3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Fully grown Decision Tree (no max_depth limit)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max depth 3: {accuracy_limited:.2f}")
print(f"Accuracy with fully grown tree: {accuracy_full:.2f}")


Accuracy with max depth 3: 1.00
Accuracy with fully grown tree: 1.00


In [8]:
#22. Write a Python program to train a Decision Tree Classifier using min_samples_split=5 and compare its accuracy with a default tree?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree with min_samples_split=5
clf_min_split = DecisionTreeClassifier(min_samples_split=5, random_state=42)
clf_min_split.fit(X_train, y_train)
y_pred_min_split = clf_min_split.predict(X_test)
accuracy_min_split = accuracy_score(y_test, y_pred_min_split)

# Default Decision Tree
clf_default = DecisionTreeClassifier(random_state=42)
clf_default.fit(X_train, y_train)
y_pred_default = clf_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

print(f"Accuracy with min_samples_split=5: {accuracy_min_split:.2f}")
print(f"Accuracy with default parameters: {accuracy_default:.2f}")


In [None]:
#23. Write a Python program to apply feature scaling before training a Decision Tree Classifier and compare its accuracy with unscaled data?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
clf_unscaled = DecisionTreeClassifier(random_state=42)
clf_unscaled.fit(X_train, y_train)
y_pred_unscaled = clf_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

# With feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_scaled = DecisionTreeClassifier(random_state=42)
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {accuracy_unscaled:.2f}")
print(f"Accuracy with scaling: {accuracy_scaled:.2f}")



In [9]:
#24. Write a Python program to train a Decision Tree Classifier using One-vs-Rest (OvR) strategy for multiclass classification ?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree classifier
dt = DecisionTreeClassifier(random_state=42)

# Wrap it with OneVsRestClassifier
ovr = OneVsRestClassifier(dt)

# Train the OvR model
ovr.fit(X_train, y_train)

# Predict on test data
y_pred = ovr.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy (OvR): {accuracy:.2f}")


Model Accuracy (OvR): 1.00


In [None]:
#25. Write a Python program to train a Decision Tree Classifier and display the feature importance scores?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X, y)

# Display feature importances
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


In [10]:
#26. Write a Python program to train a Decision Tree Regressor with max_depth=5 and compare its performance with an unrestricted tree?
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Decision Tree Regressor with max_depth=5
regressor_limited = DecisionTreeRegressor(max_depth=5, random_state=42)
regressor_limited.fit(X_train, y_train)
y_pred_limited = regressor_limited.predict(X_test)
mse_limited = mean_squared_error(y_test, y_pred_limited)

# Unrestricted Decision Tree Regressor
regressor_full = DecisionTreeRegressor(random_state=42)
regressor_full.fit(X_train, y_train)
y_pred_full = regressor_full.predict(X_test)
mse_full = mean_squared_error(y_test, y_pred_full)

print(f"MSE with max_depth=5: {mse_limited:.4f}")
print(f"MSE with unrestricted tree: {mse_full:.4f}")


MSE with max_depth=5: 0.5245
MSE with unrestricted tree: 0.4952


In [None]:
#27. Write a Python program to train a Decision Tree Classifier, apply Cost Complexity Pruning (CCP), and visualize its effect on accuracy ?
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train initial tree to get effective alphas for pruning
clf = DecisionTreeClassifier(random_state=42)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

train_scores = []
test_scores = []

# Train trees with different ccp_alpha values and record accuracy
for ccp_alpha in ccp_alphas:
    clf_pruned = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf_pruned.fit(X_train, y_train)
    train_pred = clf_pruned.predict(X_train)
    test_pred = clf_pruned.predict(X_test)
    train_scores.append(accuracy_score(y_train, train_pred))
    test_scores.append(accuracy_score(y_test, test_pred))

# Plot accuracy vs ccp_alpha
plt.figure(figsize=(8, 5))
plt.plot(ccp_alphas, train_scores, marker='o', label='Train Accuracy')
plt.plot(ccp_alphas, test_scores, marker='o', label='Test Accuracy')
plt.xlabel('ccp_alpha (Pruning parameter)')
plt.ylabel('Accuracy')
plt.title('Effect of Cost Complexity Pruning on Accuracy')
plt.legend()
plt.grid(True)
plt.show()




In [11]:
#28. Write a Python program to train a Decision Tree Classifier and evaluate its performance using Precision,Recall, and F1-Score?
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Evaluate with precision, recall, and f1-score
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(report)



              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [None]:
#29.Write a Python program to train a Decision Tree Classifier and visualize the confusion matrix using seaborn ?
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix using seaborn
plt.figure(figsize=(7,5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()


In [12]:
#30. Write a Python program to train a Decision Tree Classifier and use GridSearchCV to find the optimal values for max_depth and min_samples_split.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 4, 6, 8]
}

# Setup GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.2f}")

# Evaluate on test data using the best estimator
best_clf = grid_search.best_estimator_
test_accuracy = best_clf.score(X_test, y_test)
print(f"Test set accuracy: {test_accuracy:.2f}")



Best parameters: {'max_depth': 4, 'min_samples_split': 2}
Best cross-validation accuracy: 0.94
Test set accuracy: 1.00
