# Decision Tree
## Assignment Questions

### 1 What is a Decision Tree, and how does it work in the context of classification?

Answer:
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. In classification, it splits the dataset into smaller subsets based on feature values, forming a tree-like structure with decision nodes and leaf nodes. Each internal node represents a feature test, each branch represents an outcome, and each leaf node represents a class label. The model works by recursively choosing the feature that best separates the data according to a certain criterion (e.g., Gini Impurity or Entropy) until it reaches pure leaf nodes or satisfies stopping conditions.


### 2 Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer:
Gini Impurity: Measures the probability of incorrectly classifying a randomly chosen element if it was labeled according to the distribution of labels in the node. Formula:

Gini=1 - i=1 ∑n pi2
is the proportion of class
i in the node. Lower Gini means higher purity.
Entropy: Measures the amount of uncertainty or disorder in the data. Formula:
Entropy=- is=1 ∑ n pi log2(pi)
Lower entropy means higher purity.
Impact on splits:
Both measures are used to decide the best split. The feature and threshold that lead to the greatest reduction in impurity (highest Information Gain for entropy or highest Gini reduction) are chosen for splitting.


### 3 What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer:
Pre-Pruning: Stops the growth of the tree early based on certain conditions (e.g., max depth, min samples split).
Advantage: Prevents overfitting and reduces training time.
Post-Pruning: Allows the tree to grow fully and then removes branches that do not contribute significantly to accuracy, based on validation data.
Advantage: Produces a simpler model while maintaining accuracy.


### 4 What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:
Information Gain measures the reduction in entropy (uncertainty) after splitting a dataset based on a feature. It is calculated as:
IG=Entropy parent ​ − i ∑ ​ ∣D∣ ∣D i ​ ∣ ​ ×Entropy(D i ​ )
A higher Information Gain means the split produces purer child nodes. It is important because it helps select the feature that provides the most useful separation of the classes at each step of building the tree.


### 5 What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:
Applications:
Medical diagnosis (predicting diseases)
Credit scoring and loan approval
Customer segmentation in marketing
Predicting equipment failure in manufacturing
Advantages:
Easy to understand and interpret
Handles both numerical and categorical data
No need for extensive data preprocessing
Limitations:
Prone to overfitting on training data
Can be unstable with small data changes
Biased towards features with more levels

In [1]:
# 6

# Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Feature Importances
feature_importances = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': clf.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importances)


Model Accuracy: 1.00

Feature Importances:
             Feature  Importance
2  petal length (cm)    0.906143
3   petal width (cm)    0.077186
1   sepal width (cm)    0.016670
0  sepal length (cm)    0.000000


In [2]:
# 7

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fully grown tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Tree with max_depth = 3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print(f"Fully-grown tree accuracy: {full_acc:.2f}")
print(f"Max depth=3 tree accuracy: {limited_acc:.2f}")


Fully-grown tree accuracy: 1.00
Max depth=3 tree accuracy: 1.00


In [7]:
# 8

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load California housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions & MSE
y_pred = reg.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Feature importances
feature_importances = pd.DataFrame({
    'Feature': housing.feature_names,
    'Importance': reg.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
print(feature_importances)


Mean Squared Error: 0.50

Feature Importances:
      Feature  Importance
0      MedInc    0.528509
5    AveOccup    0.130838
6    Latitude    0.093717
7   Longitude    0.082902
2    AveRooms    0.052975
1    HouseAge    0.051884
4  Population    0.030516
3   AveBedrms    0.028660


In [6]:
# 9

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

# Grid search
grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5
)
grid.fit(X_train, y_train)

# Best parameters & accuracy
best_params = grid.best_params_
best_score = grid.best_score_

print("Best Parameters:", best_params)
print(f"Cross-validated Accuracy: {best_score:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Cross-validated Accuracy: 0.94


10)

Answer:

Step-by-step process:
1)Handle the missing values:
For numerical features: Replace missing values with the mean or median.
For categorical features: Replace missing values with the mode or create a separate category like "Unknown".

2)Encode the categorical features:
Use One-Hot Encoding for nominal categories (e.g., gender, blood type).
Use Label Encoding or Ordinal Encoding if the categories have an inherent order (e.g., disease severity levels).

3)Train a Decision Tree model:
Split the dataset into training and testing sets (e.g., 80% train, 20% test).
Initialize and train a DecisionTreeClassifier on the training data.

4)Tune its hyperparameters:
Adjust parameters like max_depth, min_samples_split, and criterion (Gini or Entropy) using GridSearchCV or RandomizedSearchCV to improve performance.

5)Evaluate its performance:
Use metrics such as accuracy, precision, recall, F1-score, and confusion matrix to assess prediction quality.
Perform cross-validation to ensure model robustness.

6)Business value in a real-world setting:
Early prediction of diseases allows timely intervention, reducing treatment costs and improving patient outcomes.
Helps healthcare providers prioritize high-risk patients for preventive care.
Supports data-driven decisions in hospital resource allocation and personalized treatment plans.

