**Question 1: What is a Decision Tree, and how does it work in the context of classification?**

*Answer*

A Decision Tree is a machine learning algorithm that makes decisions by splitting data into smaller groups based on feature values. It works like a flowchart with branches.

In classification:

- The tree starts at the root node (the whole dataset).

- It selects the best feature to split the data (using criteria like Gini impurity or entropy).

- The data is divided into branches based on the selected feature.

- This process continues until the tree reaches leaf nodes, where a final class label is assigned.

In simple terms, a Decision Tree classifies new data by following the path of decisions from the root to a leaf, based on feature values.


**Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

*Answer*

**Gini Impurity**

Gini Impurity measures how often a randomly chosen sample would be incorrectly classified if it were labeled according to the class distribution in that node.

* Formula:
 Gini=1‚àí‚àëpi^2‚Äã
  
  (where ùëùi^2 is the probability of each class)

* **Lower Gini = purer node**

**Entropy**

Entropy measures the level of uncertainty or disorder in the data.

* Formula:
  [
  Entropy = ‚àí‚àëpi ‚Äãlog2 ‚Äã(pi‚Äã)
  ]

* **Lower entropy = purer node**


**How They Impact Splits in a Decision Tree**

* During training, the decision tree algorithm checks all possible splits.
* For each split, it calculates **Gini** or **Entropy**.
* It chooses the split that produces the **lowest impurity** (i.e., the purest child nodes).
* This means the chosen split should separate the classes as well as possible.

In simple terms, **both impurity measures help the tree pick the best feature and threshold to create clean, well-separated groups**.


**Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

*Answer*

**Pre-Pruning (Early Stopping)**

Pre-pruning stops the tree from growing too deep during training.
It sets limits such as:

* maximum depth
* minimum samples required to split
* minimum samples per leaf

**Advantage:**
Faster training, useful when working with large datasets (e.g., customer data with millions of rows).
It avoids unnecessary splits and saves computational cost.

--

**Post-Pruning (Pruning After Training)**

Post-pruning allows the tree to grow fully first and then trims unnecessary or weak branches.

**Advantage:**
Produces a simpler and more accurate model, especially helpful when the initial tree overfits (e.g., medical diagnosis datasets that are noisy).

**Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

*Answer*

Information Gain is a metric used in Decision Trees to measure how much a feature improves the purity of the dataset after a split.

It is calculated as:

**Information Gain = Entropy(before split)‚àíEntropy(after split)**

Why It Is Important:

- Information Gain helps the tree choose the best feature to split the data.

- A split with high Information Gain means:

- It reduces uncertainty the most

- It creates child nodes that are more pure

- It separates the classes better

In simple words, Information Gain tells the decision tree which feature gives the most useful separation, helping build a better and more accurate model.


**Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

*Answer*

**Real-World Applications of Decision Trees**

1. **Loan approval (Banking)**
   Used to decide whether a person qualifies for a loan based on income, credit score, etc.

2. **Medical diagnosis (Healthcare)**
   Helps classify whether a patient has a disease based on symptoms and test results.

3. **Fraud detection (Finance & E-commerce)**
   Identifies unusual or suspicious transactions.

4. **Customer segmentation (Marketing)**
   Groups customers based on age, spending habits, or behavior for targeted marketing.

5. **Predicting churn (Telecom)**
   Determines if a customer is likely to leave the service.

**Advantages**

1. **Easy to understand and interpret**
   Looks like a flowchart, so even non-technical people can understand the model.

2. **Works with both numerical and categorical data**
   No need for heavy preprocessing.

3. **Handles non-linear relationships**
   Can capture complex decision boundaries.

4. **Requires less data preparation**
   No need for scaling or normalization.

**Limitations**

1. **Prone to overfitting**
   Trees can become too deep and memorize the data instead of learning patterns.

2. **Unstable**
   Small changes in data can create a completely different tree.

3. **Biased toward features with many categories**
   Features with more levels may dominate the splits.

4. **Lower accuracy compared to ensemble methods**
   Models like Random Forest or XGBoost perform better in most cases.






In [None]:
'''
Dataset Info:
‚óè Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
‚óè Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).
'''

In [3]:
'''
Question 6: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Train a Decision Tree Classifier using the Gini criterion
‚óè Print the model‚Äôs accuracy and feature importances

'''

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data       # features
y = iris.target     # labels

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
model = DecisionTreeClassifier(criterion="gini", random_state=42)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("Feature Importances:")
for name, importance in zip(iris.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")



Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [4]:
'''
Question 7: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
a fully-grown tree.
'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ------------------------------
# Model 1: Decision Tree with max_depth=3
# ------------------------------
dt_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_limited.fit(X_train, y_train)
pred_limited = dt_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, pred_limited)

# ------------------------------
# Model 2: Fully-grown Decision Tree (no depth limit)
# ------------------------------
dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
pred_full = dt_full.predict(X_test)
accuracy_full = accuracy_score(y_test, pred_full)

# Print results
print("Accuracy (max_depth=3):", accuracy_limited)
print("Accuracy (Fully-grown tree):", accuracy_full)


Accuracy (max_depth=3): 1.0
Accuracy (Fully-grown tree): 1.0


In [6]:
'''
Question 8: Write a Python program to:
‚óè Load the Boston Housing Dataset
‚óè Train a Decision Tree Regressor
‚óè Print the Mean Squared Error (MSE) and feature importances
'''
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Load California Housing dataset
housing = fetch_california_housing()

# Convert to DataFrame
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["target"] = housing.target

# Splitting data
X = df.drop("target", axis=1)
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)

# Print output
print("Mean Squared Error (MSE):", mse)
print("\nFeature Importances:")
for feature, importance in zip(X.columns, reg.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.495235205629094

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


In [8]:
'''
Question 9: Write a Python program to:
‚óè Load the Iris Dataset
‚óè Tune the Decision Tree‚Äôs max_depth and min_samples_split using
GridSearchCV
‚óè Print the best parameters and the resulting model accuracy
'''
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split (correct 4 outputs)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model
dt = DecisionTreeClassifier(random_state=42)

# Hyperparameter grid
param_grid = {
    "max_depth": [1, 2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# GridSearchCV
grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Best model evaluation
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


**Question 10: Imagine you‚Äôre working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:**

- Handle the missing values

- Encode the categorical features

- Train a Decision Tree model

- Tune its hyperparameters

- Evaluate its performance

**And describe what business value this model could provide in the real-world
setting.**

*Answer*

**1. Handling missing values:**
I would start by identifying the amount and pattern of missing data.

* For **numerical features**, I'd use **median imputation**.
* For **categorical features**, I'd replace missing values with a **'Missing'** category.
* If missingness itself is meaningful, I'd also add **missing-indicator columns**.
  All imputations would be done inside a **pipeline** to avoid data leakage.

**2. Encoding categorical features:**

* For low-cardinality categories ‚Üí **One-Hot Encoding**
* For high-cardinality categories ‚Üí **Frequency or Target Encoding**
  Encoding is also applied inside the pipeline so it fits **only on training data**.

**3. Training a Decision Tree model:**
I would build a pipeline with preprocessing + 'DecisionTreeClassifier', split data using **stratified train-test**, apply **class weights** if the dataset is imbalanced, and then fit the model on the processed data.

**4. Hyperparameter tuning:**
I would use **GridSearchCV** or **RandomizedSearchCV** to tune key parameters like:

* 'max_depth'
* 'min_samples_split'
* 'min_samples_leaf'
* 'criterion'
  The scoring metric would depend on the problem, e.g., **Recall or F1** because false negatives are costly in healthcare.

**5. Evaluating performance:**

I would evaluate using:

* **Accuracy**, **Precision**, **Recall**, **F1-score**,
* **ROC-AUC** (or **PR-AUC** for imbalance),
* A **confusion matrix** to understand false negatives and false positives.
  If required, I'd calibrate the model's probabilities.

**6. Business value in real-world healthcare:**
This model can help **early disease detection**, prioritize high-risk patients, reduce unnecessary tests, support doctors in decision-making, and ultimately **improve patient outcomes while reducing healthcare costs**.

