Q1.What is a Decision Tree, and how does it work in the context of classification?
Ans:-

A Decision Tree is a supervised machine learning algorithm used for classification and regression. It makes decisions by splitting data into smaller subsets based on feature values.

How it Works in Classification

 * The dataset starts at the root node.

 * The algorithm selects the best feature to split the data.

* Data is divided into branches based on feature conditions.

* This process continues recursively.

 * Final nodes are called leaf nodes, which represent class labels.

Example:

In the Iris dataset, the model may split data based on:

* Petal length

* Petal width

If petal length < certain value → Class A
Else → further split → Class B or C
___
Q2.Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Ans:-

Decision Trees use impurity measures to decide the best split.

**Gini Impurity**

 * Measures how often a randomly chosen element would be incorrectly classified.

 * Lower Gini value means better purity.

 * Faster to compute.

* Default in sklearn DecisionTreeClassifier.

**Entropy**

* Measures randomness or disorder in data.

* Higher entropy means more mixed classes.

* Based on information theory.

**Impact on Splits**

 * The algorithm selects the feature that reduces impurity the most.

* Lower impurity after split → Better split.

* Both methods usually give similar results.
---

Q3.Difference Between Pre-Pruning and Post-Pruning

Ans:-

Pre-Pruning:-

* Stops tree growth early.

* Uses conditions like max_depth or min_samples_split.

* Prevents overfitting during training.

Advantage:
Reduces computation time and prevents very complex trees.

Post-Pruning:-

* Builds full tree first.

* Then removes unnecessary branches.

* Prunes based on validation performance.

Advantage:
Produces more accurate and optimized trees.
___
Q4.What is Information Gain and why is it important?

Ans:-

Information Gain measures how much uncertainty is reduced after splitting the data.

Importance:-

* Helps select the best feature for splitting.

* Higher Information Gain → Better feature.

* Ensures optimal tree structure.
____

Q5. Applications, Advantages, and Limitations
Applications.

* Medical diagnosis

* Credit risk analysis

* Fraud detection

* Customer segmentation

* Loan approval systems

Advantages:-

* Easy to understand and interpret

* Works with numerical and categorical data

* Requires little data preprocessing

* Handles nonlinear relationships

* Limitations

* Can overfit easily

* Sensitive to small data changes

* Can create biased trees if data is imbalanced
____

Q6.Iris Dataset – Gini Criterion
Ans:-






In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model using Gini
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Feature Importances
print("Feature Importances:", model.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


Q7.Compare max_depth=3 vs Fully Grown Tree.

In [2]:
# Fully grown tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

# Tree with max_depth=3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print("Fully Grown Tree Accuracy:", full_acc)
print("Max Depth=3 Accuracy:", limited_acc)

Fully Grown Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


Q8.Boston Housing – Decision Tree Regressor.

In [4]:
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load dataset
boston = load_boston()
X = boston.data
y = boston.target

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict
y_pred = reg.predict(X_test)

# MSE
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


Q9.GridSearchCV Hyperparameter Tuning.

In [5]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(DecisionTreeClassifier(random_state=42),
                    param_grid,
                    cv=5)

grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Best Accuracy: 0.9428571428571428


Q10.Healthcare Disease Prediction – Step-by-Step Process

Ans:-
1. Handle Missing Values

* Use mean/median for numerical features.

* If too many missing values → remove column.

2. Encode Categorical Features

* Use One-Hot Encoding for nominal data.

* Use Label Encoding for ordinal data.

3. Train Decision Tree Model

* Split data into train and test.

* Use DecisionTreeClassifier.

* Fit model on training data.

4. Tune Hyperparameters

* Use GridSearchCV.

* Tune max_depth, min_samples_split, min_samples_leaf.

* Use cross-validation.

5. Evaluate Performance

* Accuracy

* Precision

* Recall

* F1-score

* Confusion Matrix

* ROC-AUC score

 Business Value

* Early disease detection

* Faster diagnosis support

* Reduced medical costs

* Improved patient outcomes

* Better hospital resource allocation
____

____
**End**