1. What is a Decision Tree, and how does it work in the context of classification?
    - A Decision Tree is a supervised learning algorithm that splits data into branches based on feature values to predict a target class.
It works like a flowchart—each internal node represents a condition on a feature, each branch a decision outcome, and each leaf a class label.
The model learns rules to classify unseen data by minimizing impurity at each split.

2.  Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
    - Gini Impurity: Measures how often a randomly chosen element is misclassified.

        Entropy: Measures information disorder.

        Lower impurity = better split; trees choose splits that minimize impurity.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
    - Pre-pruning: Stops tree growth early using parameters like max_depth. Prevents overfitting and saves computation time.

        Post-pruning: Grows full tree first, then prunes less useful branches. Improves generalization and accuracy on unseen data

4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
    - Information Gain = Reduction in impurity after a split.

        It measures how much “information” a feature provides about the target.
        
        Higher gain = better split, so the tree uses it to choose the best dividing feature.

5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
    - Applications: Loan approval, disease diagnosis, churn prediction, fraud detection.

        Advantages: Easy to interpret, handles both numerical and categorical data.
    
        Limitations: Prone to overfitting, unstable with small data changes.

6. Write a Python program to:
● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


7. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to

a fully-grown tree

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

full_tree = DecisionTreeClassifier(random_state=42)
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)

full_tree.fit(X_train, y_train)
limited_tree.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, full_tree.predict(X_test)))
print("Depth=3 Accuracy:", accuracy_score(y_test, limited_tree.predict(X_test)))


Full Tree Accuracy: 1.0
Depth=3 Accuracy: 1.0


8. Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [4]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


MSE: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


9. Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

● Print the best parameters and the resulting model accuracy

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

params = {'max_depth': [2, 3, 4, 5], 'min_samples_split': [2, 3, 4]}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), params, cv=3)
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
print("Best Accuracy:", accuracy_score(y_test, grid.best_estimator_.predict(X_test)))


Best Params: {'max_depth': 3, 'min_samples_split': 2}
Best Accuracy: 1.0


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Steps:

Handle Missing Values: Use SimpleImputer to fill missing numeric data (mean/median) and categorical data (most frequent).

Encode Categoricals: Use LabelEncoder or OneHotEncoder.

Train Model: Fit a DecisionTreeClassifier.

Tune Hyperparameters: Apply GridSearchCV to optimize depth, split size, etc.

Evaluate: Use accuracy_score or roc_auc_score.

Business Value: Helps doctors predict diseases early, improving treatment and saving costs.