Question 1: What is a Decision Tree, and how does it work in the context of classification?
   - Definition:
      - A Decision Tree is a type of supervised machine learning algorithm used for classification and regression tasks. In the context of classification, a decision tree is a model that predicts the class label of an instance by learning decision rules inferred from the data features.
   
   - Structure of a Decision Tree:

       - Root Node: Represents the entire dataset; the starting point of the tree.
       - Internal Nodes: Represent decisions based on features (e.g., "Is age < 30?").
       - Leaf Nodes: Represent the output class (e.g., "Yes" or "No", or "Class A", "Class B").
       - Each internal node splits the data based on the value of one feature, and this process continues recursively.

  - How it Works (for Classification):
    1. Start at the Root Node:
      - Choose the best feature to split the data.
      - "Best" is determined using a metric like:
         - Gini Impurity
         - Entropy/Information Gain
         - Gain Ratio
  
    2. Split the Dataset:
      - Based on the chosen feature, divide the data into subsets.
    
     
    3. Repeat:
      - Continue splitting recursively for each subset until a stopping condition is met (e.g., max depth, no improvement, pure node).

    4. Predict:
      - To classify a new instance, traverse the tree from root to leaf, following decisions based on the feature values of the instance.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

  1. Gini Impurity
      - Gini Impurity measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the class distribution in a node.

     - Formula (for a node with k classes):
       
      

$$
\text{Gini} = 1 - \sum_{i=1}^{k} p_i^2
$$

   - Pi is the proportion of samples belonging to class
𝑖
i in the node.

     - Interpretation:

         - Gini = 0 → Pure node (all samples belong to one class).
         - Higher Gini → More mixed node (more impurity).

   -  Example:
        - If a node has 50% class A and 50% class B:
            - Gini = 1-(0.5^2 + 0.5^2) = 0.5
   
   2. Entropy (Information Gain)
      - Entropy measures the level of disorder or randomness in a node. It's borrowed from information theory.

      - Formula:
$$
\text{Entropy} = - \sum_{i=1}^{k} p_i \log_2 p_i
$$
	​    
      - Interpretation:
        - Entropy = 0 → Pure node.

  - Maximum entropy occurs when all classes are equally likely (maximum disorder).

     - Example:
         - Using the same 50%-50% split:
        -  Entropy=−(0.5log2​0.5+0.5log2​0.5)=1


  - How They Impact Splits in Decision Trees:
      1. For each candidate feature:
         - Calculate the impurity (Gini or Entropy) of each split.
         - Compute the weighted average impurity of the child nodes.
      2. Select the feature and split point that minimizes the impurity (or maximizes Information Gain, in the case of entropy).
      3. Split the node using that feature.    

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

  - Decision Trees can easily overfit the training data if they grow too deep. To avoid this, we use pruning — a technique to limit tree complexity and improve generalization.

  1. Pre-Pruning (Early Stopping):
      - In Pre-Pruning, we stop the tree from growing further during the building phase, based on some criteria.

      - Common Pre-Pruning Conditions:
          - Maximum depth of the tree (max_depth)
          - Minimum number of samples required to split a node (min_samples_split)
          - Minimum information gain required to make a split

      - Practical Advantage:
         - Faster Training Time
            - Because the tree doesn't grow unnecessarily deep, it trains faster and uses less memory.

  2. Post-Pruning (Reduced Error Pruning):
      - In Post-Pruning, we first grow the full tree, and then remove branches that do not improve accuracy on a validation set.
      
      - How it works:
         - Fully grow the tree
         - Evaluate subtrees on validation data
         - Remove branches that don't help (i.e., they cause overfitting)

      - Practical Advantage:
        - Better Generalization  
              - Since pruning is done based on actual performance on unseen data (validation set), the model is less likely to overfit.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

 - Information Gain (IG) is a metric used in decision trees to measure the effectiveness of an attribute (feature) in classifying the training data.
 - It quantifies the reduction in entropy (uncertainty or impurity) after splitting a dataset based on a feature.

 - Formula:

Let:


S: The original dataset (node)


A: The feature to split on

𝑆
𝑣
: The subset of 𝑆
S where feature
𝐴
A has value
𝑣
v

Then:

  $$
\text{Information Gain}(S, A) = \text{Entropy}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \cdot \text{Entropy}(S_v)
$$

  - Why is it Important?
     - During tree construction:
    - The algorithm evaluates each feature and its possible split points.
    - It chooses the feature/split that maximizes Information Gain.
    - This leads to more pure child nodes, helping the tree classify better.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
   - Real-World Applications of Decision Trees:
   1. Healthcare
      - Use: Diagnosing diseases based on symptoms.
      - Example: Predicting whether a patient has diabetes or not using test results.
      - Why?: Easy to interpret, which is critical in medical decisions.

   2. Finance
       - Use: Credit scoring, loan approval, fraud detection.
       - Example: Will the customer default on a loan?
       - Why?: Rules are transparent, useful for compliance.

   3. Marketing
      - Use: Customer segmentation, churn prediction.
      - Example: Will a customer buy a product based on their behavior?
      - Why?: Helps create targeted marketing strategies.

  4. Retail & E-Commerce
      - Use: Recommender systems, inventory prediction.
      - Example: What products to show to a user?
      - Why?: Fast prediction, good for real-time systems.

  5. Manufacturing & Quality Control
     - Use: Defect detection, process optimization.
     - Example: Will a product pass quality standards based on measurements?

  - Limitations of Decision Trees

| Limitation                        | Description                                                                 |
|----------------------------------|-----------------------------------------------------------------------------|
|  Overfitting                   | Deep trees can memorize training data and perform poorly on new data.      |
| Unstable                      | Small changes in data can lead to completely different trees.              |
|  Greedy Algorithm              | Locally optimal splits don’t guarantee globally optimal tree.              |
| Poor with Imbalanced Data     | Can bias toward the majority class.                                        |
| Not Good with Complex Patterns Alone | Decision Trees may underperform compared to ensemble methods like Random Forests or Gradient Boosted Trees. |
  

In [None]:
#Dataset Info:
# Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
# Boston Housing Dataset for regression tasks
#(sklearn.datasets.load_boston() or provided CSV).


In [1]:
#Question 6: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split data into train and test sets (optional but recommended)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4. Train the model
clf.fit(X_train, y_train)

# 5. Predict on test set
y_pred = clf.predict(X_test)

# 6. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy on Test Set: {accuracy:.4f}")

# 7. Print feature importances
feature_names = iris.feature_names
importances = clf.feature_importances_

print("\nFeature Importances:")
for name, importance in zip(feature_names, importances):
    print(f"{name}: {importance:.4f}")



Model Accuracy on Test Set: 1.0000

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


In [2]:
#Question 7: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree with max_depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

# Train fully grown Decision Tree (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print accuracies
print(f"Accuracy with max_depth=3: {accuracy_pruned:.4f}")
print(f"Accuracy of fully grown tree: {accuracy_full:.4f}")


Accuracy with max_depth=3: 1.0000
Accuracy of fully grown tree: 1.0000


In [3]:
#Question 8: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)

# Train the model
regressor.fit(X_train, y_train)

# Predict on test set
y_pred = regressor.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Set: {mse:.4f}")

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")



Mean Squared Error on Test Set: 0.5280

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


In [4]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using
#GridSearchCV
#● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Predict on test set using the best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Best Model on Test Set: {accuracy:.4f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy of Best Model on Test Set: 1.0000


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

Explain the step-by-step process you would follow to:

   -  Handle the missing values
   -   Encode the categorical features
   - Train a Decision Tree model
   - Tune its hyperparameters
   - Evaluate its performance
  - And describe what business value this model could provide in the real-world
setting.

  Answer -

  1. Handle Missing Values
    - Identify missing data: Explore dataset to find where values are missing.
    - Imputation:
       - For numerical features, fill missing values with mean, median, or use advanced methods like KNN imputation.
       - For categorical features, fill missing values with mode or create a special category like "Missing".
   - Consider dropping columns or rows if missing data is too high or not salvageable.
   - Use libraries like sklearn.impute.SimpleImputer or IterativeImputer.

2. Encode Categorical Features
    - Convert categorical variables to numeric formats usable by the model:
      - Use One-Hot Encoding for nominal categorical variables (e.g., gender, blood type).
      - Use Ordinal Encoding for ordinal categories (e.g., disease severity levels).
    - Handle high-cardinality features carefully to avoid dimensionality explosion.
    - Libraries: pandas.get_dummies(), sklearn.preprocessing.OneHotEncoder, OrdinalEncoder.

  3. Train a Decision Tree Model
     - Split the data into training and testing (or validation) sets.
     - Initialize the Decision Tree Classifier with default or basic parameters.
     - Fit the model on the training data.
     - Decision Trees naturally handle mixed types but require numeric input.

 4. Tune Hyperparameters
     - Use techniques like GridSearchCV or RandomizedSearchCV to tune:
        - max_depth: controls tree depth to prevent overfitting.
        - min_samples_split: minimum samples required to split a node.
        - min_samples_leaf: minimum samples per leaf node.
        - criterion: 'gini' or 'entropy'.
    - Use cross-validation to ensure robust performance estimation.

5. Evaluate Model Performance
   - Use appropriate metrics for classification:
      - Accuracy: overall correctness.
      - Precision, Recall, F1-Score: especially important in healthcare to balance false positives and negatives.
      - ROC-AUC: for evaluating model’s ability to discriminate between classes.
  - Analyze confusion matrix to understand types of errors.
  - Perform error analysis and check for bias or data leakage.

6. Business Value of the Model
  - Early and accurate disease prediction enables proactive treatment and better patient outcomes.
  - Helps in resource allocation by identifying high-risk patients.
  - Reduces healthcare costs by preventing advanced disease progression.
  - Improves decision support for healthcare professionals.
  - Enables personalized care plans based on predicted risks.
  - Builds trust with transparent and interpretable models (Decision Trees are easy to explain).
