1. What is a Decision Tree, and how does it work in the context of
classification?
 - A Decision Tree is a supervised machine learning model that is widely used  for both classification and regression tasks. In the context of classification, it works by splitting data into subsets based on feature values, forming a tree-like structure that leads to decisions about class labels.

     How it works in classification:
     1. Root Node (Starting Point):

        The tree starts at the root, which represents the entire dataset. At this point, the algorithm looks for the "best feature" to split the data on.
     2. Splitting Criteria:
         * The algorithm chooses a feature and a threshold that best separates the data into classes.
         * Common measures for this selection include:
           * Gini Impurity
           * Entropy (Information Gain)
           * Classification Error
     3. Internal Nodes (Decision Points):
        
        Each node in the tree represents a feature test (e.g., "Is Age > 30?"). Based on the answer (Yes/No), the data flows down different branches.
     4. Leaf Nodes (Outcomes):
        
        Eventually, the data reaches a leaf node, which assigns a class label (or probability distribution over classes). For example, "Spam" vs. "Not Spam."
     5. Prediction Process:
        
        To classify a new sample, the model starts at the root node and moves through the tree according to the sample's feature values until it reaches a leaf node. The class of that leaf node is the prediction.

        Example:
        
        Suppose we want to classify whether someone will buy a product:
        * Root Node: "Is Age > 30?"
        * If Yes → Check "Income level?"
        * If No → Check "Student?"
        * Leaf Nodes: "Buys Product" or "Does Not Buy Product."

        Advantages:
        * Easy to interpret and visualize.
        * Handles both numerical and categorical data.
        * Requires little preprocessing (no need for feature scaling).

        Limitations:
        * Can easily overfit if not pruned.
        * Small changes in data can produce very different trees (high variance).

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
 - Decision Trees split data based on impurity measures. Two of the most common are Gini Impurity and Entropy (Information Gain). Let’s break them down:

 1. Gini Impurity
     
     Definition: Gini measures how often a randomly chosen sample would be incorrectly classified if it was randomly labeled according to the class distribution in the node.

     Formula:
     
     For a node with classes C={1,2,...,k}:

     Gini=1−i=1∑k​pi2

     where pi is the proportion of samples belonging to class i at that node.

  * Intuition:
     * If all samples in a node belong to the same class → Gini = 0 (pure).
     * Higher Gini means more mixed classes (impure).
 2. Entropy (Information Gain)
  * Definition: Entropy measures the "disorder" or "uncertainty" in a dataset. It comes from information theory.
  * Formula:

     Entropy=−i=1∑k​pi​log2​(pi​)

     where pi is the proportion of samples belonging to class i.

  * Intuition:
     * Entropy = 0 → completely pure (all samples same class).
     * Entropy = 1 (for binary classification with 50/50 split) → maximum uncertainty.
  * Information Gain:
     
     Decision Trees using Entropy typically maximize Information Gain, which is the reduction in entropy after a split:

     IG=Entropy(parent)−j∑​nnj​​⋅Entropy(childj​)

 3. How They Impact Splits in a Decision Tree
  * At each node, the algorithm evaluates possible splits on features and  chooses the split that results in the greatest reduction in impurity.
     * Using Gini: Chooses the split with the lowest Gini after splitting.
     * Using Entropy: Chooses the split with the highest Information Gain.
  * Both generally give similar results:
     * Gini is slightly faster (no logarithms).
     * Entropy can be more sensitive to class imbalance.



3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
 - **Pre-Pruning (Early Stopping)**
    * Definition: Stop growing the tree early before it becomes too complex.
    * How it works: The algorithm imposes constraints during tree construction, such as:
      * Maximum depth of the tree (max_depth)
      * Minimum samples required to split a node (min_samples_split)
      * Minimum samples in a leaf (min_samples_leaf)
      * Minimum impurity decrease required for a split
    * Advantage:
      
      Faster training & simpler models — avoids wasting time on splits that don’t add much value.
    * Example:
      
      In customer churn prediction, you might restrict tree depth to prevent it from memorizing rare cases.

  * **Post-Pruning (Cost Complexity Pruning / Reduced Error Pruning)**
    * Definition: Stop growing the tree early before it becomes too complex.
    * How it works: The algorithm imposes constraints during tree construction, such as:
      * Maximum depth of the tree (max_depth)
      * Minimum samples required to split a node (min_samples_split)
      * Minimum samples in a leaf (min_samples_leaf)
      * Minimum impurity decrease required for a split
    * Advantage:
     
       Faster training & simpler models — avoids wasting time on splits that don’t add much value.
    * Example:
      
      In customer churn prediction, you might restrict tree depth to prevent it from memorizing rare cases.

  * **Post-Pruning (Cost Complexity Pruning / Reduced Error Pruning)**
    * Definition: First grow a fully grown tree (possibly overfitted), then prune back by removing branches that don’t improve generalization.
    * How it works:
      * Start with a large tree.
      * Evaluate subtrees on a validation set or using cost-complexity pruning (penalizing tree size).
      * Remove branches that don’t improve accuracy.
    * Advantage:
      
      Better generalization — because the tree has already explored complex patterns, pruning ensures only useful branches remain.
    * Example:
      
      In medical diagnosis, post-pruning ensures the tree doesn’t overfit to rare, noisy patient cases while still capturing important patterns.

  * **Main Difference:**
    * Pre-pruning: Prevents the tree from growing too large in the first place (early stopping).
    * Post-pruning: Grows the tree fully, then trims it back (after training).




4. : What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
 -  Definition:
     
     Information Gain measures the reduction in uncertainty (entropy) about the target variable after splitting a dataset on a given feature.
   * Formula:

     IG(S,A)=Entropy(S)−v∈Values(A)∑​∣S∣∣Sv​∣​⋅Entropy(Sv​)

     Where:
      * S = dataset at the current node
      * A = feature being considered for the split
      * Sv = subset of S where feature A has value v
      * Entropy(S) = impurity before split
      * Weighted average entropy of children = impurity after split
  * Why is it Important?
     * A Decision Tree can split on many possible features.
     * Information Gain tells us how much a split improves purity:
       * Higher IG = Better split (more reduction in uncertainty).
       * The tree chooses the feature with the highest Information Gain at each step.
  * Example:

     Suppose we want to predict if students will "Pass" or "Fail" based on "Hours Studied":
     * Parent Node: 50% Pass, 50% Fail →

      Entropy=1.0
     * After splitting on "Hours Studied > 5":
       * Left child: 80% Pass, 20% Fail → Entropy = 0.72
       * Right child: 20% Pass, 80% Fail → Entropy = 0.72
     * Weighted avg child entropy = 0.72
     * Information Gain = 1.0 − 0.72 = 0.28
     

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
 - **Real-World Applications of Decision Trees**
   1. Finance & Banking
      * Credit scoring → Predict if a customer is a “good” or “bad” loan risk.
      * Fraud detection → Classify whether a transaction is suspicious.
   2. Healthcare
      * Disease diagnosis → Predict whether a patient has a particular condition (e.g., diabetes, cancer).
      * Treatment recommendation → Decide on suitable treatment plans based on patient data.
   3. Marketing & E-commerce
      * Customer segmentation → Identify high-value customers.
      * Churn prediction → Predict if a customer will stop using a service.
      * Product recommendation → Suggest products based on user behavior.
   4. Manufacturing & Operations
      * Quality control → Detect defective vs. non-defective products.
      * Predictive maintenance → Decide when a machine is likely to fail.
   5. Telecommunications / IT
      * Network intrusion detection → Identify abnormal usage patterns.
      * Spam filtering → Classify emails as spam or not spam.
 * **Main Advantages**
   1. Easy to interpret & visualize
      * Trees look like flowcharts, so even non-technical stakeholders can understand.
   2. Handles both numerical and categorical data
      * Works well without much preprocessing (no need for scaling/normalization).
   3. Captures non-linear relationships
      * Can model complex decision boundaries.
   4. Feature selection built-in
      * Automatically chooses the most informative features during splitting.
 * **Main Limitations**
   1. Overfitting
      * Trees can grow too deep, memorizing noise in the training data.
   2. High variance
      * Small changes in data can produce very different trees.
   3. Bias toward features with more levels
      * Features with many unique values (like IDs, zip codes) can dominate splits.
   4. Not always the best predictive performance
      * Standalone Decision Trees may be weaker than ensemble methods like Random Forests or Gradient Boosted Trees.

   Decision Trees are powerful because they’re simple, interpretable, and versatile — used in finance, healthcare, marketing, and beyond. But they need pruning or ensembling to overcome overfitting and instability.

6. Write a Python program to:
     
     ● Load the Iris Dataset
     
     ● Train a Decision Tree Classifier using the Gini criterion
     
     ● Print the model’s accuracy and feature importances

In [1]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split into train and test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# 4. Make predictions
y_pred = clf.predict(X_test)

# 5. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# 6. Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 0.93

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0286
petal length (cm): 0.5412
petal width (cm): 0.4303


What this program does
* Loads the Iris dataset from sklearn.datasets.
* Splits into training (70%) and test (30%) sets.
* Trains a Decision Tree Classifier with the Gini index.
* Evaluates model accuracy on the test set.
* Prints feature importances to show which features the tree found most useful.

7. Write a Python program to:
     
     ● Load the Iris Dataset
     
     ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
      a fully-grown tree.


In [2]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Train a Decision Tree with max_depth=3 (Pre-Pruned Tree)
clf_shallow = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_shallow.fit(X_train, y_train)

# 4. Train a fully grown Decision Tree
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)

# 5. Predictions
y_pred_shallow = clf_shallow.predict(X_test)
y_pred_full = clf_full.predict(X_test)

# 6. Accuracy
acc_shallow = accuracy_score(y_test, y_pred_shallow)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {acc_shallow:.2f}")
print(f"Accuracy with fully grown tree: {acc_full:.2f}")

Accuracy with max_depth=3: 0.98
Accuracy with fully grown tree: 0.93


What this does
* Loads the Iris dataset.
* Splits into train/test sets.
* Trains:
  * A pre-pruned tree (max_depth=3).
  * A fully grown tree (no depth restriction).
* Compares their accuracies on the test set.

8. Write a Python program to:
    
     ● Load the California Housing dataset from sklearn
     
     ● Train a Decision Tree Regressor
     
     ● Print the Mean Squared Error (MSE) and feature importances

In [3]:
# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 2. Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions
y_pred = regressor.predict(X_test)

# 5. Evaluate using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 6. Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Mean Squared Error (MSE): 0.4952

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


What this program does
* Loads the California Housing dataset.
* Splits into training (80%) and test (20%) sets.
* Trains a Decision Tree Regressor.
* Evaluates the model using Mean Squared Error (MSE).
* Prints feature importances to show which housing features influence price predictions most.

9. Write a Python program to:
     
     ● Load the Iris Dataset
     
     ● Tune the Decision Tree’s max_depth and min_samples_split using
      GridSearchCV
     
     ● Print the best parameters and the resulting model accuracy

In [4]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Define parameter grid for tuning
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 3, 4, 5, 10]
}

# 4. Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# 5. Perform GridSearchCV (5-fold cross-validation)
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring="accuracy", n_jobs=-1)
grid_search.fit(X_train, y_train)

# 6. Best parameters
print("Best Parameters:", grid_search.best_params_)

# 7. Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Test Set Accuracy: 0.98


What this does
* Loads the Iris dataset.
* Splits into training (70%) and test (30%).
* Defines a grid of hyperparameters (max_depth, min_samples_split).
* Uses GridSearchCV with 5-fold cross-validation to find the best combination.
* Prints the best parameters and evaluates the accuracy on the test set.

10. : Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:

     ● Handle the missing values
     
     ● Encode the categorical features
     
     ● Train a Decision Tree model
     
     ● Tune its hyperparameters
     
     ● Evaluate its performance
       And describe what business value this model could provide in the real-world setting.


 - Step-by-step process for building a disease-prediction Decision Tree (practical, healthcare focus)
  1. Understand the data & split correctly
  2. Handling missing values
  3. Encoding categorical features
  4. Feature engineering & selection (brief)
  5. Training a Decision Tree model (practically)
  6. Hyperparameter tuning
  7. Evaluation (metrics & validation)
  8. Deployment & monitoring (short)
  9. Business value (real-world benefits & caveats)

In [None]:
# Example pipeline: numeric impute (median) + categorical one-hot, DecisionTree, GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

# X, y = your features and binary label (1 = disease, 0 = no disease)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# identify column lists
num_cols = X.select_dtypes(include=["int64","float64"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object","category"]).columns.tolist()

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    # trees don't need scaling, so we skip scaler
])

cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="MISSING")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse=False))
])

preproc = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])

pipe = Pipeline([
    ("preproc", preproc),
    ("clf", DecisionTreeClassifier(random_state=42))
])

param_grid = {
    "clf__criterion": ["gini", "entropy"],
    "clf__max_depth": [3, 5, 7, None],
    "clf__min_samples_split": [2, 5, 10],
    "clf__min_samples_leaf": [1, 2, 5],
    "clf__class_weight": [None, "balanced"]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, cv=cv, scoring="roc_auc", n_jobs=-1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
best = grid.best_estimator_

y_pred = best.predict(X_test)
y_proba = best.predict_proba(X_test)[:,1]
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, y_proba))
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))