#ASSIGNMENT

1.What is a Decision Tree, and how does it work in the context of classification?
  - A Decision Tree is a supervised machine learning algorithm that creates a tree-like model of decisions based on feature values. It works by recursively splitting the dataset into subsets based on the most significant feature at each node, creating a hierarchical structure of decision rules.

  - In classification, the tree starts at the root node with the entire dataset and selects the best feature to split on (using metrics like Gini Impurity or Entropy). This process continues recursively until reaching leaf nodes that represent class labels. To classify a new instance, it traverses the tree from root to leaf following the decision rules, ultimately assigning it to a class.
2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
  - Gini Impurity measures the probability of incorrectly classifying a randomly chosen element. Formula: Gini = 1 - Σ(pi²), where pi is the probability of class i. It ranges from 0 (pure node) to 0.5 (maximum impurity for binary classification).

  - Entropy measures the disorder or uncertainty in the data. Formula: Entropy = -Σ(pi × log₂(pi)). It ranges from 0 (pure node) to 1 (maximum uncertainty for binary classification).

  - Impact on splits: Both measures help determine the best feature to split on. The algorithm calculates the impurity reduction (Information Gain) for each possible split and chooses the one that maximizes this reduction. Gini is computationally faster, while Entropy tends to produce more balanced trees.
3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
  - Pre-Pruning (Early Stopping): Stops tree growth during construction by setting constraints like max_depth, min_samples_split, or min_samples_leaf. The tree building stops when these conditions are met.

  - Post-Pruning: Allows the tree to grow fully, then removes branches that provide little predictive power by working backward from leaf nodes.

  - Advantages:

    - Pre-Pruning: Computationally efficient as it prevents unnecessary tree growth, saving time and memory during training.
    - Post-Pruning: Generally produces better accuracy as it first examines the full structure before making pruning decisions, avoiding premature stopping.    
4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
  - Information Gain measures the reduction in entropy (or impurity) achieved by splitting the dataset on a particular feature. Formula: IG = Entropy(parent) - Weighted Average of Entropy(children).

  - It's important because it quantifies how much a feature reduces uncertainty about the class labels. The algorithm selects the feature with the highest Information Gain at each node, ensuring that each split maximally separates the classes. This greedy approach leads to more efficient trees that require fewer splits to achieve good classification accuracy.  
5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
  - Applications:

    - Medical diagnosis (disease prediction)
    - Credit risk assessment in banking
    - Customer churn prediction
Fraud detection
    - Marketing campaign targeting
  - Advantages:

    - Easy to understand and interpret (white-box model)
    - Requires minimal data preprocessing
    - Handles both numerical and categorical data
    - Non-parametric (no assumptions about data distribution)
    - Feature importance ranking
  - Limitations:  
    - Prone to overfitting, especially with deep trees
    - Unstable - small data changes can create very different trees
    - Biased toward features with many levels
    - Can create overly complex trees that don't generalize well
    - Not ideal for capturing linear relationships .
     

In [None]:
#6. Python program - Iris Dataset with Gini criterion
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)


print("Decision Tree Classifier (Gini Criterion)")
print("=" * 50)
print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print("\nFeature Importances:")
for i, importance in enumerate(clf.feature_importances_):
    print(f"{iris.feature_names[i]}: {importance:.4f}")

In [None]:
# 7. Python program - Compare max_depth=3 vs fully-grown tree
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)


clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)
y_pred_pruned = clf_pruned.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)


print("Decision Tree Comparison")
print("=" * 50)
print(f"Fully-grown Tree:")
print(f"  - Depth: {clf_full.get_depth()}")
print(f"  - Number of leaves: {clf_full.get_n_leaves()}")
print(f"  - Accuracy: {accuracy_full:.4f} ({accuracy_full*100:.2f}%)")
print()
print(f"Pruned Tree (max_depth=3):")
print(f"  - Depth: {clf_pruned.get_depth()}")
print(f"  - Number of leaves: {clf_pruned.get_n_leaves()}")
print(f"  - Accuracy: {accuracy_pruned:.4f} ({accuracy_pruned*100:.2f}%)")
print()
print(f"Accuracy difference: {abs(accuracy_full - accuracy_pruned):.4f}")

In [None]:
# 8. Python program - Boston Housing Dataset with Decision Tree Regressor
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np


housing = fetch_california_housing()
X = housing.data
y = housing.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)


y_pred = regressor.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)


print("Decision Tree Regressor - California Housing Dataset")
print("=" * 50)
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print("\nFeature Importances:")
for i, importance in enumerate(regressor.feature_importances_):
    print(f"{housing.feature_names[i]}: {importance:.4f}")

In [None]:
# 9. Python program - GridSearchCV for hyperparameter tuning
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score


iris = load_iris()
X = iris.data
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, 8, None],
    'min_samples_split': [2, 5, 10, 15, 20],
    'criterion': ['gini', 'entropy']
}


clf = DecisionTreeClassifier(random_state=42)


grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)


print("Performing Grid Search...")
grid_search.fit(X_train, y_train)


best_params = grid_search.best_params_
best_score = grid_search.best_score_


best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)


print("\nGrid Search Results")
print("=" * 50)
print("Best Parameters:")
for param, value in best_params.items():
    print(f"  - {param}: {value}")
print(f"\nBest Cross-Validation Accuracy: {best_score:.4f} ({best_score*100:.2f}%)")
print(f"Test Set Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
  - Step-by-Step Process:
  - 1. Handle Missing Values:

    - Analysis: Identify missing data patterns using df.isnull().sum() and visualizations
    - Strategies:
    - Numerical features: Impute with mean/median (median for skewed data) or use KNN imputation
    - Categorical features: Impute with mode or create "Unknown" category
    - Consider MICE (Multiple Imputation by Chained Equations) for complex patterns
    - If >40% missing in a feature, consider dropping it
    - Document all imputation decisions
  - 2. Encode Categorical Features:

    - Label Encoding: For ordinal variables (e.g., severity: low, medium, high)
    - One-Hot Encoding: For nominal variables with few categories (e.g., gender, blood type)
    - Target Encoding: For high-cardinality features (e.g., zip codes)
    - Handle rare categories by grouping into "Other"
    - Decision Trees handle encoded features well without scaling

In [None]:
# 3. Train Decision Tree Model :
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


clf = DecisionTreeClassifier(
    criterion='gini',
    random_state=42,
    class_weight='balanced'
)
clf.fit(X_train, y_train)

  - 4. Tune Hyperparameters :     
    - Use GridSearchCV or RandomizedSearchCV
    - Key parameters to tune:
     - max_depth: Control tree depth (3-10)
    - min_samples_split: Minimum samples to split (2-20)
    - min_samples_leaf: Minimum samples in leaf (1-10)
    - max_features: Features to consider for splits
    - class_weight: Handle imbalanced classes
    - Use cross-validation (5-10 folds) for robust evaluation
  - 5. Evaluate Performance:
    - Metrics for healthcare (where false negatives are costly):
    - Accuracy (overall correctness)
    - Recall/Sensitivity: Critical for disease detection (minimize false negatives)
    - Precision: Minimize false positives
    - F1-Score: Balance between precision and recall
    - ROC-AUC: Overall discriminative ability
    - Confusion Matrix: Detailed error analysis
    - Perform cross-validation for stability assessment
    - Test on holdout set for final evaluation
    - Feature importance analysis for clinical insights.