In [None]:
# 1. What is a Decision Tree, and how does it work in the context of classification?
A Decision Tree is a supervised machine learning model that splits data into branches based on feature values to make decisions.
In classification, it recursively partitions the dataset into subsets using rules based on feature values until it reaches a decision (leaf node).
Each internal node represents a decision rule, branches represent outcomes, and leaves represent class labels.

# 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- **Gini Impurity**: Measures the probability of incorrectly classifying a randomly chosen element.
  Formula: Gini = 1 - Σ(pᵢ²) where pᵢ is the probability of class i.
- **Entropy**: Measures the disorder or impurity in a set.
  Formula: Entropy = -Σ(pᵢ * log₂(pᵢ))
Impact: The algorithm selects splits that minimize impurity (maximizing Information Gain).

# 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- Pre-Pruning: Stops tree growth early by setting constraints like max_depth or min_samples_split.
  Advantage: Reduces overfitting and saves computation time.
- Post-Pruning: Grows a full tree first, then removes branches that do not improve accuracy.
  Advantage: Often results in better generalization.

# 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
Information Gain measures the reduction in impurity after a dataset is split.
Formula: IG = Impurity(parent) - [weighted average of Impurity(children)]
Importance: Higher IG means the split creates more homogeneous child nodes, improving prediction accuracy.

# 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Applications:
- Medical diagnosis
- Credit risk scoring
- Fraud detection
- Customer segmentation
Advantages:
- Easy to interpret and visualize
- Handles both numerical and categorical data
Limitations:
- Prone to overfitting
- Can be unstable with small changes in data

# 6. Python program: Load Iris Dataset, train Decision Tree Classifier using Gini criterion, print accuracy and feature importances.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

clf_gini = DecisionTreeClassifier(criterion='gini')
clf_gini.fit(X_train, y_train)

print("Accuracy:", clf_gini.score(X_test, y_test))
print("Feature Importances:", clf_gini.feature_importances_)

# 7. Python program: Train Decision Tree Classifier with max_depth=3 and compare accuracy to fully-grown tree.
clf_depth3 = DecisionTreeClassifier(max_depth=3)
clf_depth3.fit(X_train, y_train)

clf_full = DecisionTreeClassifier()
clf_full.fit(X_train, y_train)

print("Accuracy (max_depth=3):", clf_depth3.score(X_test, y_test))
print("Accuracy (full tree):", clf_full.score(X_test, y_test))

# 8. Python program: Load California Housing dataset, train Decision Tree Regressor, print MSE and feature importances.
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(X_train_h, y_train_h)
y_pred_h = reg.predict(X_test_h)

print("MSE:", mean_squared_error(y_test_h, y_pred_h))
print("Feature Importances:", reg.feature_importances_)

# 9. Python program: Tune Decision Tree max_depth and min_samples_split using GridSearchCV, print best parameters and accuracy.
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [2, 3, 4, 5, None], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

# 10. Imagine you’re working as a data scientist for a healthcare company...
Step-by-step approach:
1) Handle Missing Values:
   - For numerical: Fill with median.
   - For categorical: Fill with mode.
2) Encode Categorical Features:
   - Use one-hot encoding for nominal variables.
3) Train a Decision Tree Model:
   - Use DecisionTreeClassifier with initial parameters.
4) Hyperparameter Tuning:
   - Use GridSearchCV to tune max_depth, min_samples_split, and criterion.
5) Evaluate Performance:
   - Use Accuracy, Precision, Recall, F1-score, and ROC-AUC.
Business Value:
- The model can quickly identify at-risk patients.
- Enables targeted treatment, reduces healthcare costs, and improves patient outcomes.


# 1. What is a Decision Tree, and how does it work in the context of classification?
A Decision Tree is a supervised machine learning model that splits data into branches based on feature values to make decisions.  
In classification, it recursively partitions the dataset into subsets using rules based on feature values until it reaches a decision (leaf node).  
Each internal node represents a decision rule, branches represent outcomes, and leaves represent class labels.

# 2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
- **Gini Impurity**: Measures the probability of incorrectly classifying a randomly chosen element.  
  Formula: Gini = 1 - Σ(pᵢ²) where pᵢ is the probability of class i.  
- **Entropy**: Measures the disorder or impurity in a set.  
  Formula: Entropy = -Σ(pᵢ * log₂(pᵢ))  
Impact: The algorithm selects splits that minimize impurity (maximizing Information Gain).

# 3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
- Pre-Pruning: Stops tree growth early by setting constraints like max_depth or min_samples_split.  
  Advantage: Reduces overfitting and saves computation time.  
- Post-Pruning: Grows a full tree first, then removes branches that do not improve accuracy.  
  Advantage: Often results in better generalization.

# 4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
Information Gain measures the reduction in impurity after a dataset is split.  
Formula: IG = Impurity(parent) - [weighted average of Impurity(children)]  
Importance: Higher IG means the split creates more homogeneous child nodes, improving prediction accuracy.

# 5. What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Applications:  
- Medical diagnosis  
- Credit risk scoring  
- Fraud detection  
- Customer segmentation  
Advantages:  
- Easy to interpret and visualize  
- Handles both numerical and categorical data  
Limitations:  
- Prone to overfitting  
- Can be unstable with small changes in data

# 6. Python program: Load Iris Dataset, train Decision Tree Classifier using Gini criterion, print accuracy and feature importances.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

clf_gini = DecisionTreeClassifier(criterion='gini')
clf_gini.fit(X_train, y_train)

print("Accuracy:", clf_gini.score(X_test, y_test))
print("Feature Importances:", clf_gini.feature_importances_)

# 7. Python program: Train Decision Tree Classifier with max_depth=3 and compare accuracy to fully-grown tree.
clf_depth3 = DecisionTreeClassifier(max_depth=3)
clf_depth3.fit(X_train, y_train)

clf_full = DecisionTreeClassifier()
clf_full.fit(X_train, y_train)

print("Accuracy (max_depth=3):", clf_depth3.score(X_test, y_test))
print("Accuracy (full tree):", clf_full.score(X_test, y_test))

# 8. Python program: Load California Housing dataset, train Decision Tree Regressor, print MSE and feature importances.
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

housing = fetch_california_housing()
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(X_train_h, y_train_h)
y_pred_h = reg.predict(X_test_h)

print("MSE:", mean_squared_error(y_test_h, y_pred_h))
print("Feature Importances:", reg.feature_importances_)

# 9. Python program: Tune Decision Tree max_depth and min_samples_split using GridSearchCV, print best parameters and accuracy.
from sklearn.model_selection import GridSearchCV

param_grid = {'max_depth': [2, 3, 4, 5, None], 'min_samples_split': [2, 5, 10]}
grid = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)

# 10. Imagine you’re working as a data scientist for a healthcare company...
Step-by-step approach:
1) Handle Missing Values:
   - For numerical: Fill with median.
   - For categorical: Fill with mode.
2) Encode Categorical Features:
   - Use one-hot encoding for nominal variables.
3) Train a Decision Tree Model:
   - Use DecisionTreeClassifier with initial parameters.
4) Hyperparameter Tuning:
   - Use GridSearchCV to tune max_depth, min_samples_split, and criterion.
5) Evaluate Performance:
   - Use Accuracy, Precision, Recall, F1-score, and ROC-AUC.
Business Value:
- The model can quickly identify at-risk patients.
- Enables targeted treatment, reduces healthcare costs, and improves patient outcomes.
