#Decision Tree | Assignment
Assignment Code: DA-AG-012

1. What is a Decision Tree, and how does it work in the context of classification?
   - A Decision Tree is a supervised machine learning model that makes predictions by splitting data into branches based on feature values and forming a hierarchical tree-like structure where each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node corresponds to a final class label. In classification, the algorithm iteratively selects the best attribute to split the dataset using impurity measures such as Gini or Entropy, creating pure subsets that contain mostly one class, thereby improving prediction accuracy. The model starts from the root node and navigates down the decision paths according to input feature conditions until it reaches a leaf node that gives the predicted class. Decision Trees are intuitive, easy to understand and visualize, and can handle both numerical and categorical data efficiently, making them suitable for many classification tasks.

2. Explain Gini Impurity and Entropy as impurity measures and how they impact splits in a Decision Tree.
   - Gini Impurity and Entropy are impurity metrics used in Decision Trees to evaluate how well a split separates the dataset into homogeneous groups, where Gini Impurity measures the probability of misclassification if a random sample is labeled according to the class distribution at a node, and Entropy measures the level of randomness or uncertainty within the node based on information theory. Both metrics tend to favor splits that create child nodes with purer class distributions, meaning fewer mixed-class samples. The algorithm computes Gini or Entropy for each possible split and selects the attribute that results in the largest impurity reduction, leading to the most informative and efficient split. Thus, these measures guide the tree’s growth by ensuring that decision boundaries help classify data more accurately and reduce uncertainty.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of each.
   - Pre-Pruning involves stopping the tree growth prematurely before it becomes too complex by applying constraints such as maximum depth or minimum samples per split to prevent overfitting, and its main advantage is reduced training time and improved generalization on unseen data. Post-Pruning, on the other hand, allows the tree to grow fully and then removes unnecessary branches by evaluating their impact on validation performance, helping simplify the model while maintaining predictive accuracy. The advantage of post-pruning is that it produces a cleaner and more robust final model by eliminating over-specialized nodes that do not contribute significantly to classification ability, resulting in better stability and interpretability.

4. What is Information Gain in Decision Trees, and why is it important for choosing the best split?
   - Information Gain is a metric used to evaluate how much a split reduces impurity in a Decision Tree by measuring the difference between the original dataset’s entropy and the weighted entropy after a split, meaning it quantifies how much uncertainty about class labels is removed when a feature is used for splitting. A higher Information Gain indicates that the feature provides more meaningful separation of classes and therefore contributes to a more accurate model. During tree construction, the algorithm calculates Information Gain for all possible splits and selects the feature with the highest value to ensure each node provides the most informative decision. This process leads to a more efficient and discriminative tree that improves prediction quality.

5. Real-world applications of Decision Trees and their main advantages and limitations.
   - Decision Trees are widely used in real-world applications such as medical diagnosis, financial risk assessment, credit scoring, fraud detection, customer churn prediction, and recommendation systems because they mimic human decision-making and work well with both numerical and categorical data. Their main advantages include high interpretability, the ability to automatically perform feature selection, no need for data normalization, and suitability for complex datasets. However, they also have limitations such as a tendency to overfit, high sensitivity to small variations in data, and reduced performance on continuous and noisy datasets without proper pruning. Despite these drawbacks, their transparency and ease of deployment make them one of the most trusted models in business environments.

In [None]:
#6. Python program: Train Decision Tree Classifier using Gini and print accuracy & feature importances.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = DecisionTreeClassifier(criterion='gini')
model.fit(X_train, y_train)

pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
print("Feature Importances:", model.feature_importances_)


In [None]:
#7 Python program: Decision Tree with max_depth=3 vs fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

full_tree = DecisionTreeClassifier()
full_tree.fit(X_train, y_train)
full_acc = accuracy_score(y_test, full_tree.predict(X_test))

limited_tree = DecisionTreeClassifier(max_depth=3)
limited_tree.fit(X_train, y_train)
limited_acc = accuracy_score(y_test, limited_tree.predict(X_test))

print("Fully-grown Tree Accuracy:", full_acc)
print("Depth-limited Tree Accuracy:", limited_acc)


In [None]:
#8 Python program: Decision Tree Regressor on Boston Housing dataset with MSE.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = load_boston()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, pred))
print("Feature Importances:", model.feature_importances_)


In [None]:
#9 Python program: GridSearchCV tuning of max_depth & min_samples_split.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

params = {'max_depth':[2,3,4,5,6], 'min_samples_split':[2,3,4,5]}
grid = GridSearchCV(DecisionTreeClassifier(), params, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
best_model = grid.best_estimator_
print("Best Model Accuracy:", accuracy_score(y_test, best_model.predict(X_test)))


10. Step-by-step process for disease prediction using Decision Tree and business value.
    - To build a Decision Tree model for disease prediction, I would first handle missing values by applying techniques like mean/median imputation for numerical features and mode imputation or a special category for categorical features, ensuring no loss of important data. Next, I would encode categorical variables using Label Encoding or One-Hot Encoding so that the model can interpret non-numeric information, and then I would split the dataset into training and testing sets to maintain unbiased evaluation. I would train a Decision Tree classifier and then optimize its hyperparameters such as max_depth, min_samples_split, and criterion using GridSearchCV to reduce overfitting and maximize predictive performance. Finally, I would evaluate the model using metrics like accuracy, precision, recall, and confusion matrix to ensure the disease predictions are reliable. In real-world healthcare settings, such a model provides tremendous business value by enabling faster and more accurate early diagnosis, optimizing resources, reducing treatment delays, and improving patient outcomes while lowering overall medical costs.