In [None]:
# Question 1:  What is a Decision Tree, and how does it work in the context of classification?

"""
A Decision Tree is a supervised machine learning algorithm that can be used for both classification and regression tasks. It works by recursively partitioning the data into smaller subsets based on the values of input features. The goal is to create a tree-like model of decisions and their possible consequences.

In the context of classification, a Decision Tree works as follows:

Splitting the Data: The algorithm starts with the entire dataset at the root node. It then evaluates different features and their possible split points to find the one that best separates the data into distinct classes. The 'best' split is typically determined by metrics like Gini impurity or entropy, which measure the homogeneity of the classes within each subset.

Creating Nodes: Based on the chosen split, the data is divided into two or more subsets, and child nodes are created for each subset. This process is repeated recursively for each child node, essentially building the tree downwards.

Decision Rules: Each internal node in the tree represents a test on an attribute (feature), and each branch represents the outcome of that test. For example, a node might test if 'Age > 30'.

Leaf Nodes: The recursion stops when a node contains data that is sufficiently 'pure' (i.e., mostly belongs to a single class), or when other stopping criteria are met (e.g., maximum depth of the tree, minimum number of samples per leaf). These final nodes are called leaf nodes, and each leaf node represents a class label or a probability distribution over the classes.

Classification: To classify a new data point, you start at the root of the tree and traverse down the branches by answering the questions at each internal node based on the data point's feature values. Eventually, you reach a leaf node, and the class associated with that leaf node is the predicted class for the new data point.

Key Characteristics:

Interpretability: Decision trees are easy to understand and interpret, as their structure mirrors human decision-making.
Non-parametric: They don't make assumptions about the underlying distribution of the data.
Handles both numerical and categorical data: They can work with various data types.
Prone to overfitting: Without proper pruning or limiting the tree's depth, decision trees can easily overfit the training data, leading to poor generalization on unseen data.
Sensitivity to data variations: Small changes in the data can lead to a completely different tree structure.

"""

In [None]:
# Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

"""

Gini Impurity and Entropy are two common metrics used in Decision Tree algorithms to measure the 'impurity' or 'disorder' of a set of samples.
The goal of a Decision Tree is to make splits that reduce this impurity as much as possible, leading to more homogeneous child nodes.


"""

In [None]:
# Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

"""
Pre-Pruning and Post-Pruning are two techniques used to prevent Decision Trees from overfitting the training data by reducing the complexity of the tree.
The main difference between Pre-Pruning and Post-Pruning in Decision Trees lies in when the pruning occurs and how it affects the tree-building process:

Pre-Pruning (Early Stopping): This technique stops the tree's growth during its construction.
It uses criteria like maximum depth, minimum samples per split, or minimum impurity reduction to decide whether to continue splitting a node.
Its advantage is reduced training time and computational cost.

Post-Pruning (Backward Pruning): This technique involves growing a complete (or nearly complete) Decision Tree first, and then afterwards, it removes branches or nodes from the fully grown tree.
It evaluates subtrees to see if their removal improves performance on a validation set or simplifies the model.
Its advantage is that it often leads to more optimal or better-performing trees on unseen data.


The practical advantages for each are:

Pre-Pruning: A significant advantage is that it reduces the training time and computational cost by stopping the tree growth early.

Post-Pruning: A practical advantage is that it often leads to more optimal or better-performing trees on unseen data because it allows the tree to explore all splits before removing those that don't generalize well.


"""

In [None]:
# Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

"""
Information Gain is a crucial concept in Decision Trees, especially for classification tasks.
It quantifies how much the uncertainty (entropy) of the target variable decreases after splitting the data based on a particular feature. Essentially, it helps the Decision Tree choose the best split.

"""

In [None]:
# Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


"""

Common Real-World Applications: Decision Trees are used in various fields like medical diagnosis, credit scoring and fraud detection, customer relationship management (CRM), manufacturing quality control, and bioinformatics.

Main Advantages: They are highly interpretable and explainable, can handle both numerical and categorical data without extensive pre-processing, do not require data scaling, are non-parametric, and are relatively robust to outliers.

Main Limitations: They are prone to overfitting without proper pruning, can be unstable and exhibit high variance, may show bias towards dominant classes in imbalanced datasets, and their greedy approach to splitting may not lead to globally optimal trees.


"""

In [None]:
"""
Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).



Question 6:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)


"""


In [1]:


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Train a Decision Tree Classifier with max_depth=3
dtc_max_depth_3 = DecisionTreeClassifier(max_depth=3, random_state=42)
dtc_max_depth_3.fit(X_train, y_train)

# Make predictions and calculate accuracy for max_depth=3
y_pred_3 = dtc_max_depth_3.predict(X_test)
accuracy_3 = accuracy_score(y_test, y_pred_3)
print(f"Accuracy for Decision Tree with max_depth=3: {accuracy_3:.4f}")

# 2. Train a fully-grown Decision Tree Classifier (default max_depth=None)
dtc_fully_grown = DecisionTreeClassifier(random_state=42)
dtc_fully_grown.fit(X_train, y_train)

# Make predictions and calculate accuracy for fully-grown tree
y_pred_full = dtc_fully_grown.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print(f"Accuracy for fully-grown Decision Tree: {accuracy_full:.4f}")

print("\nComparison:")
if accuracy_3 > accuracy_full:
    print("The Decision Tree with max_depth=3 performed better.")
elif accuracy_full > accuracy_3:
    print("The fully-grown Decision Tree performed better.")
else:
    print("Both Decision Trees performed equally well.")

Accuracy for Decision Tree with max_depth=3: 1.0000
Accuracy for fully-grown Decision Tree: 1.0000

Comparison:
Both Decision Trees performed equally well.


In [None]:
"""

Question 7:   Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

Here's a Python program to load the Iris Dataset, train two Decision Tree Classifiers (one with max_depth=3 and another fully grown), and compare their accuracies.
The code will be added to the notebook.

The Python program successfully loaded the Iris Dataset and trained two Decision Tree Classifiers.
Both the tree with max_depth=3 and the fully-grown tree achieved an accuracy of 1.0000 on the test set, indicating perfect classification for this particular split of the Iris dataset.
As a result, the comparison shows that "Both Decision Trees performed equally well."

"""

In [None]:
"""
Question 8: Write a Python program to:
● Load the Boston Housing Dataset
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)



"""

In [3]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing Dataset as an alternative
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Convert to DataFrame for better feature name handling
X_df = pd.DataFrame(X, columns=housing.feature_names)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtr.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(X_df.columns, dtr.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 0.5280

Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


In [4]:
"""
Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

"""


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, 5, None],  # None means full depth
    'min_samples_split': [2, 5, 10]
}

# Create a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found
print(f"Best parameters found: {grid_search.best_params_}")

# Get the best model
best_dt_model = grid_search.best_estimator_

# Make predictions on the test set with the best model
y_pred_best = best_dt_model.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy with best parameters: {accuracy_best:.4f}")


Best parameters found: {'max_depth': 4, 'min_samples_split': 10}
Accuracy with best parameters: 1.0000


In [None]:
"""
Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


Handling Missing Values: This includes identifying missing data, and then choosing appropriate imputation strategies like dropping rows/columns, mean/median/mode imputation, regression imputation, or KNN imputation, often guided by domain knowledge.

Encoding Categorical Features: Convert non-numerical categorical data into a numerical format that the Decision Tree can process. Common methods include One-Hot Encoding for nominal features and Ordinal Encoding for ordered features.

Training a Decision Tree Model: Split the preprocessed data into training and testing sets, then initialize and fit a DecisionTreeClassifier to the training data.

Tuning its Hyperparameters: Crucial for preventing overfitting. This involves using techniques like GridSearchCV or RandomizedSearchCV with cross-validation to find the optimal values for hyperparameters such as max_depth, min_samples_split, and min_samples_leaf.

Evaluating its Performance: Assess the best-tuned model's performance on the unseen test set using metrics appropriate for classification, such as accuracy, precision, recall, F1-score, ROC AUC, and analyzing the confusion matrix.

From a business value perspective, this model could provide:

Early Detection and Intervention: Leading to better patient outcomes and potentially reduced treatment costs.
Resource Optimization: Efficient allocation of healthcare resources by identifying high-risk patients.
Cost Reduction: Preventing disease progression to more expensive stages.
Personalized Medicine: Informing tailored prevention and treatment plans.
Improved Patient Satisfaction: Through more accurate diagnoses and timely care.
Research and Development: Providing insights into disease mechanisms and risk factors.
Proactive Healthcare: Shifting towards a preventive approach rather than a reactive one.

"""
