<a href="https://colab.research.google.com/github/thepersonuadmire/Decision-Tree/blob/main/Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theoretical

1. A Decision Tree is a flowchart-like structure used for decision-making and predictive modeling.

It consists of nodes (representing features), branches (representing decision rules), and leaves (representing outcomes).
The tree is built by splitting the dataset into subsets based on feature values, aiming to create homogeneous groups.

2. Impurity measures quantify how mixed the classes are in a node.
Common measures include Gini Impurity and Entropy.

3. Gini Impurity = 1 - Σ(p_i^2)

Where p_i is the probability of class i in the node.

4. Entropy = -Σ(p_i * log2(p_i))

Where p_i is the probability of class i in the node.

5. Information Gain measures the reduction in entropy or impurity after a dataset is split on a feature.

It is used to determine the best feature to split the data at each node.

6.
*   Gini Impurity focuses on the probability of misclassification, while
Entropy measures the uncertainty in the dataset.
*   Gini is generally faster to compute, while Entropy can provide more information in some cases.

7. Decision Trees use recursive partitioning based on impurity measures to create splits that maximize Information Gain.

8. Pre-Pruning involves stopping the tree growth early based on certain criteria (e.g., maximum depth, minimum samples per leaf) to prevent overfitting.

9. Post-Pruning involves removing branches from a fully grown tree to reduce complexity and improve generalization.

10. Pre-Pruning occurs during the tree construction, while Post-Pruning occurs after the tree is fully grown.

11. A Decision Tree Regressor is a type of Decision Tree used for regression tasks, predicting continuous values instead of class labels.

12.
1.   Advantages:

*   Easy to interpret and visualize.
*   Handles both numerical and categorical data.
*   Requires little data preprocessing.

2.   Disadvantages:

*   Prone to overfitting.
*   Sensitive to noisy data.
*   Can be biased towards features with more levels.

13. Decision Trees can handle missing values by using surrogate splits or by assigning the most common value for that feature.

14. Decision Trees can directly handle categorical features by creating branches for each category.

15. Medical diagnosis, Credit scoring, Customer segmentation, Fraud detection.

# Practical

16.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

17.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train model with Gini Impurity
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

# Print feature importances
print(f'Feature Importances: {clf.feature_importances_}')

18.

In [None]:
from sklearn.datasets import load _iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model with Entropy
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

19.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load housing dataset
data = pd.read_csv('housing.csv')  # Replace with your dataset path
X = data.drop('target', axis=1)  # Replace 'target' with your target column name
y = data['target']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

# Predict and evaluate
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

20.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import graphviz

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train model
clf = DecisionTreeClassifier()
clf.fit(X, y)

# Visualize tree
dot_data = export_graphviz(clf, out_file=None,
                           feature_names=iris.feature_names,
                           class_names=iris.target_names,
                           filled=True, rounded=True,
                           special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("iris_tree")  # Save as iris_tree.pdf

21.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train fully grown tree
clf_full = DecisionTreeClassifier()
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Train tree with max depth of 3
clf_depth = DecisionTreeClassifier(max_depth=3)
clf_depth.fit(X_train, y_train)
y_pred_depth = clf_depth.predict(X_test)
accuracy_depth = accuracy_score(y_test, y_pred_depth)

print(f'Fully Grown Tree Accuracy: {accuracy_full}')
print(f'Max Depth 3 Tree Accuracy: {accuracy_depth}')

22.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train default tree
clf_default = DecisionTreeClassifier()
clf_default.fit(X_train, y_train)
y_pred_default = clf_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

# Train tree with min_samples_split=5
clf_split = DecisionTreeClassifier(min_samples_split=5)
clf_split.fit(X_train, y_train)
y_pred_split = clf_split.predict(X_test)
accuracy_split = accuracy_score(y_test, y_pred_split)

print(f'Default Tree Accuracy: {accuracy_default}')
print(f'Min Samples Split Tree Accuracy: {accuracy_split}')

23.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model without scaling
clf_default = DecisionTreeClassifier()
clf_default.fit(X_train, y_train)
y_pred_default = clf_default.predict(X_test)
accuracy_default = accuracy_score(y_test, y_pred_default)

# Apply feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with scaling
clf_scaled = DecisionTreeClassifier()
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f'Default Tree Accuracy: {accuracy_default}')
print(f'Scaled Tree Accuracy: {accuracy_scaled}')

24.

In [None]:
from sklearn.datasets import load_iris
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model using One-vs-Rest strategy
clf_ovr = OneVsRestClassifier(DecisionTreeClassifier())
clf_ovr.fit(X_train, y_train)

# Predict and evaluate
y_pred_ovr = clf_ovr.predict(X_test)
accuracy_ovr = accuracy_score(y_test, y_pred_ovr)
print(f'One-vs-Rest Tree Accuracy: {accuracy_ovr}')

25.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train model
clf = DecisionTreeClassifier()
clf.fit(X, y)

# Display feature importance scores
importance_scores = clf.feature_importances_
for i, score in enumerate(importance_scores):
    print(f'Feature {i}: Importance Score = {score}')

26.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load housing dataset
data = pd.read_csv('housing.csv')  # Replace with your dataset path
X = data.drop('target', axis=1)  # Replace 'target' with your target column name
y = data['target']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train unrestricted tree
regressor_full = DecisionTreeRegressor()
regressor_full.fit(X_train, y_train)
y_pred_full = regressor_full.predict(X_test)
mse_full = mean_squared_error(y_test, y_pred_full)

# Train tree with max_depth=5
regressor_depth = DecisionTreeRegressor(max_depth=5)
regressor_depth.fit(X_train, y_train)
y_pred_depth = regressor_depth.predict(X_test)
mse_depth = mean_squared_error(y_test, y_pred_depth)

print(f'Unrestricted Tree MSE: {mse_full}')
print(f'Max Depth 5 Tree MSE: {mse_depth}')

27.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict (X_test)
accuracy_before_pruning = accuracy_score(y_test, y_pred)

# Apply Cost Complexity Pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
clfs = []

for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

# Evaluate accuracy for each pruned tree
accuracies = [accuracy_score(y_test, clf.predict(X_test)) for clf in clfs]

# Visualize effect on accuracy
import matplotlib.pyplot as plt

plt.plot(ccp_alphas, accuracies, marker='o')
plt.title('Effect of Cost Complexity Pruning on Accuracy')
plt.xlabel('CCP Alpha')
plt.ylabel('Accuracy')
plt.grid()
plt.show()

print(f'Accuracy before pruning: {accuracy_before_pruning}')

28.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Calculate Precision, Recall, and F1-Score
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1-Score: {f1}')

29.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

30.

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Set up the parameter grid
param_grid = {
    'max_depth': [None, 2, 3, 4, 5],
    'min_samples_split': [2, 5, 10]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters and score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Cross-Validation Score: {grid_search.best_score_}')

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Best Cross-Validation Score: 0.95
