# Decision Tree

Decision trees are a popular supervised machine learning algorithm used for both classification and regression tasks.

### Structure of a Decision Tree

1.   A decision tree consists of nodes, where each node represents a decision or a test on an attribute or feature.
2.  The top node is called the root node, and it represents the initial decision or the most significant feature in the dataset.
3. The nodes below the root node are called internal nodes, which represent intermediate decisions based on features.
4. The terminal nodes are called leaf nodes or decision nodes, which provide the final outcome or prediction.


### How Decision Trees Work

1. Decision trees make decisions by recursively splitting the dataset based on the most significant feature at each node.

2. The goal is to split the data into subsets in a way that minimizes impurity or maximizes information gain.

3. For classification tasks, impurity can be measured using metrics like Gini impurity or entropy, while for regression tasks, it's typically measured using mean squared error (MSE).

### Advantages of Decision Trees

1. **Interpretability**: Decision trees are easy to understand and interpret, making them a valuable tool for explaining the reasoning behind predictions.

2. **Handling Non-linearity**: They can model non-linear relationships in the data without the need for complex transformations.

3. **Feature Importance**: Decision trees can rank features by their importance in the decision-making process.

4. **Robustness**: They are robust to outliers and missing values (to some extent).

5. **Versatility**: Decision trees can be used for classification and regression tasks and are a building block for more advanced ensemble methods like Random Forests.

### Limitations of Decision Trees

1. **Overfitting**: Decision trees tend to overfit the training data if not pruned properly. This can lead to poor generalization on unseen data.

2. **Instability**: Small changes in the data can lead to different tree structures, making them somewhat unstable.

3. **Bias**: Decision trees can be biased if the classes are imbalanced in the dataset.

4. **Limited Expressiveness**: They may not capture complex relationships in the data as effectively as more sophisticated models.

### Import Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

### Load data

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Access a specific folder in your Google Drive
path = "/content/drive/MyDrive/Dataset/Decision_Tree_ Dataset.csv"
# Load dataset
data = pd.read_csv(path)

In [4]:
data.head()

Unnamed: 0,1,2,3,4,sum,Unnamed: 5
0,201,10018,250,3046,13515,yes
1,205,10016,395,3044,13660,yes
2,257,10129,109,3251,13746,yes
3,246,10064,324,3137,13771,yes
4,117,10115,496,3094,13822,yes


In [9]:
# column names
data.columns

Index(['1', '2', '3', '4', 'sum', 'Unnamed: 5'], dtype='object')

In [8]:
#length of data
len(data)

1000

In [10]:
# shape of the data
data.shape

(1000, 6)

### Splitting the data

In [14]:
X = data[['1', '2', '3', '4', 'sum']]
y = data[['Unnamed: 5']]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

### Model building and training

Define the decision tree classifier with specified parameters

The clf.fit(X_train, y_train) line trains the decision tree classifier (clf) on the provided training data (X_train, y_train). After this step, the classifier is ready to make predictions.

In [21]:
clf = DecisionTreeClassifier(
    criterion='entropy',  # Splitting criterion using entropy
    random_state=1,       # Random seed for reproducibility
    max_depth=3,          # Maximum depth of the decision tree
    min_samples_leaf=5    # Minimum number of samples required to be a leaf node
)

# Training the decision tree classifier on the training data
clf.fit(X_train, y_train)

 Make predictions on a set of test data (X_test) using a previously trained decision tree classifier (clf).

In [18]:
y_pred  = clf.predict(X_test)
y_pred

array(['No', 'No', 'yes', 'yes', 'yes', 'No', 'yes', 'No', 'yes', 'yes',
       'yes', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'No',
       'No', 'yes', 'No', 'No', 'yes', 'yes', 'yes', 'No', 'No', 'yes',
       'No', 'No', 'No', 'No', 'yes', 'yes', 'yes', 'No', 'No', 'No',
       'No', 'No', 'yes', 'yes', 'yes', 'No', 'No', 'yes', 'yes', 'No',
       'No', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'No', 'No', 'No',
       'yes', 'yes', 'yes', 'No', 'No', 'No', 'yes', 'No', 'No', 'yes',
       'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'yes', 'No',
       'yes', 'No', 'No', 'No', 'No', 'No', 'yes', 'No', 'No', 'No', 'No',
       'yes', 'No', 'No', 'yes', 'No', 'No', 'No', 'yes', 'yes', 'yes',
       'yes', 'No', 'yes', 'yes', 'No', 'yes', 'No', 'yes', 'No', 'No',
       'No', 'No', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'No', 'yes',
       'yes', 'yes', 'yes', 'yes', 'No', 'No', 'No', 'No', 'No', 'No',
       'yes', 'No', 'yes', 'No', 'No', 'No', 'No', 'yes', 'No'

### Evaluation

calculates the accuracy of a machine learning model's predictions by comparing the predicted values (y_pred) to the actual true values (y_test).

In [19]:
accuracy_score(y_pred, y_test)

1.0

Higher accuracy scores indicate better model performance, while lower scores suggest that the model's predictions are less accurate.