🎯 What is a Decision Tree?

A Decision Tree is a supervised machine learning algorithm used for:

- Classification: Predicting categories (e.g., spam or not spam).

- Regression: Predicting continuous values (e.g., house prices).

It splits the data into smaller subsets by asking a series of "Yes/No" questions until it reaches the final decision.



📚 How Does It Work?
1. Start at the Root: The tree starts with the root node (initial question/feature).

2. Split the Data: Based on feature values, data is split into branches.

3. Decision Nodes: Intermediate nodes that continue splitting.

4. Leaf Nodes: Final nodes where predictions or decisions are made.

🧠 How Does It Split Data?

- Gini Index: Measures impurity. Lower values mean better splits.

- Entropy: Measures information gain. Higher gain = better split.

In [137]:
import pandas as pd

In [138]:
df = pd.read_csv('500hits.csv', encoding = 'latin-1')

In [139]:
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [140]:
df = df.drop(columns = ['PLAYER', 'CS'])

In [141]:
X = df.iloc[:, 0:13]

In [142]:
y = df.iloc[:, 13]

In [143]:
from sklearn.model_selection import train_test_split

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 17)

In [145]:
X_train.shape

(372, 13)

In [146]:
X_test.shape

(93, 13)

In [147]:
y_train.shape

(372,)

In [148]:
y_test.shape

(93,)

In [149]:
from sklearn.tree import DecisionTreeClassifier

In [150]:
dtc = DecisionTreeClassifier()
dtc

In [151]:
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

In [152]:
dtc.fit(X_train, y_train)

In [153]:
y_pred = dtc.predict(X_test)

In [154]:
from sklearn.metrics import confusion_matrix

In [155]:
print(confusion_matrix(y_test, y_pred))

[[50 11]
 [ 9 23]]


In [157]:
from sklearn.metrics import classification_report

In [158]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.82      0.83        61
           1       0.68      0.72      0.70        32

    accuracy                           0.78        93
   macro avg       0.76      0.77      0.77        93
weighted avg       0.79      0.78      0.79        93



In [159]:
dtc.feature_importances_

array([0.00919003, 0.0154869 , 0.02972098, 0.05542191, 0.39759474,
       0.0491026 , 0.00953164, 0.04585115, 0.05434852, 0.13298196,
       0.04098071, 0.04594388, 0.113845  ])

In [160]:
X.columns

Index(['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA'],
      dtype='object')

In [161]:
features = pd.DataFrame(dtc.feature_importances_, index = X.columns)

In [163]:
features.head(10)

Unnamed: 0,0
YRS,0.00919
G,0.015487
AB,0.029721
R,0.055422
H,0.397595
2B,0.049103
3B,0.009532
HR,0.045851
RBI,0.054349
BB,0.132982


In [164]:
dtc2 = DecisionTreeClassifier(criterion = 'entropy', ccp_alpha = 0.04)

In [165]:
dtc2.fit(X_train, y_train)

In [168]:
y_pred2 = dtc2.predict(X_test)
y_pred2

array([0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0])

In [169]:
print(confusion_matrix(y_pred2, y_test))

[[50  9]
 [11 23]]


In [170]:
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.85      0.82      0.83        61
           1       0.68      0.72      0.70        32

    accuracy                           0.78        93
   macro avg       0.76      0.77      0.77        93
weighted avg       0.79      0.78      0.79        93



In [171]:
features2 = pd.DataFrame(dtc2.feature_importances_, index = X.columns)

In [173]:
features2.head(15)

Unnamed: 0,0
YRS,0.0
G,0.0
AB,0.0
R,0.0
H,0.837977
2B,0.0
3B,0.0
HR,0.0
RBI,0.0
BB,0.0


Another Example

In [174]:
# ============================
# Import Required Libraries
# ============================
import pandas as pd  # For data manipulation
from sklearn.datasets import load_iris  # Load sample Iris dataset
from sklearn.tree import DecisionTreeClassifier  # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split  # Split dataset
from sklearn.metrics import accuracy_score  # Check model accuracy

# ============================
# Load and Prepare Dataset
# ============================
# Load iris dataset
data = load_iris()

# Create a DataFrame from the dataset
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add target variable to DataFrame
df['target'] = data.target

# Define features (X) and target (y)
X = df.drop('target', axis=1)  # All columns except 'target'
y = df['target']  # Target column

# ============================
# Split Dataset into Train and Test Sets
# ============================
# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ============================
# Initialize and Train Decision Tree
# ============================
# Create Decision Tree Classifier model
model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)

# Fit the model on training data
model.fit(X_train, y_train)

# ============================
# Make Predictions
# ============================
# Predict on test data
y_pred = model.predict(X_test)

# ============================
# Evaluate Model
# ============================
# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 1.00
