## Decision Tree

- A supervised learning algorith used for classification and regression tasks.
- It works by splitting the data into subset based on the value of input features, creating a tree-like structure of decisions
  
  1. Tree Structure:
     - Root Node: The starting point that represents the entire dataset
     - Decison Nodes: Intermediate nodes where data is split based on a feature
     - Leaf Nodes: Finals ndes that represent the output or decision (eg: class label, or a numerical label)
  2. Splitting Criteria: A decision tree divides data at each node based on certain criteria to minimize uncertainty
  3. Flow of decisions: Each data point is evaluated by traversing the tree from the root node to a leaf node based on feature values and splitting rules.
 
<div style = "text-align: center;">
    <img src = "https://viso.ai/wp-content/smush-webp/2024/04/Visual-Representation-1-1060x596.png.webp" alt = "d_tree" width = 450/>
</div>

In [18]:
# Importing libraries
import pandas as pd

df = pd.read_csv("500hits.csv", encoding = "latin-1")
df.head()

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [19]:
# cleaning the data since HOF - have unique: 2 by dropping the column

df["HOF"].unique()

# locating the index
df[df["HOF"] == 2]

df.drop(index = 160, inplace = True)

# re-checking the unique value - 2
df["HOF"].unique()

array([1, 0], dtype=int64)

In [20]:
# Step 1: Dropping the unwanted column

df = df.drop(columns = ["PLAYER", "CS"])
df.shape

(464, 14)

In [15]:
# Step 2: Splitting the data into X and y for model training and target
X = df.iloc[:, 0:13]
print(X)
y = df.iloc[:, 13]
print(y)

     YRS     G     AB     R     H   2B   3B   HR   RBI    BB    SO   SB     BA
0     24  3035  11434  2246  4189  724  295  117   726  1249   357  892  0.366
1     22  3026  10972  1949  3630  725  177  475  1951  1599   696   78  0.331
2     22  2789  10195  1882  3514  792  222  117   724  1381   220  432  0.345
3     20  2747  11195  1923  3465  544   66  260  1311  1082  1840  358  0.310
4     21  2792  10430  1736  3430  640  252  101     0   963   327  722  0.329
..   ...   ...    ...   ...   ...  ...  ...  ...   ...   ...   ...  ...    ...
460   15  1920   6653  1105  1665  285   39  291   964  1224  1427  225  0.250
461   17  1829   6092   900  1664  379   10  275  1065   936  1453   20  0.273
462   15  1834   6499  1062  1661  338   67  210   761   960  1190  315  0.256
463   16  1822   6309   714  1660  254   25   54   593   396   489   74  0.263
464   15  1468   5629   785  1660  247   71   61   499   266   471  267  0.295

[464 rows x 13 columns]
0      1
1      1
2      1


In [22]:
# Step 3: Apply the train_test_split method to split the data into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 12, test_size = 0.2)

# checking the sahpe for each sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(371, 13)
(93, 13)
(371,)
(93,)


In [35]:
# Step 4: training the model

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

# fit the data into the model
dtc.fit(X_train, y_train)

# Get the prediction from the Decision Tree
y_pred = dtc.predict(X_test)
print("The y prediction of column HOF:", "\n", y_pred)

# Confusion Matrix - to check how good the model is
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("The confusion matrix:", "\n", cm)

The y prediction of column HOF: 
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0
 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 1 1 0 1 0 1
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
The confusion matrix: 
 [[53  8]
 [17 15]]


### Confusion Matrix:

| **Actual/Predicted**              | **Predicted Not HOF (0)** | **Predicted HOF (1)** |
|-----------------------------------|---------------------------|-----------------------|
| **Actual Not HOF (0)**            | True Negatives (TN) = 53  | False Positives (FP) = 8 |
| **Actual HOF (1)**                | False Negatives (FN) = 18 | True Positives (TP) = 14 |

- **True Negatives (TN)** = 53: 53 instances of class 0 were correctly predicted as class 0.
- **False Positives (FP)** = 8: 8 instances of class 0 were incorrectly predicted as class 1.
- **False Negatives (FN)** = 18: 18 instances of class 1 were incorrectly predicted as class 0.
- **True Positives (TP)** = 14: 14 instances of class 1 were correctly predicted as class 1.

In [28]:
# to check the list of parameters available for DecissionTreeClassifier
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

In [39]:
# To check what has the biggest impact on our model
dtc.feature_importances_

features_df = pd.DataFrame(dtc.feature_importances_, index = X.columns)
features_df

# note: hits has the highest impact, make sense

Unnamed: 0,0
YRS,0.0
G,0.037569
AB,0.04229
R,0.048241
H,0.111199
2B,0.057558
3B,0.045867
HR,0.175999
RBI,0.077892
BB,0.019686


In [42]:
# Trying different parameters

dtc2 = DecisionTreeClassifier(criterion = "entropy", ccp_alpha = 0.04)

dtc2.fit(X_train, y_train)

y_pred_2 = dtc2.predict(X_test)
print("The y prediction by using diff parameters of column HOF:", "\n", y_pred_2)

cm2 = confusion_matrix(y_test, y_pred_2)
print("The confusion matrix:", "\n", cm2)



The y prediction by using diff parameters of column HOF: 
 [1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0
 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 1
 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0]
The confusion matrix: 
 [[55  6]
 [ 7 25]]
