<a href="https://www.kaggle.com/code/krupadharamshi/decision-tree-krupa-01?scriptVersionId=168881747" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**DECISION TREE CLASSIFIER**

The decision tree classifier model was trained on a dataset containing information about baseball players and their hits. Upon evaluation, the model demonstrates a reasonable performance with an accuracy of 78%. Precision and recall scores indicate satisfactory discrimination between players who achieved 500 hits and those who did not, with an overall balanced performance reflected in the macro average F1-score of 0.77.

# Dataset link - https://www.kaggle.com/datasets/krupadharamshi/500hits

# 1. Data Collection
## Importing libraries and reading datasets

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("/kaggle/input/500hits/500hits.csv",encoding = "latin-1")

# 2. Exploratory Data Analysis 
## In EDA we perform task like cleaning, describing and modifying the data accordingly.

In [3]:
df.head() #Shows the 1st 5 rows of the data

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
0,Ty Cobb,24,3035,11434,2246,4189,724,295,117,726,1249,357,892,178,0.366,1
1,Stan Musial,22,3026,10972,1949,3630,725,177,475,1951,1599,696,78,31,0.331,1
2,Tris Speaker,22,2789,10195,1882,3514,792,222,117,724,1381,220,432,129,0.345,1
3,Derek Jeter,20,2747,11195,1923,3465,544,66,260,1311,1082,1840,358,97,0.31,1
4,Honus Wagner,21,2792,10430,1736,3430,640,252,101,0,963,327,722,15,0.329,1


In [4]:
df.tail()  #Shows the last 5 rows of the data

Unnamed: 0,PLAYER,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
460,Jim Wynn,15,1920,6653,1105,1665,285,39,291,964,1224,1427,225,101,0.25,0
461,Jorge Posada,17,1829,6092,900,1664,379,10,275,1065,936,1453,20,21,0.273,0
462,Brady Anderson,15,1834,6499,1062,1661,338,67,210,761,960,1190,315,100,0.256,0
463,Cookie Rojas,16,1822,6309,714,1660,254,25,54,593,396,489,74,68,0.263,0
464,Mickey Rivers,15,1468,5629,785,1660,247,71,61,499,266,471,267,90,0.295,0


In [5]:
df.info()  #info about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 465 entries, 0 to 464
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   PLAYER  465 non-null    object 
 1   YRS     465 non-null    int64  
 2   G       465 non-null    int64  
 3   AB      465 non-null    int64  
 4   R       465 non-null    int64  
 5   H       465 non-null    int64  
 6   2B      465 non-null    int64  
 7   3B      465 non-null    int64  
 8   HR      465 non-null    int64  
 9   RBI     465 non-null    int64  
 10  BB      465 non-null    int64  
 11  SO      465 non-null    int64  
 12  SB      465 non-null    int64  
 13  CS      465 non-null    int64  
 14  BA      465 non-null    float64
 15  HOF     465 non-null    int64  
dtypes: float64(1), int64(14), object(1)
memory usage: 58.2+ KB


In [6]:
df.describe()  #stats of the data

Unnamed: 0,YRS,G,AB,R,H,2B,3B,HR,RBI,BB,SO,SB,CS,BA,HOF
count,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0,465.0
mean,17.049462,2048.698925,7511.455914,1150.313978,2170.247312,380.952688,78.554839,201.049462,894.260215,783.56129,847.470968,195.905376,58.083871,0.288712,0.329032
std,2.765186,354.391805,1294.065992,289.635071,424.190773,96.48346,49.36303,143.622664,486.193456,327.43195,489.224289,181.845543,48.027509,0.021208,0.474928
min,11.0,1331.0,4981.0,601.0,1660.0,177.0,3.0,9.0,0.0,239.0,0.0,7.0,0.0,0.246,0.0
25%,15.0,1802.0,6523.0,936.0,1838.0,312.0,41.0,79.0,640.0,535.0,436.0,63.0,22.0,0.273,0.0
50%,17.0,1993.0,7241.0,1104.0,2076.0,366.0,67.0,178.0,968.0,736.0,825.0,137.0,52.0,0.287,0.0
75%,19.0,2247.0,8180.0,1296.0,2375.0,436.0,107.0,292.0,1206.0,955.0,1226.0,285.0,84.0,0.3,1.0
max,26.0,3308.0,12364.0,2295.0,4189.0,792.0,309.0,755.0,2297.0,2190.0,2597.0,1406.0,335.0,0.366,2.0


In [7]:
df = df.drop(columns = ["PLAYER","CS"]) #remove the particular columns from the data

In [8]:
X = df.iloc[:,0:13]  #defining index location.

In [9]:
y = df.iloc[:,13]

# 3. Model Training
## In model training we split the data in train(80%), test(20%) and train the model.


In [10]:
from sklearn.model_selection import train_test_split #importing train_test_split from sklearn

In [11]:
X_train, X_test, y_train ,y_test = train_test_split(X,y,random_state=17,test_size=0.2) 
# In random_state we decide how much juggling of data is needed as we split in train and test
#test size is 0.2 and train size is 0.8

In [12]:
X_train.shape  #shape of X_train

(372, 13)

In [13]:
y_train.shape #shape of Y_train

(372,)

In [14]:
y_test.shape #shape of Y_test

(93,)

# 4. Decision Tree Classifier
## A decision tree classifier is a supervised machine learning algorithm used for classification tasks. It works by partitioning the feature space into regions and assigning a class label to each region based on the majority class of the training examples within that region.

In [15]:
from sklearn.tree import DecisionTreeClassifier #importing decision tree classifier model

In [16]:
kd = DecisionTreeClassifier() #assigning variable to model

In [17]:
kd.get_params() #features of model

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [18]:
kd.fit(X_train, y_train)  #model pipeline

In [19]:
y_pred = kd.predict(X_test) #predicting based on x_test

In [20]:
from sklearn.metrics import confusion_matrix  #importing confusion matrix

In [21]:
print(confusion_matrix(y_test,y_pred)) #applying confusion matrix

[[51 10]
 [11 21]]


In [22]:
from sklearn.metrics import classification_report #importing classification_report

In [23]:
print(classification_report(y_test,y_pred)) #applying classification_report

              precision    recall  f1-score   support

           0       0.82      0.84      0.83        61
           1       0.68      0.66      0.67        32

    accuracy                           0.77        93
   macro avg       0.75      0.75      0.75        93
weighted avg       0.77      0.77      0.77        93



In [24]:
kd.feature_importances_   #finding importance for each column

array([0.01050289, 0.04804585, 0.04194517, 0.0272322 , 0.3873836 ,
       0.06580402, 0.00953164, 0.03972447, 0.06455966, 0.13090403,
       0.04649472, 0.02522985, 0.10264192])

In [25]:
X.columns #displaying column names

Index(['YRS', 'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'BB', 'SO', 'SB',
       'BA'],
      dtype='object')

In [26]:
feature = pd.DataFrame(kd.feature_importances_,index = X.columns) #saving importance for each column and column names in array form

In [27]:
feature.head(15) #displaying 1st 15 rows

Unnamed: 0,0
YRS,0.010503
G,0.048046
AB,0.041945
R,0.027232
H,0.387384
2B,0.065804
3B,0.009532
HR,0.039724
RBI,0.06456
BB,0.130904


In [28]:
kd2 = DecisionTreeClassifier(criterion="entropy",ccp_alpha = 0.04) #applied 2 parameters from decision tree classifier

In [29]:
kd2.fit(X_train,y_train) #fitting data with new variable

In [30]:
y_pred2 = kd2.predict(X_test) #prediciting with new variable

In [31]:
print(confusion_matrix(y_test,y_pred2)) #calculating confusion matrix

[[50 11]
 [ 9 23]]


In [32]:
print(classification_report(y_test,y_pred2)) #claculating classification report

              precision    recall  f1-score   support

           0       0.85      0.82      0.83        61
           1       0.68      0.72      0.70        32

    accuracy                           0.78        93
   macro avg       0.76      0.77      0.77        93
weighted avg       0.79      0.78      0.79        93



In [33]:
feature2 = pd.DataFrame(kd2.feature_importances_,index = X.columns) #saving importance for each column and column names in array form

In [34]:
feature2.head(15) #displaying the 1st 15 rows

Unnamed: 0,0
YRS,0.0
G,0.0
AB,0.0
R,0.0
H,0.837977
2B,0.0
3B,0.0
HR,0.0
RBI,0.0
BB,0.0
