<a href="https://colab.research.google.com/github/primriq/ML-Apex-Univ/blob/main/decision_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 style='color:blue' align='center'>Decision Tree Classification</h2>

We will build a decision tree classifier to predict whether a person's salary is more than 100K based on their company, job role, and degree.

We'll go step by step: load the data, preprocess it, train the model, and then evaluate it using accuracy, confusion matrix, and other metrics.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from sklearn.metrics import (
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    classification_report
)

First, we load the data from `salaries.csv` and take a quick look at the first few rows to understand the columns.

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/primriq/ML-Apex-Univ/refs/heads/main/Datasets/salaries.csv")  #read data from github repo
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


Next, we separate the input features (company, job, degree) from the target label `salary_more_then_100k`, which indicates whether the salary is above 100K.

In [3]:
inputs = df.drop('salary_more_then_100k', axis='columns')
target = df['salary_more_then_100k']

The feature columns are currently text (categorical). We need to convert them into numbers so that the decision tree model can work with them. We will use `LabelEncoder` for this.

In [4]:
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()

inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


Now that we have numeric versions of the categorical features, we can drop the original text columns and keep only the numeric ones for training.

In [5]:
inputs_n = inputs.drop(['company', 'job', 'degree'], axis='columns')
inputs_n.head()

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0


Next, we create a decision tree classifier and train it using our numeric features and the target labels.

In [6]:
model = tree.DecisionTreeClassifier()
model.fit(inputs_n, target)

Let's first check the training accuracy of the model. This tells us what fraction of training examples the model classified correctly.

In [7]:
training_accuracy = model.score(inputs_n, target)
print('Training Accuracy:', training_accuracy)

Training Accuracy: 1.0


To understand the model's performance in more detail, we generate predictions on the training data, compute the confusion matrix, and then calculate precision, recall, F1 score, and accuracy using scikit-learn metrics.

In [8]:
# Generate predictions
predictions = model.predict(inputs_n)

# Confusion matrix
cm = confusion_matrix(target, predictions)
print('Confusion Matrix:\n', cm)

# Other evaluation metrics
print('Accuracy:', accuracy_score(target, predictions))
print('Precision:', precision_score(target, predictions))
print('Recall:', recall_score(target, predictions))
print('F1 Score:', f1_score(target, predictions))

print('\nClassification Report:\n', classification_report(target, predictions))

Confusion Matrix:
 [[ 6  0]
 [ 0 10]]
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       1.00      1.00      1.00        10

    accuracy                           1.00        16
   macro avg       1.00      1.00      1.00        16
weighted avg       1.00      1.00      1.00        16



Finally, we can use the trained model to make predictions for specific inputs.

For example, suppose we want to know whether the salary of a person working at **Google** as a **Computer Engineer** with a **Bachelor's degree** is more than 100K. We first need to know the encoded values for these categories from the label encoders.

In [9]:
# Example 1: Google, Computer Engineer, Bachelor's degree
# Replace the numeric values below with the appropriate label-encoded values
# as obtained from le_company, le_job, and le_degree.
example_1 = [[2, 1, 0]]  # (company_n, job_n, degree_n)
model.predict(example_1)



array([0])

Similarly, we can check for **Google, Computer Engineer, Master's degree**.

In [10]:
# Example 2: Google, Computer Engineer, Master's degree
example_2 = [[2, 1, 1]]  # (company_n, job_n, degree_n)
model.predict(example_2)



array([1])