# **Data Mining for Confidence Level Detection**

Lydia Lonzarich and Katie Park

CPSC 322-01, Fall 2025

# Import Libraries

In [None]:
import importlib

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyRandomForestClassifier, MyDecisionTreeClassifier, MyNaiveBayesClassifier, MyDecisionTreeSolo

import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable

import matplotlib.pyplot as plt

# **Introduction**

## Dataset Description

The dataset used for this project is called "Confidence Detection Dataset," from Kaggle: https://www.kaggle.com/datasets/muhammadkhubaibahmad/confidence-detection-dataset

</br>

The dataset contains features extracted from human body landmarks and postures to classify confidence levels, which is also what is being classified for this project.


# Findings

3 different classifiers were used on the dataset:
1. Random Forest Classifier
1. Decision Tree Classifier
1. Naive Bayes Classifier

After using all 3 classifiers on the dataset, the classifier that performed the best was the Naive Bayes Classifier.

# **Data Analysis**

# Dataset Information

The dataset includes 19 attributes and 1 target (as defined by the dataset author), with 5,950 rows in total.

The class that is predicted from the attributes is "confidence_label," which can contains the class labels:
1. confident
2. neutral
3. low

</br>

Although there are 19 attributes used to classify confidence_label, only 9 attributes were used in the different classifier approaches. The 9 attributes were chosen as they were the only attributes that could be changed based on confidence level. Certain attributes in the dataset, such as "eye_distance," for example, which represents the distance between eyes, do not change if a person is confident or not, and so only attributes that were deemed to be possible predictors for confidence_label were used. These attributes include:
1. "body_lean_x"
    * A float value representing the horizontal body lean ratio
1. "shoulder_center_x"
    * A float value representing the X-coordinate of the shoulder center
1. "hip_center_x"
    * A float value representing the X-coordinate of the hip-center
1. "spine_angle"
    * A float value representing the spine inclination angle in degrees
1. "head_tilt_angle"
    * A float value representing the head tilt angle in degrees
1. "shoulder_slope"
    * A float value representing the slope of the shoulder line
1. "head_direction"
    * A categorical value representing the head orientation (can be "Looking Straight," "Center," "Looking Right," or "Looking Left")
1. "arm_position"
    * A categorical value representing the arm position (can be "Partially Open," "Closed Arms," or "Open Arms")
1. "posture"
    * A categorical vaue representing the general body posture (can be "Upright," "Stiff," or "Slouched).


# Dataset Loading

Below, the dataset is extracted and saved as a MyPyTable object. A copy is then created, as all the float values will undergo normalization and discretization for algorithm compatibility for all classifiers. The copy will ensure the original dataset's values are not changed, retaining its true values.


The copy of the dataset then undergoes normalization, ensuring all values are between [0, 1]. After being normalized, the values then undergo discretization, wherein every 0.1 (normalized) value is set to a value between [1, 10]. This ensures algorithms such as Decision Trees have a limited number of attribute values, preventing every unique float value creating a branch in the tree.

In [None]:
confidence_raw_data = MyPyTable().load_from_file("confidence_features.csv") # dataset before processing
confidence = confidence_raw_data.new_deep_copy() # the dataset we will normalize. Original dataset retains its true values

all_attributes = ["body_lean_x", "shoulder_center_x", "hip_center_x", "spine_angle", "head_tilt_angle", "shoulder_slope", "head_direction", "arm_position", "posture"]
continuous_attributes = all_attributes[:6]
categorical_attributes = all_attributes[6:]

# Relevant Summary Statistics


In Figure 1 below, the distribution of class labels in the dataset are visualized, with over half being classified as "Confident," and the class with the smallest percentage being "Low," making up only 19.4%.

In [None]:
# Pie chart of the percentages of each confidence label in the dataset

# reset figure
plt.figure(figsize = (4, 4))

# get x and y values (frequency of each confidence label)
freq = myutils.get_frequency(confidence_raw_data.get_column("confidence_label"))

xs = []
ys = []

for key in freq:
    xs.append(key)
    ys.append(freq[key])

# create the chart (with number of decimals)
plt.pie(ys, labels = xs, autopct = "%1.1f%%")

# add border to current figure
fig = plt.gcf()

# add title to pie chart, and change formatting
fig.suptitle("Figure 1: Confidence Label Percentages")
fig.patch.set_edgecolor("black")
fig.patch.set_linewidth(1)

plt.show

Below, Figures 2, 3, and 4 show the life cycle of our data preprocessing, with Figure 2 showing a histogram, representing the distribution of the raw, unchanged data containing the true values from the original dataset.

In [None]:
# Histogram of the body_lean_x attribute before preprocessing

# reset figure
plt.figure()

# get x values
x = confidence_raw_data.get_column("body_lean_x")

plt.hist(x, bins = 20, alpha = 0.75)
plt.grid(True)

# add titles
plt.title("Figure 2: body_lean_x (before preprocessing)")
plt.xlabel("Horizontal body lean ratio")
plt.ylabel("Count")

plt.show()

The data then undergoes normalization, as shown in Figure 3. Figure 3 shows a histogram, representing the distribution of the dataset after being normalized to the values between [0, 1].

In [None]:
# Histogram of the body_lean_x attribute after normalization

# reset figure
plt.figure()

# get x values after normalization
x = confidence.get_column("body_lean_x")

plt.hist(x, bins = 20, alpha = 0.75)
plt.grid(True)

# add titles
plt.title("Figure 3: Body Lean X (after normalization)")
plt.xlabel("Horizontal body lean ratio")
plt.ylabel("Count")

plt.show()

After being normalized, the values are then discretization, where every 0.1 value is set to a value between [1, 10], meaning values between [0, 0.1] are set to 1, values between (0.1, 0.2] are set to 2, etc.

In [None]:
# find the column indices of the attributes we're using
att_indices = []
for att in all_attributes:
   att_idx = confidence.column_names.index(att)
   att_indices.append(att_idx)

# discretize values (to make continuous attribute vals --> categorical attribute vals)
for row_index, row in enumerate(confidence.data):
    for val_index, value in enumerate(row):
        if val_index in att_indices and type(confidence.data[row_index][val_index]) != str:
            confidence.data[row_index][val_index] = myutils.my_discretizer(confidence.data[row_index][val_index])

The distribution of the normalized and discretized values are shown in Figure 4, representing the distribution of values for the body_lean_X attribute.

This normalization and discretization is done to all float values that are attributes, ensuring consistency across all numeric values for future classification.

In [None]:
# Histogram of the body_lean_x attribute after discretization

# reset figure
plt.figure()

# get x values after normalization AND discretization
x = confidence.get_column("body_lean_x")

plt.hist(x, bins = 20, alpha = 0.75)
plt.grid(True)

# add titles
plt.title("Figure 4: Body Lean X (after discretization)")
plt.xlabel("Horizontal body lean ratio")
plt.ylabel("Count")

plt.show()

# **Classification Results**

The 3 different classification approaches that were developed were: Decision Tree classifier, Random Forest classifier, and Naive Bayes classifier.

# Implementing Decision Tree for Classification

Below, the estimated predictive accuracy of the Decision Tree classifier is found using K-fold cross validation. Then, the Decision Tree classifier is trained and used to classify unseen instances. Various performance metrics are used to verify the performance of the classifier.


</br>


**K-Fold Cross Validation**

K-fold cross validation is used to give a robust estimate of the predictive performance of the Decision Tree model and interpret model stability by training and evaluating the model on different folds, or subsets, of the same dataset. During cross validation, the entire dataset is partitioned into k (approximately) equal folds, or subsets. Training is completed k times, wherein each fold becomes the test set exactly once.


</br>


**Fitting the Decision Tree Classifier**

After k sets of training and testing sets are created, a decision tree is fitted for each training set. Each decision tree is created by using the Top Down Induction of Decision Trees Algorithm. This algorithm selects an attribute to split the data based on having the lowest entropy (meaning the least uncertainty). Each partition is then repeatedly partitioned until either all class values for a partion is the same, there are no more attributes to split on, or there are no more instances to partition. In all cases, a leaf node is created.


</br>


**Predicting Unseen Instances**

After a decision tree is created for a fold, the testing set is then used against the tree to predict each instance's class label. This is done by traversing down the tree, based on the attribute value at each node.

</br>


Below is the k-fold cross validation process.

In [None]:
# define X and y data
X = confidence.get_column_rows(all_attributes)
y = confidence.get_column("confidence_label")

# get all unique class labels.
labels = list(set(y))

# compute k fold cross validation with k=10 folds to evaluate model performance across different train and test subsets of data.
acc, err_rate, precision, recall, f1, y_trues, y_preds = myutils.cross_val_predict(X, y, 10, MyDecisionTreeClassifier, True)

**Decision Tree Performance Metrics for k-Fold Cross Validation**

For each fold in the dataset, after the Decision Tree classifier is fitted and evaluated against the test set, the performance metrics for each fold are found. After all k folds are fitted and evaluated against the test set, the average for each performance metric are found. The performance metrics include:
- Accuracy
- Error rate
- Precision
- Recall
- F1 Score

The results of each performance metrics are output below

In [None]:
print("K-FOLD CROSS VALIDATION (k=10) PERFORMANCE METRICS")
print(f"(Average) Accuracy Score: {round(acc, 2)}")
print(f"(Average) Error Rate: {round(err_rate, 2)}")
print(f"(Average) Precision Score: {round(precision, 2)}")
print(f"(Average) Recall Score: {round(recall, 2)}")
print(f"(Average) F1 Score: {round(f1, 2)}")

**Fitting the Final Classifier**

We chose to train our model with the same training set that was generated internally by the Random Forest class in order to ensure fair comparison of our three different algorithms on the same test instances.



</br>


**Generating Final Predictions**

We chose to generate predictions using the same testing set that was generated internally by the Random Forest class in order to ensure fair comparison of our three different algorithms on the same test instances.

In [None]:
# create a decision tree instance.
myTree = MyDecisionTreeClassifier()

# train the decision tree classifer (use the same train set that we generated in the random forest class for fair classifier comparison).
myTree.fit(myForest.X_train, myForest.y_train)

# generate predictions (use the same test set that we generated in the random forest class for fair classifier comparison).
y_pred = myTree.predict(myForest.X_test)

**Naive Bayes Performance Metrics for Final Model**

After the Naive Bayes Classifier has been evaluated against the test set, performance metrics are calculated and output below.

We found that our Naive Bayes Classifier resulted in 55.30% accuracy.

In [None]:
print("DECISION TREE CLASSIFIER RESULTS")
acc = myevaluation.accuracy_score(myForest.y_test, y_pred)
precision = myevaluation.multiclass_precision_score(myForest.y_test, y_pred, labels=labels)
recall = myevaluation.multiclass_recall_score(myForest.y_test, y_pred, labels=labels)
f1 = myevaluation.multiclass_f1_score(myForest.y_test, y_pred, labels=labels)
print(f"Accuracy Score: {round(acc, 2)}")
print(f"Error Rate: {round(err_rate, 2)}")
print(f"Precision Score: {round(precision, 2)}")
print(f"Recall Score: {round(recall, 2)}")
print(f"F1 Score: {round(f1, 2)}")

# Implementing Random Forest for Classification

Below, the Random Forest classifier is fitted, and then used to classify unseen instances. After, various performance metrics are used to verify the performance of the Random Forest classifier.

</br>

**Generating Test and Training Sets**

For the Random Forest classifier, a pre-processing step occurs to the given dataset, wherein both training and testing sets are generated internally. This is done by first generating a random stratified test set, consisting of one third of the original dataset, with the test set having the same class distribution as the original dataset. After the test set is created, any remaining rows in the dataset that were not selected make up the training set which is then used to fit the classifier.

</br>

**Fitting the Classifier**

Fitting the classifier is done by creating N "random" decision trees using bootstrapping over the remainder set. Bootstrapping is a technique where the training set is created by randomly selecting rows with replacement, and the test set is made up of any values not in the training set. During the Top Down Induction for Decision Trees Algorithm for generating each tree in the forest, F randomly chosen attributes are selected as candidate attributes to partition the data on at each node. Note that entropy is still used to determine the best attribute to partition on, similar to the Decision Tree classifier. After all N trees have been created, the M most accurate decision trees of the N trees are selected to be used for the random forest classifier. Note: the accuracy was based on the training and testing sets found using boostrapping.

For this dataset, after testing against multiple N, F, and M values, it was determined the values did not change the performance of the Forest by much, so the Random Forest for the confidence level dataset is set to:
1. N = 20
1. M = 5
1. F = 4

In [None]:
# define X and y data
X = confidence.get_column_rows(all_attributes)
y = confidence.get_column("confidence_label")

# create a random forest classifer instance using the best N, M, and F parameters found.
myForest = MyRandomForestClassifier(N=20, M=5, F=4)

# train the random forest classifier on our train data. (class does internal split into train and test set, so here we just use internal train set).
myForest.fit(X, y)

**Predicting Unseen Instances**

After the Random Forest classifier is fitted, the testing sets found earlier are then used to against the classifier. This is done by running each unseen instance against all M trees in the forest. The class label predicted the most out of the M trees is then considered the predicted class for that unseen attribute.

In [None]:
y_preds = myForest.predict()

**Random Forest Performance**

After the class predictions are found against the unseen instances, various performance metrics are used against the classifier to verify its performance.


The results of each performance metric is ouputted below.

In [None]:
print("----- PERFORMANCE METRICS FOR RANDOM FOREST -----")
print(f"Accuracy Score: {round(myevaluation.accuracy_score(myForest.y_test, y_preds), 2)}")
print(f"Error Rate: {round(1 - myevaluation.accuracy_score(myForest.y_test, y_preds), 2)}")
print(f"Precision Score: {round(myevaluation.multiclass_precision_score(myForest.y_test, y_preds, ["Confident", "Neutral", "Low"]), 2)}")
print(f"Recall Score: {round(myevaluation.multiclass_recall_score(myForest.y_test, y_preds, ["Confident", "Neutral", "Low"]), 2)}")
print(f"F1 Score: {round(myevaluation.multiclass_f1_score(myForest.y_test, y_preds, ["Confident", "Neutral", "Low"]), 2)}")

# Implementing Naive Bayes for Classification

Below, the estimated predictive accuracy of the Naive Bayes classifier is found using K-fold cross validation. Then, the Naive Bayes classifier is trained and used to classify unseen instances. Various performance metrics are used to verify the performance of the classifier.


</br>


**K-Fold Cross Validation**

K-fold cross validation is used to give a robust estimate of the predictive performance of the Naive Bayes model and interpret model stability by training and evaluating the model on different folds, or subsets, of the same dataset. During cross validation, the entire dataset is partitioned into k (approximately) equal folds, or subsets. Training is completed k times, wherein each fold becomes the test set exactly once.


</br>


**Fitting the Naive Bayes Classifier**

Fitting in Naive Bayes involves using the training data to estimate all probabilities the model needs to generate predictions.

For each fold during training, the class priors and class conditional probabilities are computed.

- Class priors are the probability of each class label ocurring in the training data. These are computed by dividing the total number of instances of each label by the total number of training instances.
- Class conditional probabilities are the probability of each feature occurring given a specific class label. For each given class, the conditional probabilities are found by dividing the total number of instances with each feature by the total number of instances with that class.

Together, the class priors and conditional probabilities serve as the "learned" parameters of the Naive Bayes classifier.


</br>


**Predicting Unseen Instances**

After a Naive Bayes classifier is fitted, the testing set is then used to predict each instance's class label. This is done by computing its posterior probability for each unique class label in the dataset, which is the probability of that label occuring given the attribute values in that test instance. The posterior probability for each class is calculated by multiplying the class prior by a sum of each conditional probability for the class. Once the posterior probabilities are found for each class given the test instance, the class assigned to the test instance is the largest posterior probability found.


</br>


Below is the k-fold cross validation process.

In [None]:
# define X and y data
X = [[row[idx] for idx in att_indices] for row in confidence.data]
y = confidence.get_column("conafidence_label")

# get all unique class labels.
labels = list(set(y))

# compute the avg acc and error rate, avg precision, avg recall, and avg F1 over each train/test split of the data.
acc, err_rate, precision, recall, f1, y_trues, y_preds = myutils.cross_val_predict(X, y, 10, MyNaiveBayesClassifier, True)

**Naive Bayes Performance Metrics for k-Fold Cross Validation**

For each fold in the dataset, after the Naive Bayes Classifier is fitted and evaluated against the test est, the performance metrics for each fold are found. After all k folds are fitted and evaluated against the test set, the average for each performance metric is found and output below.

The performance metrics include:
1. Accuracy
1. Error rate
1. Precision
1. Recall
1. F1 Score

The results of each performance metric is output below

In [None]:
print("K-FOLD CROSS VALIDATION (k=10) PERFORMANCE METRICS")
print(f"(Average) Accuracy Score: {round(acc, 2)}")
print(f"(Average) Error Rate: {round(err_rate, 2)}")
print(f"(Average) Precision Score: {round(precision, 2)}")
print(f"(Average) Recall Score: {round(recall, 2)}")
print(f"(Average) F1 Score: {round(f1, 2)}")

**Fitting the Final Classifier**

We chose to train our model with the same training set that was generated internally by the Random Forest class in order to ensure fair comparison of our three different algorithms on the same test instances.



</br>


**Generating Final Predictions**

We chose to generate predictions using the same testing set that was generated internally by the Random Forest class in order to ensure fair comparison of our three different algorithms on the same test instances.

In [None]:
# create a naive bayes classifier instance.
my_nb = MyNaiveBayesClassifier()

# train our naive bayes classifier (use the same train set that we generated in the random forest class for fair classifier comparison).
my_nb.fit(myForest.X_train, myForest.y_train)

# generate confidence label predictions (use the same test set that we generated in the random forest class for fair classifier comparison).
y_pred = my_nb.predict(myForest.X_test)

**Naive Bayes Performance Metrics for Final Model**

After the Naive Bayes Classifier has been evaluated against the test set, performance metrics are calculated and output below.

We found that our Naive Bayes Classifier resulted in 56.88% accuracy.

In [None]:
print("NAIVE BAYES CLASSIFIER RESULTS")
acc = myevaluation.accuracy_score(myForest.y_test, y_pred)
precision = myevaluation.multiclass_precision_score(myForest.y_test, y_pred, labels=labels)
recall = myevaluation.multiclass_recall_score(myForest.y_test, y_pred, labels=labels)
f1 = myevaluation.multiclass_f1_score(myForest.y_test, y_pred, labels=labels)
print("Accuracy: ", acc)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1-Score: ", f1)