# Predicting User Knowledge Levels Through Study Habits and Exam Performance Analysis


#### Junseo Park Michaela Ahkong Teresa Yao Lia Sayers 

## Introduction

We plan to analyze the “User Knowledge Modeling” data set retrieved from the Phd.D. Thesis of Dr. Kahraman through the UC Irvine Machine Learning Repository. This dataset contains data regarding the knowledge class of students concerning Electrical DC Machines. Analysis of this data set will help us explore the following predictive question: Can we use a user’s degree of study time for goal object materials (STG) and a user’s exam performance for goal objects (PEG) to predict whether the user will have a very low, low, middle, or high knowledge level concerning Electrical DC Machines?

## Preliminary exploratory data analysis

In [5]:
import random  # Import the random module to generate random numbers.

import altair as alt  # Import Altair for declarative statistical visualization.
import pandas as pd  # Import pandas for data manipulation and analysis.
import numpy as np  # Import NumPy for numerical computing.

from sklearn import set_config  # Import set_config from sklearn to set global configuration.
from sklearn.compose import make_column_transformer  # Import make_column_transformer for creating a column transformer.
from sklearn.metrics.pairwise import euclidean_distances  # Import euclidean_distances to compute pairwise distances.
from sklearn.neighbors import KNeighborsClassifier  # Import KNeighborsClassifier for k-nearest neighbors classification.
from sklearn.pipeline import make_pipeline  # Import make_pipeline to construct a pipeline.
from sklearn.preprocessing import OneHotEncoder, StandardScaler  # Import OneHotEncoder for encoding categorical features as a one-hot numeric array, and StandardScaler for standardization by removing the mean and scaling to unit variance.

In [6]:
# Disable maximum rows limit for Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

# Read the CSV file into a DataFrame
data = pd.read_csv("data/User_Knowledge_Modeling.csv")

# Filter the data to select specific columns
filtered_data = data[["STG", "SCG", "STR", "LPR", "PEG", " UNS"]]

# Display the filtered DataFrame
filtered_data

Unnamed: 0,STG,SCG,STR,LPR,PEG,UNS
0,0.00,0.00,0.00,0.00,0.00,very_low
1,0.08,0.08,0.10,0.24,0.90,High
2,0.06,0.06,0.05,0.25,0.33,Low
3,0.10,0.10,0.15,0.65,0.30,Middle
4,0.08,0.08,0.08,0.98,0.24,Low
...,...,...,...,...,...,...
253,0.61,0.78,0.69,0.92,0.58,High
254,0.78,0.61,0.71,0.19,0.60,Middle
255,0.54,0.82,0.71,0.29,0.77,High
256,0.50,0.75,0.81,0.61,0.26,Middle


In [7]:
# Create a scatter plot using Altair
scatter_plot_stg = alt.Chart(filtered_data).mark_point().encode(
    x=alt.X("STG").title("The degree of study time for goal object materials"),  # X-axis representing the degree of study time
    y=alt.Y("PEG").title("The user performance in exam").scale(zero=False),  # Y-axis representing the user performance in exam
    color=alt.Color(' UNS',  # Color encoding based on UNS (unsatisfactory, satisfactory, good, very good)
        scale=alt.Scale(
            domain=['very_low', 'Low', 'Middle', 'High'],  # Domain of UNS values
            range=['red', 'yellow', 'green', 'blue']  # Corresponding color range
        )
    )
).properties(
    title='STG vs PEG'  # Title of the scatter plot
)

# Show the scatter plot
scatter_plot_stg

In [8]:
# Additional scatter plot for SCG vs PEG
scatter_plot_scg = alt.Chart(filtered_data).mark_point().encode(
    x=alt.X("SCG").title("The degree of repetition number of user for goal object materials"),
    y=alt.Y("PEG").title("The user performance in exam").scale(zero=False),
    color=alt.Color(' UNS',  # Adjusted for correct column name
        scale=alt.Scale(
            domain=['very_low', 'Low', 'Middle', 'High'],
            range=['red', 'yellow', 'green', 'blue']
        )
    )
).properties(
    title='SCG vs PEG',
)

# Combining the charts side by side
combined_charts = alt.hconcat(scatter_plot_stg, scatter_plot_scg).resolve_scale(color='independent')

# Display the combined charts
combined_charts

## Methods

To predict how exam performance for goal objects and time spent studying relate to the knowledge levels of users, we will first organize the dataset into a tidy format with only the columns we will be using. By examining the PEG (exam performance of users for goal objects) and STG (degree of study time for goal object materials) columns, we can conduct an analysis to see if there is a relationship between the time spent studying and exam performance. We can see if increased study time is correlated to a better exam performance. Furthermore, we considered the knowledge level of the user (UNS) to determine if higher and lower knowledge level users are associated with more study time or a better exam performance or both. We used a scatter plot to visualize the distribution of STG and PEG with different UNS categories. The x-axis is “The degree of study time for goal object materials (STG)” and the y-axis is “The exam performance of users for goal objects (PEG)”. The points on the graph are color-coded based on the different knowledge levels of the users (UNS). This visualization enabled us to quickly identify the patterns among different knowledge levels of users, representing how study time and exam outcomes are heavily reliant on the users understanding of the goal object materials. 

## Expected Outcomes and Significance

Based on the scatter plot, it appears that users that display high exam performance for goal objects possess a high knowledge level while users that display poor exam performance for goal objects possess a low or very low knowledge level. There appears to be greater variation in exam performance for users possessing a middle knowledge level, though the majority of these users still perform better than users with a low or very low knowledge level. It appears that users with a low or very low knowledge level tend to study for a shorter period of time compared to other users. However, when considering all the data points, it appears that a user’s degree of study time may not strongly influence their exam performance. Thus, the predicted knowledge level for a user may be primarily based on their exam performance for goal objects. Future studies should investigate why study time varies so greatly for all knowledge levels. Other factors that influence exam performance should also be explored.