## User Knowledge Classification Model Proposal
##### Group 41: Daeun Lee, Jessie Megan, Mia Ling, Renee Chan

### 1. Introduction

Various factors can play a significant role in a student’s knowledge level. Understanding the attributes that contribute to knowledge levels can help improve a student’s academic success. In this project, the following question will be answered: what is the predicted knowledge level of a user given their degree of study time for goal object materials (STG) and the exam performance of the user for goal objects (PEG)?

This project utilizes the User Knowledge Modeling Dataset to predict the knowledge level of a user (UNS) through creating a classification model using the k-nearest neighbor algorithm. This dataset examines a group of student’s knowledge pertaining to the subject of Electrical DC machines. The knowledge level of a user can be classified as “Very Low”, “Low” , “Middle”, and “High”. 

### 2. Preliminary exploratory data analysis

In [3]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)

ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


Attribute Information:
1) STG (The degree of study time for goal object materails)
2) SCG (The degree of repetition number of user for goal object materails)
3) STR (The degree of study time of user for related objects with goal object)
4) LPR (The exam performance of user for related objects with goal object)
5) PEG (The exam performance of user for goal objects)
6) UNS (The knowledge level of user)

In [None]:
training_data <- read_csv("data/User_Modeling_Training_data.csv") |>
                    select(STG:UNS)
training_data

testing_data <- read_csv("data/User_Modeling_Testing_data.csv") |>
                    select(STG:UNS)
testing_data

Below are two summary tables. The first one shows the number of observations in each class of UNS(the knowledge level of user).
The second table shows the mean of all the perdictor variables can be used to predict tge class of UNS.

In [None]:
Training_table_1 <- training_data |>
                    group_by(UNS) |>
                    summarize(count = n(), percentage = n() / nrow(training_data) * 100)
Training_table_1

Training_table_2 <- training_data |> 
                    summarize(across(STG:PEG,mean))
Training_table_2                                         

The two histogram below shows the distribution of STG and PEG based on each knowledge level.

In [None]:
STG_PLOT <- ggplot(testing_data,aes(x = STG, fill = as_factor(UNS))) +
            geom_histogram() +
            facet_grid(rows = vars(UNS)) +
            labs(x = "The degree of study time for goal object materails", 
                 fill = "The knowledge level of user") +
            ggtitle("The distribution of STG")
STG_PLOT

PEG_PLOT <- STG_PLOT <- ggplot(testing_data,aes(x = PEG,fill = as_factor(UNS))) +
            geom_histogram() +
            facet_grid(rows = vars(UNS)) +
            labs(x = "The exam performance of user for goal objects", 
                 fill = "The knowledge level of user") +
            ggtitle("The distribution of PEG")
PEG_PLOT

### 3. Methods

We will conduct our data analysis by creating a classification model using the k-nearest neighbor algorithm. This is done by going through the steps as mentioned below:
1) Load the dataset
2) Create a scatterplot
3) 


The variables from the dataset that we include in this analysis are the degree of study time goal object materials (STG), the exam performance of user for goal objects (PEG), and the knowledge level of user as the variables (UNS). We will visualize the results by creating a scatterplot, labelling the x-axis as the degree of study time goal object materials (STG) and y-axis as the exam performance of user for goal objects (PEG). By doing so, we will be able to visualize the relationship between STG and PEG. Then we will use the K-nearest neighbors classification algorithm to predict the knowledge level of user (UNS). With this classifier, we will find the K “nearest” observations from the training data set.

### 4. Expected outcomes and significance

#### a. What do we expect to find?
Utilizing the k-nearest neighbor algorithm, we expect to create an efficient classification model that is able to predict students' knowledge level about the subject of Electrical DC Machines. From the dataset, we selected the degree of study time for goal object materials (STG) and the exam performance of the user for goal objects (PEG) as our predictors. By doing this, we will be able to understand how relevant those two variables are in predicting students' knowledge level.

#### b. What impact could such findings have?
This classification model would be able to predict students' knowledge level about the subject of Electrical DC Machines. This type of prediction would be really helpful in understanding the factors that may influence students' knowledge level. Furthermore, this model allows us to understand how we can improve their knowledge level and consequently be able to guide them to perform better in class. 

#### c. What future questions could this lead to?
Some questions that this classification model may lead to are as follows:
1) Are there any other variables that are relevant in predicting students' knowledge level about the subject of Electrical DC Machines?
2) How can we improve students’ knowledge level on a certain subject?