## GROUP 135 PROJECT PROPOSAL

Group Members: Jannah Mohamed, Kate Saychaleun, Julia Toito, Elaine Chen


In [1]:
library(tidyverse)
library(readxl)
library(ggplot2)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [None]:
download.file(url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls", destfile = "Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls")
user_training_data <- read_excel("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls", "Training_Data")%>%
    select(STG:UNS)
user_test_data <- read_excel("Data_User_Modeling_Dataset_Hamdi Tolga KAHRAMAN.xls", "Test_Data")%>%
    select(STG:UNS)                    

In [None]:
# Run this cell before continuing.
# Wrangle and clean the data from it's original format to the format necessary for the planned analysis
user_training_data %>%
    select(STG:UNS) %>%
    head(rows)
    
user_test_data %>%
    select(STG:UNS) %>%
    head(rows)

In [None]:
# tallying up the amount of each category of UNS 
tally <- user_training_data %>%
    group_by(UNS) %>%
    summarize(amount = n()) %>%
    arrange(c(4, 2, 3, 1))

# mean of each one of the predictor variable
means <- user_training_data %>%
    select(-UNS) %>%
    map_df(mean)

In [None]:
pvl1 <- user_training_data %>%
    ggplot(aes(x = STG, y = UNS, fill = UNS)) +
    geom_boxplot() +
    ylab("UNS (The Knowledge Level Of User)") +
    xlab("STG (The Degree of Study Time for Goal Object Materials)") +
    scale_y_discrete(limits = c("very_low", "Low", "Middle", "High")) +
    ggtitle("The Knowledge Level Of a User \nDepending on their Degree of Study Time for Goal Object Materials") +
    theme(text = element_text(size = 14))

pvl2 <- user_training_data %>%
    ggplot(aes(x = SCG, y = UNS, fill = UNS)) +
    geom_boxplot() +
    ylab("UNS (The Knowledge Level Of User)") +
    xlab("SCG (The Degree of Repetition Number of User for Goal Object Materials)") +
    scale_y_discrete(limits = c("very_low", "Low", "Middle", "High")) +
    ggtitle("The Knowledge Level Of a User \nDepending on their Degree of Repetition Number of User for Goal Object Materials") +
    theme(text = element_text(size = 14))

pvl3 <- user_training_data %>%
    ggplot(aes(x = STR, y = UNS, fill = UNS)) +
    geom_boxplot() +
    ylab("UNS (The Knowledge Level Of User)") +
    xlab("STR (The Degree of Study Time of User for Related Objects with Goal Object)") +
    scale_y_discrete(limits = c("very_low", "Low", "Middle", "High")) +
    ggtitle("The Knowledge Level Of a User \nDepending on their Degree of Study Time of User for Related Objects with Goal Object")+
    theme(text = element_text(size = 14))

pvl4 <- user_training_data %>%
    ggplot(aes(x = LPR, y = UNS, fill = UNS)) +
    geom_boxplot() +
    ylab("UNS (The Knowledge Level Of User)") +
    xlab("LPR (The Exam Performance of User for Related Objects with Goal Object)") +
    scale_y_discrete(limits = c("very_low", "Low", "Middle", "High")) +
    ggtitle("The Knowledge Level Of a User \nDepending on their Exam Performance of User for Related Objects with Goal Object") +
    theme(text = element_text(size = 14))

pvl5 <- user_training_data %>%
    ggplot(aes(x = PEG, y = UNS, fill = UNS)) +
    geom_boxplot() +
    ylab("UNS (The Knowledge Level Of User)") +
    xlab("PEG (The Exam Performance of User for Goal Objects)") +
    scale_y_discrete(limits = c("very_low", "Low", "Middle", "High")) +
    ggtitle("The Knowledge Level Of a User \nDepending on their Exam Performance of User for Goal Objects") +
    theme(text = element_text(size = 14))

In [None]:
# performing forward selection
names <- colnames(user_training_data |> select(-UNS))
accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

cancer_vfold <- vfold_cv(user_training_data, v = 5, strata = UNS)

n_total <- length(names)

selected <- c()

for (i in 1:n_total) {
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        preds_new <- c(selected, names[[j]])
        model_string <- paste("UNS", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        cancer_recipe <- recipe(as.formula(model_string), 
                                data = user_training_data) |>
                          step_scale(all_predictors()) |>
                          step_center(all_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
          add_recipe(cancer_recipe) |>
          add_model(knn_spec) |>
          tune_grid(resamples = cancer_vfold, grid = 10) |>
          collect_metrics() |>
          filter(.metric == "accuracy") |>
          summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}

In [None]:
point <- user_training_data %>%
    ggplot(aes(x = PEG, y = LPR, color = as_factor(UNS))) +
    geom_point() +
    labs(x = "PEG (The Exam Performance of User for Goal Objects)", y = "LPR (The Exam Performance of User for Related Objects with Goal Object)", color = "UNS (The Knowledge Level Of User)")+
    ggtitle("The Exam Performance of User for Goal Objects \n Depending on The Exam Performance of User for Related Objects with Goal Object") +
    theme(text = element_text(size = 14))

## Introduction
#### **Which set of predictors accurately demonstrates one’s UNS: The Knowledge Level of a User?**
For our group project we will explore the development of intuitive knowledge classifiers and the modeling of domain dependent data. The dataset aims to classify the level of a user's knowledge on Electrical DC Machines, using an intuitive algorithmic classifier. 
First, the proposed knowledge classifier investigates the best weighted values for the students’ knowledge classes. Then, the distance between the students is calculated using their data and weight values. These differences in the classification procedure are then employed to define their knowledge class (very_low, Low, Middle, High). The model classifies them through weighting students’ domain-dependent data, a combination of user modeling algorithms and population-based approach. The approach enables us to optimize the accuracy score and visualize the relationship between the categories.

## Preliminary Exploratory Analysis

After importing and reading the dataset from the csv file, we will be using the “UNS (The knowledge level of user)” column as the target value.

## Graphs and Tables

This table shows how many observations each category (very_low, Low, Middle, High) of UNS user_training_data has. This gives us an idea of the weight distribution of each variable. If the model is more skewed towards one variable, we can shift the weight to accomodate for other variables accordingly.

In [None]:
tally

This table shows the mean of each one of the potential predictor variables (STG, SCG, STR, LPR, PEG).

In [None]:
means

**Pvl1**: There is mostly a positive correlation between the mean of study time and UNS, meaning students typically score higher the more time they spend on studying Goal Object Materials. However, there are some outliers in the model. For example, there is one student whose degree of study time is high but is still “very low” in UNS.

In [None]:
options(repr.plot.width = 7, repr.plot.height = 7)
pvl1

**Pvl2**: The relationship between the mean repetition and UNS is a positive trend, indicating that higher scores are more dependent on the student’s repetition in studying the Goal Object Materials. 

In [None]:
pvl2

**Pvl3**: There is no real relationship between UNS and STR.

In [None]:
pvl3

**Pvl4**: There is no real relationship between UNS and LPR.

In [None]:
pvl4

**Pvl5**: There is a clear positive correlation between the mean of exam performance for Goal Objects and UNS. In other words, the dataset shows that the higher the student can score in their exam, the higher their level of UNS is. There are some outliers where the students scored less than the mean score, but have a higher level of UNS than the majority in their group.

In [None]:
pvl5

**accuracies:** Using forward selection, test each combination of variables to find which ones are most suitable to use as predictors. As we can see in this table, the combination of PEG + LPR gives the highest estimated accuracy for predicting UNS; therefore it would be ideal to use the PEG and LPR as predictors.

In [None]:
accuracies

**Point:** The categories of UNS are clearly separated into different sections of the graph. We can see that as PEG gets larger, the UNS of the observations increase from very_low to Low to Middle to High.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
point

## Method
Given the variable we are trying to predict (UNS), is categorical and not continuous, the correct model to employ would be a classification model. We are going to use the K-nearest neighbors classification model to try and predict UNS. The predictor variables we are going to use are PEG and LPR. This is because we used forward selection to try and see which combination of variables is most likely to be accurate for predicting UNS, and PEG + LPR had the highest estimated accuracy, at 0.96. We also graphs PEG against LPR and coloured the observations by UNS, to try and see if the observations are indeed separated into clear categories, which they are. 

## Conclusion
We expect to find that higher levels of PEG and LPR will lead to a better understanding of DC machines, ie: a higher UNS. Some impact this finding can have is letting professors know how to better gauge their students’ understanding of a certain topic. Some further questions we can ask are: What do we need to improve to better teach students about Electrical DC Machines? What are the best methods students can use to improve their understanding? 
<br>
<br>

## References
H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and 
modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, 
pp. 283-295, 2013.

Kahraman, H. T. (2009). Designing and Application of Web-Based Adaptive Intelligent 
Education System. Gazi University Ph. D. Thesis, Turkey, 1-156.
