# **Group 8 Project - User Knowledge Modeling Data Set**
Authors: Minting Fu, Zeti Batrisha Jamiluddin Amini, Liz Ji, Caroline Zhang

## INTRODUCTION

In this project, we will study a dataset of 403 real user knowledge status in the area of Electrical DC Machines to understand how the user knowledge level is determined. Good academic performances are pursued both by students and educational institutions. In order to find a more efficient way to study and gain better academic scores, we need to further explore and analyze the User Knowledge dataset, which can be found <a href = "https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling" target = "_blank">here</a>.

We would like to investigate this question through our project:

* Is there a relationship between STG, SCG, STR, LPR, PEG and UNS? (ie. which of STG, SCG, STR, LPR, PEG are contributing factors to UNS?)


We will be looking at 5 different variables to predict the knowledge level of users (UNS). These variables are:

* STG : the degree of study time for goal object materails.
* SCG : the degree of repetition number of user for goal object materails.
* STR : the degree of study time of user for related objects with goal object.
* LPR : the exam performance of user for related objects with goal object.
* PEG : the exam performance of user for goal objects.

## Citation
Dua, D. and Graff, C. (2019). <a href="http://archive.ics.uci.edu/ml" target="_blank">UCI Machine Learning Repository</a>. Irvine, CA: University of California, School of Information and Computer Science.

H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013.

## Data Analysis

In [2]:
#Import the pacakge we need to analyze the data
library(readxl)
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [3]:
#Using the set.seed function to make sure our code is reproducible
set.seed(7)

In [4]:
# download the file from the website
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls'
download.file(url, destfile='data/user_knowledge_data.xls')

# read and clean the training data, and making sure that the target variable, UNS, is a factor type
training_data <- read_excel('data/user_knowledge_data.xls', sheet=2, range='A1:F259') %>% 
                mutate(UNS = as_factor(UNS)) 
print("Training Data")
training_data


# read and clean the testing data, and making sure that the target variable, UNS, is a factor type
test_data <- read_excel('data/user_knowledge_data.xls', sheet=3, range='A1:F146') %>% 
                mutate(UNS = as_factor(UNS))

print("Test Data")
test_data

[1] "Training Data"


STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
0.00,0.00,0.00,0.00,0.00,very_low
0.08,0.08,0.10,0.24,0.90,High
0.06,0.06,0.05,0.25,0.33,Low
⋮,⋮,⋮,⋮,⋮,⋮
0.54,0.82,0.71,0.29,0.77,High
0.50,0.75,0.81,0.61,0.26,Middle
0.66,0.90,0.76,0.87,0.74,High


[1] "Test Data"


STG,SCG,STR,LPR,PEG,UNS
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
0.00,0.10,0.50,0.26,0.05,Very Low
0.05,0.05,0.55,0.60,0.14,Low
0.08,0.18,0.63,0.60,0.85,High
⋮,⋮,⋮,⋮,⋮,⋮
0.56,0.60,0.77,0.13,0.32,Low
0.66,0.68,0.81,0.57,0.57,Middle
0.68,0.64,0.79,0.97,0.24,Middle


**We can see from each table above:**

    1. each row is a single observation
    2. each column is a single variable 
    3. each value is a single cell
**Therefore, the training set and testing set are clean and tidy now.**

**We also notice that `the proportion of the training_data is around 64%`, and `the proportion of the test data is around 36%`.**

In [5]:
# summarise the maximum value of each predictors
training_data_max <- training_data %>%
                     select(-UNS) %>%
                     map_df(max, na.rm = TRUE)
print("The maximum value of each predictors")
training_data_max

# summarise the minimum value of each predictors
training_data_min <- training_data %>%
                     select(-UNS) %>%
                     map_df(min, na.rm = TRUE)
print("The minimum value of each predictors")
training_data_min

# summarise the average value of each predictors
training_data_avg <- training_data %>%
                     select(-UNS) %>%
                     map_df(mean, na.rm = TRUE)
print("The average value of each predictors")
training_data_avg

[1] "The maximum value of each predictors"


STG,SCG,STR,LPR,PEG
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.99,0.9,0.95,0.99,0.93


[1] "The minimum value of each predictors"


STG,SCG,STR,LPR,PEG
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,0,0,0


[1] "The average value of each predictors"


STG,SCG,STR,LPR,PEG
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.3711473,0.3556744,0.4680039,0.4327132,0.4585388


**According to the tables above, we can see the range of each variable is :**
* The range of STG, the degree of study time for goal object materails, is [0, 0.99].
* The range of SCG, the degree of repetition number of user for goal object materails, is [0, 0.9].
* The range of STR, the degree of study time of user for related objects with goal object, is [0, 0.95].
* The range of LPR, the exam performance of user for related objects with goal object, is [0, 0.99].
* The range of PEG, the exam performance of user for goal objects, is [0, 0.93].

In [6]:
# Determine how many rows have missing data for each variable
stg_miss <- sum(is.na(training_data$STG))
sprintf("The number of missing data in STG: %s", stg_miss)

scg_miss <- sum(is.na(training_data$SCG))
sprintf("The number of missing data in STG: %s", scg_miss)

str_miss <- sum(is.na(training_data$STR))
sprintf("The number of missing data in STG: %s", str_miss)

lpr_miss <- sum(is.na(training_data$LPR))
sprintf("The number of missing data in STG: %s", lpr_miss)

peg_miss <- sum(is.na(training_data$PEG))
sprintf("The number of missing data in STG: %s", peg_miss)

**Now, we have some basic ideas about our dataset. In the next step, We want to know which factor(s) is related to our target variable (UNS), in other words, which factor is our explanatory variable. To do this, we need to visualize our data to find out if there is a relationship between STG, SCG, STR, LPR, PEG and UNS.**

## Data Visualization
To visualize the data, we will use the `ggpairs` function, which returns a matrix of plots for a given dataset. Since we have 5 potential explanatory variables and 1 target variable, it is better to compare the distibution and evaluate the association of them in a whole. The `ggpairs` function provides an efficient way to exploring the distribution and correlation between different quantitative variables and categorical variables.

The `columns` argument is used to select the number of columns we want to include in the plot. In this case, we need to put `1:6` here since we have 6 variables in total. 

We also need to change the font size of correlation values in order to make it fit in the panel and readable. To change the font size, we need to include `upper = list(continuous = wrap('cor', size = ...)` in our `ggpairs` function.

In [7]:
relationship_plot <- ggpairs(training_data, columns = 1:6, ggplot2 :: aes(color = UNS), upper = list(continuous = wrap('cor', size = 5)), title = "Scatterplot matrix of `user knowledge dataset`")
relationship_plot

ERROR: Error in set_to_blank_list_if_blank(obj, combo = !isDiag & !isDuo, blank = blankVal, : object 'UNS' not found


**By default, the upper panel shows the `correlation` between quantitative variables. The diagnal shows the `density plot` of quantitative variables. The lower panel shows the `scatterplot` and `histogram` of quantitative variables. The right-hand sides shows the `side-by-side boxplot` between quantitative and categorical variables and `bar chart` of the categorical variable.**

**Recall from previous information, we have 6 variables in this user knowledge dataset, they are:**

* STG : the degree of study time for goal object materails. (Quantitative variable)
* SCG : the degree of repetition number of user for goal object materails. (Quantitative variable)
* STR : the degree of study time of user for related objects with goal object. (Quantitative variable)
* LPR : the exam performance of user for related objects with goal object. (Quantitative variable)
* PEG : the exam performance of user for goal objects. (Quantitative variable)
* UNS : the knowledge level of users.(Categorical variable)


**As the graph shown above, we first look at the density plot and histogram for each quantitative variable, as well as the bar chart for categorical variable. The density plots show a distribution of each quantitative variable. The bar chart just shows the distribution of student in each knowledge level. The histograms and side-by-side boxplot shows the differences in the distribution of STG, SCG, STR, LPR and PEG between 4 distinct user knowledge level. These 4 plots give an overall distribution and information about the 6 variables.**

**Then, we look at the correlation coefficients between 2 quantitative variables. All of them are pretty close to 0, indicates that there is a weak linear relationship between different quantitative variables. Among all of the correlation coefficients, the maximum is 0.206, which is the value between PEG and STG. This suggests that there is a relatively strong linear relationship between PEG and STG compared to the rest. Then, we took a look at the scatterplots between 2 quantitative variables. Among all the scatterplots, we can observe that all the points are quite dispersed and does not show a clear trend except for the one between PEG and STG. It shows a relatively clear and positive linear relationship between PEG and STG. This is also consistent with our observations from correlation coefficients.**

    
    
    


**In addiditon, STG represents the degree of study time for goal object materails, PEG represents the exam performance of user for goal objects. By our common sense, the more time a student spend on studying, the better performance they will get in an exam. However, this might be vary between individuals, a genius can get good grades without studying for a long time. It might not be appropriate to use STG (the degree of study time for goal object materails) to predict the UNS (the knowledge level of users).**

**Therefore, we can reason that we may be able to use the PEG (the exam performance of user for goal objects) to predict the UNS (the knowledge level of users).**

## Data Analysis