## Prediction of user Knowledge

**Introduction**

The data set records the standardized results of study time, repetition, exam performance for the goal subject as well as non-goal but related subjects. It has a target value of user knowledge classified as very low, low, middle, or high. 

Research question: How can The degree of study time of user for related objects with goal object and exam performance on the goal object predict user knowledge?

The data set we will be using is taken from the UCI Machine learning repository titled, “User Knowledge Modeling Data Set”. The training data set contains 258 observations. There are 6 variables measured: STG (The degree of study time for goal object materials), SCG (The degree of repetition number of user for goal object materials), STR (The degree of study time of user for related objects with goal object), LPR (The exam performance of user for related objects with goal object), and PEG (The exam performance of user for goal object). Each of the variables are standardized. (Kahraman et Al.). There is research done on how general knowledge can be used to predict test scores (Hartwig et Al.), so in contrast, this data utilizes hard data such as test scores and studying time as the variables to something more abstract such as knowledge. 

In [None]:
#Required libraries 
import altair as alt
import pandas as pd 
import numpy as np
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)

## Methods

Firstly we wil create scatterplots with PEG on the y-axis and each of the other 4 variables with the shape and color aissgned by the UNS. Following this we identified that then We will use STR and PEG to predict. As these two variables are both focused on goal object, they provide a sense of coherence to the analysis. We will use the technique of classification on the training set to predict whether a new observation would be classified into one of the four categories of the knowledge level of the user.

We will plot the dataset as a scatterplot, with STR on the x-axis, PEG on the y-axis and we will use color and shape to distinct different categories of UNS. 

### Loads data from the original source on the web & cleans 

In [None]:
#importing the data set
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00257/Data_User_Modeling_Dataset_Hamdi%20Tolga%20KAHRAMAN.xls"

user_knowledge_data = pd.read_excel(url, sheet_name = 1).drop(columns = ["Unnamed: 6", "Unnamed: 7", "Attribute Information:"])
user_knowledge_data

### Summary of the data

In [None]:
user_knowledge_data.agg(["min", "max"])

In [None]:
#UNS_data = user_knowledge_data[[" UNS"]]
data= user_knowledge_data.assign(count = 0)
data = data.iloc[:,5:7].groupby(" UNS").count().reset_index()
data

### Visualization of the dataset

In [None]:
user_chart = (
    alt.Chart(user_knowledge_data)
    .mark_point()
    .encode(
        x = alt.X("STG", title = "Study time for goal object (Standardized)"),
        y = alt.Y("PEG", title = "Exam performance for goal object (standardized)"),
        color = " UNS",
        shape = " UNS"
)).properties(width=400, height=400)
user_chart        

In [None]:
user_chart2 = (
    alt.Chart(user_knowledge_data)
    .mark_point()
    .encode(
        x = alt.X("SCG", title = "The degree of repetition number of user for goal object materials (Standardized)"),
        y = alt.Y("PEG", title = "Exam performance for goal object (standardized)"),
        color = " UNS",
        shape = " UNS"
)).properties(width=400, height=400)
user_chart2  

In [None]:
user_chart3 = (
    alt.Chart(user_knowledge_data)
    .mark_point()
    .encode(
        x = alt.X("STR", title = "The degree of study time of user for related objects with goal object(Standardized)"),
        y = alt.Y("PEG", title = "Exam performance for goal object (standardized)"),
        color = " UNS",
        shape = " UNS"
)).properties(width=400, height=400)
user_chart3

In [None]:
user_chart4 = (
    alt.Chart(user_knowledge_data)
    .mark_point()
    .encode(
        x = alt.X("LPR", title = "The exam performance of user for related objects with goal object(Standardized)"),
        y = alt.Y("PEG", title = "Exam performance for goal object (standardized)"),
        color = " UNS",
        shape = " UNS"
)).properties(width=400, height=400)
user_chart4  

### Data Analysis
#### split training set and testing set

In [None]:
usr_training, usr_test = train_test_split(user_knowledge_data,test_size=0.25,random_state=123)
print(usr_training.head())
print(usr_test.head())

#### Assign predictors and targets

In [None]:
X_train = pd.DataFrame(usr_training.loc[:,["STG","PEG"]])
y_train = usr_training[" UNS"]
X_test = pd.DataFrame(usr_test.loc[:,["STG","PEG"]])
y_test = usr_test[" UNS"]


#### Pick K-value using GridSearch

In [None]:
'''
Create prepocessor, pipeline, and knn
'''
usr_data_prepocessor = make_column_transformer((StandardScaler(),["STG","PEG"]),)
knn = KNeighborsClassifier()
usr_data_pipe = make_pipeline(usr_data_prepocessor,knn)


param_grid = {"kneighborsclassifier__n_neighbors": range(2, 50, 1),}

In [None]:
'''
Create grid search
'''
usr_data_tune_grid = GridSearchCV(
    estimator=usr_data_pipe, param_grid=param_grid, cv=5
)
usr_data_tune_grid

In [None]:
'''
Fit the model to the data
'''
usr_data_model_grid = usr_data_tune_grid.fit(X_train,y_train)
accruacies_grid = pd.DataFrame(usr_data_model_grid.cv_results_)
accruacies_grid.head()

In [None]:
'''
Plot the accuracy against k to find the ideal k value
'''
accuracy_versus_k_grid = (
    alt.Chart(accruacies_grid, title="Grid Search")
    .mark_line(point=True)
    .encode(
        x=alt.X(
            "param_kneighborsclassifier__n_neighbors",
            title="Neighbors",
            scale=alt.Scale(zero=False),
        ),
        y=alt.Y(
            "mean_test_score", 
            title="Mean Test Score", 
            scale=alt.Scale(zero=False)
        ),
    )
    .configure_axis(labelFontSize=10, titleFontSize=15)
    .properties(width=800, height=600)
)
accuracy_versus_k_grid

It is trivial from the above plot that when $k=10$ or $k=13$, the mean test score is the highest. $\\$
(Test the model with GridSearchRange from $2$ to $100$, and the mean test score continues to drop after $k=50$, for better view of the plot, just pick the range from $2$ to $50$)

#### Set the model for K=10

In [None]:
'''
create knn, pipeline
'''
knn_spec = KNeighborsClassifier(n_neighbors=10)
usr_data_fit = make_pipeline(usr_data_prepocessor,knn_spec).fit(X_train,y_train)
usr_data_fit

In [None]:
'''
user the model to predict test data and calculate the accuracy of the model
'''
usr_data_test_predictions = usr_test.assign(predictions=usr_data_fit.predict(X_test))

X_test_pred = usr_data_test_predictions[["STG","PEG"]]
y_test_pred = usr_data_test_predictions[" UNS"]
usr_data_pred_accuracy = usr_data_fit.score(X_test_pred,y_test_pred)
usr_data_pred_accuracy


#### Set model for K=13

In [None]:
knn_spec_13 = KNeighborsClassifier(n_neighbors=13)
usr_data_fit_13 = make_pipeline(usr_data_prepocessor,knn_spec_13).fit(X_train,y_train)
usr_data_fit_13

In [None]:
usr_data_test_predictions_13 = usr_test.assign(predictions=usr_data_fit_13.predict(X_test))

X_test_pred_13 = usr_data_test_predictions_13[["STG","PEG"]]
y_test_pred_13 = usr_data_test_predictions_13[" UNS"]
usr_data_pred_accuracy_13 = usr_data_fit_13.score(X_test_pred_13,y_test_pred_13)
usr_data_pred_accuracy_13

Since the accuracy score of $K=10$ is slightly higher than $K=13$, so we will use $K=10$ 

#### Cross-Validation

In [None]:
'''
Check if the model is overfit
'''
np.random.seed(2020)
usr_data_vfold_score = cross_validate(estimator=usr_data_fit,X=X_train,y=y_train,return_train_score=True,)
pd.DataFrame(usr_data_vfold_score)

#### Visualization of the model

In [None]:
usr_data_mat = confusion_matrix(
    usr_data_test_predictions[" UNS"],  # true labels
    usr_data_test_predictions["predictions"],  # predicted labels
    labels=usr_data_fit.classes_, # specify the label for each class
)

usr_data_mat

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay(
    confusion_matrix=usr_data_mat, display_labels=usr_data_fit.classes_
)
disp.plot()

In [None]:
import numpy as np

# create the grid of area/smoothness vals, and arrange in a data frame
are_grid = np.linspace(
    user_knowledge_data["STG"].min(), user_knowledge_data["STG"].max(), 50
)
smo_grid = np.linspace(
    user_knowledge_data["PEG"].min(), user_knowledge_data["PEG"].max(), 50
)
asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
asgrid = pd.DataFrame(asgrid, columns=["STG", "PEG"])

# use the fit workflow to make predictions at the grid points
knnPredGrid = usr_data_fit.predict(asgrid)

# bind the predictions as a new column with the grid points
prediction_table = asgrid.copy()
prediction_table[" UNS"] = knnPredGrid

# plot:
# 1. the colored scatter of the original data
unscaled_plot = (
    alt.Chart(
        user_knowledge_data,
    )
    .mark_point(opacity=0.6, filled=True, size=40)
    .encode(
        x=alt.X(
            "STG",
            title="STG",
            scale=alt.Scale(
                domain=(user_knowledge_data["STG"].min(), user_knowledge_data["STG"].max())
            ),
        ),
        y=alt.Y(
            "PEG",
            title="PEG",
            scale=alt.Scale(
                domain=(
                    user_knowledge_data["PEG"].min(),
                    user_knowledge_data["PEG"].max(),
                )
            ),
        ),
        color=alt.Color(" UNS", title="UNS"),
    )
)

# 2. the faded colored scatter for the grid points
prediction_plot = (
    alt.Chart(prediction_table)
    .mark_point(opacity=0.05, filled=True, size=300)
    .encode(
        x=alt.X("STG"),
        y=alt.Y("PEG"),
        color=alt.Color(" UNS", title="UNS"),
    )
)
unscaled_plot + prediction_plot

**Expected Outcomes and Significance**


We expect that by using the two variables for The degree of study time with goal object (STR) and Exam performance of users (PEG) we would be able to predict the possible knowledge level of the user(UNS). Since there is a clear distinction between the levels of knowledge in the scatter plot provided we believe using K nearest neighbours would provide an accurate prediction for the users knowledge level.

Our findings could assess common methods for classifying students in settings where knowledge is heavily based on test scores such as in high school and in university. We can find if different variables such as test scores on the target object as well as study time with goal object accurately predict knowledge and how that can be translated to the system that schools currently use and if they are reliable and viable. 

One straightforward question may be “Are there any variables that could improve the accuracy of the prediction?”, as we are now using only two out of five variables to predict. Further down this path, another example of a potential future question could be, “Are Degree of Study time(STR) and Exam performance of users (PEG) the most two determinant variables among the five input variables for deciding a user’s knowledge level?” In an attempt to find the most accurate prediction, one may proceed by choosing variables to build the model and compare it with this one.

