In [None]:
#example with Python
#reading data from a file

In [None]:
#this line is only necessary for Jupyter Binder
#if you're trying this example on your local environment, you can skip this step!
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install statsmodels

In [None]:
#import all the python libraries that we will need for our analysis


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import statsmodels.api as sm


In [None]:
# Read data from file 'geometryITS.csv' 
# (in the same directory that your python process is based)
 
data = pd.read_csv("geometryITS.csv") 


# Preview the first 5 lines of the loaded data 
data.head()

In [None]:
#view all variables that are recorded in the dataset
list(data)

## What are we doing today?

In this example, we will implement the Additive Factors Model (AFM) proposed by Cen, et. al.
[Cen, H., Koedinger, K., & Junker, B. (2008, June). Comparing two IRT models for conjunctive skills. In International Conference on Intelligent Tutoring Systems (pp. 796-798). Springer, Berlin, Heidelberg.]

The AFM student model predicts student performance on every step based on student's prior practice (number of practice opportunities) and other factors, such as step difficulty, student's proficiency, learning rate and so on. 

To implement the model, we will use logistic regression and we will train and test using the Geometry Tutor dataset.

## Student model 
The AFM equation (see Cen's paper) states that in order to predict student's performance we need to know as input the knowledge component (KC) that is involved in every step and the number of practice opportunities - that is, how many times the student has tried this step before. Also, since the model predicts individual learning, we have to take into account the student's ID.

The result of the each student's attempt is depicted by the column "first attempt" (the name of the column is misleading, i know!). So we use this, to create the "outcome" variable, that is - the variable we want to predict. Here, 0 signifies an incorrect answer and 1 signified a correct asnwer.

Simply put, the equation that we want to implement should look something like this:

Outcome ~ Opportunity + KC + (1| Anon Student ID)   [equation (1)]

which rougly translates to: the outcome is predicted based on the knowledge component (skill) that is tested and the number of prior opportunities a student had, taking into account the random effect that may come into place regarding student's individual characteristics.

Now lets go and fit the model!

In [None]:
#create a new column to classify final results. classify steps with a corrent answer as "1", the rest as "0"
data["outcome"] = 1

#studentInfo["result.class"] = studentInfo["final_result"].apply(lambda x: 0 if (x == 'Fail') | x == "Withdrawn") else 1)
data["outcome"].loc[(data["First Attempt"] == "incorrect")] = 0


## Fitting the AFM model
Now we will fit our model. We will use a part of the dataset to train the model (let's say 80% of the data) and we will also keep a part of the old data to test our model's performance (let's say the rest 20% of the data).

The datasets used for training have the suffix "_train" while the datasets saved for testing have the suffix "_test". 

In [None]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=0)

In [None]:
#data = pd.read_csv("dietox.csv")
model = sm.MixedLM.from_formula("outcome ~ Opportunity + KC", data_train, groups=data_train["Anon Student Id"])
result = model.fit()
result.summary()

Now lets use the newly fitted model to predict the outcomes of the test set. To keep it simple, we're just calculating the Mean Absolute Error as a performance metric. Of course there are other, more sofisticated evaluation methods but these are beyond the scope of this course!

In [None]:
#predict on a testset
y_pred = result.predict(data_test)
data_test["y_pred"] = y_pred
data_test["outcome_pred"] = 1

data_test["outcome_pred"].loc[(data_test["y_pred"] <0.5)] = 0
error = abs(data_test["outcome_pred"] - data_test["outcome"])
float(sum(error))/float(len(error))

So, our model is able to predict student performance with a 26% error.
Not bad for such a simple approach, right?

## Assigment 3

Pavlik et.al [Pavlik Jr, P. I., Cen, H., & Koedinger, K. R. (2009). Performance Factors Analysis--A New Alternative to Knowledge Tracing. Online Submission.] took the AFM model further by proposing to use - instead of prior opportunities - prior correct and prior incorrect answers for each step - they named this new model the Performance Factors Analysis Model (PFM). 
The simplified [equation (1)], for the PFM model can be written as:

Outcome ~ Prior Correct + Prior Incorrect + KC + (1| Anon Student ID)   [equation (2)]

For the third assignment, you have to do the following:
1. Modify the AFM model provided in this example in order to implement the PFM model. You already have information about the correct and incorrect answers in the dataset. 
2. Use the PFM model to predict student performance and compare the results with the AFM model. Discuss your findings.
3. Propose additional factors that you could take into account in order to improve the student model. You can be as creative as you like - as long as you're keeping your feet on earth. Think of improvement vs. feasibility.

I'm looking forward to reading your ideas!

