# Instructional Factors Analysis Model (IFM) Implementation
This notebook provides a python implementation of the IFM student model.
The PFM model calculates the probability of a student carrying out correctly a step based on the prior correct and incorrect attempts (opportunities) as well as the tells (hints) that a student received from the intelligent tutor.

Please edit and execute the implementation steps taking into account what we learned about the rule space student models and the Q-matrix.

## Initializing the environment
First, we import the required libraries for handling data and training the machine learning model. We deactivate potential warnings for readability purposes.

In [None]:
import pandas as pd
import numpy as np
import warnings

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from patsy import dmatrices
from sklearn.metrics import f1_score, precision_score, recall_score, brier_score_loss
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

warnings.filterwarnings('ignore')

## Data preparation
We define the function **feature_engineering()** that transforms the student data values into the appropriate types

In [None]:
 def feature_engineering(df):
    df.loc[ df['First Attempt'] == 'incorrect', 'First Attempt'] = 0
    df.loc[ df['First Attempt'] == 'hint', 'First Attempt'] = 0
    df.loc[ df['First Attempt'] == 'correct', 'First Attempt'] = 1
    df = df[(df['First Attempt']==0) | (df['First Attempt']==1)]

    df=df.dropna()
    df.insert(loc=len(df.columns),column='Outcome',value=df['First Attempt'])

    df.rename(columns={'KC (Default)': 'KCModel', 'Opportunity': 'OpportunityModel'}, inplace=True)

    df.rename(columns={'Corrects': 'CorrectModel', 'Incorrects': 'IncorrectModel'}, inplace=True)

    df.rename(columns={'Hints': 'TellsModel'}, inplace=True)
    return df

## Model Training and Testing
For the model's implementation, we will use Logistic Regression and the python library scikit-learn.
The function **trainModel()** splits the dataset into a training and a test dataset following the 80/20 (Paretto) principle.
Then, we use the "train" subset to train the model and the "test" subset to test the model.

The prediction values are stored in the variable *"y_pred"* while the actual values are stored in the variable *"y_test"*.
By comparing the variables *"y_pred"* and *"y_test"*, we can assess the performance of the predictive model.
To do so, we use the following measures: RMSE, f1, precision and recall since our model practically works as a binary classifier to predict correct or incorrect student steps.



In [None]:
def trainModel(df,modeltype,X):

    y = df['Outcome']
    y= y.astype('int')

    X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=0)
    TrainTestSplitModel=LogisticRegression(max_iter=1000,penalty='l2')   ######  USUALLY L2
    TrainTestSplitModel.fit(X_train,y_train)

    y_pred=TrainTestSplitModel.predict(X_test)
    RMSE=np.sqrt(np.mean((y_test-y_pred)**2))
    f1=f1_score(y_test, y_pred, average="macro")
    precision=precision_score(y_test, y_pred, average="macro")
    recall=recall_score(y_test, y_pred, average="macro")    

    return (RMSE, f1, precision, recall)

## Read data
Now, lets import an example dataset which we will use for training and testing the model.
First, we read the dataset from the excel file "Example" and we save it as a pandas dataframe.

Then, we call the function **feature_engineering()** to pre-process and prepare the data.

In [None]:
#read data
datalink = "https://drive.google.com/uc?export=download&id=1kyb9X7rcN5hObMkp9cuQ1E0TNb6nXz-U"
xl = pd.ExcelFile(datalink)
df = xl.parse("Example")
df.head()

#transform data
data = feature_engineering(df)
data.head()

## Define the model's function

Your task is to define the logistic regression function for the IFM model. Remember, the IFM model calculates the probability of correctness based on the student's prior correct and incorrect responses as well as the tells (hints) that a student receives from the tutor in the respective Knowledge Components (KCs).

The function **dmatrices()** prepares the X(input) and y(output) data that we will use for training and testing the model.

In [None]:
#specify the model type
modeltype="IFM"
#specify the model function. Here you should complete the function for the IFM model as previously done for the AFM and the PFM
y, X = dmatrices()


In [None]:
(RMSE, f1, precision, recall)=trainModel(data,modeltype,X)
print(RMSE, f1, precision, recall)

**QUESTION 1:**
How do the results (RMSE, f1, precision, recall) change if you train the model on a new extended dataset?
The dataset can be retrieved at: https://drive.google.com/uc?export=download&id=12Xu21Tbp-1O-4fevB_fXAj_JWvoJU5Jl

Please change the code below so that you use the new dataset for the training and testing of the model and don't forget to also define the IFM function.

In [None]:
#read data
datalink = ""
xl = pd.ExcelFile(datalink)
df = xl.parse("Example")
df.head()

#transform data
data = feature_engineering(df)
data.head()

#specify the model type
modeltype="IFM"
#specify the model function. Here you should complete the function for the IFM model as previously done for the AFM and the PFM
y, X = dmatrices()

(RMSE, f1, precision, recall)=trainModel(data,modeltype,X)
print(RMSE, f1, precision, recall)

**QUESTION 2:** How do the results compare to the results you got from the AFM and the PFM models earlier?