#### Final Proposal EDF 6938: Natural Langauge Processing

### The Performance of Machine Learning Methods in Automated Item Difficulty Prediction 
> #### Author: Jing Huang
> #### Date: Dec.6, 2022
> #### Email:jing.huang@ufl.edu


#### 1. Introduction 

Item difficulty prediction (also known as item calibration) is “of crucial importance” in the educational field [1], especially for test development and assessment that need accurate item difficulty predictions to assign items to the targeted population. Examples include the development of a computerized adaptive test (CAT) [2], and the creation/maintenance of an item bank in large-scale performance assessment [3].

The traditional methods to estimate item difficulty have some limitations, including biased prediction, labor intensive, high economic costs, and content leaking [4-6]. To overcome those limitations, recent studies show a trend that employs Natural Language Processing (NLP) and Machine Learning (ML) techniques to automatically predict item difficulty by analyzing the textual content of items [1]. Accordingly, many methods and procedures have been developed/employed but not yet well tested, especially for the difficulty prediction in reading comprehension questions (which is a most recent trend during the past five years [1]). In this background, this study evaluates the per-formance of three popular ML methods (i.e., Naive Bayes (NB), Neural Network (NN), and Support Vector Machines (SVM)) in automated item difficulty prediction, and the findings can provide evidence to validate the use of these methods in the item difficulty prediction field.

#### 2. Related Work 

2.1	Overview. 
Item difficulty can be estimated by non-pretesting or pretesting calibration [1]. The most frequently used non-pretesting method is expert rating [7], and the difficulty of a new item would be assessed and labeled by the content experts based on their experience. An alternative is a one-to-many comparison [2,8] raters would first compare the target item with a series of items with known ordered difficulty, and then label this item’s difficulty level accordingly. These approaches are subjective/biased and labor intensive [6].
Compared to non-pretesting methods, the pretesting methods that are based on field tests are more accurate [4]. The commonly used pretesting calibrations include proportion correct and Item Response Theory-based (IRT-based) calibration [2]. In the adaptive learning framework, learner feedback has also been used to rate items’ difficulty levels [2]. These methods administer a set of new items to a group of examinees either as a standalone form or by adding them to existing forms [5]; therefore, some drawbacks exist, such as high economic costs, time-consuming, and the risk of content leaking [5]. In addition, the prediction accuracy of pretesting methods highly relies on a heterogeneous sample of examinees [3], and if the sample is “small, unrepresentative, or unmotivated”, it would be problematic [3]. Therefore, these methods are not suitable for small-scale testing [9], especially for IRT-based calibration which normally needs a large sample size (varying between 50-1000) to get reliable parameter estimates [3,10,11].

2.2	Automated Item Difficulty Prediction. 
Considering the limitations of the traditional methods mentioned above, recent studies tend to employ NLP to automatically estimate the item difficulty. A review by Benedetto et al. (2022) [1] has comprehensively summarized and explained these related recent studies on automat-ed item difficulty predictions, and below are some takeaways: (1) item difficulty could be defined in the IRT framework, the classical test theory framework, or just by manually selecting some discrete values; (2) the item difficulty prediction was treated as a supervised problem in almost all studies; (3) studies were focused on a single educational domain (i.e., Language Assessment (LA) or Content Knowledge Assessment (CKA)).
Specifically, among all the item types in the LA domain (i.e., vocabulary questions, sentence knowledge questions, reading and listening comprehension questions), the interest in comprehension questions are “very recent”, and this might be because the recent advances in deep learning (e.g., neural networks) have enabled new ways to analyze the text [1]. When estimating difficulty for the reading comprehension items, the sources of text could be the reading passages alone, or the combination of reading passages, question text, and possible choices (correct answer and distractors) [12-16]. The most recent studies on reading comprehension [14,15] followed the procedure including word embedding (e.g., word2vec), the sequences of word embed-dings (e.g., LSTMs, CNN), and the final prediction (e.g., FCNN).
In the CKA domain, items were normally developed to minimize the effects of other dimensions (e.g., language) and no background knowledge available, and the length of the question text could be much shorter than the LA assessment. Therefore, some previous studies included addi-tional information (e.g., additional knowledge corpus) for model pretraining [4,20,21]. However, there are also some studies that trained their models based on question text alone [17-19] and got reliable item difficulty prediction. For example, Benedetto et al. (2020(1), 2020(2)) [17-18] proposed the R2DE (Regressor for Difficulty and Discrimi-nation Prediction) model which uses the TF-IDF vectorized features, readability indices, and linguistic features as the inputs of random forest model to estimate item difficulty, and the model performed well in real-world practice. 
Among the studies mentioned above, TF-IDF and word2vec embeddings are the most frequently seen vectorization methods, and rare studies have tested the performance of the most basic BOW procedure in estimating item difficulty. The ML methods used for final difficulty prediction include NN, SVM, and Random Forests, and the performance evaluation for the NB method is still missing.

2.3	Research Purposes.
As mentioned above, automated item difficulty prediction using NLP and ML is a new solution for item calibration in educational measurement. Existing studies were mainly conducted in single-domain and single-method settings, and no systematic study investigated the effects of different disciplines on the performance of different ML methods. Therefore, this study assesses the performance of three ML methods in estimating the difficulty of items from one LA test dataset (i.e., reading comprehension questions) and one CKA test dataset (i.e., the science test questions), and it will be guided by the following research questions:
（1）For each dataset, to what extent did NB, NN, and SVM perform differently?
（2）How would each method perform across the LA and CKA datasets?

#### 3. Methods 

3.1	Data.
We selected two publicly available datasets with item diffi-culty ground truth so we can conduct supervised prediction. One is the reading comprehension dataset RACE (ReAding Comprehension dataset from Examinations) [22], and the other one is the science test dataset ARC (AI2 Reasoning Challenge) [23]. 
RACE was collected from English exams designed for Chinese secondary language learners in middle and high schools between age 12 and 18, which contains 27,933 passages with 97,687 questions written by human experts [22]. To label the ground truth of item difficulty in RACE dataset, we referred to Song et al. (2021) [16] and labeled items from middle school exams as “easy” and items from high school exams as “difficult”. The sizes of partitions are 7,139 passages for middle school (easy set), and 20,794 passages for high school (difficult set).
Another dataset ARC consists of 7,787 text-only science questions across school grades (grade 3 - grade 9) designed for real tests, and it also includes a text corpus containing a large “unordered, science-related sentences including knowledge relevant to ARC” [23]. The size of questions in each grade is 270 (3rd grade), 824 (4th grade), 1,607 (5th grade), 263 (6th grade), 929 (7th grade), 3,211 (8th grade), and 683 (9th grade), respectively. ARC has been partitioned into easy and challenge sets by the creators [23]: if an item cannot be correctly answered by “both a retrieval-based algorithm and a word co-occurrence algorithm”, then it would be partitioned into challenge set; otherwise, the easy set. However, the difficulty of ARC can also be labelled by school grade, and the items from the 3rd grade represent the easiest items, and the items from the 9th grade represent the most difficulty items.

3.2	Research Design.
This study evaluates the performance of three ML methods based on two datasets under 12 conditions.
(1)RACE-NB:	RACE,NB,Difficulty Label Binary: Easy (middle school item), Difficult (high school item)
(2)RACE-NN:	RACE,NN,Difficulty Label Binary: Easy (middle school item), Difficult (high school item)
(3)RACE-SVM:RACE,SVM,Difficulty Label Binary: Easy (middle school item), Difficult (high school item)
(4)ARC-NB1: ARC,NB,Difficulty Label Binary: Easy and Challenge defined by the creators*
(5)ARC-NN1:	ARC,NN,Difficulty Label Binary: Easy and Challenge defined by the creators*
(6)ARC-SVM1:ARC,SVM,Difficulty Label Binary: Easy and Challenge defined by the creators*
(7)ARC-NB2:	ARC,NB,Difficulty Label Binary: Easy (school grade 3-5), Difficult (school grade 6-9)
(8)ARC-NN2:	ARC,NN,Difficulty Label Binary: Easy (school grade 3-5), Difficult (school grade 6-9)
(9)ARC-SVM2:ARC,SVM,Difficulty Label Binary: Easy (school grade 3-5), Difficult (school grade 6-9)
(10)ARC-NB3:ARC,NB,Difficulty Label Categorical: Difficulty levels 3-9 represent school grade 3-9.
(11)ARC-NN3:ARC,NN,Difficulty Label Categorical: Difficulty levels 3-9 represent school grade 3-9.
(12)ARC-SVM3:ARC,SVM,Difficulty Label Categorical: Difficulty levels 3-9 represent school grade 3-9.
Note. * If an item cannot be correctly answered by “both a retrieval-based algorithm and a word co-occurrence algorithm”, then it would be partitioned into challenge set; otherwise, the easy set.

We won’t separate the question texts and possible choices in each condition because studies show that “using the possible answer choices (both the correct answer and the distractors) improves the accuracy of item difficulty prediction” [4,17].

#### 4. Analysis Demonstration 
Three procedures were implemented in Python. First was data preprocessing, including lowercasing, data cleaning, punctuation removal, word tokenization, stemming and lemmatization. The second step conducted vectorization using BOW and split each dataset into a training set (80% of cases) and a testing set (20% of cases). The third step was to run each ML method for final item difficulty prediction.

![](https://drive.google.com/uc?export=view&id=1X_tl3bQdp1UezW7rAEmJNDFPwZe_rpLS)



##### 4.1. Dependencies 

In [None]:
# Import all the library that is necessary for your analysis 
import os
import pandas as pd
import pandas as pd
import nltk
nltk.download(['punkt', 'wordnet', 'omw-1.4'])
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer


from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
#############################################################

##### 4.2. Code

In [None]:
#### Your code for the analysis will be provided here 

### Part A: Race dataset
# import datasets
os.chdir('/Users/jing/Dropbox (UFL)/Courses_REM/!2022_fall/NLP/final/data_process/step1_processed_data')
RACE_all = pd.read_csv('RACE_all.csv')

# Step 1: pre-procesing (use RACE dataset for example)
def cleaning(string):
    
    import re
    string = string.lower() # step 1. lowercase
    
    string = string.replace("can't", 'cannot') # step 3. replace abbreviated forms 
    string = string.replace("n't", ' not')
    string = string.replace("'ll", ' will')
    string = string.replace("'m", ' am')
    string = string.replace("he's", "he is")
    string = string.replace("it's", 'it is')
    
    string = string.replace("!", '')
    string = string.replace("#", '')
    string = string.replace("$", '')
    string = string.replace("%", '')
    string = string.replace('"', '')
    string = string.replace("&", '')
    string = string.replace("\\", '')
    string = string.replace("'", '')
    string = string.replace("(", '')
    string = string.replace(")", '')
    string = string.replace("*", '')
    string = string.replace("+", '')
    string = string.replace(",", '')
    string = string.replace("-", '')
    string = string.replace(".", '')
    string = string.replace("/", '')
    string = string.replace(":", '')
    string = string.replace(";", '')
    string = string.replace("<", '')
    string = string.replace("=", '')
    string = string.replace(">", '')
    string = string.replace("?", '')
    string = string.replace("@", '')
    string = string.replace("[", '')
    string = string.replace("]", '')
    string = string.replace("^", '')
    string = string.replace("_", '')
    string = string.replace("`", '')
    string = string.replace("{", '')
    string = string.replace("|", '')
    string = string.replace("}", '')
    string = string.replace("~", '')
    
    string = re.sub(r"\d+", "<DIGIT>", string) # step 4. replace all the numbers to <DIGIT>

    return string

RACE_all['arti_ques_opt1'] = RACE_all['arti_ques_opt'].apply(cleaning)

# 2. word tokenization for column "article_question_options"
RACE_all['word_token'] = RACE_all.arti_ques_opt1.apply(nltk.word_tokenize)

# 3. stemming
def stemming(token_lst):
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(token) for token in token_lst]

RACE_all['stems'] = RACE_all['word_token'].apply(stemming)

# 4. lemmatizing
def lemmatize(token_lst):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in token_lst]

RACE_all['lemma'] = RACE_all['stems'].apply(lemmatize)


# Step 2: 
#========= 1. Naive Bayes Classifier + BOW =============

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Let's create a BOW vector from the 'response' column as an input and call it `X`

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X = bow.fit_transform(RACE_all['lemma'])

### Let's set the 'difficulty_code' column as an input and call it `y`
y = RACE_all['difficulty_code']

### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#############################################################
clf_nb = MultinomialNB() # this is your new classifier 
clf_nb.fit(X_train, y_train) #let's fit the model 
y_hat = clf_nb.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nb, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

# 2.Neural Network Classifier + BOW
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X_train,y_train) #let's fit the model 
y_hat = clf_nn.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nn, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

# 3.Support Vector Classifier + BOW
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X_train,y_train) #let's fit the model 
y_hat = clf_svm.predict(X_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_svm, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))



### Part B: ARC dataset
# import datasets
os.chdir('/Users/jing/Dropbox (UFL)/Courses_REM/!2022_fall/NLP/final/data_process/step1_processed_data')
ARC_all = pd.read_csv('ARC_all.csv')

# Step 1: pre-procesing (use RACE dataset for example)
def cleaning(string):
    
    import re
    string = string.lower() # step 1. lowercase
    
    string = string.replace("can't", 'cannot') # step 3. replace abbreviated forms 
    string = string.replace("n't", ' not')
    string = string.replace("'ll", ' will')
    string = string.replace("'m", ' am')
    string = string.replace("he's", "he is")
    string = string.replace("it's", 'it is')
    
    string = string.replace("!", '')
    string = string.replace("#", '')
    string = string.replace("$", '')
    string = string.replace("%", '')
    string = string.replace('"', '')
    string = string.replace("&", '')
    string = string.replace("\\", '')
    string = string.replace("'", '')
    string = string.replace("(", '')
    string = string.replace(")", '')
    string = string.replace("*", '')
    string = string.replace("+", '')
    string = string.replace(",", '')
    string = string.replace("-", '')
    string = string.replace(".", '')
    string = string.replace("/", '')
    string = string.replace(":", '')
    string = string.replace(";", '')
    string = string.replace("<", '')
    string = string.replace("=", '')
    string = string.replace(">", '')
    string = string.replace("?", '')
    string = string.replace("@", '')
    string = string.replace("[", '')
    string = string.replace("]", '')
    string = string.replace("^", '')
    string = string.replace("_", '')
    string = string.replace("`", '')
    string = string.replace("{", '')
    string = string.replace("|", '')
    string = string.replace("}", '')
    string = string.replace("~", '')
    
    string = re.sub(r"\d+", "<DIGIT>", string) # step 4. replace all the numbers to <DIGIT>

    return string

ARC_all['arti_ques_opt1'] = ARC_all['arti_ques_opt'].apply(cleaning)

# 2. word tokenization for column "article_question_options"
ARC_all['word_token'] = ARC_all.arti_ques_opt1.apply(nltk.word_tokenize)

# 3. stemming
def stemming(token_lst):
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(token) for token in token_lst]

ARC_all['stems'] = ARC_all['word_token'].apply(stemming)

# 4. lemmatizing
def lemmatize(token_lst):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in token_lst]

ARC_all['lemma'] = ARC_all['stems'].apply(lemmatize)


# step 2 condtion 1: use "difficulty_code" as difficuty label

#========= 1. Naive Bayes Classifier + BOW =============

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Let's create a BOW vector from the 'response' column as an input and call it `X`

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X = bow.fit_transform(ARC_all['lemma'])

### Let's set the 'schoolGrade' column as an input and call it `y`
y = ARC_all['difficulty_code']

### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#############################################################

clf_nb = MultinomialNB() # this is your new classifier 
clf_nb .fit(X_train, y_train) #let's fit the model 
y_hat = clf_nb.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nb, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))



# 2.Neural Network Classifier + BOW
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X_train,y_train) #let's fit the model 
y_hat = clf_nn.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nn, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# 3.Support Vector Classifier + BOW
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X_train,y_train) #let's fit the model 
y_hat = clf_svm.predict(X_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_svm, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))




# step 2 condtion 2: use "schoolGrade" as difficuty label
#========= 1. Naive Bayes Classifier + BOW =============

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Let's create a BOW vector from the 'response' column as an input and call it `X`

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X = bow.fit_transform(ARC_all['lemma'])

### Let's set the 'schoolGrade' column as an input and call it `y`
y = ARC_all['schoolGrade']

### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#############################################################

clf_nb = MultinomialNB() # this is your new classifier 
clf_nb .fit(X_train, y_train) #let's fit the model 
y_hat = clf_nb.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))


# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nb, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# 2.Neural Network Classifier + BOW
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X_train,y_train) #let's fit the model 
y_hat = clf_nn.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nn, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# 3.Support Vector Classifier + BOW
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X_train,y_train) #let's fit the model 
y_hat = clf_svm.predict(X_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_svm, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# step 2 condtion 3: use "schoolGradeBinary" as difficuty label
#========= 1. Naive Bayes Classifier + BOW =============

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Let's create a BOW vector from the 'response' column as an input and call it `X`

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()

X = bow.fit_transform(ARC_all['lemma'])

### Let's set the 'schoolGrade' column as an input and call it `y`
y = ARC_all['schoolGradeBinary']

### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#############################################################

clf_nb = MultinomialNB() # this is your new classifier 
clf_nb .fit(X_train, y_train) #let's fit the model 
y_hat = clf_nb.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nb, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# 2.Neural Network Classifier + BOW
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X_train,y_train) #let's fit the model 
y_hat = clf_nn.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_nn, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


# 3.Support Vector Classifier + BOW
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X_train,y_train) #let's fit the model 
y_hat = clf_svm.predict(X_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, y_hat))

# k folder cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_svm, X_test,y_test, cv=5)
scores
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))


####################################################

#### 4. Results 

This study compares the model performance based on the accuracy, precision, recall, and f1 score indices. The accuracy was estimated using k-fold cross validation with standard deviation. As shown in Table 2 and Table 3 below, models perform differently across conditions, and generally the ML methods perform better in RACE dataset than ARC dataset in regards to the accuracy index. 

As for the RACE dataset, The accuracy across conditions show that models perform similarly across conditions, and no substantial difference across three ML methods. The accuracy (standard deviation) for NB, NN, and SVM methods are .82 (.00), .86 (.01), and .86 (.01), respectively. The values of precision, recall and f1 scores for RACE dataset are all higher for difficult items than easy items. In other words, the difficulty level of items from high school reading comprehensions can be more easily predicted than the difficulty level of items from middle school reading comprehensions.

![](https://drive.google.com/uc?export=view&id=11DxScw8nc0wth6yLDWIQVDUAoFoV9cl9)

For the ARC dataset, the performance of ML models varies a lot across conditions. The accuracy indices are highest for condition 2 using binary difficulty label based on school grades (.68-.74), then for condition 1 using the binary difficulty label created by the data creators (.61-.67), and the accuracy for condition 3 is the lowest using categorical difficulty label based on school grades (.41-.45).

In condition 1, most values of precision, recall and f1 score are higher for easy items than difficult items. In condition 2, all precision, recall and f1 score values are higher for difficult items than easy items.
In condition 3, in general, the precision, recall and f1 score values are higher for grades 4,5,8, which are highly correlated with the sample sizes of each school grade.

For the first two conditions, NB and NN have similar performance and they outperform SVM; while in the last condition, NN is the best model to predict categorical difficulty level.


![](https://drive.google.com/uc?export=view&id=1gIN6bosIGD-sLQCexgtnQRSMbDmOp4KM)


#### 5. Conclusion and Discussion

Based on the findings above, we can conclude that ML methods perform better in RACE than in ARC. For RACE dataset, no substantial difference across three ML methods. For ARC dataset, three methods perform differently: NB and NN perform similarly and acceptably to the binary difficulties, while NN performs best for categorical difficulties. NN has the most stable performance across RACE and ARC datasets.

This project can be improved from the following aspects. First, for ARC dataset, we may try to add the additional knowledge corpus to train the model, because the short item length might influence the model performance. Second, we may try some other vectorization methods (e.g., word2vec). Third but not last, we can explore some advanced models to handle categorical item difficulties.