#### Final Proposal EDF 6938: Natural Langauge Processing

### Automatic Grading Short Answer Science Questions
> #### Author: Wallace Nascimento Pinto Junior
> #### Date: Dec 05 2022
> #### Email: wallace.pintojun@ufl.edu


#### 1. Introduction 

> Educational testing remains a fundamental practice for different stakeholders who need to make evidence-based decisions, whether for formative, summative, selection, or certification purposes. However, the larger the number of individuals to be assessed, the more complex, time-consuming, and expensive are the processes for obtaining evidence about the knowledge and learning developed by these individuals (Liu et al, 2016; Zhang, Shah & Chi, 2016). In this regard, some technological advances have allowed educational testing to become increasingly faster, inexpensive, and effective, such as automatic short answer grading (ASAG), the focus of this paper.

> Short answer is a particular case of constructed response questions, which have the advantage to assess complex thinking and reasoning when compared to multiple choice questions. Although constructed response questions have traditionally been used sparingly in both classroom and large-scale assessments due to cost and time implications (Liu et al, 2016; Ramesh & Sanampudi, 2021), AGAS has proven to be a viable (and crucial) solution for e-learning and intelligent tutoring systems to make intensive use of this type of question nowadays. Although the advances in the use of AGAS are numerous and significant (Hanh et al, 2021), there are still few studies that have investigated, for example, the quality of scores when automatically grading Elementary or Middle School Science questions. Moreover, there are special concerns with students developing knowledge and skills in STEM area, which includes Scientific Literacy (OECD, 2016). Therefore, my proposal seeks to contribute methodologically to the area of ASAG, as well as conceptually to the area of science learning.

#### 2. Related Work 

> Hanh et al. (2021) have investigated the state of the effects of automatic scoring and/or feedback by analyzing 125 studies published in journals and proceedings between 2016 and 2020. Some of the trends that they observed were: 1) Most of the work reviewed were at the bachelor’s or equivalent level, with small numbers at the early educa-tion (2% of papers), and secondary education (6% of pa-pers). 2) 47% of the papers fell into the categories of the sciences, including areas like geology, mathematics, com-puter science, computer networking. 3) Regarding the types of automatic scores and feedback, 51% of the papers used structured inputs (mathematics, code, or controlled envi-ronments such as simulations), only 2% used short an-swers , 14% used essays, and 33% used others (i.e., virtual reality, handwriting, etc.).

> Liu et al. (2014) investigated the accuracy of c-rater, a concept-based automatic scoring systems developed by Educational Testing Service (ETS) for scoring science inquiry questions that had complex rubrics. For c-rater scoring, one or more model responses were identified, and their linguistic features were analyzed using Natural Language Processing (NLP) techniques. Such linguistic features were then applied to evaluate students’ responses to determine the presence or absence of key concepts. The researchers scored four questions using the sys-tem and got moderate to good agreement between automated and human scores. They also identified a few challenges in using the concept-based scoring method, such as the long time to identify the key concepts, being almost impossible to identify them exhaustively, and the relationship between analytic and scoring rubrics was not fully specified.

> Next, Liu et al. (2016) studied the accuracy of c-rater-ML, an enhanced version of the c-rater that used support vector regression to model the relationship between students’ responses and scores. The system was effective in scoring science items with five-point scoring rubrics, showing reasonable agreement with human scoring for all test takers as well as specific subgroups.

> The research to date reveals the potential of ASAG in Science. However, so far, among the few studies found, most were focused on ETS products, which are not open-source, and the content of the questions used, as well as the answers graded are not public (if they were, maybe would help to demystify the lack of "faith" that some stakeholders have about AGAS (Burrows et al., 2015)). 

> Additionally, there are new methods in the field of NLP and Machine Learning (ML) that have not been tested yet. Therefore, we understand that further research with other databases, exploring traditional or new NLP/ML methods may help to disseminate the potentialities of ASAG and further substantiate the field.

#### 3. Purpose of the Study and Research Questions

> We focus on exploring three classical classifiers – Naïve Bayes, Neural Network, and Support Vector – to predict the Science questions scores and using Accuracy to evaluate their performance. Our research questions are:

> 1) How accurate are Naive Bayes Classifier + Bag of Words (BOW), Neural Network Classifier + BOW and Support Vector Classifier + BOW in predicting the scores of two short answer Science questions with multicategory scoring rubrics?

> 2) Are there any patterns in the responses that the models failed to predict the true scores?

#### 3. Methods 

#### 3.1 Definition of Short Answer Questions and Overview of the Science Questions Investigated 

> In conducting the literature review, we noticed the importance of distinguishing between short and long answer questions (or essays), since the latter type is mostly used in the area of Language to assess aspects of writing, such as coherence, syntax, style, etc., which impacts the goal of the scoring process. Therefore, using the same definition as Burrows, Gurevych, & Stein (2015), in this paper short answer questions are those: 
> 1) in the form of natural language; 
> 2) requiring students to recall external knowledge that is not provided by the question; 
> 3) of which the length ranges between one phrase to one paragraph; 
> 4) focusing on the correctness of the content rather than the style; and 
> 5) and are closed, which means that the answers have to match the specific facts corresponding to questions.

> The two science questions used in our study meet this definition. These questions are part of a bank of questions used in 6th grade Comprehensive Science exams administered by an online public school that provides K-12 instruction to students in one of the US states. The exams were mostly composed of short constructed response questions, and their data range from 2017 to 2022. Due to confidentiality reasons, we cannot present the exact content of the Science questions used in this study, but only outline the skills they measure. Using the Next Generation Science Standards (NGSS) in science as a reference, question 1 is related to the core idea of "Forces and Interactions" and measures the standard "MS-PS2-4. Construct and present arguments using evidence to support the claim that gravitational interactions are attractive and depend on the masses of interacting objects", while question 2 is related to the core ideas of "Weather and Climate" and partially measures "MS-ESS2-4. Develop a model to describe the cycling of water through Earth’s systems driven by energy from the sun and the force of gravity" because it requires only a description/explanation rather than the development of a model.

#### 3.2 Data

> Our training and testing corpus consists of textual responses to the two Science questions. Question 1 had a total sample size of 203 responses with an average length of 27 words/response, while Question 2 had a total sample size of 118 responses with an average length of 37 words/response. Originally, each response was manually scored by the online public school teachers, however, due to a mismatch among the columns of the dataset, it was not possible to associate the scores with the students' responses. Therefore, based on other data that was available in the database (teachers' feedbacks regarding student responses), we elaborated new rubrics for the two Science questions and used three codes to categorize the responses: 0 for incorrect, 1 for partially correct, and 2 for completelly correct. These three categories served as labels for our supervised classifiers.

> Regarding the distribution of responses per category, initially we had a higher percentage of completely correct responses, but since this could bias the classifiers, we then randomly dropped some correct responses until they represented 50% of each sample. In the end, we got approximately the same proportional distribution of responses for both questions: 17% incorrect, 33% partially correct and 50% completelly correct.

#### 3.3 Analysis Framework

> First, we conducted some of the steps in standard NLP pre-processing: lowercasing and replacing abbreviated with complete forms. We did not remove punctuation neither tokenized the sentences into words because the vectorizer function that we used (CountVectorizer() from package sklearn) did these steps by default. We also did not lemmatized, nor stemmed the sentences, but in a future study we will conduct tests with those conditions and analyze the impacts on the classifiers accuracy.

> After the pre-processing, our text data was vectorized (i.e. converted into numerical data) using Bag-of-Words (BOW). The next step, similar to prior research on ASAG (Burrows et al., 2015), we tested three classical classifiers – Naïve Bayes, Neural Network, and Support Vector - splitting each of our two samples randomly into 80% for training and 20% for testing, and running our tests 5 rounds for each sample to get an average accuracy for each classifier.

## Dataset
#### Science Questions
---
`Prompt`:
> Q1. In your own words, describe why an astronaut's weight is different on the moon than it is on Earth.

---
#### Data Description

| Type of response            | Source dependent response |
|-----------------------------|---------------------------|
| Grade level                 | `6`                       |
| Subject                     | `Science`                 |
| Core Idea                   | `Forces and Interactions` |
| Total sample size           | `203(17% 0; 33% 1, 49% 2)`|
| Average length of responses | `27 words`                |
| Score range                 | `0-2`                     |

`0 = No credit`
`1 = Partial credit`
`2 = Full credit`

`Prompt`:
> Q2. Explain how precipitation happens. Use complete sentences and give at least two supporting details.

---
#### Data Description

| Type of response            | Source dependent response |
|-----------------------------|---------------------------|
| Grade level                 | `6`                       |
| Subject                     | `Science`                 |
| Core Idea                   | `Weather and Climate`     |
| Total sample size           | `118(16% 0; 33% 1, 50% 2)`|
| Average length of responses | `37 words`                |
| Score range                 | `0-2`                     |


#### 4. Analysis Demonstration 

##### 4.1. Dependencies 

In [136]:
# Import all the library that is necessary for your analysis 
import numpy as np
import pandas as pd 
import nltk
import re
#############################################################

##### 4.2. Code

##### Pre-processing Question 1 data

In [137]:
#### Your code for the analysis will be provided here 
df = pd.read_excel('./gravity_responses.xlsx')
df
####################################################

Unnamed: 0,Response,Score
0,A person's weight will change because the moo...,2
1,"Because the astronaut is on the moon, the his...",1
2,On the earth you weigh more because there is ...,2
3,His weight is more heaver on earth but when h...,0
4,It is diffrent in moon because there is no gra...,1
...,...,...
198,An astronaut weighs more on the Earth than th...,2
199,Weight is a measure of force and the object ne...,2
200,"Because there is no mass on the moon, however ...",0
201,"The moon has less gravity than Earth, so an a...",2


In [138]:
def cleaning(string):
    string  = string.lower()                  #lowercase

    string  = string.replace("can't", 'cannot') #replace abbreviated forms 
    string = string.replace("n't", ' not')
    string  = string.replace("'ll", ' will')
    string  = string.replace("'m", ' am')
    string = string.replace("he's", "he is") #this will also replace "she's"
    string = string.replace("it's", 'it is')
    string = string.replace("'re", ' are')

    #CountVectorizer() strips the puctuation and tokenize by default as can be seen running bow.get_feature_names_out()
    
    return string

In [139]:
df['cleaned_response'] = df['Response'].apply(cleaning)
df

Unnamed: 0,Response,Score,cleaned_response
0,A person's weight will change because the moo...,2,a person's weight will change because the moo...
1,"Because the astronaut is on the moon, the his...",1,"because the astronaut is on the moon, the his..."
2,On the earth you weigh more because there is ...,2,on the earth you weigh more because there is ...
3,His weight is more heaver on earth but when h...,0,his weight is more heaver on earth but when h...
4,It is diffrent in moon because there is no gra...,1,it is diffrent in moon because there is no gra...
...,...,...,...
198,An astronaut weighs more on the Earth than th...,2,an astronaut weighs more on the earth than th...
199,Weight is a measure of force and the object ne...,2,weight is a measure of force and the object ne...
200,"Because there is no mass on the moon, however ...",0,"because there is no mass on the moon, however ..."
201,"The moon has less gravity than Earth, so an a...",2,"the moon has less gravity than earth, so an a..."


##### 4.2.1. Naive Bayes Classifier + BOW for Question 1

In [140]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Let's create a BOW vector from the 'response' column as an input and call it `X`

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer()
X = bow.fit_transform(df['cleaned_response']) #### here you can provide the response data

### Let's set the 'grade' column as an input and call it `y`
y = df['Score'] #### here provide the y variable 

#IMPORTANT
### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#We will not show the machine the y_test. We will have the y_hat and then compare to the 20% y that we have in our dataset.
                      
#############################################################

#Naive Bayes classifier
clf_nb = MultinomialNB() # this is your new classifier 
clf_nb .fit(X_train, y_train) #let's fit the model 
NB_y_hat = clf_nb.predict(X_test) #predict y_hat 

sklearn_score_test= clf_nb.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, NB_y_hat))

Sklearn's score on testing data : 0.6097560975609756
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.75      0.30      0.43        10
           1       0.40      0.36      0.38        11
           2       0.67      0.90      0.77        20

    accuracy                           0.61        41
   macro avg       0.61      0.52      0.53        41
weighted avg       0.62      0.61      0.58        41



In [141]:
new_df = pd.DataFrame(y_test)
new_df['NB_y_hat'] = NB_y_hat
new_df['cleaned_response'] = df['cleaned_response']
new_df

Unnamed: 0,Score,NB_y_hat,cleaned_response
110,0,2,b
7,2,2,when on earth your gravitational force is dif...
21,0,0,i dont know
30,0,2,?
2,2,1,on the earth you weigh more because there is ...
77,1,1,the astronaut's weight changed because there ...
143,2,2,earth is far larger that the moon. the bigger...
86,1,2,astronauts weigh less because there is less gr...
131,2,2,the reason astronaut's weight is different on ...
151,2,2,the moon and the earth have two very different...


##### 4.2.2 Neural Network Classifier + BOW for Question 1

In [142]:
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X_train,y_train) #let's fit the model 
NN_y_hat = clf_nn.predict(X_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, NN_y_hat))

Sklearn's score on testing data : 0.6829268292682927
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.71      0.50      0.59        10
           1       0.50      0.55      0.52        11
           2       0.77      0.85      0.81        20

    accuracy                           0.68        41
   macro avg       0.66      0.63      0.64        41
weighted avg       0.69      0.68      0.68        41



In [143]:
new_df.insert(loc = 2, column = 'NN_y_hat', value = NN_y_hat)
new_df

Unnamed: 0,Score,NB_y_hat,NN_y_hat,cleaned_response
110,0,2,0,b
7,2,2,0,when on earth your gravitational force is dif...
21,0,0,0,i dont know
30,0,2,0,?
2,2,1,1,on the earth you weigh more because there is ...
77,1,1,1,the astronaut's weight changed because there ...
143,2,2,1,earth is far larger that the moon. the bigger...
86,1,2,2,astronauts weigh less because there is less gr...
131,2,2,2,the reason astronaut's weight is different on ...
151,2,2,2,the moon and the earth have two very different...


##### 4.2.3. Support Vector Classifier + BOW for Question 1 

In [144]:
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X_train,y_train) #let's fit the model 
SVC_y_hat = clf_svm.predict(X_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X_test,y_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y_test, SVC_y_hat))

Sklearn's score on testing data : 0.5609756097560976
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.29      0.36      0.32        11
           2       0.70      0.95      0.81        20

    accuracy                           0.56        41
   macro avg       0.33      0.44      0.38        41
weighted avg       0.42      0.56      0.48        41



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [145]:
new_df.insert(loc = 3, column = 'SVC_y_hat', value = SVC_y_hat)
new_df

Unnamed: 0,Score,NB_y_hat,NN_y_hat,SVC_y_hat,cleaned_response
110,0,2,0,1,b
7,2,2,0,2,when on earth your gravitational force is dif...
21,0,0,0,1,i dont know
30,0,2,0,1,?
2,2,1,1,1,on the earth you weigh more because there is ...
77,1,1,1,1,the astronaut's weight changed because there ...
143,2,2,1,2,earth is far larger that the moon. the bigger...
86,1,2,2,2,astronauts weigh less because there is less gr...
131,2,2,2,2,the reason astronaut's weight is different on ...
151,2,2,2,2,the moon and the earth have two very different...


##### Pre-processing Question 2 data

In [146]:
df2 = pd.read_excel('./precipitation_responses.xlsx')
df2['cleaned_response'] = df2['Response'].apply(cleaning)
df2

Unnamed: 0,Response,Score,cleaned_response
0,Precipitation is liquid that forms to clouds a...,2,precipitation is liquid that forms to clouds a...
1,Precipitation happens when evaporation happens...,1,precipitation happens when evaporation happens...
2,Precipitation happens when the clouds become ...,2,precipitation happens when the clouds become ...
3,Precipitation happens when a part of the atmo...,1,precipitation happens when a part of the atmo...
4,How precipitation happens is by when either gr...,2,how precipitation happens is by when either gr...
...,...,...,...
113,The cloud full uo so full of water that they l...,0,the cloud full uo so full of water that they l...
114,in the atmosphere as the air gets warmer it r...,0,in the atmosphere as the air gets warmer it r...
115,Precipitation is the same thing as rain,0,precipitation is the same thing as rain
116,"First the water in a river evaporates, and the...",1,"first the water in a river evaporates, and the..."


##### 4.2.4. Naive Bayes Classifier + BOW for Question 2

In [147]:
bow2 = CountVectorizer()
X2 = bow.fit_transform(df2['cleaned_response'])

### Let's set the 'grade' column as an input and call it `y`
y2 = df2['Score'] #### here provide the y variable 

#IMPORTANT
### Let's randomly shuffle and use 80% of the data as training and rest of it as testing 
from sklearn.model_selection import train_test_split
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2)

#We will not show the machine the y_test. We will have the y_hat and then compare to the 20% y that we have in our dataset.
                      
#############################################################

#Naive Bayes classifier
clf_nb = MultinomialNB() # this is your new classifier 
clf_nb .fit(X2_train, y2_train) #let's fit the model 
NB_tf_y_hat = clf_nb.predict(X2_test) #predict y_hat 

sklearn_score_test= clf_nb.score(X2_test,y2_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y2_test, NB_tf_y_hat))

Sklearn's score on testing data : 0.625
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.50      0.38      0.43         8
           2       0.67      0.86      0.75        14

    accuracy                           0.62        24
   macro avg       0.39      0.41      0.39        24
weighted avg       0.56      0.62      0.58        24



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [148]:
new_df_2 = pd.DataFrame(y2_test)
new_df_2['NB_tf_y_hat'] = NB_tf_y_hat
new_df_2['cleaned_response'] = df2['cleaned_response']
new_df_2

Unnamed: 0,Score,NB_tf_y_hat,cleaned_response
104,1,1,precipitation happens when the water vapors c...
47,1,2,"if water from a lake evaporates into the air,..."
80,2,1,precipitation occurs when a portion of the at...
12,0,2,"it is rain, snow,sleet, or hail that fall to ..."
37,2,2,precipation happens when clouds form water dr...
65,2,2,precipitation is water falling from the cloud...
19,1,2,precipitation happens when water condenses an...
5,1,1,precipitation happens when condensation is the...
25,2,2,precipitation happens when there is too much ...
17,1,2,precipitation happens when condensation is th...


##### 4.2.5. Neural Network Classifier + BOW for Question 2

In [149]:
from sklearn.neural_network import MLPClassifier

clf_nn = MLPClassifier(random_state=1, max_iter=300) # this is your new classifier 
clf_nn.fit(X2_train,y2_train) #let's fit the model 
NN_tf_y_hat = clf_nn.predict(X2_test) #predit y_hat 

sklearn_score_test= clf_nn.score(X2_test,y2_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y2_test, NN_tf_y_hat))

Sklearn's score on testing data : 0.75
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.50      0.50      0.50         2
           1       0.60      0.75      0.67         8
           2       0.92      0.79      0.85        14

    accuracy                           0.75        24
   macro avg       0.67      0.68      0.67        24
weighted avg       0.78      0.75      0.76        24



In [150]:
new_df_2.insert(loc = 2, column = 'NN_tf_y_hat', value = NN_tf_y_hat)
new_df_2

Unnamed: 0,Score,NB_tf_y_hat,NN_tf_y_hat,cleaned_response
104,1,1,1,precipitation happens when the water vapors c...
47,1,2,1,"if water from a lake evaporates into the air,..."
80,2,1,1,precipitation occurs when a portion of the at...
12,0,2,0,"it is rain, snow,sleet, or hail that fall to ..."
37,2,2,1,precipation happens when clouds form water dr...
65,2,2,2,precipitation is water falling from the cloud...
19,1,2,1,precipitation happens when water condenses an...
5,1,1,1,precipitation happens when condensation is the...
25,2,2,2,precipitation happens when there is too much ...
17,1,2,1,precipitation happens when condensation is th...


#### 4.2.6. Support Vector Classifier + BOW for Question 2

In [151]:
from sklearn.svm import SVC

clf_svm = SVC(random_state=1, max_iter=3000) # this is your new classifier 
clf_svm.fit(X2_train,y2_train) #let's fit the model 
SVC_tf_y_hat = clf_svm.predict(X2_test) #predit y_hat 

sklearn_score_test= clf_svm.score(X2_test,y2_test)
print("Sklearn's score on testing data :",sklearn_score_test)

print("Classification report for testing data : ")
print(classification_report(y2_test, SVC_tf_y_hat))

Sklearn's score on testing data : 0.7083333333333334
Classification report for testing data : 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.56      0.62      0.59         8
           2       0.80      0.86      0.83        14

    accuracy                           0.71        24
   macro avg       0.45      0.49      0.47        24
weighted avg       0.65      0.71      0.68        24



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [152]:
new_df_2.insert(loc = 3, column = 'SVC_tf_y_hat', value = SVC_tf_y_hat)
new_df_2

Unnamed: 0,Score,NB_tf_y_hat,NN_tf_y_hat,SVC_tf_y_hat,cleaned_response
104,1,1,1,1,precipitation happens when the water vapors c...
47,1,2,1,1,"if water from a lake evaporates into the air,..."
80,2,1,1,1,precipitation occurs when a portion of the at...
12,0,2,0,1,"it is rain, snow,sleet, or hail that fall to ..."
37,2,2,1,2,precipation happens when clouds form water dr...
65,2,2,2,2,precipitation is water falling from the cloud...
19,1,2,1,2,precipitation happens when water condenses an...
5,1,1,1,2,precipitation happens when condensation is the...
25,2,2,2,2,precipitation happens when there is too much ...
17,1,2,1,1,precipitation happens when condensation is th...


#### 5. Results 

> After testing each classifier 5 times with each sample, we obtained the following average accuracies:

> Table 1 - Average Accuracies for the three classifiers

| Question                    | Naïve Bayes | Neural Network | Support Vector |
|-----------------------------|-------------|----------------|----------------|
| Q1 Gravity                  | `0.69`      |     `0.72`     | `0.74`         |
| Q2 Precipitation            | `0.62`      |     `0.62`     | `0.67`         |

> As we can see, the accuracies of the three classifiers were higher for Q1 than for Q2 probably due to the larger number of responses for Q1. We know that the larger the corpus, the more words will compose BOW and, consequently, we will have less sparsity in the vectors and a better prediction. For both questions, SVC had the highest accuracy. Despite the high accuracies (above 0.70) of the classifiers for Q1 responses, the interpretation of these values needs to take into account that the responses to both questions were, in general, similar and that BOW does not capture the context of the sentences. For example, responses such as the ones below may have been  classified as correct (code 2) because they both have the keywords expected in a correct response, but clearly only one is correct. 

> 1) “An astronauts weight is less on the moon than it is on the Earth because Earth’s gravitational pull is much less than the moon’s.”

> 2) “An astronauts weight is less on the moon than it is on the Earth because moon’s gravitational pull is much less than the Earth’s.”

> Additionally, since the distribution of the responses was skewed for both questions, then the accuracy may be biased and other metrics should be analyzed, as well as testing for their statistical significance, which will be done in the next versions of the study.

#### 6. Conclusion and Discussion

> The results obtained in this study can add more evidence to the body of studies that have been conducted on AGSA, especially regarding its use with Science questions. Despite a small sample size, an accuracy of 70% is not negligible, although it still needs to be better analyzed in light of other metrics. Since the other Science questions used in the exams of this online school probably don't have as much variability of responses, part of the scoring that is currently conducted manually by the teachers could be done automatically, perhaps on an experimental basis.

> We intend to continue the research on AGAS, deepening our knowledge about vectorizers that can capture the context of sentences (Word2Vec) and study ways to use them with the classifiers.