#### Final Proposal EDF 6938: Natural Langauge Processing

### Can a Human Scoring System Be Replaced by an Automated Scoring System?: A New Approach to Teacher Evaluation

> #### Author: Hyojong Sohn
> #### Date: December 6th, 2022
> #### Email: hsohn@ufl.edu


#### 1. Introduction 

> #### Problem Statement
> Teachers are considered the most crucial school factor for supporting students’ needs and improving their learning, and this is especially true for those students scoring below grade level on state and national assessments, including those with disabilities (Myers et al., 2022). Researchers have shown that students with disabilities consistently are the lowest performing subgroup of students on the National Assessment of Education Progress and most likely to have poor post school outcomes, thus, efforts to improve teacher quality for these students is essential to their success (Brownell et al., 2020; Jones et al., 2022). The Institute of Education Sciences (IES) has taken efforts to improve teaching quality by funding professional development (PD) research to augment teacher changes in evidence-based practices (EBPs; National Academies of Sciences, Engineering, and Medicine 2022). Classroom observations play an essential role in evaluating and improving teacher practice in PD research. However, observational research has consistently revealed several challenges due, in part, to the current human-based scoring system. One, human-rater scoring is a costly process; it heavily relies on human resources manually assessing teaching practices. To train raters, PD researchers may spend valuable time and resources on rater recruitment and training, regular calibration, and reassessment for quality control (Connor et al., 2014; Demszky et al., 2021). These efforts are often necessary to reduce such threats as rater bias and severity to the temporal stability of scores and fairness (Park et al., 2015; Wind & Guo, 2021). This cost of human raters decreases the accessibility of observational measures. Two, current human-mediated feedback systems, based on video-recorded observations in research, impede providing teachers with timely performance feedback. If researchers fail to provide teachers with immediate feedback, teachers have more time to repeat errors and less time to implement target practices (Taylor et al., 2022). Three, live observations are more common in typical school practice, thus, differences between the conditions of observations conducted in research and practice may diminish the generalizability of research results (Liu & Cohen, 2021). 




#### 2. Related Work 

> To address challenges associated with human-mediated observation systems, recent advances in machine learning techniques provide data-driven solutions. In recent observational research, a convolutional neural networks model was able to recognize the classroom dialogue in seven categories (e.g., prior-known knowledge, agreement) with an acceptable level of accuracy (the overall precision: 0.688; Song et al., 2021). Liu and Cohen (2021) also found that a natural language processing algorithm that was embedded in the text-as-data methods successfully detected three latent traits (i.e., classroom management, interactive instruction, teacher-centered instruction) aligned with existing observation instruments, such as the Framework for Teaching (Danielson, 2013) and Protocol for Language Arts Teaching Observations (Grossman et al., 2014). In this study, we will examine the possibility of automated classification methods of transcribed lessons in general and special education settings. Our analytic framework attempts to provide teachers with timely performance feedback with automated coding. The following research questions guided this study:

> 1.	Does the classification system achieve comparable accuracy to human-expert judgment?
> 2.	Does the linguistic evidence that is identified as deterministic features by the classification algorithm associate with teachers’ latent abilities in effective special education instruction (e.g., explicit instruction, reading evidence-based practice)?
> 3.	What implications do the classification models have for effective special education practice?

> #### Theoretical Framework
> We adopted the computational psychometrics framework developed by von Davier and colleagues (2021; see Figure 1) to compare performance between human-coded and computer-coded labels. Their framework blends both top-down and bottom-up approaches to integrate traditional psychometric methods (e.g., classical test theory, factor analyses) and computational methods (e.g., machine learning algorithms) for observational data. First, in the top-down approach, we designed an observation protocol based on the principles of effective special education instruction (see Table 1) and generated the micro- and macro-level evidence using the protocol. In other words, human raters analyzed teacher cues at the micro level (e.g., “Let me show you how to decode this word,” “Find the main idea with your partner”) and labeled them into ten categories at the macro level (e.g., modeling, practice opportunities, feedback, decoding, prefix, suffix). Once human-coded measures are collected, the bottom-up approach is applied to generate computer-coded labels using supervised machine learning methods. Lastly, in the psychometric models, kappa scores will be used to compare the prediction accuracy of computer-coded labels and human-coded ones. Factor analyses will also be conducted to examine whether the constructs of computer-coded labels are associated with teachers’ latent instructional factors in effective special education instruction. 

#### 3. Methods 

> #### Overview of Research Design and Methodology
> In Project Coordinate (PC), a Goal II professional development innovation funded by the IES, we developed the Project Coordinate Observation Protocol (PC-OP) to capture teachers’ use of evidence-based instruction in tiered instruction. This instrument has been used to assess well-researched principles of instruction that support the effective implementation of EBPs: explicit instruction (e.g., modeling, explanation, practice opportunities; Johnson et al., 2019), and responsiveness to student learning (e.g., response affirmation, constructive feedback; Connor et al., 2011; Doabler et al., 2015). It also assesses the amount of time teachers spend implementing evidence-based word study, morphological awareness, and summarization strategies (Katz & Carlisle, 2009). Human-coded measures on the PC-OP were previously validated in Kane (2006)’s measurement framework (Pua et al., 2020; Sohn et al., 2021). In this study, we will evaluate psychometric properties of computer-coded measures generated by automated classification models in the computational psychometrics framework. 

> #### Participants
>  During the 2019-2020 school year, participants were all from Project Coordinate and were 4th grade teachers from 15 schools in the southwestern United States. Eight treatment schools included 14 general education teachers (GETs) and 8 special education teachers (SETs), and seven control schools included 14 GETs and 8 SETs. 


> #### Data 
> We collected 176 video-recorded pre-intervention lessons and 90 video-recorded post-intervention lessons. Among them, 12 videos will be randomly selected and transcribed. As one transcript (video) includes 100 to 300 cues, 12 transcripts would have a substantial amount of data set (at least 1200 cues) to adopt and conduct machine learning analyses.  

> #### Data Analysis 
> As supervised machine learning requires labeled training data, it is considered more powerful than unsupervised machine learning, which deals with unstructured data (von Davier et al., 2021). Thus, we will adopt supervised machine learning methods, specifically classification models, to analyze our structured data with human-coded measures (i.e., the target outcomes). Before applying the machine learning methods, we will perform several preprocessing steps, such as tokenization, part-of-speech tagging, and stop-word filtering, to increase the accuracy of the classification performance. Then, we will conduct the following four stages of a basic machine learning analysis framework: data processing, model development, model learning, and model evaluation. First, in the data processing stage, we will examine the use of two vectorization approaches (i.e., count-based and contextual vectorization) to effectively capture the semantic representation of the teacher discourse. Second, in the model development and learning stages, we will use four classification models (i.e., logistic regression, Naïve Bayes, support vector machines, multi-layer perceptron [MLP]) using a one-vs-rest (i.e., one-vs-all) approach in order to deal with a multi-class classification problem given a finite number (n = 10) of label categories (Glare, 2020). Lastly, in the model evaluation stage, we will evaluate the performance accuracy of the four models with the commonly adopted classification performance measures (e.g., accuracy, f1-score, the area under the receiver operating characteristic curve [AUC-ROC], kappa scores) with the 5-fold cross-validation. This system will enable a computer to classify the teacher discourse by learning the implicit and explicit decisions made by human judgement, and automatically applying the rules to unseen data (e.g., the teacher discourse data that are not coded by human raters).


#### 4. Analysis Demonstration 

##### 4.1. Dependencies 

In [1]:
# Import all the library that is necessary for your analysis 
import numpy as np
import pandas as pd

import nltk
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer

#Classification Models
from sklearn.model_selection import StratifiedKFold
from sklearn import model_selection, naive_bayes, svm
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#Classification performance measures
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score #for f1 score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import cohen_kappa_score

#Generate the confusion matrix
from sklearn.metrics import confusion_matrix

#Create a heat map
import seaborn as sns
#############################################################

##### 4.2. Code

In [None]:
input_dir = './'
filename = 'PC06-03-22.xlsx'

dat = pd.read_excel(input_dir+filename)

In [None]:
def import_data(dir_='./'):
    dat= pd.read_excel('PC06-03-22.xlsx')
    dat = dat.fillna(0)

    dat[['ESI1', 'ESI2', 'ESI3', 'ESI4', 'ESI5', 'ESI6',
           'EBP4', 'EBP5', 'EBP6', 'EBP10']] = dat[['ESI1', 'ESI2', 'ESI3', 'ESI4', 'ESI5', 'ESI6',
           'EBP4', 'EBP5', 'EBP6', 'EBP10']].replace({"x":1})

    short_dat = dat.iloc[:399] 

    short_dat['Dialogue'] = [entry.lower() for entry in short_dat['Dialogue'].astype(str)]
    short_dat['Dialogue']= [word_tokenize(entry) for entry in short_dat['Dialogue']]
    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    for index,entry in enumerate(short_dat['Dialogue']):
        # Declaring Empty List to store the words that follow the rules for this step
        Final_words = []
        # Initializing WordNetLemmatizer()
        word_Lemmatized = WordNetLemmatizer()
        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
        for word, tag in pos_tag(entry):
            # Below condition is to check for Stop words and consider only alphabets
            if word not in stopwords.words('english'):# and word.isalpha():
                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
                Final_words.append(word_Final)
        # The final processed set of words for each iteration will be stored in 'text_final'
        short_dat.loc[index,'text_final'] = str(Final_words)
    return short_dat, dat.iloc[:399]
    

In [None]:
short_dat, og_dat = import_data()
short_dat = short_dat.iloc[1:] #remove the second row in the excel file

from sklearn.model_selection import StratifiedKFold
Tfidf_vect = TfidfVectorizer(max_features=500)
Tfidf_vect.fit(short_dat.text_final)

skf = StratifiedKFold(n_splits=5)
X = short_dat.text_final.values
y = short_dat['ESI3'].values  #item that I would like to analyze
y = list(y)

lr_acc = []
nb_acc = []
svm_acc = []
nn_acc = []

lr_f1 = []
nb_f1 = []
svm_f1 = []
nn_f1 =[]

lr_auc = []
nb_auc = []
svm_auc = []
nn_auc =[] 

lr_kappa = []
nb_kappa = []
svm_kappa = []
nn_kappa =[] 

result_df =[]

for train, test in skf.split(X, y):
    y = np.array(y)
    Train_X, Test_X = X[train], X[test]
    Train_Y, Test_Y = y[train], y[test]
    
    Train_X_Tfidf = Tfidf_vect.transform(Train_X)
    Test_X_Tfidf = Tfidf_vect.transform(Test_X)

    # fit the training dataset on the logistic regression classifier
    lr = LogisticRegression()
    lr.fit(Train_X_Tfidf,Train_Y)
    # predict the labels on validation dataset
    predictions_lr = lr.predict(Test_X_Tfidf)
    # Use accuracy_score function to get the accuracy
    print("Logistic Regression Accuracy Score -> ",
          accuracy_score(predictions_lr, Test_Y))
    print("LR F1 score -> ", f1_score(predictions_lr, Test_Y))
    print("LR AUC-ROC score -> ", roc_auc_score(Test_Y, predictions_lr))
    print("LR Kappa score -> ", cohen_kappa_score(predictions_lr, Test_Y))
        
    lr_acc.append(accuracy_score(predictions_lr, Test_Y))
    lr_f1.append(f1_score(predictions_lr, Test_Y))
    lr_auc.append(roc_auc_score(Test_Y, predictions_lr))
    lr_kappa.append(cohen_kappa_score(predictions_lr, Test_Y))
    
    # fit the training dataset on the NB classifier
    Naive = naive_bayes.MultinomialNB()
    Naive.fit(Train_X_Tfidf,Train_Y)
    # predict the labels on validation dataset
    predictions_NB = Naive.predict(Test_X_Tfidf)
    # Use accuracy_score function to get the accuracy
    print("Naive Bayes Accuracy Score -> ",
          accuracy_score(predictions_NB, Test_Y))
    print("NB F1 score -> ", f1_score(predictions_NB, Test_Y))
    print("NB AUC-ROC score -> ", roc_auc_score(Test_Y, predictions_NB))
    print("NB Kappa score -> ", cohen_kappa_score(Test_Y, predictions_NB))
    
    nb_acc.append(accuracy_score(predictions_NB, Test_Y))
    nb_f1.append(f1_score(predictions_NB, Test_Y))
    nb_auc.append(roc_auc_score(Test_Y, predictions_NB))
    nb_kappa.append(cohen_kappa_score(predictions_NB, Test_Y))
    
    # Classifier - Algorithm - SVM
    # fit the training dataset on the classifier
    SVM = svm.SVC(C=1.0, kernel='linear', degree=2, gamma='auto')
    SVM.fit(Train_X_Tfidf,Train_Y)
    # predict the labels on validation dataset
    predictions_SVM = SVM.predict(Test_X_Tfidf)
    # Use accuracy_score function to get the accuracy
    print("SVM Accuracy Score -> ",
          accuracy_score(predictions_SVM, Test_Y))
    print("SVM F1 score -> ", f1_score(predictions_SVM, Test_Y))
    print("SVM AUC-ROC score -> ", roc_auc_score(Test_Y, predictions_SVM))
    print("SVM Kappa score -> ", cohen_kappa_score(Test_Y, predictions_SVM))
    
    svm_acc.append(accuracy_score(predictions_SVM, Test_Y))
    svm_f1.append(f1_score(predictions_SVM, Test_Y))
    svm_auc.append(roc_auc_score(Test_Y, predictions_SVM))
    svm_kappa.append(cohen_kappa_score(predictions_SVM, Test_Y))
    
    # Classifier - Algorithm - multiperceptron neural network classifier (MLP)
    from sklearn.neural_network import MLPClassifier
    nn = MLPClassifier(solver='lbfgs', alpha=1e-5,
                    hidden_layer_sizes=(3, 2), random_state=1)
    nn.fit(Train_X_Tfidf,Train_Y)
    # predict the labels on validation dataset
    predictions_nn = nn.predict(Test_X_Tfidf)
    # Use accuracy_score function to get the accuracy
    print("NN Accuracy Score -> ",
          accuracy_score(predictions_nn, Test_Y))
    print("NN F1 score -> ", f1_score(predictions_nn, Test_Y))
    print("NN AUC-ROC score -> ", roc_auc_score(Test_Y, predictions_nn))
    print("NN Kappa score -> ", cohen_kappa_score(Test_Y, predictions_nn))
    
    nn_acc.append(accuracy_score(predictions_nn, Test_Y))
    nn_f1.append(f1_score(predictions_nn, Test_Y))
    nn_auc.append(roc_auc_score(Test_Y, predictions_nn))
    nn_kappa.append(cohen_kappa_score(predictions_nn, Test_Y))
    

    compare = og_dat.iloc[test]
    compare['prediction_nn'] = predictions_nn # you can change this to check other models 
    result_df.append(compare)
    
print('\n Average Prediction Accuracy for LR Accuracy: ', np.average(lr_acc))
print('\n Average Prediction Accuracy for LR F1 score: ', np.average(lr_f1))
print('\n Average Prediction Accuracy for LR AUC-ROC score: ', np.average(lr_auc))
print('\n Average Prediction Accuracy for LR Kappa score: ', np.average(lr_kappa))
    
print('\n Average Prediction Accuracy for NB Accuracy: ', np.average(nb_acc))
print('\n Average Prediction Accuracy for NB F1 score: ', np.average(nb_f1))
print('\n Average Prediction Accuracy for NB AUC-ROC score: ', np.average(nb_auc))
print('\n Average Prediction Accuracy for NB Kappa score: ', np.average(nb_kappa))

print('\n Average Prediction Accuracy for SVM Accuracy: ', np.average(svm_acc))
print('\n Average Prediction Accuracy for SVM F1 score: ', np.average(svm_f1))
print('\n Average Prediction Accuracy for SVM AUC-ROC score: ', np.average(svm_auc))
print('\n Average Prediction Accuracy for SVM Kappa score: ', np.average(svm_kappa))

print('\n Average Prediction Accuracy for NN Accuracy: ', np.average(nn_acc))
print('\n Average Prediction Accuracy for NN F1 score: ', np.average(nn_f1))
print('\n Average Prediction Accuracy for NN AUC-ROC score: ', np.average(nn_auc))
print('\n Average Prediction Accuracy for NN Kappa score: ', np.average(nn_kappa))

In [None]:
#Generate the confusion matrix
cf_matrix = confusion_matrix(predictions_nn, Test_Y)

print(cf_matrix)

In [None]:
compare = og_dat.iloc[test]
compare['prediction_nn'] = predictions_nn # Change this to check other models 

In [None]:
compare[(compare['ESI4'] != compare.prediction_nn)] # Change ESI5 to check other indices 

In [None]:
#Exporting Data Out of Python
compare[(compare['ESI4'] != compare.prediction_nn)].to_excel('./Excel.xlsx')

In [None]:
import seaborn as sns

ax = sns.heatmap(cf_matrix, annot=True, cmap='Blues')

ax.set_title('ESI3 Modeling\n\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Human-coded Values ');

## Ticket labels - List must be in alphabetical order
ax.xaxis.set_ticklabels(['0','1'])
ax.yaxis.set_ticklabels(['0','1'])

## Display the visualization of the Confusion Matrix.
plt.show()

#### 4. Results 

> We demonstrated the capacity of the classification models. Our preliminary analyses consisted of three transcribed lessons. The results showed that the MLP model showed relatively better classification performance across four models in predicting the decoding strategy category. The values for accuracy, F1 score, and AUC-ROC in the MLP model were 89.95, 73.13, and 82.08, respectively. 


#### 5. Conclusion and Discussion

> Our findings support the possibility of one of the supervised algorithms, classification, to automatically score the teacher discourse. However, if the human-labeled training data includes any bias, the automated scoring system based on supervised learning could also be biased. To address this limitation, studies that evaluate fairness evidence need to be followed up.  
