#### Final Proposal EDF 6938: Natural Langauge Processing

### Do data ‘suggest’ that data can suggest?
> #### Author: Eric Wright
> #### Date: 12/6/2022
> #### Email: eric@ericwright.org


#### 1. Introduction 

This is a study of using natural language processing (NLP) to identify the prevalence of a language pattern in quantitative studies in education research journals. This pattern is one that, from my perspective, communicates the idea that data can speak for themselves, an idea that is counter to both new and old theoretical perspectives that supposedly underlie the practices of quantitative research. My goal is to train an algorithm to successfully identify this pattern while generating as few false positives as possible, thus using quantitative methods to examine the writing/communication aspect of quantitative methodology. This pattern is ‘[these data/results/findings/etc.] _suggest_/_imply_/_indicate_/etc. [a conclusion made by the researchers based on the data],’ and up to now, it has been entirely unaddressed in the literature. From certain theoretical perspectives, the answer to the title is simple: data do not suggest that data can suggest simply because data cannot suggest. The real question is whether the language used in published articles can be inferred to mean that data can suggest.




#### 2. Related Work 
##### 2.1 Theoretical underpinning
Recent critical-theory–based approaches to quantitative research have highlighted the idea that data cannot speak for themselves. These approaches include critical data studies [1], QuantCrit [2], and data feminism [3]. While some refer to ‘numbers’ rather than ‘data,’ the common point is that data, findings, numbers, results, and so on do not have any inherent meaning and that it is the authors or researchers who make meaning from those. In other words, a conclusion does not emerge directly from a result. Instead, a conclusion is proposed by a researcher who interprets the result, or the result is evidence that supports a theory.

This position is also supported by the quantitative-associated theoretical perspective of _postpositivism_. This is the theoretical perspective on which modern quantitative practices have supposedly been based following a rejection of positivism [4]. And while the idea that data cannot speak for themselves is not always stated in such clear terms in the postpositivist literature, it follows easily from the standpoint that theories are underdetermined by data, that a set of data can never point to a single theory that explains it [5]. Instead, there are many possible theories that could explain any set of observations. So while newer theoretical perspectives have called specific attention to this, the idea itself is not new and is, in theory, very mainstream.

This brings me to a question: why then do we write our quantitative articles as if data do speak for themselves? This question was born from my own observations, particularly of the ways researchers have a habit of discussing results, while reading education research articles and reviewing conference proposals as a quantitative methodologist. The most common phrase I have noticed is of the form ‘[these data/results/findings/etc.] _suggest_ [a conclusion made by the researchers based on the data],’ as if the conclusion sprang forth from the data without any thought or extrapolation from the researchers. Almost always, this type of phrase included only one potential conclusion out of the many that different researchers could draw given the same results.

The use of the word ‘suggest’ may seem like a small detail, but even if every quantitative researcher simply used this as shorthand with the best intentions, enough repetition may communicate and instill the belief in readers that data can speak for themselves and suggest conclusions, especially for anyone who is not intimately familiar with quantitative methods and the underlying theoretical perspectives. This is highlighted, in a way, by QuantCrit [2]. Gillborn et al. [2] recommend researchers move away from referring to ‘race’ when talking about results related to a categorization of race. This is because a racial designation is not something that is physically inherent to the individual, but calling it ‘race’ and saying that results are due to race or explained by race gives the impression that any negative quantitative effects related to that categorization are based on some individual quality rather than a likely-systemic force, especially in the United States. Therefore, they recommend at least replacing ‘race’ with ‘race/racism.’ Lipton and Steinhardt [6] also call out the extent of anthropomorphization in machine learning as potentially creating a false impression that algorithms do more and are more human-like than in actuality.

##### 2.2 Methodologically related work
Based on a search of the literature, I was unable to identify any similar NLP studies. However, there is one line of research that address a very similar topic: unsupported causal inferences in quantitative research studies. These are statements in that imply causation from results in studies where the research designs do not support causal inferences from those results. This has been studied in psychology [7], teaching and learning [8], and professional counseling [9] to name a few, and they all found large percentages of unsupported causal inferences. In fact, I was on the research team for the professional counseling study [9]. That study’s focus on the linguistic and communication aspect of methodology, my long experience of hand-coding articles for it, and my informally-observed prevalence of ‘suggests’ in those articles all contribute to the inspiration for this study and the interest in NLP for a potential solution.

##### 2.3 Purpose and research questions
No one currently knows the extent of these language patterns. Therefore, I have two primary research questions:

(1) In quantitative educational research articles, how prevalent is the use of language that implies data speak for themselves, for this study specifically focusing on sentences that include a form of the word ‘suggest?’

(2) Can the naïve Bayes or neural network NLP algorithms successfully classify these sentences as cases of data speaking for themselves, with perfect or near-perfect precision and at least 50% recall?

As a secondary purpose, I hope readers will be encouraged by this study to consider theoretical perspectives in quantitative work and to think about how their choice of words may be supporting undesired meanings.

#### 3. Methods 
All processing and NLP algorithms were done in Python 3 [10]. All web scraping and conversion of HTML to plaintext were done in R and Perl, respectively [11], [12], repurposing code I originally wrote for [9].

##### 3.1 Data collection
For this study, I identified all research articles published in volume 58 (published in 2021) of the American Educational Research Journal (AERJ) that were empirical and quantitative-only, excluding meta-analyses, measurement studies, and methodological studies. AERJ is a well-considered journal that accepts articles from all educational research subfields, which is advantageous because it means the trained algorithm will be less likely to be specific to a subfield.

Volume 58 contained 72 articles (excluding one published correction to an article). I examined the abstracts of those articles for inclusion, and if there were any question at all about inclusion based on the abstract, I looked at the article body. This resulted in 22 quantitative-only articles as my final sample. Article abstracts and bodies for these 22 articles were then converted to plaintext, excluding any notes, references, appendices, and mathematical markup.

Both NLP and the manual coding of sentences required some preprocessing beforehand.

Phase one of preprocessing, prior to coding, involved removing instances of ‘et al.’ that were not at the end of a sentence because those could confuse the sentence tokenizer, removing content in parentheses or any other type of bracket (this simplifies the entire process, and any loss of ‘suggests’ within brackets is acceptable), as well as some miscellaneous cleanup important for NLP that would not interfere with manual coding. After cleanup, sentences were tokenized using NLTK’s Punkt tokenizer and sentences that did not contain a form of the word suggest were discarded. Full details are presented in the code in section 4.2.1.

Next, I manually coded the 307 sentences that resulted for whether ‘suggest’ is used in a way that may imply data speak for themselves. For this study, I operationalized language that implies data speak for themselves as phrases where any type of data, findings, results, studies, etc. ‘suggest’ a conclusion that could not be supported by data. I erred on the side of the data not speaking for themselves for phrases such as ‘this suggests’ that have no clear indication from the isolated sentence of what is doing the suggesting. I also considered studies, research, literature, theory, author citations, etc. as not referring to data. However in practice, I expect many of these examples are likely cases of data speaking for themselves.

Phase two of preprocessing was to do some further cleaning such as removal of punctuation (see section 4.2.3 for full details), stem the words using the ILTK Porter stemmer, and remove stop words. The stop words came from NLTK’s English list with several added and removed on a theoretical basis. One corpus of sentences was saved with stop words left in and one corpus saved with stop words excluded.

The multinomial naïve Bayes and MLP neural network algorithms in the sklearn package were then run using bag of words for the feature representation and several factors. After some informal testing to help select appropriate settings to keep and factors to vary, the final conditions used here were (1) sentences with stop words or without stop words; (2) n-gram sizes of 1–3 or 1–5 (n-grams of some size greater than one are important theoretically, but a smaller maximum would help reduce dimensionality); (3) the minimum number of times a feature needs to appear to be included, considering 1–4 as cutoffs (higher numbers reduce dimensionality but may lead to important but rare features being removed); (4) the maximum occurrences, over which a feature will be excluded, comparing all included to some excluded to slightly more excluded (again, this relates to dimensionality); (5) the activation function for the neural network of ReLU or logistic (ReLU is newer and generally preferred, but logistic is worth trying because it fits with the binary form of my desired output [13]); (6) six configurations of hidden layers. The first two configurations were simply no hidden layer and a single hidden layer of 100 nodes. The remaining went from one to four hidden layers with a total number of hidden layer nodes equal to two-thirds the number of input features with a maximum number of 500. The number of nodes is partially based on a common rule of thumb plus the recognition that a small training set may require fewer nodes and a mind for training time [14]. Maintaining the same number of nodes, I divided the nodes between layers such that each node has half as many as the previous layer. Altogether, this resulted in 624 conditions.

For training and testing, I aimed for 75% of articles in the training set with 25% in the testing set. The dataset was randomly split by article rather than sentence because sentences within an article are likely to have some similarity. The actual split by article was 16 in training and 6 in testing (73%/27%). At the sentence level, the split was 67%/33%, which is close enough to intended to be fine. Furthermore, the proportion of positive cases in each sample was similar (65.4% in training, 60.5% in testing).

##### 3.2 Analysis
Analysis of results is descriptive. For RQ1, I use the proportion of articles that contain at least one instance, the proportion of sentences coded as a positive case, the mean number of instances per article, and the distribution of numbers of articles by number of instances. For RQ2, I examine the precision and recall with respect to positive cases at both the sentence and article level. I also take what I consider to be the best NLP result and qualitatively examine the misclassifications.

#### 4. Analysis Demonstration 

##### 4.1. Dependencies 

In [None]:
#### Import all necesary libraries
# General
import numpy as np
import pandas as pd
import re
import math
import statistics

# NLP
import nltk
nltk.download(['punkt', 'wordnet', 'omw-1.4', 'stopwords'])
import sklearn
#############################################################

##### 4.2. Code
###### 4.2.1 Cleaning and data management prior to manual coding

In [2]:
#### Load and clean full documents and tokenize into sentences

# Load data
articles = pd.read_csv('articles_td.csv', sep='\t', header=0)


# Function doclean(): Clean articles prior to coding
def doc_clean(text):
    text = re.sub(r'[^\.]+https://orcid\.org/.*?$', '', text) #remove any ORCID links at end
    text = re.sub(r' et al\.(?:\'s)?( ?[^\sA-Z])', r'\1', text) #remove "et al.('s)" (unless at end of sentence, which is unlikely anyway)
    text = re.sub(r'\[MATH\]', '<MATH>', text) #math ML replaced with [MATH] during conversion from html; change to <MATH>
    text = re.sub(r'\s?\(.*?\)', r'', text) #remove content in parentheses
    text = re.sub(r'\s?\[.*?\]', r'', text) #remove content in brackets
    text = re.sub(r'\s?\{.*?\}', r'', text) #remove content in squigglies
    text = re.sub(r"n't", r' not', text) #n't to not (not likely)
    text = re.sub(r"'ll", r' will', text) #'ll to will (not likely)
    text = re.sub(r"'m", r' am', text) #'m to am (not likely)
    text = re.sub(r"(s?he|it)'s", r'\1 is', text) #expand contractions involving 's to is (not likely)
    text = re.sub(r"'re", r' are', text) #'re to are (not likely)
    text = re.sub(r"'ve", r' have', text) #'ve to have (not likely)
    text = re.sub(r'(\s?)[\-–−$+#]*\.?\d[\d\w\.,:%$\(\)*^#]*?(\.?[\s\-–−])', r'\1<NUMBER>\2', text) #numbers to <NUMBER>, with logic to account for decimals, periods, signs, and retaining ending punctuation
    text = re.sub(r'\s+', r' ', text) #collapse all instances of more than one whitespace character into a single space
    text = re.sub(r'^\s+|\s+$', '', text) #remove any spaces at beginning and end
    return text

# Clean the abstracts and bodies
articles['c_abs'] = [doc_clean(text) for text in articles['Abstract']]
articles['c_body'] = [doc_clean(text) for text in articles['Body']]


# Tokenize sentences for a given article, with abstracts and bodies converted to long format
def my_st(article):
    sents_a = nltk.sent_tokenize(article['c_abs'])
    sents_b = nltk.sent_tokenize(article['c_body'])
    return pd.DataFrame({'ID':np.repeat(article['ID'], len(sents_a)+len(sents_b)), #article ID repeated for all sentences
                         'Cat':np.concatenate([np.repeat('a', len(sents_a)), np.repeat('b', len(sents_b))]), #abstract and body codings, repeated the necessary number of times each
                         'Sent':sents_a+sents_b}) #the cleaned, tokenized sentences

all_sent = pd.concat(articles.apply(my_st, axis=1, result_type='reduce').tolist())

In [4]:
#### Send sentences comtaining forms of 'suggest' to Excel for coding (uncomment to store)

#all_sent.loc[all_sent['Sent'].str.contains(r'\bsuggest')].to_excel('sentences_to_code.xlsx', freeze_panes=(1,0))

###### 4.2.3 Pre-NLP sentence/word-level cleaning, normalization, etc.

In [83]:
#### Load manually-coded sentences for recoding, further cleaning, normalization, etc.

### Load manually-coded sentences
coded_sent = pd.read_excel('sentences_coded.xlsx')


### Recode to n/? to 0 and y/~ to 1
coded_sent['Recode'] = coded_sent['Code'].str.contains(r'y|~')*1


### Cleaning and word tokenization
nopunc = [re.sub(r"'", r'', sent) for sent in coded_sent['Sent']] #remove apostrophes and replace with nothing (targets possessives)
nopunc = [re.sub(r'\s*[^\w\d<>]\s*', r' ', sent) for sent in nopunc] #remove all other punctuation, replacing with spaces
words = [nltk.word_tokenize(i) for i in nopunc]


### Normalize (stem) including stopwords
stemmer = nltk.stem.PorterStemmer()
stemmed = [[stemmer.stem(token) for token in sent] for sent in words]

# Store as a sentence again (mostly for troubleshooting)
coded_sent['Withstop'] = [nltk.tokenize.treebank.TreebankWordDetokenizer().detokenize(sent) for sent in stemmed] #store as a sentence again (mostly for troubleshooting)


### Normalize (stem) excluding stopwords
stop_words = set(nltk.corpus.stopwords.words('english'))

# Exclude words from list that could be important subjects for 'suggest'
stop_words.discard('i')
stop_words.discard('we')
stop_words.discard('this')
stop_words.discard('these')
stop_words.discard('those')
stop_words.discard('they')

# Include other stop words in list based on observations
stop_words.add('also')

# Remove stopwords
nostop = [np.array(sent)[[word not in stop_words for word in sent]].tolist() for sent in words]    

# Stem
stemmer = nltk.stem.PorterStemmer()
stemmed = [[stemmer.stem(token) for token in sent] for sent in nostop]

# Store as a sentence again (mostly for troubleshooting)
coded_sent['Nostop'] = [nltk.tokenize.treebank.TreebankWordDetokenizer().detokenize(sent) for sent in stemmed]

###### 4.2.4 NLP testing setup

In [84]:
#### NLP setup, including data splitting

### Some specific imports for ease of use
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

### Sim conditions
corpuses = [coded_sent['Withstop'], coded_sent['Nostop']]
ng_ranges = [(1,3), (1,5)] #n-grams
min_dfs = [1, 2, 3, 4] #minimum number of occurances required
max_dfs = [0.999, 0.5, 0.1] #no maximum occurrences cut (0.999) vs. cutting some of the maximum occurrences (0.5 and 0.1)
act_fs = ['relu', 'logistic'] #activation function


### Get data splits by ID for testing/training (have to do it manually because train_test_split doesn't like clusters with only one sentence)
ids = coded_sent['ID'].unique()
testn = math.ceil(0.25*len(ids)) #number in testing
ids = sklearn.utils.shuffle(ids, random_state=32607) #randomize ids
testids = ids[0:testn] #get testing ids
trainids = ids[testn:] #get training ids

## Get coded_set indices for each group
testind = coded_sent.index[coded_sent['ID'].isin(testids)]
trainind = coded_sent.index[coded_sent['ID'].isin(trainids)]

### Check group balance
print("Training/Testing Balance\n")
print("Training proportion of cases:", round(coded_sent['Recode'][trainind].sum()/len(coded_sent['Recode'][trainind]), 3))
print("Testing proportion of cases:", round(coded_sent['Recode'][testind].sum()/len(coded_sent['Recode'][testind]), 3))
print("Sentence percentage in testing:", round(len(coded_sent['Recode'][testind]) / len(coded_sent['Recode'][trainind]), 3))
print("Article percentage in testing:", round(testn / len(ids), 3))

Training/Testing Balance

Training proportion of cases: 0.654
Testing proportion of cases: 0.605
Sentence percentage in testing: 0.329
Article percentage in testing: 0.273


###### 4.2.5 Run NLP algorithms

In [None]:
#### NLP algorithm testing

# Set up DataFrame for results
results = pd.DataFrame(columns=['alg', 'settings', 's_precision', 's_recall', 'a_precision', 'a_recall', 'n_features'])

### Big loop
for sentence_set in corpuses:
    for ng in ng_ranges:
        for mindf in min_dfs:
            for maxdf in max_dfs:
                
                ## BOW with training/testing split
                bow = CountVectorizer(ngram_range=ng, min_df=mindf, max_df=maxdf)
                X = bow.fit_transform(sentence_set)

                x_train = X[trainind]
                x_test = X[testind]
                y_train = coded_sent['Recode'][trainind]
                y_test = coded_sent['Recode'][testind]


                ## Naive Bayes
                clf_nb = MultinomialNB()
                clf_nb.fit(x_train, y_train)
                y_hat = clf_nb.predict(x_test)

                # Record NB results
                ay_test = y_test.groupby(coded_sent['ID'][testind]).sum().ne(0)
                ay_hat = pd.Series(y_hat).groupby(coded_sent['ID'][testind].reset_index()['ID']).sum().ne(0)
                outcomes = [round(sum(y_test & y_hat) / sum(y_hat), 3) if sum(y_hat) != 0 else 0, #sentence precision
                            round(sum(y_test & y_hat) / sum(y_test), 3), #sentence recall
                            round(sum(ay_test & ay_hat) / sum(ay_hat), 3) if sum(ay_hat) != 0 else 0, #article precision
                            round(sum(ay_test & ay_hat) / sum(ay_test), 3)] #article recall
                
                results.loc[len(results)] = ['NB', sentence_set.name+' '+str(ng)+' '+str(mindf)+' '+str(maxdf)] + outcomes + [x_train.shape[1]]
                #print(results.loc[len(results)-1].to_list())
                
                                
                ## Neural Network
                
                # Set up layers, mostly based on dimensionality
                n_nodes = min(math.ceil(x_train.shape[1]*2/3), 500) #same as 2/3 input layer or max of 500
                layers = [(),                             #no hidden layers
                          (100,),                         #sklearn default (one layer)
                          (n_nodes,),                     #one hidden layer
                          (math.floor(n_nodes*2/3),       #two hidden layers
                           n_nodes-math.floor(n_nodes*2/3)),
                          (math.floor(n_nodes*4/7),       #three hidden layers
                           math.floor(n_nodes*2/7),
                           n_nodes-math.floor(n_nodes*4/7)-math.floor(n_nodes*2/7)),
                          (math.floor(n_nodes*8/15),      #four hidden layers
                           math.floor(n_nodes*4/15),
                           math.floor(n_nodes*2/15),
                           n_nodes-math.floor(n_nodes*8/15)-math.floor(n_nodes*4/15)-math.floor(n_nodes*2/15))]
                          
                # Loop it
                for act in act_fs:
                    for lay in layers:
                        # Neural Network
                        clf_nn = MLPClassifier(hidden_layer_sizes=lay, random_state=352, max_iter=500, activation=act, solver='lbfgs')
                        clf_nn.fit(x_train, y_train)
                        y_hat = clf_nn.predict(x_test)

                        # Record NN results
                        ay_test = y_test.groupby(coded_sent['ID'][testind]).sum().ne(0)
                        ay_hat = pd.Series(y_hat).groupby(coded_sent['ID'][testind].reset_index()['ID']).sum().ne(0)
                        outcomes = [round(sum(y_test & y_hat) / sum(y_hat), 3) if sum(y_hat) != 0 else 0, #sentence precision
                                    round(sum(y_test & y_hat) / sum(y_test), 3), #sentence recall
                                    round(sum(ay_test & ay_hat) / sum(ay_hat), 3) if sum(ay_hat) != 0 else 0, #article precision
                                    round(sum(ay_test & ay_hat) / sum(ay_test), 3)] #article recall
                        
                        results.loc[len(results)] = ['NN', sentence_set.name+' '+str(ng)+' '+str(mindf)+' '+str(maxdf)+' '+act+' '+str(lay)] + outcomes + [x_train.shape[1]]
                        #print(results.loc[len(results)-1].to_list())

### Output results to Excel
results.to_excel('results.xlsx')

###### 4.2.6 Partial RQ1 analysis

In [7]:
#### Some results from manual-coding

# Proportions, mean, and sd
counts_per_article = coded_sent['Recode'].groupby(coded_sent['ID']).sum()
print("Proportion of articles with a case:", statistics.mean(counts_per_article.ne(0)))
print("Proportion of sentences that are cases:", round(statistics.mean(coded_sent['Recode']), 3))
print("Mean sentence-cases per article:", round(statistics.mean(counts_per_article), 2))
print(" Standard deviation:", round(statistics.stdev(counts_per_article), 2))

Proportion of articles with a case: 1
Proportion of sentences that are cases: 0.642
Mean sentence-cases per article: 8.95
 Standard deviation: 6.1


###### 4.2.7 Partial RQ2 analysis

In [175]:
#### Get misclassification data for the best NLP result for qualitative analysis
#### Best result was Nostop (1, 3) 1 0.1 relu (266, 133, 66, 35)

# BOW with training/testing split
bow = CountVectorizer(ngram_range=(1,3), min_df=1, max_df=0.1)
X = bow.fit_transform(coded_sent['Nostop'])

x_train = X[trainind]
x_test = X[testind]
y_train = coded_sent['Recode'][trainind]
y_test = coded_sent['Recode'][testind]


# Neural Network
clf_nn = MLPClassifier(hidden_layer_sizes=(266, 133, 66, 35), random_state=352, max_iter=500, activation='relu', solver='lbfgs')
clf_nn.fit(x_train, y_train)
y_hat = clf_nn.predict(x_test)


# Record false positives (FP) and false negatives (FN)
errors = coded_sent.copy()

FP = ((pd.Series(y_hat).eq(1) & y_test.reset_index(drop=True).eq(0))*1)
FP.index = y_test.index
errors['FP'] = [FP[i] if (i in testind) else 0 for i in range(len(errors))]

FN = ((pd.Series(y_hat).eq(0) & y_test.reset_index(drop=True).eq(1))*1)
FN.index = y_test.index
errors['FN'] = [FN[i] if (i in testind) else 0 for i in range(len(errors))]

errors.to_excel('errors.xlsx')

#### 4. Results 

For RQ1, 100% of the manually-coded articles had at least one positive case, and 64.2% of the sentences were a positive case. The average number per article was 8.95 (_sd_ = 6.10). The distribution of articles by numbers of cases is displayed below:

In [8]:
#### Figure: Histogram
frequency, bins = np.histogram(coded_sent['Recode'].groupby(coded_sent['ID']).sum(), bins=12, range=[1, 25])
print('\nHistogram (# of articles per case count)')
for b, f in zip(bins[0:], frequency):
    print(round(b, 1), ' '.join(np.repeat('*', f)))


Histogram (# of articles per case count)
1.0 * * *
3.0 * *
5.0 * * * *
7.0 * * *
9.0 * * *
11.0 * * *
13.0 *
15.0 
17.0 *
19.0 
21.0 *
23.0 *


The distribution lies mainly from 1 to 12 cases per article with numbers higher than that being rarer.

For RQ2, precision matters the most, so I examined the top 20 results sorted by sentence-level precision and and then recall, displayed below:

In [11]:
#### Table: Top 20 NLP results according to sentence-level precision and then recall
pd.read_excel('results.xlsx').sort_values(['s_precision', 's_recall'], ascending=False).head(20).reset_index(drop=True)

Unnamed: 0,alg,settings,s_precision,s_recall,a_precision,a_recall,n_features
0,NN,"Nostop (1, 3) 1 0.1 relu (266, 133, 66, 35)",0.885,0.5,1,0.833,10997
1,NB,"Nostop (1, 3) 1 0.999",0.88,0.478,1,1.0,11017
2,NB,"Nostop (1, 3) 1 0.5",0.88,0.478,1,1.0,11017
3,NB,"Nostop (1, 5) 1 0.999",0.867,0.283,1,0.667,21114
4,NB,"Nostop (1, 5) 1 0.5",0.867,0.283,1,0.667,21114
5,NB,"Nostop (1, 3) 2 0.999",0.864,0.413,1,1.0,1488
6,NB,"Nostop (1, 3) 2 0.5",0.864,0.413,1,1.0,1488
7,NB,"Withstop (1, 5) 1 0.1",0.857,0.13,1,0.5,30908
8,NB,"Nostop (1, 5) 1 0.1",0.857,0.13,1,0.5,21094
9,NN,"Nostop (1, 5) 1 0.1 logistic (100,)",0.852,0.5,1,0.833,21094


In the above, "s_" indicates sentences and "a_" indicates articles. The settings column contains a string representing the specific settings used in the order of stop words, n-grams, minimum occurrences, maximum occurrences (0.999 is no maximum while lower numbers place greater restrictions), activation function (NN-only), and hidden layer configuration (NN-only). The highest article precision is .885 for the setting "Nostop (1, 3) 1 0.1 relu (266, 133, 66, 35)." This is lower than the ideal of 1.00, but the recall for that just hits the desired threshold at .500. Article precision will always be 1.00 because there were no true negatives for articles. Article recall is .83, meaning one article was not correctly classified as a positive case. The number of features included in each analysis are displayed for additional context.

I care more about false positives than negatives, so for further investigation, below are the cleaned and normalized versions of the three negative cases that were predicted to be positive:

In [200]:
#### Table: Get negative cases that were predicted to be positive
with pd.option_context('display.max_colwidth', 0):
    display(pd.read_excel('errors.xlsx')[['Sent','Nostop']].loc[errors['FP'].eq(1)].style.set_properties(**{'text-align': 'left'}))

Unnamed: 0,Sent,Nostop
185,Academic capitalism suggests that race and ethnicity are also related to the ways in which student markets are segmented.,academ capit suggest race ethnic relat way student market segment
208,"After examining potential shifts in enrollment, we did find some evidence to suggest that Native American, White, and Asian students who would otherwise have attended two nearby -year universities instead decided to enroll in TCC due to Tulsa Achieves.",after examin potenti shift enrol we find evid suggest nativ american white asian student would otherwis attend two nearbi year univers instead decid enrol tcc due tulsa achiev
231,"Building on the existing studies on AP and DE enrollment as well as the broader literature on racial disparities in educational choices and outcomes, we focus on six sets of factors that theories and existing literature suggest may be correlated with racial disparities in AP and DE participation: student academic preparation prior to high school, family socioeconomic background, racial composition in a district, between-school income segregation and racial segregation, average characteristics of high schools in a district, and state-level AP and DE policies.",build exist studi ap de enrol well broader literatur racial dispar educ choic outcom we focu six set factor theori exist literatur suggest may correl racial dispar ap de particip student academ prepar prior high school famili socioeconom background racial composit district school incom segreg racial segreg averag characterist high school district state level ap de polici


A few things to note about the above: The first sentence (#185) contains the subject "academic capitalism," which was not in the training set at all. The second sentence (#208) was an uncertain case for me during coding, so I said it was not example. However, it likely should be. The third sentence (#231) includes "theories" and "literature," which were always coded as definite negative cases.

#### 5. Conclusion and Discussion

The results from hand-coding are promising to me as a confirmation of data speaking for themselves because the results show what I consider to be practically-significant percentages of articles (100%) and sentences (64.2% or 197 sentences in 22 articles) as positive cases. And although the distribution of cases does tend towards lower numbers per article, this is with only looking at sentences containing 'suggest' and with coding 'this suggests' as negative. The true number of cases would be larger.

None of the NLP results met my standards, but the top result did reach a precision of .885 and a recall of .500. I would want both of those to be much higher before using an algorithm in practice. Naive Bayes dominated the top results by sentence precision but typically had much worse recall than the neural networks at the top. Based on this, neural networks seem more promising to me, especially because there is still room to try other hidden layer structure. However, the performance of naive Bayes could always change with alternative feature representations, other methods for reducing dimensionality, and more data.

The good news is that precision and recall may have been underestimated here because one of the false positives likely should have been coded as a positive, but even that would not be enough to bring the results up to a level I would be satisfied with. The other two false positives possibly show something interesting. In one, there is a sentence subject that is unlikely to be used in many papers except from people using an academic capitalism framework. In the other, I cannot identify any words or phrases in particular that may have caused the misprediction, but it _is_ a particularly long sentence with many words that are theoretically irrelevant but potentially misleading for these algorithms, as run.

The largest limitation for this study was time. Lack of time led to focusing only on a single year's articles and only on 'suggest,' limiting both the training and testing data in a population where each document frequently has its own subject-specific language. It also meant I was the only coder and was coding while also building my operationalization of data speaking for itself. Flawed input easily leads to flawed performance. The next steps before running more NLP algorithms _should_ be to take the time to properly operationalize data speaking for itself, to collect and code more data for training and testing, and to ideally include two other coders who are well-versed in quantitative methods. Involving others should also increase confidence in the manually-coded results, making for a stronger argument even just using those. A related open question is how to handle coreferences like 'this suggests.' There are NLP algorithms that attempt to resolve these, which could lead to identifying even more cases. However, those algorithms would need detailed evaluation to make sure they do not introduce false positives by incorrectly resolving coreferences. This is especially important because of my focus on high precision / few false positives rather than on optimizing recall.

Based on the number of features in the top-performing NLP version just from 307 sentences, I am also still concerned about dimensionality. For this, I may need to try alternative, more-dense feature representations or try other methods for reducing the dimensions of the feature set. And because my corpus has clusters (sentences within articles), it may be worth considering frequency within an article compared to frequency among all articles when discarding features or even to identify article-specific terminology that can be replaced with placeholders or even removed entirely. Different rules for different n-gram sizes could also be considered, because larger n-grams are less likely to occur frequently and may contain important context. In the same vein, rules could be different depending on whether an n-gram contains a word of interest or not (e.g., suggests, results, data, etc.).

Finally, this was also not a particularly exhaustive array of the potential options whether in terms of algorithms or even if continuing with basic MLP neural networks and n-grams. Other network configurations with different numbers of nodes could certainly work better, and the possibilities there are seemingly endless with only discrepant rules of thumb as a priori guidance [14]. Resolving this requires systematic evaluation and tuning beyond the current scope.

While the results are not quite what I would have liked, whether because of a rushed process or lower-than-desired precision, I am satisfied. The process of this study, which was essentially a pilot study of sorts, will inform future more-rigorous work. And based on these results, I am now more convinced that data speaking for themselves is endemic in our communication of educational research. Identification is the first necessary step in creating a solution.

#### References

[1] Dalton, C., & Thatcher, J. (2014). What does a critical data studies look like, and why do we care? Seven points for a critical approach to ‘big data’. Society and Space, 29.

[2] Gillborn, D., Warmington, P., & Demack, S. (2018). QuantCrit: education, policy, ‘Big Data’ and principles for a critical race theory of statistics. Race Ethnicity and Education, 21(2), 158-179.

[3] D'Ignazio, C., & Klein, L. F. (2020). Data feminism. MIT press.

[4] Phillips, D. C. (2004). Two decades after: “After the wake: Postpositivistic educational thought”. Science & Education, 13(1), 67-84.

[5] Phillips, D. C., & Burbules, N. C. (2000). Postpositivism and educational research. Rowman & Littlefield.

[6] Lipton, Z. C., & Steinhardt, J. (2019). Research for practice: troubling trends in machine-learning scholarship. Communications of the ACM, 62(6), 45-53.

[7] Bleske-Rechek, A., Gunseor, M. M., & Maly, J. R. (2018). Does the language fit the evidence? Unwarranted causal language in psychological scientists’ scholarly work. The Behavior Therapist.

[8] Robinson, D. H., Levin, J. R., Thomas, G. D., Pituch, K. A., & Vaughn, S. (2007). The incidence of “causal” statements in teaching-and-learning research journals. American Educational Research Journal, 44(2), 400-413.

[9] Huggins‐Manley, A. C., Wright, E. A., DePue, M. K., & Oberheim, S. T. (2021). Unsupported causal inferences in the professional counseling literature base. Journal of Counseling & Development, 99(3), 243-251.

[10] Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

[11] R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. https://www.R-project.org/.

[12] Wall, L., Christiansen, T., & Orwant, J. (2000). Programming Perl. O’Reilly Media, Inc.

[13] Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345-420.
