Calculate predictors of adjective order and test them in large dependency treebanks.
In this study, we extracted data from the following external data sources, not included here:
- A parsed Common Crawl corpus in CoNLLU format, introduced in Futrell et al. (2019).
- Universal Dependencies 2.4, in particular the English Web Treebank
- GLoVe embeddings from glove.42B.300d.zip (only if performing your own clustering)
English adjective and noun wordforms from CELEX are provided in data/english_adjectives.txt and data/english_nouns.txt is from CELEX.
The files data/subjectivity* are from Scontras et al. (2017): these are subjectivity ratings collected in previous experiments.
The subjectivity ratings collected for this study are at experiments/1-UD-subjectivity/results/adjective-subjectivity.csv.
The tarball data/clust_pairs.tar.gz contains pre-clustered pairs of adjectives and nouns from the Common Crawl corpus.
The directory corpus_extraction has scripts for pulling relevant data out of CoNLLU-formatted dependency treebank files. Supposing you have a bunch of files at location $CORPORA, run the following in the directory corpus_extraction to get all the adjective--adjective--noun pairs:
cat $CORPORA | python extract_conllu.py aan > aan.csv
sh csvcount.sh aan.csv > aan_counts.csv
and run the following to just get all the adjectives:
cat $CORPORA | python extract_conllu.py a > a.csv
sh csvcount.sh a.csv > a_counts.csv
Adjectives and nouns are clustered with measures/cluster.py with the following arguments:
-v $GLOVE-- a file containing space-delimited wordforms and their vectors-p $PAIRS-- a file containing comma-delimitedcount,adj,nounrows-k ($ADJ_K,$NOUN_K)-- [optional] a tuple listing what k to use for adjectives and nouns; default is (300,1000)-c $PCA-- [optional] the amount of information to preserve when running PCA; default is 1.0 (no reduction)
Output is a comma-delimited clust_pairs.csv with the following columns:
count-- the count of this pair in$PAIRSawf-- adjective wordformnwf-- noun wordformacl-- adjective cluster IDncl-- noun cluster ID
Predictors are calculated using measures/score_adj_pairs.py with the following arguments:
-t $TRIPLES-- a comma-delimited file with at least the rows[count,adj1_word,adj2_word,noun_word]-s $SUBJ-- a comma-delimited file containing at least the rows[predicate,response]
Output is a comma-delimited scores.csv with the following columns:
id-- the ID of a triple in$TRIPLESidx-- 0 or 1 depending on position of this adjective in$TRIPLEScount-- the count of this triple in$TRIPLESawf-- adjective wordformnwf-- noun wordformacl-- adjective cluster IDncl-- noun cluster ID- various predictors named according to the following scheme:
ic_-- integration costig_-- information gainp_-- log probabilitypmi_-- pointwise mutual informationsubj_-- subjectivity rating
The predictors calculated and reported in scores.csv can be run with python measures/predict.py scores.csv. Output is deltas.csv, a comma-delimited file with the following columns:
id-- the ID of a triple in$TRIPLESpredictor-- the predictor being rundelta-- the (absolute) difference between the predictor score for each adjectiveresult-- whether the adjective with the smallest predictor comes first (0) or second (1).
Note that predictors with None values in scores.csv will not be included in deltas.csv. This can happen due to out-of-vocabulary words, adjectives not rated for subjectivity, and so on.
Plots can be generated by running plots/plot_logistic.py delta.csv. A single image (predictors.png) will be generated with a plot for each predictor, showing predictive accuracy and area under curve (AUC) for a logistic regression indicating the predicted probability (y-axis) as a function of the difference between each adjective's score (x-axis). Note that if accuracy is less than 0.5 for a given predictor, the polarity of the predictions -- and the resulting logistic regression -- is switched.
If you are here for the code used in Futrell (2019), check out the previous version of this repo at #464e24d.