The annotation folder contains four subdirectories for the various corpora used in this study:
- British National Corpus (BNC)
- ACPROSE (Academic section of the BNC -- all texts markes with <acprose> in the XML)
- PHILO (Philosophy of perception corpus)
- Stanford Encyclopedia of Philosophy (SEP)
The study is specifically focused on lexical entries see and aware. Usage samples from BNC, PHILO and SEP have been manually annotated to distinguish perceptual vs non-perceptual usages. The directory contains the normalised annotations for those three corpora. For comparison purposes, 1500 random sentences of ACPROSE are contained in the same folder, but those were automatically annotated (see below).
In addition to the annotation, the directory contains the contextualised vectors for see and aware, as obtained through BERT Base, for each annotated sentence. The images below show the class distribution for each of the annotated corpora, after reducing the BERT vectors to 2D with PCA.
Below, we also show the class distribution for the automatically annotated ACPROSE, using the BNC background model (using SEP results in very little difference).
The training directory contains code to train a perceptual vs non-perceptual classifier, using as input the BERT vectors extracted from the data. The classifier can only be trained for corpora for which we have annotation, so BNC, PHILO and SEP. The classifier is a simple MLP with two hidden layers and RELu activation, with softmax on the output layer.
The training regime is as follows. For a given dataset, we first retain 200 instances for the optimisation of the model. Tuning is performed using optimise.py, which relies on Bayesian Optimisation (BayesOpt). BayesOpt is run for 200 iterations before returning the best set of hyperparameters.
python3 optimise.py BNC see
The best hyperparameters can be printed in a user-friendly way using the following command over the json file generated in the relevant directory:
python3 read_json.py <path to json file>
Once the system is tuned on a dataset, we perform 5-fold cross-validation on the rest of the data. E.g.
python3 classify.py --file=BNC/BNC_see_kfold_features.txt --lr=0.01 --batch=46 --epochs=50 --hidden=323 --wdecay=0.01
There is CUDA support for running on GPU.
Results for see are as follows (accuracy averages over 5-folds):
BNC | SEP | PHILO |
90% | 98% | 96% |
Results for aware are as follows (accuracy averages over 5-folds):
BNC | SEP | PHILO |
90% | 98% | 92% |
We first check how a model trained on one corpus fares on the other corpora. We first give results for see:
BNC | SEP | PHILO | |
Baseline | 71% | 59% | 60% |
BNC | - | 96% | 83% |
SEP | 87% | - | 94% |
PHILO | 81% | 97% | - |
Results for aware are as follows:
BNC | SEP | PHILO | |
Baseline | 79% | 60% | 91% |
BNC | - | 85% | 88% |
SEP | 89% | - | 92% |
PHILO | 75% | 80% | - |
We also inspect the most frequent ngrams for each corpus (with n=3).
BNC | ACPROSE | SEP | PHILO | |
see (before) | i saw 11 she saw 8 he saw 8 i can see 4 i do n't see 4 i ca n't see 4 you want to see 4 to come and see 3 i 've just seen 3 i 've never seen 3 |
as we have seen 40 as we shall see 24 we have seen 13 we have already seen 11 it can be seen 9 as can be seen 7 can also be seen 7 , we can see 7 is difficult to see 5 remains to be seen 5 |
as we have seen 43 is hard to see 19 to see 19 as we shall see 14 we have seen 10 as we will see 8 we see 8 as we saw 8 in order to see 7 we have already seen 7 |
as we have seen 78 as we shall see 40 we have already seen 33 we have seen 26 i do not see 25 we saw 21 as i can see 21 is difficult to see 20 that i am seeing 20 , as we saw 20 |
see (after) | see . 13 see me . 7 see you . 7 see it . 4 see ? 3 see him . 3 see it ? 3 see it as a 3 see them . 3 see 3 |
see 19 see , for example 8 see below 7 see chapter 5 5 seen as 4 see above , p. 4 see , however , 3 seen to be a 3 see above 3 see chapter 9 3 |
see the entry on 35 see , e . 15 see e . 11 see , for example 11 see section 2 . 9 see section 3 . 6 see other internet resources 5 see the entries on 5 see section 5 . 5 see below ) , 5 |
seen 28 see 23 see the same flash 14 sees that a is 12 see that it is 11 seeing 10 seen , it is 8 seen in virtue of 7 see it 6 seen that it is 6 |
aware (before) | need to be aware 14 she was aware 12 he was aware 8 i am aware 8 we are not aware 6 should be made aware 6 you should be aware 5 be aware 4 to be fully aware 4 being aware 4 |
need to be aware 19 important to be aware 12 we are not aware 8 needs to be aware 7 may not be aware 6 we are aware 6 to be more aware 5 necessary to be aware 5 he was aware 5 being aware 5 |
one is directly aware 23 we are directly aware 17 we are not aware 15 that we are aware 12 we are immediately aware 11 that he was aware 6 that i am aware 6 can be directly aware 6 i am directly aware 5 which we are aware 5 |
we are directly aware 13 that we are aware 11 hallucinating subject is aware 9 what we are aware 9 we are immediately aware 8 are not directly aware 6 , we are aware 6 that i am aware 5 i am immediately aware 5 we are not aware 4 |
aware (after) | aware of it. 19 aware of the need 16 aware of the fact 12 aware of that. 12 aware of the dangers 10 aware of it , 10 aware of the problem 9 aware of this and 8 aware of this. 8 aware of the problems 8 |
aware of the need 23 aware of 22 aware of the fact 18 aware of the 16 aware of the nature 9 aware of the problem 9 aware of the importance 8 aware of the presence 7 aware 7 aware of the limitations 7 |
aware of the fact 30 aware of . 20 aware of it . 13 aware of it , 13 aware of , and 11 aware of them , 10 aware of one 's 9 aware of this , 8 aware of the limitations 8 aware of the problem 8 |
aware of an object 11 aware of 9 aware of the same 7 aware of material things 6 aware of them 5 aware of something that 5 aware of something 5 aware of them , 5 aware of it 5 aware of a non-normal 5 |