This repository contains the code for our solution for identifying propaganda techniques in text.
The source code was prepared within a study Where Does It End? Long Named Entity Recognition for Propaganda Detection and Beyond, presented at the Workshop on NLP applied to Misinformation, organised at the SEPLN 2023 Conference, held in Jaén, Spain on the 26th of September, 2023. Please consult the paper for details on the work.
The research was done within the HOMADOS project at Institute of Computer Science, Polish Academy of Sciences.
Propaganda detection is usually defined and solved as a Named Entity Recognition (NER) task. However, the instances of propaganda techniques (text spans) are usually much longer than typical NER entities (e.g. person names) and can include dozens of words. In this work, we investigate how the extensive span lengths affect the recognition of propaganda, showing that the task difficulty indeed increases with the span length. We systematically evaluate several common approaches to the task, measuring how well they recover the length distribution of true spans. We also propose a new solution, specifically aimed to perform NER for such long entities.
python3
fastprogress==1.0.2
keras==2.6.0
scikit-learn==1.0.1
spacy==3.0.1
scipy==1.8.0
tensorflow==2.6.0
tensorflow-addons==0.15.0
tensorflow-hub==0.12.0
tensorflow-text==2.6.0
Run the following script to install the dependencies,
pip3 install -r requirements.txt
Data processing scripts are invoked inside model running scripts. Please unzip PICO_dataset.zip for ebmnlp scripts or PTC_dataset.zip for other scripts.
Download and unzip Bert base uncased model from here . You can also try different sizes of Bert available on Tensorflow Hub.
Run either of these scripts inside your favourite Python IDE. Remember to replace file paths with your own:
-bert_raw_tags.py (Bert coupled with linear classification layer on top)
-bert_bio_tags.py (The same as above, labels are encoded according to Begin-Inside-Outside scheme)
-bert_iobes_tags.py (The same as above, labels are encoded according to Inside-Outside-Begin-End-Single scheme)
-bert_crf_raw_tags.py (Bert coupled with CRF layer on top)
-bert_crf_bio_tags.py (The same as above, labels are encoded according to Begin-Inside-Outside scheme)
-bert_crf_iobes_tags.py (The same as above, labels are encoded according to Inside-Outside-Begin-End-Single scheme)
-bert_LoNER_systematic_45.py (Bert with adaptive convolutional layer)
LoNER code is released under the GNU GPL 3.0 licence.