The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT> tags and | characters indicating multiple solutions).
This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.
The script xml_to_json.py converts the XML file to JSON format and selects a single solution for all <ALT> tags and vague entities:
- For each Entity with multiple classes, it selects the first valid class.
- For each
<ALT>tag, it selects the solution with the highest number of entities.
The script is tested for the following XML files:
- FirstHAREM: CDPrimeiroHAREMprimeiroevento.xml
- MiniHAREM: CDPrimeiroHAREMMiniHAREM.xml
Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:
- PESSOA (Person)
- ORGANIZACAO (Organization)
- LOCAL (Location)
- TEMPO (Date)
- VALOR (Value)
- ABSTRACCAO (Abstraction)
- ACONTECIMENTO (Event)
- COISA (Thing)
- OBRA (Title)
- OUTRO (Other)
The Selective scenario considers only the first 5 classes of the list above.
The script is compatible to both scenarios and selects the entities respecting the chosen scenario.
The script xml_to_json.py can be used to convert the First, [Mini],(https://www.linguateca.pt/aval_conjunta/HAREM/CDPrimeiroHAREMMiniHAREM.xml) and Second versions of the HAREM dataset.
The Second version of the HAREM dataset changed its XML structure from the previous two versions, it now has the <P> tags that separate the sentences in the docs. Thus, the Second HAREM converted data generated by the scripts also has a diferent structure.
The Structure of the converted data from the First and Mini HAREM is the following:
[{
"doc_id": "HAREM-871-07800",
"doc_text": "Abraço Página Principal ..."
"entities": [
{
"entity_id": "0",
"text": "Abraço",
"label": "ORGANIZACAO",
"start_offset": 1,
"end_offset": 7
}, ...]
}, ...]
As mentioned before, the Second HAREM now has all text inside sentences; the converted structure is as follows:
[...
{
"doc_id": "H2-dftre765",
"doc_ps": [
{
"p_id": 0,
"p_text": "Fatores Demográficos e Econômicos Subjacentes",
"entities": []
}, ...
{
"p_id": 2,
"p_text": "A Reforma Religiosa e Política na Inglaterra",
"entities": [
{
"entity_id": "H2-dftre765-28",
"text": "Inglaterra",
"label": "LOCAL",
"start_offset": 34,
"end_offset": 44
}
]
}, ...]
}, ...]
The scripts are tested with Python 3.6.
Install the requirements:
$ pip install -r requirements.txt
Run the script:
$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]
As mentioned in the support section, the script can be used to convert all three current versions of the HAREM dataset, by default, it will convert the first and mini HAREM format. If you want to convert the Second HAREM, you need to specify it with the flag --version with the "second" value as following:
$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective] --version [first_and_mini|second]
By default, the converted file will be saved with the same name and suffix -{scenario}.json. You can also save each document in a separate file, using the --saving_strategy flag with the "doc_files" value as follows:
$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective] --saving_strategy [one_file|doc_files]
Using the "doc_files" strategy will output each file with the name HAREMdoc_{doc_id}.json in the same folder as the input file.
To run the tests, first install the test requirements and run the tests:
$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py
You will note that we are using uv as our dependency management tool. To install the necessary libraries, simply run:
uv sync
We start by donwloading the Second version of the HAREM dataset.
curl https://www.linguateca.pt/aval_conjunta/HAREM/CDSegundoHAREM.xml -o CDSegundoHAREM.xml
Then we use this forks CLI commands to output the selective and total versions to json format.
uv run xml_to_json.py './CDSegundoHAREM.xml' --scenario 'selective' --version second
uv run xml_to_json.py './CDSegundoHAREM.xml' --scenario 'total' --version second
At this point we need only to run:
uv run main.py
You may now check the dataset at: marquesafonso/SegundoHAREM