HAREM Datasets Preprocessing

The HAREM collections are popular Portuguese datasets that are commonly used in Named Entity Recognition (NER) task. In their original XML format, some phrases can have multiple entity identification solutions and entities can be assigned more than one class (<ALT> tags and | characters indicating multiple solutions). This annotation scheme is good for representing vagueness and indeterminacy. However, it introduces complications when modeling NER as sequence tagging problem, specially during evaluation, because a single true answer is required.

The script xml_to_json.py converts the XML file to JSON format and selects a single solution for all <ALT> tags and vague entities:

For each Entity with multiple classes, it selects the first valid class.
For each <ALT> tag, it selects the solution with the highest number of entities.

The script is tested for the following XML files:

FirstHAREM: CDPrimeiroHAREMprimeiroevento.xml
MiniHAREM: CDPrimeiroHAREMMiniHAREM.xml

Total and Selective scenarios

Recent works often train and report performances for two scenarios: Total and Selective. Total scenario corresponds to the full dataset with 10 Entity classes:

PESSOA (Person)
ORGANIZACAO (Organization)
LOCAL (Location)
TEMPO (Date)
VALOR (Value)
ABSTRACCAO (Abstraction)
ACONTECIMENTO (Event)
COISA (Thing)
OBRA (Title)
OUTRO (Other)

The Selective scenario considers only the first 5 classes of the list above.

The script is compatible to both scenarios and selects the entities respecting the chosen scenario.

First, Mini and Second HAREM support

The script xml_to_json.py can be used to convert the First, [Mini],(https://www.linguateca.pt/aval_conjunta/HAREM/CDPrimeiroHAREMMiniHAREM.xml) and Second versions of the HAREM dataset.

The Second version of the HAREM dataset changed its XML structure from the previous two versions, it now has the <P> tags that separate the sentences in the docs. Thus, the Second HAREM converted data generated by the scripts also has a diferent structure.

The Structure of the converted data from the First and Mini HAREM is the following:

[{
    "doc_id": "HAREM-871-07800",
    "doc_text": "Abraço Página Principal ..."
    "entities": [
        {
            "entity_id": "0",
            "text": "Abraço",
            "label": "ORGANIZACAO",
            "start_offset": 1,
            "end_offset": 7
        }, ...]            
}, ...]

As mentioned before, the Second HAREM now has all text inside sentences; the converted structure is as follows:

[...
{
    "doc_id": "H2-dftre765",
    "doc_ps": [
        {
            "p_id": 0,
            "p_text": "Fatores Demográficos e Econômicos Subjacentes",
            "entities": []
        }, ...
        {
            "p_id": 2,
            "p_text": "A Reforma Religiosa e Política na Inglaterra",
            "entities": [
                {
                    "entity_id": "H2-dftre765-28",
                    "text": "Inglaterra",
                    "label": "LOCAL",
                    "start_offset": 34,
                    "end_offset": 44
                }
            ]
        }, ...]
}, ...]

Usage

The scripts are tested with Python 3.6.

Install the requirements:

$ pip install -r requirements.txt

Run the script:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective]

Specify Dataset Version

As mentioned in the support section, the script can be used to convert all three current versions of the HAREM dataset, by default, it will convert the first and mini HAREM format. If you want to convert the Second HAREM, you need to specify it with the flag --version with the "second" value as following:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective] --version [first_and_mini|second]

Output files

By default, the converted file will be saved with the same name and suffix -{scenario}.json. You can also save each document in a separate file, using the --saving_strategy flag with the "doc_files" value as follows:

$ xml_to_json.py path_to_xml_file.xml --scenario [total|selective] --saving_strategy [one_file|doc_files]

Using the "doc_files" strategy will output each file with the name HAREMdoc_{doc_id}.json in the same folder as the input file.

Tests

To run the tests, first install the test requirements and run the tests:

$ pip install requirements_test.txt
$ HAREM_DATA_DIR=test_files/ python tests.py

Preparing a Huggingface dataset

You will note that we are using uv as our dependency management tool. To install the necessary libraries, simply run:

uv sync

We start by donwloading the Second version of the HAREM dataset.

curl https://www.linguateca.pt/aval_conjunta/HAREM/CDSegundoHAREM.xml -o CDSegundoHAREM.xml

Then we use this forks CLI commands to output the selective and total versions to json format.

uv run xml_to_json.py './CDSegundoHAREM.xml' --scenario 'selective' --version second
uv run xml_to_json.py './CDSegundoHAREM.xml' --scenario 'total' --version second

At this point we need only to run:

uv run main.py

You may now check the dataset at: marquesafonso/SegundoHAREM

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
tests.py		tests.py
utils.py		utils.py
uv.lock		uv.lock
xml_to_json.py		xml_to_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HAREM Datasets Preprocessing

Total and Selective scenarios

First, Mini and Second HAREM support

Usage

Specify Dataset Version

Output files

Tests

Preparing a Huggingface dataset

About

Uh oh!

Releases

Packages

Languages

License

marquesafonso/harem_preprocessing

Folders and files

Latest commit

History

Repository files navigation

HAREM Datasets Preprocessing

Total and Selective scenarios

First, Mini and Second HAREM support

Usage

Specify Dataset Version

Output files

Tests

Preparing a Huggingface dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages