semeval22_structured_sentiment/data at master · jerbarnes/semeval22_structured_sentiment

History

Name		Name	Last commit message	Last commit date
parent directory ..
darmstadt_unis		darmstadt_unis
mpqa		mpqa
multibooked_ca		multibooked_ca
multibooked_eu		multibooked_eu
norec		norec
opener_en		opener_en
opener_es		opener_es
README.md		README.md
requirements.txt		requirements.txt

README.md

Requirements

lxml==4.3.2
tqdm=4.56.0
stanza==1.1.1

Step 1:

Go to the MPQA 2.0 website, agree to the license and download the corpus. Put the zipped archive in /mpqa. Finally, run the extraction script.

bash process_mpqa.sh

Go to the Darmstadt Service Review Corpus website, agree to the license and download the corpus. Put the zipped archive in /darmstadt_unis and finally, run the extraction script.

bash process_darmstadt.sh

Subtask 1: Monolingual structured sentiment

This track assumes that we train and test on the same languages. For this we will use the following datasets:

norec (Norwegian professional reviews in multiple domains)
multibooked_ca (Catalan hotel reviews)
multibooked_eu (Basque hotel reviews)
opener_en (English hotel reviews)
opener_es (Spanish hotel reviews)
darmstadt_unis (English online university reviews)
MPQA

Subtask 2: Cross-lingual structured sentiment

This track will instead train only on a high-resource language (English) and test on several languages.

For training, you can use any of the other datasets, as well as any other resource that does not contain sentiment annotations in the target language.

Test:

opener_es
multibooked_ca
multibooked_eu

That means that the cross-lingual models should be able to adapt quickly to new languages.

Data and formatting

We provide the data in json lines format.

Each line is an annotated sentence, represented as a dictionary with the following keys and values:

'sent_id': unique sentence identifiers
'text': raw text
'opinions': list of all opinions (dictionaries) in the sentence

Additionally, each opinion in a sentence is a dictionary with the following keys and values:

'Source': a list of text and character offsets for the opinion holder
'Target': a list of text and character offsets for the opinion target
'Polar_expression': a list of text and character offsets for the opinion expression
'Polarity': sentiment label ('negative', 'positive', 'neutral')
'Intensity': sentiment intensity ('average', 'strong', 'weak')

{
    "sent_id": "../opener/en/kaf/hotel/english00164_c6d60bf75b0de8d72b7e1c575e04e314-6",

    "text": "Even though the price is decent for Paris , I would not recommend this hotel .",

    "opinions": [
                 {
                    "Source": [["I"], ["44:45"]],
                    "Target": [["this hotel"], ["66:76"]],
                    "Polar_expression": [["would not recommend"], ["46:65"]],
                    "Polarity": "negative",
                    "Intensity": "average"
                  },
                 {
                    "Source": [[], []],
                    "Target": [["the price"], ["12:21"]],
                    "Polar_expression": [["decent"], ["25:31"]],
                    "Polarity": "positive",
                    "Intensity": "average"}
                ]
}

You can import the data by using the json library in python:

>>> import json
>>> with open("data/norec/train.json") as infile:
            norec_train = json.load(infile)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

darmstadt_unis

darmstadt_unis

mpqa

mpqa

multibooked_ca

multibooked_ca

multibooked_eu

multibooked_eu

norec

norec

opener_en

opener_en

opener_es

opener_es

README.md

README.md

requirements.txt

requirements.txt

README.md

Requirements

Step 1:

Subtask 1: Monolingual structured sentiment

Subtask 2: Cross-lingual structured sentiment

Data and formatting

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

Requirements

Step 1:

Subtask 1: Monolingual structured sentiment

Subtask 2: Cross-lingual structured sentiment

Data and formatting