NoReC_fine

This dataset is based largely on the original data described in the paper A Fine-Grained Sentiment Dataset for Norwegian by L. Øvrelid, P. Mæhlum, J. Barnes, and E. Velldal, accepted at LREC 2020, paper available. However, we have since added annotations for another 3476 sentences, increasing the overall size and scope of the dataset.

Overview

While the previously released dataset NoReC_eval labeled sentences as to whether they are evaluative or sentiment-bearing, NoReC_fine expands on these annotations by labeling polar expressions, opinion holders and opinion targets. This data comprises roughly 11,000 sentences across more than 400 reviews and 10 different thematic categories (literature, products, restaurants, etc.), taken from a subset of the Norwegian Review Corpus (NoReC; Velldal et al. 2018). The data comes with a predefined train/dev/test split (inherited from NoReC), and some key statistics are summarized in the table below, including frequency counts and average token lengths.

Type	Train	Dev	Test	Total
Sentences	8634	1531	1272	11437
--- subjective	4555	821	674	6050
--- multiple polarities	660	120	91	871
--- avg. len	16.7	16.9	17.2	16.8
Holders	898	120	110	1128
--- unique	585	88	73	746
--- avg. len	1.1	1.0	1.0	1.1
--- avg. per subj sent	0.1	0.1	0.1	0.1
Targets	6778	1152	993	8923
--- unique	5000	871	729	6699
--- avg. len	1.9	2.0	2.0	2.0
--- avg. per subj sent	1.1	1.1	1.1	1.1
--- discontinuous	39	5	6	50
--- Not On Topic	971	226	148	1345
Polar Expressions	8448	1432	1235	11115
--- unique	8071	1390	1190	10651
--- avg. len	4.9	5.1	4.9	4.9
--- avg. per subj sent	1.8	1.7	1.8	1.8
--- discontiuous	783	131	125	1039

Each opinion is annotated for polarity (positive, negative) and intensity (slight, standard, strong). The distribution is shown in the figure below:

Annotation guidelines

The full annotation guidelines are distributed with this repo and can be found here. A summary can also be found in the paper.

Terms of use

NoReC_fine inherits the license of the underlying NoReC corpus, copied here for convenience:

The data is distributed under a Creative Commons Attribution-NonCommercial licence (CC BY-NC 4.0), access the full license text here: https://creativecommons.org/licenses/by-nc/4.0/

The licence is motivated by the need to block the possibility of third parties redistributing the orignal reviews for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of NoReC are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.

JSON format

Each sentence has a dictionary with the following keys and values:

'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data
'text': raw text
'opinions': list of all opinions (dictionaries) in the sentence

Additionally, each opinion in a sentence is a dictionary with the following keys and values:

'Source': a list of text and character offsets for the opinion holder
'Target': a list of text and character offsets for the opinion target
'Polar_expression': a list of text and character offsets for the opinion expression
'Polarity': sentiment label ('Negative', 'Positive')
'Intensity': sentiment intensity ('Standard', 'Strong', 'Slight')
'NOT': Whether the target is 'Not on Topic' (True, False)
'Target_is_general': (True, False)
'Type': Whether the polar expression is Evaluative (E) or Evaluative Fact Implied (EFINP)

{
    'sent_id': '202263-20-01',
    'text': 'Touchbetjeningen brukes også til å besvare innkomne mobilanrop , og Sennheiser skryter av å ha doble mikrofoner i øreklokkene for å kutte ned på støyen .',
    'opinions': [
                    {
                     'Source': [['Sennheiser'], ['68:78']],
                     'Target': [['øreklokkene'], ['114:125']],
                     'Polar_expression': [['skryter av å ha doble mikrofoner i øreklokkene for å kutte ned på støyen'], ['79:151']],
                     'Polarity': 'Positive',
                     'Intensity': 'Standard',
                     'NOT': False,
                     'Source_is_author': False,
                     'Target_is_general': True,
                     'Type': 'E'
                     }
                 ]
}

Note that a single sentence may contain several annotated opinions. At the same time, it is common for a given instance to lack one or more elements of an opinion, e.g. the holder (source). In this case, the value for that element is [[],[]].

Importing the data

We include train.json, dev.json, and test.json in this directory.

You can import them by using the json library in python:

>>> import json
>>> data = {}
>>> for name in ["train", "dev", "test"]:
        with open("{0}.json".format(name)) as infile:
            data[name] = json.load(infile)

Cite

If you use this dataset, please cite the following paper:

@InProceedings{OvrMaeBar20,
  author = {Lilja {\O}vrelid and Petter M{\ae}hlum and Jeremy Barnes and Erik Velldal},
  title = {A Fine-grained Sentiment Dataset for {N}orwegian},
  booktitle = {{Proceedings of the 12th Edition of the Language Resources and Evaluation Conference}},
  year = 2020,
  address = "Marseille, France, 2020"
}

URL: https://www.aclweb.org/anthology/2020.lrec-1.618/

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
annotation_guidelines		annotation_guidelines
README.md		README.md
convert_to_bio.py		convert_to_bio.py
data_analysis.py		data_analysis.py
dev.json		dev.json
test.json		test.json
train.json		train.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoReC_fine

Overview

Annotation guidelines

Terms of use

JSON format

Importing the data

Cite

About

Releases

Packages

Contributors 4

Languages

ltgoslo/norec_fine

Folders and files

Latest commit

History

Repository files navigation

NoReC_fine

Overview

Annotation guidelines

Terms of use

JSON format

Importing the data

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages