Skip to content

jerbarnes/finegrained_data

Repository files navigation

Fine-grained sentiment datasets

This repo contains code to download and preprocess the following sentiment datasets

  1. Fine-grained
    1. MPQA: An English corpus of NewsWire annotated for structured sentiment
    2. DarmStadt Service Review Corpus: An English corpus of online service reviews
    3. NoReC Fine-grained
    4. MultiBooked EU and CA: Hotel reviews in Basque and Catalan
    5. OpeNER: Hotel reviews in English and Spanish
  2. Targeted
    1. SemEval 2014 Task 4: Restaurant and Laptop reviews in English.
    2. Open Domain Targeted Sentiment: twitter dataset with various targets
    3. TDParse: twitter election tweets with multiple targets per tweet.
    4. MAMS: Restaurant reviews with multiple targets in each sentence.

Dependencies

  • Python3
  • lxml
  • numpy
  • nltk

Downloading and preprocessing the data

Running the following scripts will download and process all the data. The only exception is MPQA. You will need to get the MPQA 2.0 dataset, agree to the license, download the dataset directly, and then place it in this repo before running 'get_finegrained_data.sh'

./get_finegrained_data.sh
./get_targeted_data.sh
./process_finegrained.sh
./process_targeted.sh

This will create a directory called 'processed', and in each of the subdirectories, you will find three json files (train.json, dev.json, test.json). Each json file contains a list of sentences, where each sentence is a dictionary with the following

Each sentence has a dictionary with the following keys and values:

  • 'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data

  • 'text': raw text

  • 'opinions': list of all opinions (dictionaries) in the sentence

Additionally, each opinion in a sentence is a dictionary with the following keys and values:

  • 'Source': a list of text and character offsets for the opinion holder

  • 'Target': a list of text and character offsets for the opinion target

  • 'Polar_expression': a list of text and character offsets for the opinion expression

  • 'Polarity': sentiment label ('Negative', 'Positive')

  • 'Intensity': sentiment intensity ('Standard', 'Strong', 'Slight')

{
    'sent_id': '202263-20-01',
    'text': 'Touchbetjeningen brukes også til å besvare innkomne mobilanrop , og Sennheiser skryter av å ha doble mikrofoner i øreklokkene for å kutte ned på støyen .',
    'opinions': [
                    {
                     'Source': [['Sennheiser'], ['68:78']],
                     'Target': [['øreklokkene'], ['114:125']],
                     'Polar_expression': [['skryter av å ha doble mikrofoner i øreklokkene for å kutte ned på støyen'], ['79:151']],
                     'Polarity': 'Positive',
                     'Intensity': 'Standard'
                     }
                 ]
}

Note that a single sentence may contain several annotated opinions. At the same time, it is common for a given instance to lack one or more elements of an opinion, e.g. the holder (source) or target. In this case, the value for that element is [[],[]].

Importing the data

We include train.json, dev.json, and test.json in this directory.

You can import them by using the json library in python:

>>> import json
>>> data = {}
>>> for name in ["train", "dev", "test"]:
        with open("{0}.json".format(name)) as infile:
            data[name] = json.load(infile)

Cite

If you use this script, please cite the following paper, as well as the corresponding citations from the appropriate datasets:

@inproceedings{barnes-etal-2021-youve,
    title = "If you{'}ve got it, flaunt it: Making the most of fine-grained sentiment annotations",
    author = "Barnes, Jeremy  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.5",
    pages = "49--62"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published