# Processing [MPQA 2.0](http://mpqa.cs.pitt.edu/corpora/mpqa_corpus/) with python

To understand how is the MPQA corpus annotated observe the image from the Chapter 7, page 122, from Theresa Ann Wilson's thesis (2008): Fine-grained Subjectivity and Sentiment Analysis: Recognizing the Intensity, Polarity, and Attitudes of Private States. University of Pittsburgh.

This repository is made to use the MPQA 2.0 corpus to train and test a neural **token-level** sentence tagger (labeler) using python. The tagger labels holders and targets of **direct-subjectives** expressions.

This repository ignores expressive-subjectivity annotations and objective-speech-event annotations. 

![MPQA schema](mpqa_schema.png)

## Pre-process 

Split sentences and tokenize them with the [python wrapper](https://github.com/brendano/stanford_corenlp_pywrapper) of the Stanford CoreNLP. Add path to the wrapper in the line 49 and call:

In [1]:
from generate_mpqa_jsons import preprocess
preprocess()

SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? (process_mpqa2_new.py, line 346)

## Produce corpus json 

To demonstrate how does the produced corpus json look like, let's consider only two documents:

In [None]:
docs = ['non_fbis/06.12.31-26764', '20020320/11.52.35-10118']

Let's check MPQA annotation files of these two documents.

In [None]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 20;

<IPython.core.display.Javascript object>

In [None]:
for doc in docs:
    lre = 'database.mpqa.2.0/man_anns/' + doc + '/gateman.mpqa.lre.2.0'
    lre_file = open(lre, 'r').readlines()
    for line in lre_file:
        print(line)
    print('='*110 + '\n')

# MPQA annotation file

#  Created by /afs/cs.pitt.edu/projects/wiebe/opin/bin/mkman_anns.pl

#  from gate_anns/non_fbis/06.12.31-26764/06.12.31-26764.jlw.attitudes.xml

#  on Sun Dec 10 07:48:07 EST 2006

#  by twilson

125	1393,1393	string	GATE_direct-subjective	 attitude-link="a16, a17, a18" intensity="high" nested-source="w" implicit="true"

121	1515,1532	string	GATE_expressive-subjectivity	 nested-source="w" intensity="high" polarity="negative"

117	1,12	string	GATE_objective-speech-event	 nested-source="w,implicit"

113	2617,2621	string	GATE_agent	 nested-source="w,ƒrice"

109	1630,1770	string	GATE_inside	 nested-source="w"

105	1611,1625	string	GATE_expressive-subjectivity	 nested-source="w" intensity="high" polarity="negative"

101	290,471	string	GATE_inside	 nested-source="w"

97	1261,1269	string	GATE_objective-speech-event	 objective-uncertain="very-uncertain" nested-source="w,one" insubstantial="c3"

93	913,928	string	GATE_expressive-subjectivity	 nested-source="w" intensity

Clearly, these annotations have to be reformatted if you would like train a neural **token-level** sentence labeler using them.

In [None]:
from data_helpers.process_mpqa2_new import get_annotations
from operator import add
import json

In [None]:
annos_path = 'database.mpqa.2.0/man_anns/'
docsraw_path = 'database.mpqa.2.0/docs/'

data_dict = {'documents_num': len(docs)}
stats_corpus = [0]*14

for d, doc in enumerate(docs):
    lre = annos_path + doc + '/gateman.mpqa.lre.2.0'
    sent = annos_path + doc + '/gatesentences.mpqa.2.0'
    docraw = docsraw_path + doc

    doc_corenlp_json = docraw + ".json"
    with open(doc_corenlp_json) as data_file:
        doc_corenlp = json.load(data_file)
    doc_token_sentences = doc_corenlp['sentences']

    argv = [lre, docraw, sent, doc_token_sentences]
    doc_dict, stats_doc = get_annotations(argv)
    stats_corpus = map(add, stats_corpus, stats_doc)
    
    doc_name = 'document'+str(d)
    data_dict.update({doc_name: doc_dict})

```data_dict``` is the corpus dictionary. Let's it print nicely using the first answer to [this](https://stackoverflow.com/questions/3229419/how-to-pretty-print-nested-dictionaries) stackoverflow question.

In [None]:
def pretty(d, indent=0):
   for key, value in d.items():
      print('\t' * indent + str(key))
      if isinstance(value, dict):
         pretty(value, indent+1)
      else:
         print('\t' * (indent+1) + str(value))

Following code just sorts the dictionary keys.

In [None]:
import collections

for doc in data_dict:
    if doc != 'documents_num':
        doc_dict_sort = collections.OrderedDict(sorted(data_dict[doc].items()))
        for sid in range(doc_dict_sort['sentences_num']):
            sent_dict = doc_dict_sort['sentence'+str(sid)]
            sent_dict_sort = collections.OrderedDict(sorted(sent_dict.items(), reverse=True))
            doc_dict_sort['sentence'+str(sid)] = sent_dict_sort            
        data_dict[doc] = doc_dict_sort

pretty(data_dict)

document0
	document_path
		database.mpqa.2.0/docs/non_fbis/06.12.31-26764
	sentence0
		sentence_tokenized
			[u'Elaborating', u'on', u'the', u'`', u'axis', u'of', u'evil', u"'", u'PRESIDENT', u'GEORGE', u'Bush', u"'s", u'National', u'Security', u'Adviser', u'Condoleezza', u'Rice', u'has', u'recently', u'``', u'defined', u"''", u'the', u'context', u'and', u'scope', u'of', u'the', u'term', u'``', u'axis', u'of', u'evil', u"''", u'that', u'Bush', u'used', u'in', u'his', u'State', u'of', u'the', u'Union', u'address', u'a', u'few', u'weeks', u'ago', u'to', u'describe', u'Iraq', u',', u'Iran', u'and', u'North', u'Korea', u'.']
		dss_ids
			[3, 5]
		ds5
			ds_implicit
				False
			ds_intensity
				neutral
			ds_annotation_uncertain
				none
			ds_indices
				[48, 49]
			holders_uncertain
				['no']
			holder_ds_overlap
				[False]
			holders_indices
				[[35]]
			ds_polarity
				none
			att_num
				1
			holders_tokenized
				[[u'Bush']]
			att0
				attitudes_uncertain
					no
				attitudes_sa

However, not everything was retrived from given annotation files, mostly due to the sentence splitting and tokenization. Some problems come from the annotation files. We fix some holder annotations (starting with line 536 in ```data_helpers.process_mpqa2_new```) as follows:

1. nhs -> w,nhs
2. w,ip -> ip
3. w,mug,mug -> w,mug

In [None]:
print("# DSs for which the corresponding tokenized sentence was not retrieved: %d" % stats_corpus[5])
print("# DSs for which two corresponding tokenized sentence were retrieved: %d" % stats_corpus[4])
print("# holders that are not in the same sentence as the corresponding DS: %d" % stats_corpus[6])
print("# targets that are not in the same sentence as the corresponding DS: %d" % stats_corpus[7])
print("# holders that are not retrieved from the given annotation: %d" % stats_corpus[9])
print("# targets that are not retrieved from the given annotation: %d" % stats_corpus[8])

# DSs for which the corresponding tokenized sentence was not retrieved: 0
# DSs for which two corresponding tokenized sentence were retrieved: 0
# holders that are not in the same sentence as the corresponding DS: 1
# targets that are not in the same sentence as the corresponding DS: 1
# holders that are not retrieved from the given annotation: 0
# targets that are not retrieved from the given annotation: 0


Although annotators should have used ```implicit``` attribute to mark implicit direct-subjectives, such cases are not always properly marked. Some of them are one-character long direct-subjectives. We discard them.

In [None]:
print("# direct subjectives (DSs) with lenght smaller or equal to one character: %d" % stats_corpus[2])
print("# number of implicit DSs: %d" % stats_corpus[0])

implicit_one_char =  stats_corpus[12]/float(stats_corpus[0]) if stats_corpus[0] > 0 else 0
print("percentage of implicit DSs that are one character long: %d" % implicit_one_char)

print("# DSs longer than one character and with one corresponding sentence: %d" % stats_corpus[3])

# direct subjectives (DSs) with lenght smaller or equal to one character: 23
# number of implicit DSs: 2
percentage of implicit DSs that are one character long: 0
# DSs longer than one character and with one corresponding sentence: 30


Finally, save the dictionary in a json to be able to re-use it later. 

In [None]:
with open('example_jsons/two_docs.json', 'w') as fp:
    json.dump(data_dict, fp, sort_keys=True, indent=2)

**The dictionary contains all information about direct subjectives available from the annotation files, but not all of them are always neccessary (e.g. direct subjective subjectivity ceratainty attribute). Follows code how to filter only information needed to prepare data for labelling of opinion role (i.e. holder and target) of direct-subjectives.**

### Produce train, dev, test jsons for experiments 

The folder ```datasplit``` contains two sub-folders: ```new``` and ```prior```. The sub-folder ```prior``` contains document splits for 10-fold CV given by authors of the [prior work](http://www.aclweb.org/anthology/P16-1087) and ```new``` contains document splits for 4-fold CV such that dev and test set are large enough. Call function ```main(4, 'new')``` or ```main(10,'prior')``` from ```generate_mpqa_jsons``` which will produce train, dev, test jsons and save them in ```jsons``` folder. 

## FIlter and vectorize train, dev, test jsons 

Build a vocabulary from training data.

In [None]:
json_name = 'jsons/prior/train_fold_0.json'
with open(json_name) as data_file:
    orl_train_corpus = json.load(data_file)

In [None]:
from data_utils import get_emb_vocab

train_sentences_orl = []
for doc_num in range(orl_train_corpus['documents_num']):
    document_name = 'document' + str(doc_num)
    doc = orl_train_corpus[document_name]

    for sent_num in range(doc['sentences_num']):
        sentence_name = 'sentence' + str(sent_num)
        sentence_lower = map(lambda x: x.lower(), doc[sentence_name]['sentence_tokenized'])
        train_sentences_orl.append(sentence_lower)
        
emb_type = 'glove'
emb_size = 100
vocab_freq = 1
embeddings, vocabulary, _ = get_emb_vocab(train_sentences_orl, emb_type, emb_size, vocab_freq)

Vectorize the above produced json for a neural model.

In [None]:
from data_utils import transform_orl_data

mode = 'dev'
att_link_obligatory = 'false'
window_size = 1
exp_setup_id = 'tmp'
orl_train, _, _ = transform_orl_data(orl_train_corpus, vocabulary, window_size, mode, exp_setup_id, att_link_obligatory)

In [None]:
#print(orl_train)

From this data structure you can make an iterator for a mini-batch gradient descent.

In [None]:
from data_utils import eval_data_iter

batch_size = 32
orl_train_iter = eval_data_iter(orl_train, batch_size, vocabulary, None)

Let's go through important steps of the ```transform_orl_data``` method. Consider only one direct-subjective for demonstration.

In [None]:
doc = orl_train_corpus['document0']
sentence = doc['sentence12']
ds = doc['sentence12']['ds2']
print(pretty(ds))

ds_implicit
	False
ds_intensity
	medium
ds_annotation_uncertain
	none
ds_indices
	[1]
holders_uncertain
	[u'no']
att0
	target_ds_overlap
		[False]
	attitudes_uncertain
		no
	attitudes_contrast
		no
	targets_uncertain
		[u'no']
	attitudes_types
		sentiment-pos
	attitudes_repetition
		no
	targets_tokenized
		[[u'the', u'support', u'of', u'The', u'__', u'Company', u'Foundation', u'in', u'the', u'amount', u'of', u'$', u'10,000.00']]
	targets_indices
		[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]]
	attitudes_sarcastic
		no
	attitudes_inferred
		no
holders_indices
	[[0]]
ds_polarity
	positive
att_num
	1
holder_ds_overlap
	[False]
holders_tokenized
	[[u'McCoy']]
ds_tokenized
	[u'invites']
attitudes
	[u'sentiment-pos']
ds_expression_intensity
	medium
ds_subjective_uncertain
	none
attitude_link_exists
	True
ds_insubstantial
	none
None


We do not allow **implicit** direct-subjectives.

For example, in *But there can not be any real [talk]<sub>opinion</sub> of success until the broad strategy agains terrorism 
begins to bear fruit.*, the annotated attitude covers the whole sentence (no target) and reflects the attitude of the author of the document (no explicit holder).

Is direct subjective implicit or not can be easily retrived using our json format. 

In [None]:
print(ds['ds_implicit'])

False


We do not allow **inferred** direct-subjectives. 

The task we are tackling is labelling of opinion roles of *explicit* opinion expressions. 

Opinion roles of inferred opinion expressions can not be recovered with a same method.

In [None]:
inferred = True
for atid in range(ds['att_num']):
    if ds['att'+str(atid)]['attitudes_inferred'] != 'yes':
        inferred = False
print(inferred)

False


The annotation of direct subjectives may miss the attitude-link attribute, which makes impossible to trace its target.
                
Example: 
2053	3765,3769	string	GATE_direct-subjective	 nested-source="w,mccoy" intensity=""

We keep such direct-subjectives.

In [None]:
if not ds['attitude_link_exists']:
    if att_link_exists_obligatory == 'true':
        print('No attitude link.')

A direct-subjective can be marked with the **insubstantial attribute** if it is:
1. **not significant** in the discourse; for example, from *It completely supports the [U.S]<sub>holder</sub> [stance]<sub>opinion</sub>*, we cannot realize what is the U.S.'s stance,

2. **not real** within the discourse; for example, from *Antonio Martino, meanwhile, said [...] that his country would not support an attack on Iraq without 'proven proof' that [Baghdad]<sub>holder</sub> is [supporting]<sub>opinion</sub> [al Qaeda]<sub>target</sub>*, we do not have a proof that Baghdad is supporting al Qaeda.
                
We keep insubstantial DSEs and label their opinion roles.

In [None]:
print(ds['ds_insubstantial'])

none


A direct-subjective can have attribute 'ds_subjective_uncertain' and 'annotation-uncertain'.

We did not discard those believing that they would have been discarded by the corpus creators if they are really incorrect.

Same holds for targets and holders. 

In [None]:
print(ds['ds_annotation_uncertain'] )

none


We do not allow a holder or a target to overlap with the corresponding direct-subjective. 

For example, in *Mugabe said [Zimbabwe]<sub>target</sub> needed their continued support against what he called [hostile [international]<sub>holder</sub> attention]<sub>opinion</sub>.*, holder and direct subjective overlap.

In [None]:
print(ds['holder_ds_overlap'])

[False]


A direct-subjective can have multiple attitudes and each attitude can point to different targets. 

We have to pick one attitude and non-overlapping targets.

We chose attitudes according to the following priorities: sentiment, intention, agreement, arguing, other-attitude, speculation.