## Script to process SemEval 2016 Task 5 - Aspect Based Sentiment Analysis (ABSA) corpus using Stanford CoreNLP dependency parser

Task/Slot being evaluated:

    SemEval 2016 Task 5 - Aspect Based Sentiment Analysis (ABSA) (http://alt.qcri.org/semeval2016/task5/)
    Subtask 1: Sentence-Level ABSA
    Slot 2: Opinion Target Expression (OTE) - (Restaurants domain)

Corpus reference:
 
    Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., & Androutsopoulos, I. To Appear. 
    Semeval-2016 task 05: Aspect based sentiment analysis. 
    In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016),
    Association for Computational Linguistics.
    http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools


Stanford CoreNLP:

    Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June).
    The Stanford CoreNLP Natural Language Processing Toolkit. In ACL (System Demonstrations) (pp. 55-60).
    http://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf
    http://stanfordnlp.github.io/CoreNLP/


### Startanford core nlp server

* Download Stanford CoreNLP - http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
* Move ziped folder to /opt/stanford-corenlp
* Start the corenlp server

```bash
$ java -mx6g -cp "/opt/stanford-corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
```


In [36]:
# Lets test the Stanford CoreNLP server
import json
import requests

try:
    # python 3
    from urllib.parse import urlencode
except:
    # python 2
    from urllib import urlencode
    
properties = {'ssplit.isOneSentence':'true',
              'outputFormat': 'conll'}

In [37]:
# use json dumps function to convert the nested dictionary to a string
properties_val = json.dumps(properties)
params = {'properties': properties_val}

In [38]:
# we encode the URL params using urllib because we need a URL with GET parameters
# even though we are making a POST request. The POST data is the text itself.
encoded_params = urlencode(params)
url = '{url}:{port}/?{params}'.format(url='http://localhost',
                                      port='9000', 
                                      params=encoded_params)

In [39]:
text = 'Saul is the best restaurant on Smith Street and in Brooklyn.'
headers = {'Content-Type': 'text/plain;charset=utf-8'}
response = requests.post(url, text.encode('utf-8'), headers=headers)
print (response.content.decode('utf8'))

1	Saul	Saul	NNP	PERSON	5	nsubj
2	is	be	VBZ	O	5	cop
3	the	the	DT	O	5	det
4	best	best	JJS	O	5	amod
5	restaurant	restaurant	NN	O	0	ROOT
6	on	on	IN	O	8	case
7	Smith	Smith	NNP	ORGANIZATION	8	compound
8	Street	Street	NNP	ORGANIZATION	5	nmod
9	and	and	CC	O	8	cc
10	in	in	IN	O	11	case
11	Brooklyn	Brooklyn	NNP	LOCATION	8	conj
12	.	.	.	O	5	punct




In [19]:
from lxml import etree

In [20]:
CORPUS_FILES = [('../corpus/SemEval_ABSA2016/SemEvalABSA2016EnglishRestaurants_trial+train.xml',
                 '../corpus/SemEvalABSA2016EnglishRestaurants_train.xml'),
                ('../corpus/SemEval_ABSA2016/SemEvalABSA2016EnglishRestaurants_test.xml',
                 '../corpus/SemEvalABSA2016EnglishRestaurants_test.xml')]

In [21]:
parser = etree.XMLParser(remove_blank_text=True)

In [25]:
for input_filename, output_filename in CORPUS_FILES:    
    xmldoc = etree.parse(input_filename, parser)
    print('Processing file: "{}"'.format(input_filename))
    for sentence_node in xmldoc.xpath('.//sentence'):
                    
        sentence_string = sentence_node.xpath('string(./text/text())')
                
        # process the sentence in Stanford CoreNLP
        properties = {'ssplit.isOneSentence':'true','outputFormat': 'xml'}
        properties_val = json.dumps(properties)
        params = {'properties': properties_val}
        encoded_params = urlencode(params)
        url = '{url}:{port}/?{params}'.format(url='http://localhost',
                                              port='9000', 
                                              params=encoded_params)
        response = requests.post(url, sentence_string.encode('utf-8'), headers=headers)

        # parse xml removing the header
        output= response.content.decode('utf8')
        parsed_sentence = etree.fromstring(output[101:], parser)
        
        # compile list of opinions
        opinions = list()
        for opinion in sentence_node.xpath('.//Opinion'):
            opinions.append(dict(opinion.items()))
        
        # compile a dict of dependencies
        dependencies = dict()
        for dep in parsed_sentence.xpath('.//dependencies[@type="basic-dependencies"]/dep'):
            idx= dep.xpath('string(dependent/@idx)')
            dep_type = dep.get('type')
            head = dep.xpath('string(governor/@idx)')
            dependencies[idx] = (head, dep_type)
                    
        # build the tokens node
        tokens_node = etree.SubElement(sentence_node, 'tokens')
        
        
        for token_node in parsed_sentence.xpath('.//token'):
            word_node = etree.SubElement(tokens_node, 'word')
                    
            word_node.set('id', token_node.get('id'))
            word_node.set('form', token_node.xpath('string(./word)'))
            word_node.set('from', token_node.xpath('string(./CharacterOffsetBegin)'))
            word_node.set('to', token_node.xpath('string(./CharacterOffsetEnd)'))
            word_node.set('base', token_node.xpath('string(./lemma)'))
            word_node.set('postag', token_node.xpath('string(./POS)'))
            word_node.set('ner', token_node.xpath('string(./NER)'))
            word_node.set('opinion', 'false')
            
            if token_node.get('id') in dependencies:
                word_node.set('head', dependencies[token_node.get('id')][0])
                word_node.set('deprel', dependencies[token_node.get('id')][1])
            else:
                word_node.set('head', '_')
                word_node.set('deprel', '_')
                                    
            
            # check if token is part of a opinion
            for opinion in opinions:                                 
                if (int(opinion.get('from'))  <= int(word_node.get('from')) and 
                    int(opinion.get('to'))    >= int(word_node.get('to'))):
                    word_node.set('opinion', 'true')                    
                    word_node.set('polarity', opinion.get('polarity'))                
                      
        # get and append the tokens node into corpus xml        
        sentence_node.append(tokens_node)
                    
    # save the corpus with new annotation
    etree.ElementTree(xmldoc.getroot()).write(output_filename, encoding='utf8', xml_declaration=True, pretty_print=True)

Processing file: "../corpus/SemEval_ABSA2016/SemEvalABSA2016EnglishRestaurants_trial+train.xml"
Processing file: "../corpus/SemEval_ABSA2016/SemEvalABSA2016EnglishRestaurants_test.xml"
