## Script to process the ReLi Corpus with Stanford CoreNLP dependency parserd trained in the Google Universal Dependencies Treebank

Corpus reference:
 
          Freitas, C., Motta, E., Milidiú, R., & Cesar, J. (2012).
          Vampiro que brilha... rá! Desafios na anotação de opinião em um corpus
          de resenhas de livros. Proceedings do XI Encontro de Linguística de Corpus (XI ELC). São Carlos - SP.
          http://www.linguateca.pt/Repositorio/ReLi/
          
Universal Dependencies:

        McDonald, R. T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., ... & Bedini, C. (2013, August). 
        Universal Dependency Annotation for Multilingual Parsing. In ACL (2) (pp. 92-97).
        http://www.ryanmcd.com/papers/treebanksACL2013.pdf        
        http://universaldependencies.org/index.html#language-pt
        
Stanford CoreNLP:

        Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014, June).
        The Stanford CoreNLP Natural Language Processing Toolkit. In ACL (System Demonstrations) (pp. 55-60).
        http://nlp.stanford.edu/pubs/StanfordCoreNlp2014.pdf
        http://stanfordnlp.github.io/CoreNLP/

### Final XML format style
We are going to add a new postag, head and deprel elements into word tag from the output of Stanford CoreNLP
```xml
<Reviews>
    <Review rid="">
        <sentences>
            <sentence id="">
                <text>....</text>
                <Opinions>
                    <Opinion target="" category="" polarity="" from="" to=""/>
                </Opinions>
                <tokens>
                    <word id="" form="" from="" to="" base="" morf="" postag="" obj="" opinion="" head="" deprel=""/>
                </tokens>
            </sentence>
        </sentences>
    </Review>
</Reviews>
```

### Startanford core nlp server

* Download Stanford CoreNLP - http://nlp.stanford.edu/software/stanford-corenlp-full-2015-12-09.zip
* Move ziped folder to /opt/stanford-corenlp
* Start the corenlp server

```bash
$ cd models/stanford-corenlp_UTB-PT
$ java -mx6g -cp "/opt/stanford-corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
```

* The model used was trained over the Google Universal TreeBank with Portuguese Word Embeddings
* The process to generate the model was described at:
    http://linguaecomputador.blogspot.com.br/2016/03/treinando-modelos-do-parser-de-stanford.html

## 1. Lets test the Stanford CoreNLP server

In [1]:
import json
import requests
from pprint import pprint

try:
    # python 3
    from urllib.parse import urlencode
except:
    # python 2
    from urllib import urlencode

In [2]:
# parameters
properties = {'tokenize.whitespace': 'true',
              'annotators': 'tokenize,ssplit,pos,depparse',
              'depparse.model': 'dep-parser',
              'pos.model': 'pos-tagger.dat',
              'outputFormat': 'conll'}

# use json dumps function to convert the nested dictionary to a string
properties_val = json.dumps(properties)
params = {'properties': properties_val}

In [3]:
# we encode the URL params using urllib because we need a URL with GET parameters
# even though we are making a POST request. The POST data is the text itself.
encoded_params = urlencode(params)
url = '{url}:{port}/?{params}'.format(url='http://localhost',
                                      port='9000', 
                                      params=encoded_params)

In [4]:
text = 'Esta é uma sentença de teste .'
headers = {'Content-Type': 'text/plain;charset=utf-8'}
response = requests.post(url, text.encode('utf-8'), headers=headers)
print (response.content.decode('utf8'))

1	Esta	_	PRON	_	2	nsubj
2	é	_	VERB	_	0	ROOT
3	uma	_	DET	_	4	det
4	sentença	_	NOUN	_	2	attr
5	de	_	ADP	_	4	adpmod
6	teste	_	NOUN	_	5	adpobj
7	.  	_	.	_	2	p




## 2. Parse the ReLi Corpus and include the Stanford Core NLP output

In [5]:
RELI_CORPUS = '../corpus/ReLi.xml'
OUTPUT_CORPUS = '../corpus/ReLiUniversalDependencies.xml'

In [6]:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
xmldoc = etree.parse(RELI_CORPUS)

In [7]:
count = 0
for review_node in xmldoc.iter('Review'):
    count += 1
    print('Processing review {}/{}'.format(count, len(xmldoc.getroot())))
    
    for sentence_node in review_node.iter('sentence'):
        sentence_string = ' '.join([w.get('form') for w in sentence_node.iter('word')])
        
        try:            
            # process the sentence string calling Stanford CoreNLP WebService
            response = requests.post(url, sentence_string.encode('utf-8'), headers=headers)
            output= response.content.decode('utf8')
        
            # parse output in conll format
            parsed_sentence = [s.split('\t') for s in output.strip().split('\n') if len(s.strip()) != 0]
        
            # add the annotation to each node
            for i, word_node in enumerate(sentence_node.iter('word')):
                word_node.set('postag', parsed_sentence[i][3])
                word_node.set('head', parsed_sentence[i][5])
                word_node.set('deprel', parsed_sentence[i][6])
        except Exception as exp:
            print('Error Processing sentence "{0}", got exception: "{1}" for parser output:\n {2}'.format(
                    sentence_string, 
                    exp, 
                    output))    

Processing review 1/1601
Processing review 2/1601
Processing review 3/1601
Processing review 4/1601
Processing review 5/1601
Processing review 6/1601
Processing review 7/1601
Processing review 8/1601
Processing review 9/1601
Processing review 10/1601
Processing review 11/1601
Processing review 12/1601
Processing review 13/1601
Processing review 14/1601
Processing review 15/1601
Processing review 16/1601
Processing review 17/1601
Processing review 18/1601
Processing review 19/1601
Processing review 20/1601
Processing review 21/1601
Processing review 22/1601
Processing review 23/1601
Processing review 24/1601
Processing review 25/1601
Processing review 26/1601
Processing review 27/1601
Processing review 28/1601
Processing review 29/1601
Processing review 30/1601
Processing review 31/1601
Processing review 32/1601
Processing review 33/1601
Processing review 34/1601
Processing review 35/1601
Processing review 36/1601
Processing review 37/1601
Processing review 38/1601
Processing review 39/

In [8]:
# save the corpus with new annotation
etree.ElementTree(xmldoc.getroot()).write(OUTPUT_CORPUS, 
                                          encoding='utf8', 
                                          xml_declaration=True, 
                                          pretty_print=True)

In [9]:
# Lets check the output corpus
print(etree.tounicode(xmldoc, pretty_print=True)[:1000])

<Reviews>
  <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0">
    <sentences>
      <sentence id="0:0" place="title" polarity="neutral">
        <text>Está provado: Pode existir um livro bom sem uma história boa</text>
        <tokens>
          <word id="0" form="Está" postag="VERB" obj="O" opinion="O" base="estar" morf="Y2s" from="0" to="3" head="0" deprel="ROOT"/>
          <word id="1" form="provado" postag="ADJ" obj="O" opinion="O" base="provar" morf="Kms" from="5" to="11" head="1" deprel="acomp"/>
          <word id="2" form=":" postag="." obj="O" opinion="O" base=":" morf="" from="12" to="12" head="1" deprel="p"/>
          <word id="3" form="Pode" postag="AUX" obj="O" opinion="O" base="poder" morf="Y2s" from="14" to="17" head="5" deprel="aux"/>
          <word id="4" form="existir" postag="VERB" obj="O" opinion="O" base="existir" morf="W3s" from="19" to="25" head="1" deprel="parataxis"/>
          <word id="5" form="um" postag="DET" obj="O" opinion="O" b