### Script to produce the ReLi Corpus in XML format with UNITEX lemmatizer and morfological features.
   
This format is similar as the SemEval 2015 ABSA Slot 2 dataset with the addition of tokenization and word-level annotations.

***Corpus reference:***

Freitas, C., Motta, E., Milidiú, R., & Cesar, J. (2012).
Vampiro que brilha... rá! Desafios na anotação de opinião em um corpus
de resenhas de livros. Proceedings do XI Encontro de Linguística de Corpus (XI ELC). São Carlos - SP.

http://www.linguateca.pt/Repositorio/ReLi/
     
***SemEval 2015 ABSA Slot 2 dataset format reference:***

Pontiki, M., Galanis, D., Papageorgiou, H., Manandhar, S., & Androutsopoulos, I. (2015, June). Semeval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Association for Computational Linguistics, Denver, Colorado (pp. 486-495).
   
http://alt.qcri.org/semeval2015/task12/


### SemEval XML format style

```xml
<Reviews>
    <Review rid="">
        <sentences>
            <sentence id="">
                <text>....</text>
                <Opinions>
                    <Opinion target="" category="" polarity="" from="" to=""/>
                </Opinions>
            </sentence>
        </sentences>
    </Review>
</Reviews>
    
```
### Final XML format style
We are going to add a new 'tokens' tag to keep the corpus tokens in a format similar to malt parser output.

```xml
<Reviews>
    <Review rid="">
        <sentences>
            <sentence id="">
                <text>....</text>
                <Opinions>
                    <Opinion target="" category="" polarity="" from="" to=""/>
                </Opinions>
                <tokens>
                    <word id="" form="" from="" to="" base="" morf="" postag="" obj="" opinion=""/>
                </tokens>
            </sentence>
        </sentences>
    </Review>
</Reviews>
```

## 1. Convert the corpus to xml format

In [1]:
# Parameters
reli_folder = '../corpus/ReLi/'
output_xml = '../corpus/ReLi.xml'

# some imports needed
import enchant # pip install -U enchant    
import unicodedata
spell_checker = enchant.Dict('pt_BR') # apt install aspell-pt-br

In [2]:
# makes the <tokens> node from a grid of values in the format similar to conll (the format ReLi is distributed)
def make_tokens_node(grid):               
    tokens_node = etree.Element('tokens')
    for token_id, (word, pos, obj, opinion, polarity) in enumerate(grid):
        
        # build word node        
        word_node = etree.SubElement(tokens_node, 'word')
        word_node.set('id', str(token_id))        
        word_node.set('form', word)
        word_node.set('postag', pos)
        word_node.set('obj', obj)
        word_node.set('opinion', opinion)       
            
    return tokens_node

In [3]:
def make_sentence_node(grid, review_id, sent_id, sent_place):
    sent_node = etree.Element('sentence')
    sent_node.set('id', str(review_id) + ':' + str(sent_id))                        
    sent_node.set('place', sent_place)
    polarity = grid[0][-1]
    if polarity == '-':
        sent_node.set('polarity', 'negative')
    elif polarity == '+':
        sent_node.set('polarity', 'positive')
    else:
        sent_node.set('polarity', 'neutral')
    tokens_node = make_tokens_node(grid)
    sent_node.append(tokens_node)
    return sent_node    

In [4]:
import os
import codecs

# should install lxml: pip install -U lxml
from lxml import etree

# construct the xml
xmldoc = etree.Element('Reviews')

# Read files from reli_folder to build the xml
for filename in os.listdir(reli_folder):
    if filename.endswith('.txt'):
        filepath = os.path.join(reli_folder, filename)
        with codecs.open(filepath, encoding='utf8') as fp:
            book_name = ''
            sent_place = ''
            sent_id = 0
            grid = []
            
            for line_num, line in enumerate(fp):
                line = line.strip()

                # check if line contains #Livro
                if line.startswith('#Livro_'):                    
                    book_name = line[7:]

                # check if line contains #Resenha
                elif line.startswith('#Resenha_'):
                    review_node = etree.SubElement(xmldoc, 'Review')
                    review_id = line[9:]                    
                    review_node.set('rid', review_id)
                    review_node.set('book_title', book_name)
                    sents_node = etree.SubElement(review_node, 'sentences')
                    sent_id = 0

                # check if line contains #nota
                elif line.startswith('#Nota_'):
                    score = line[6:]
                    review_node.set('score', score)

                # check if line contains #Título
                elif line.startswith('#Título'):                    
                    sent_place = 'title'

                # check if line contains #Corpo
                elif line.startswith('#Corpo'):                    
                    sent_place = 'body'

                # check if line contains 6 elements, this is the corpus data
                elif len(line.split('\t')) == 6:                    
                    word, pos, obj, opinion, polarity, _ = line.split('\t')
                    grid.append((word, pos, obj, opinion, polarity))

                # break lines indicate sentence end
                elif len(line) == 0:
                    if grid:
                        sent_node = make_sentence_node(grid, review_id, sent_id, sent_place)                        
                        sents_node.append(sent_node)
                        grid = []
                        sent_id += 1
                        
                # ignore initial line
                elif line.startswith('[features'):
                    continue
                
                # any lines that dont fit in the ifs above has an error
                else:
                    print('Problem parsing the line {} from filename {}, found: {}'.format(line_num, filename, line))
            if grid:
                sent_node = make_sentence_node(grid, review_id, sent_id, sent_place)                        
                sents_node.append(sent_node)                   

Problem parsing the line 11545 from filename ReLi-Sheldon.txt, found: N	O	O	O	O
Problem parsing the line 11546 from filename ReLi-Sheldon.txt, found: ADJ	O	O	O	O


The two lines preseting errors are lines without the word, starting only with the postag.

In [5]:
# Save to xml file
etree.ElementTree(xmldoc).write(output_xml, encoding='utf8', xml_declaration=True, pretty_print=True)

# Check the output
print(etree.tounicode(xmldoc, pretty_print=True)[:10000])

<Reviews>
  <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0">
    <sentences>
      <sentence id="0:0" place="title" polarity="neutral">
        <tokens>
          <word id="0" form="Está" postag="V" obj="O" opinion="O"/>
          <word id="1" form="provado" postag="PCP" obj="O" opinion="O"/>
          <word id="2" form=":" postag=":" obj="O" opinion="O"/>
          <word id="3" form="Pode" postag="VAUX" obj="O" opinion="O"/>
          <word id="4" form="existir" postag="V" obj="O" opinion="O"/>
          <word id="5" form="um" postag="ART" obj="O" opinion="O"/>
          <word id="6" form="livro" postag="N" obj="O" opinion="O"/>
          <word id="7" form="bom" postag="ADJ" obj="O" opinion="O"/>
          <word id="8" form="sem" postag="PREP" obj="O" opinion="O"/>
          <word id="9" form="uma" postag="ART" obj="O" opinion="O"/>
          <word id="10" form="história" postag="N" obj="O" opinion="O"/>
          <word id="11" form="boa" postag="ADJ" obj="O" 

## 1. Add Lemma and morfological features using Unitex PB

Unitex Delaf is dictionary containing the word, lemma and morfological features for Brazilian Portuguese vocabulary.

Unitex Delaf PB can be downloaded at 

http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/files/DELAF_PB_v2.zip

The Unitex PB project can be visited at:

http://www.nilc.icmc.usp.br/nilc/projects/unitex-pb/web/index.html

In [6]:
# Reading unitex file
import gzip

unitex = dict()
unitex['all'] = dict()

with gzip.open('Delaf2015v04.dic.gz') as fp:
    for line in fp:
        line = line.decode('utf8')
        word, info = line.split(',')
        lemma, info = info.split('.')
        postag = info.split(':')[0].strip()
        if len(info.split(':')) == 2:
            morf = info.split(':')[1].strip()
        else:
            morf = ''
        
        # get the first pos tag in case of multiple
        postag = postag.split('+')[0]
        
        if postag not in unitex:
            unitex[postag] = dict()
            
        # no disambiguation, get the last value present in the dict
        # keep the words organized by postag
        unitex[postag][word] = (lemma, morf)
        
        # keep the 'all' lexicon, despite the tag
        unitex['all'][word] = (lemma, morf)

In [7]:
# Check for unitex PoS Tags
print ('Unitex PoS tags: \n' + '\n'.join(sorted(unitex.keys() - ['all'])))

Unitex PoS tags: 
A
ADV
CONJ
DET
INTERJ
N
PFX
PREP
PREPXADV
PREPXDET
PREPXPREP
PREPXPRO
PRO
PROXPRO
V


In [8]:
# Check for ReLi PoS Tags
reli_postags = set([word_node.get('postag') for word_node in xmldoc.iter('word')])

print ('ReLi PoS tags: \n' + '\n'.join(sorted(reli_postags)))

ReLi PoS tags: 
!
"
$
'
(
)
,
-
.
..
...
/
//
:
;
=
?
ADJ
ADV
ADV-KS
ADV-KS-REL
ART
CUR
IN
KC
KS
N
NPROP
NUM
O
PCP
PDEN
PREP
PRO-KS
PRO-KS-REL
PROADJ
PROPESS
PROSUB
V
VAUX
[


In [9]:
# All tags are similar except A -> ADJ, fix it.
unitex['ADJ'] = unitex['A']

#### Check for words and PoS in Unitex. Try to correct the word if this is the case

In [10]:
for word_node in xmldoc.iter('word'):
    # lower case word for normalization
    word = word_node.get('form').lower()
    postag = word_node.get('postag')
    base = word
    morf = ''
    
    
    # check if word is in unitex and if not try to correct the spelling
    if word not in unitex['all']:
        # try to correct the spelling if this is the case
        # only corrects in case of missing accent, what is usual in social texts
        suggestions = spell_checker.suggest(word)
            
        if len(suggestions) >= 1:
            suggestion = suggestions[0]

            # strip accents
            nfkd_form = unicodedata.normalize('NFKD', str(word))
            word = nfkd_form.encode('ASCII', 'ignore')
            
            nfkd_form = unicodedata.normalize('NFKD', str(suggestion))
            suggestion = nfkd_form.encode('ASCII', 'ignore')

            # Only corrects if the suggestion is the word with the correct accent
            if suggestion == word:
                word = suggestions[0]
    
    # check if word is under unitex postag
    if postag in unitex and word in unitex[postag]:
        base, morf = unitex[postag][word]
    
    # check if word is in under other unitex postag
    elif word in unitex['all']:
        base, morf = unitex['all'][word]
        
    word_node.set('base', base)
    word_node.set('morf', morf)

In [11]:
# Save to xml file
etree.ElementTree(xmldoc).write(output_xml, encoding='utf8', xml_declaration=True, pretty_print=True)

# Check the output
print(etree.tounicode(xmldoc, pretty_print=True)[:1000])

<Reviews>
  <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0">
    <sentences>
      <sentence id="0:0" place="title" polarity="neutral">
        <tokens>
          <word id="0" form="Está" postag="V" obj="O" opinion="O" base="estar" morf="Y2s"/>
          <word id="1" form="provado" postag="PCP" obj="O" opinion="O" base="provar" morf="Kms"/>
          <word id="2" form=":" postag=":" obj="O" opinion="O" base=":" morf=""/>
          <word id="3" form="Pode" postag="VAUX" obj="O" opinion="O" base="poder" morf="Y2s"/>
          <word id="4" form="existir" postag="V" obj="O" opinion="O" base="existir" morf="W3s"/>
          <word id="5" form="um" postag="ART" obj="O" opinion="O" base="um" morf="ms"/>
          <word id="6" form="livro" postag="N" obj="O" opinion="O" base="livro" morf="ms"/>
          <word id="7" form="bom" postag="ADJ" obj="O" opinion="O" base="bom" morf="ms"/>
          <word id="8" form="sem" postag="PREP" obj="O" opinion="O" base="sem" morf=""/>

## 2. Add the untokenized sentence and the start, end offsets

In [12]:
# basic detokenizer function
import re
def detokenizer(sentence):
    # punctuation to keep close to the left word: . , .. ... : ? ! ;
    sentence = re.sub(r' (\.|,|\.\.|\.\.\.|:|\?|!|;)',
                      r'\1', sentence, flags=re.U)

    # punctuation to keep close to the both words: -se -me
    sentence = re.sub(r'(\w) (-) (a|as|o|os|se|me|te|vos|lhe|lha|lhes|lhas|na|nas|no|nos|la|las|lo|los)( |$)',
                      r'\1\2\3\4', sentence, flags=re.U)

    # punctuation to keep in context (x) "x" 'x'
    sentence = re.sub(r'\( ?(.*?) ?\)', r'(\1)', sentence, flags=re.U)
    sentence = re.sub(r'" ?(.*?) ?"', r'"\1"', sentence, flags=re.U)
    sentence = re.sub(r"' ?(.*?) ?'", r"'\1'", sentence, flags=re.U)

    # correct emoticons!
    sentence = sentence.replace(': )', ' :)')
    sentence = sentence.replace(': (', ' :(')

    return sentence

In [13]:
for sent_node in xmldoc.iter('sentence'):
    
    # Try to transform the original sentence detokenizing it.
    sentence = ' '.join(word_node.get('form') for word_node in sent_node.iter('word'))
    sentence = detokenizer(sentence)
    
    # insert node text in the first position, before tokens    
    text_node = etree.Element('text')
    text_node.text = sentence
    sent_node.insert(0, text_node)
    
    index = 0
    for word_node in sent_node.iter('word'):
        word = word_node.get('form')
        if sentence[index] == word[0]:
            start = index
            index += len(word) - 1
        elif sentence[index + 1] == word[0]:
            start = index + 1
            index += len(word) - 1 + 1        
        else:
            print('Problem computing start offset for word {} in sentence: {}'.format(word, sentence))
            continue
                    
        if sentence[index] == word[-1]:
            end = index
            index += 1
        else:
            print('Problem computing end offset for word {} in  sentence: {}'.format(word, sentence))
            continue
            
        word_node.set('from', str(start))
        word_node.set('to', str(end))

In [14]:
# Save to xml file
etree.ElementTree(xmldoc).write(output_xml, encoding='utf8', xml_declaration=True, pretty_print=True)

# Check the output
print(etree.tounicode(xmldoc, pretty_print=True)[:1000])

<Reviews>
  <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0">
    <sentences>
      <sentence id="0:0" place="title" polarity="neutral">
        <text>Está provado: Pode existir um livro bom sem uma história boa</text>
        <tokens>
          <word id="0" form="Está" postag="V" obj="O" opinion="O" base="estar" morf="Y2s" from="0" to="3"/>
          <word id="1" form="provado" postag="PCP" obj="O" opinion="O" base="provar" morf="Kms" from="5" to="11"/>
          <word id="2" form=":" postag=":" obj="O" opinion="O" base=":" morf="" from="12" to="12"/>
          <word id="3" form="Pode" postag="VAUX" obj="O" opinion="O" base="poder" morf="Y2s" from="14" to="17"/>
          <word id="4" form="existir" postag="V" obj="O" opinion="O" base="existir" morf="W3s" from="19" to="25"/>
          <word id="5" form="um" postag="ART" obj="O" opinion="O" base="um" morf="ms" from="27" to="28"/>
          <word id="6" form="livro" postag="N" obj="O" opinion="O" base="livro" mor

## 3. Add the opinions node with the aspect target ('obj')

In [15]:
def make_opinion_node(opinion_chunk, opinions):
    opinion_node = etree.Element('opinion')
    chunk = detokenizer(' '.join([w for w,b,e,o in opinion_chunk]))

    start = opinion_chunk[0][1]
    end = opinion_chunk[-1][2]
    obj = opinion_chunk[0][-1]
    # extract the OBJ id: OBJ00 -> 00
    obj = obj[-2:]

    if obj in opinions and opinions[obj] == '-':
        polarity = 'negative'
    elif obj in opinions and opinions[obj] == '+':
        polarity = 'positive'
    else:
        polarity = 'unknown'
        print('OBJ polarity not defined for opinion chunk: {} in sentence id {}'.format(
                chunk, 
                sent_node.get('id')))

    opinion_node = etree.SubElement(opinions_node, 'opinion')
    opinion_node.set('target', chunk)
    opinion_node.set('polarity', polarity)
    opinion_node.set('from', str(start))
    opinion_node.set('to', str(end))                
    
    return opinion_node

In [16]:
for sent_node in xmldoc.iter('sentence'):    
        
    opinions_node = etree.Element('opinions')
    
    # op00+, op01+, op01-, op02+
    opinions = {word_node.get('opinion')[2:4]:word_node.get('opinion')[4:5] 
                    for word_node in sent_node.iter('word') 
                        if word_node.get('opinion')!= 'O'}    
    
    opinion_chunk = []
    last_obj = 'O'
    for word_node in sent_node.iter('word'):
        word = word_node.get('form')
        start = word_node.get('from')
        end = word_node.get('to')
        obj = word_node.get('obj')
        
        if obj != 'O':
            if len(opinion_chunk) == 0:
                opinion_chunk.append((word, start, end, obj))
            elif obj == last_obj:
                opinion_chunk.append((word, start, end, obj))
            else:
                opinion_node = make_opinion_node(opinion_chunk, opinions)
                opinions_node.append(opinion_node)
                opinion_chunk = []
                opinion_chunk.append((word, start, end, obj))
            last_obj = obj
            continue
                        
        if len(opinion_chunk) >= 1:            
            opinion_node = make_opinion_node(opinion_chunk, opinions)
            opinions_node.append(opinion_node)
            opinion_chunk = []
                    
        last_obj = obj
        
    if len(opinion_chunk) >= 1:            
            opinion_node = make_opinion_node(opinion_chunk, opinions)
            opinions_node.append(opinion_node)

    if len(opinions_node.getchildren()) >= 1:
        # insert opinions_node in the second position, after text and before tokens
        sent_node.insert(1, opinions_node)

OBJ polarity not defined for opinion chunk: personagem que idealize situações de a mesma maneira revoltosa que tu as idealiza in sentence id 9:7
OBJ polarity not defined for opinion chunk: frases in sentence id 57:3
OBJ polarity not defined for opinion chunk: considerações in sentence id 57:3
OBJ polarity not defined for opinion chunk: Malu in sentence id 5:1
OBJ polarity not defined for opinion chunk: maluquices de a mãe de ela in sentence id 45:1
OBJ polarity not defined for opinion chunk: livro in sentence id 2:3
OBJ polarity not defined for opinion chunk: livro in sentence id 5:4
OBJ polarity not defined for opinion chunk: personagens in sentence id 8:8
OBJ polarity not defined for opinion chunk: lugares in sentence id 8:8
OBJ polarity not defined for opinion chunk: final in sentence id 26:0
OBJ polarity not defined for opinion chunk: trama in sentence id 10:11
OBJ polarity not defined for opinion chunk: David in sentence id 13:3
OBJ polarity not defined for opinion chunk: Tony in 

Some opinions (OBJ/aspect) dont have the polarity defined in the sentence by a predicate (opinion). In these cases, the polarity was defined by 'unknown'

In [17]:
# Save to xml file
etree.ElementTree(xmldoc).write(output_xml, encoding='utf8', xml_declaration=True, pretty_print=True)

# Check the output
print(etree.tounicode(xmldoc, pretty_print=True)[:100000])

<Reviews>
  <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0">
    <sentences>
      <sentence id="0:0" place="title" polarity="neutral">
        <text>Está provado: Pode existir um livro bom sem uma história boa</text>
        <tokens>
          <word id="0" form="Está" postag="V" obj="O" opinion="O" base="estar" morf="Y2s" from="0" to="3"/>
          <word id="1" form="provado" postag="PCP" obj="O" opinion="O" base="provar" morf="Kms" from="5" to="11"/>
          <word id="2" form=":" postag=":" obj="O" opinion="O" base=":" morf="" from="12" to="12"/>
          <word id="3" form="Pode" postag="VAUX" obj="O" opinion="O" base="poder" morf="Y2s" from="14" to="17"/>
          <word id="4" form="existir" postag="V" obj="O" opinion="O" base="existir" morf="W3s" from="19" to="25"/>
          <word id="5" form="um" postag="ART" obj="O" opinion="O" base="um" morf="ms" from="27" to="28"/>
          <word id="6" form="livro" postag="N" obj="O" opinion="O" base="livro" mor