# Split Reli Corpus into Train and Test parts

* SemEval ABSA 2015 corpus has 255 reviews (72%) in trainset and 97 (28%) reviews in testset
* SemEval ABSA 2016 corpus has 351 reviews (79%) in trainset and 91 (21%) reviews in testset
* ReLi corpus has 1601 reviews in total
* In order to make an evaluation for ReLi compatible with SemEval we are going to split ReLi corpus into trainset and testset
* So, ReLi trainset should have 1200 reviews (75%) and ReLi testset should have 401(25%) reviews

## Spliting ReLi.xml

In [1]:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
reviews = etree.parse('../corpus/ReLi.xml', parser)

In [2]:
from collections import Counter
len(list(reviews.iter('Review')))

1601

In [3]:
# Make train and test in the following way
# Make each 1st, 2nd, 3rd reviews into trainset, 4th into testset, 
#           5th, 6th, 7th reviews into trainset, 8th into testset, and so on 
train = etree.Element('Reviews')
test = etree.Element('Reviews')
for index, review in enumerate(reviews.iter('Review')):
    if index % 4 != 0:
        train.append(review)
    else:
        test.append(review)

In [4]:
len(list(train.iter('Review')))

1200

In [5]:
len(list(test.iter('Review')))

401

In [6]:
# Save to train and test files
etree.ElementTree(train).write('../corpus/ReLi_train.xml', encoding='utf8', xml_declaration=True, pretty_print=True)
etree.ElementTree(test).write('../corpus/ReLi_test.xml', encoding='utf8', xml_declaration=True, pretty_print=True)

## Spliting ReLiPalavras.xml

In [2]:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
reviews = etree.parse('../corpus/ReLiPalavras.xml', parser)

In [3]:
# Make train and test in the following way
# Make each 1st, 2nd, 3rd reviews into trainset, 4th into testset, 
#           5th, 6th, 7th reviews into trainset, 8th into testset, and so on 
train = etree.Element('Reviews')
test = etree.Element('Reviews')
for index, review in enumerate(reviews.iter('Review')):
    if index % 4 != 0:
        train.append(review)
    else:
        test.append(review)

In [4]:
len(list(train.iter('Review')))

1200

In [5]:
len(list(test.iter('Review')))

401

In [6]:
# Save to train and test files
etree.ElementTree(train).write('../corpus/ReLiPalavras_train.xml', encoding='utf8', xml_declaration=True, pretty_print=True)
etree.ElementTree(test).write('../corpus/ReLiPalavras_test.xml', encoding='utf8', xml_declaration=True, pretty_print=True)

## Spliting ReLiUniversalDependencies.xml 

In [12]:
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
reviews = etree.parse('../corpus/ReLiUniversalDependencies.xml', parser)

In [13]:
# Make train and test in the following way
# Make each 1st, 2nd, 3rd reviews into trainset, 4th into testset, 
#           5th, 6th, 7th reviews into trainset, 8th into testset, and so on 
train = etree.Element('Reviews')
test = etree.Element('Reviews')
for index, review in enumerate(reviews.iter('Review')):
    if index % 4 != 0:
        train.append(review)
    else:
        test.append(review)

In [14]:
len(list(train.iter('Review')))

1200

In [15]:
len(list(test.iter('Review')))

401

In [16]:
# Save to train and test files
etree.ElementTree(train).write('../corpus/ReLiUniversalDependencies_train.xml', 
                               encoding='utf8', xml_declaration=True, pretty_print=True)
etree.ElementTree(test).write('../corpus/ReLiUniversalDependencies_test.xml', 
                              encoding='utf8', xml_declaration=True, pretty_print=True)