# Get data for ABSA
---

**Data list**


\< FULL \>
1. SemEval-2016 task5 Subtask1 domain: restaurant, laptop (here)
2. SemEval-2016 task5 Subtask2 domain: restaurant, laptop (another notebook)




\< TRIAL \>
1. SemEval-2016 task5 Subtask1 domain: restaurant (trial)
2. SemEval-2016 task5 Subtask2 domain: restaurant (trial)
3. SemEval-2016 task5 Subtask1 domain: laptop (trial)
4. SemEval-2016 task5 Subtask2 domain: laptop (trial)




---
**Link**

[FULL: howardhsu's github](https://github.com/howardhsu/ABSA_preprocessing)  
[TRIAL: SemEval-2016 task5 data and tools](https://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools)  


---
**Task description**



---
**Study points**

1. handling XML data using xml.etree.ElementTree
2. scraping XML structure data using BeautifulSoup (trial datasets)

todo
- find Semeval-16 Subtask2 dataset

## SemEval full data from github
[howardhsu link](https://github.com/howardhsu/ABSA_preprocessing)

    1. SemEval-16 Subtask1
    2. SemEval-14

In [1]:
!git clone https://github.com/howardhsu/ABSA_preprocessing.git # only when not cloned.

'ABSA_preprocessing'에 복제합니다...
remote: Enumerating objects: 117, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (77/77), done.[K
remote: Total 117 (delta 55), reused 87 (delta 32), pack-reused 0[K
오브젝트를 받는 중: 100% (117/117), 788.69 KiB | 1.47 MiB/s, 완료.
델타를 알아내는 중: 100% (55/55), 완료.


In [7]:
import os, sys
import xml.etree.ElementTree as ET
DIR = 'ABSA_preprocessing/dataset/SemEval'

path_16_rest = os.path.join(DIR, '16/rest/ABSA16_Restaurants_Train_SB1_v2.xml')
path_16_laptop = os.path.join(DIR, '16/laptop/ABSA16_Laptops_Train_SB1_v2.xml')
root_16_rest = ET.parse(path_16_rest).getroot()
root_16_laptop = ET.parse(path_16_laptop).getroot()

### SemEval-16 Subtask1
    - Restaurant: sentence, category, polarity, target, spans
    - Laptop: sentence, category, polarity

In [26]:
import pandas as pd

class semeval16_sub1:
    
    def __init__(self, path, data_type='restaurant'):
        self.root = ET.parse(path).getroot()
        self.data_type = data_type
        print('Element Tree parsed')
        print('Process:\n1)sentences_list\n2)make_dict\n3)export_csv')
        
    def sentences_list(self):
        'make sentences list'
        self.sentences = list()
        for r in self.root:
            self.sentences += r[0].findall('sentence')
        print('sentences list has been made, length: {:,}'.format(len(self.sentences)))
        
    def make_dict(self):
        'return dictionary all contains'
        
        self.re_idx, self.idx, self.sent, self.category, self.polarity, self.target, self.fr, self.to =\
            list(), list(), list(), list(), list(), list(), list(), list()
        self.dict_ = dict()
        
        for index, contents in enumerate(self.sentences):
            
            if len(contents) != 2: # There is not any opinion.
                continue
                
            raw_idx = contents.attrib['id']
            text = contents[0].text
            opinions = contents[1].findall('Opinion')
            num_opinions = len(opinions)
            
            # re_idx
            self.re_idx += [index] * num_opinions
            # idx
            self.idx += [raw_idx] * num_opinions
            # sentence
            self.sent += [text] * num_opinions
            
            if self.data_type == 'restaurant':
                for o in opinions:
                    self.category.append(o.attrib['category'])
                    self.polarity.append(o.attrib['polarity'])
                    self.target.append(o.attrib['target'])
                    self.fr.append(o.attrib['from'])
                    self.to.append(o.attrib['to'])
            else:
                for o in opinions:
                    self.category.append(o.attrib['category'])
                    self.polarity.append(o.attrib['polarity'])
        
        print('{:,} sentence-opinion pairs extracted.'.format(len(self.re_idx)))
        
        if self.data_type == 'restaurant':
            self.dict_['re_idx'], self.dict_['idx'], self.dict_['sentence'], self.dict_['category'],\
            self.dict_['polarity'], self.dict_['target'], self.dict_['from'], self.dict_['to'] = \
                self.re_idx, self.idx, self.sent, self.category, self.polarity, self.target, self.fr, self.to
        else:
            self.dict_['re_idx'], self.dict_['idx'], self.dict_['sentence'], self.dict_['category'], \
            self.dict_['polarity'] = \
                self.re_idx, self.idx, self.sent, self.category, self.polarity
        
        return self.dict_
    
    def export_csv(self, export_path):
        df = pd.DataFrame(self.dict_)
        df.to_csv(export_path, index=False)
        print('dataset has been exported')

#### Restaurant 16

In [27]:
rest16 = semeval16_sub1(path=path_16_rest, data_type='restaurant')

Element Tree parsed
Process:
1)sentences_list
2)make_dict
3)export_csv


In [28]:
rest16.sentences_list()
dict_rest16 = rest16.make_dict()
rest16.export_csv('semeval16_sub1_restaurant.csv')

sentences list has been made, length: 2,000
2,507 sentence-opinion pairs extracted.
dataset has been exported


In [30]:
df_rest16 = pd.read_csv('semeval16_sub1_restaurant.csv')
print(df_rest16.shape)
df_rest16.head(10)

(2507, 8)


Unnamed: 0,re_idx,idx,sentence,category,polarity,target,from,to
0,0,1004293:0,Judging from previous posts this used to be a ...,RESTAURANT#GENERAL,negative,place,51,56
1,1,1004293:1,"We, there were four of us, arrived at noon - t...",SERVICE#GENERAL,negative,staff,75,80
2,2,1004293:2,"They never brought us complimentary noodles, i...",SERVICE#GENERAL,negative,,0,0
3,3,1004293:3,The food was lousy - too sweet or too salty an...,FOOD#QUALITY,negative,food,4,8
4,3,1004293:3,The food was lousy - too sweet or too salty an...,FOOD#STYLE_OPTIONS,negative,portions,52,60
5,4,1004293:4,"After all that, they complained to me about th...",SERVICE#GENERAL,negative,,0,0
6,5,1004293:5,Avoid this place!,RESTAURANT#GENERAL,negative,place,11,16
7,6,1014458:0,"I have eaten at Saul, many times, the food is ...",FOOD#QUALITY,positive,food,38,42
8,7,1014458:1,Saul is the best restaurant on Smith Street an...,RESTAURANT#GENERAL,positive,Saul,0,4
9,8,1014458:2,The duck confit is always amazing and the foie...,FOOD#QUALITY,positive,foie gras terrine with figs,42,69


In [41]:
cat_rest = df_rest16['category'].value_counts()
print('the number of categories: {}'.format(len(cat_rest)))
cat_rest.head(5)

the number of categories: 12


FOOD#QUALITY          849
SERVICE#GENERAL       449
RESTAURANT#GENERAL    422
AMBIENCE#GENERAL      255
FOOD#STYLE_OPTIONS    137
Name: category, dtype: int64

#### Laptop 16

In [14]:
laptop16 = semeval16_sub1(path=path_16_laptop, data_type='laptop')

Element Tree parsed
Process:
1)sentences_list
2)make_dict
3)export_csv


In [15]:
laptop16.sentences_list()
dict_laptop16 = laptop16.make_dict()
laptop16.export_csv('semeval16_sub1_laptop.csv')

sentences list has been made, length: 2,500
2,909 sentence-opinion pairs extracted.
dataset has been exported


In [31]:
df_laptop16 = pd.read_csv('semeval16_sub1_laptop.csv')
print(df_laptop16.shape)
df_laptop16.head(10)

(2909, 5)


Unnamed: 0,re_idx,idx,sentence,category,polarity
0,1,79:1,This computer is absolutely AMAZING!!!,LAPTOP#GENERAL,positive
1,2,79:2,10 plus hours of battery...,BATTERY#OPERATION_PERFORMANCE,positive
2,3,79:3,super fast processor and really nice graphics ...,CPU#OPERATION_PERFORMANCE,positive
3,3,79:3,super fast processor and really nice graphics ...,GRAPHICS#GENERAL,positive
4,4,79:4,and plenty of storage with 250 gb(though I wil...,HARD_DISC#DESIGN_FEATURES,positive
5,5,79:5,This computer is really fast and I'm shocked a...,LAPTOP#OPERATION_PERFORMANCE,positive
6,5,79:5,This computer is really fast and I'm shocked a...,LAPTOP#USABILITY,positive
7,6,79:6,I've only had mine a day but I'm already used ...,LAPTOP#USABILITY,positive
8,8,79:8,GET THIS COMPUTER FOR PORTABILITY AND FAST PRO...,LAPTOP#PORTABILITY,positive
9,8,79:8,GET THIS COMPUTER FOR PORTABILITY AND FAST PRO...,CPU#OPERATION_PERFORMANCE,positive


In [40]:
cat_laptop = df_laptop16['category'].value_counts()
print('the number of categories: {}'.format(len(cat_laptop)))
cat_laptop.head(5)

the number of categories: 81


LAPTOP#GENERAL                  634
LAPTOP#OPERATION_PERFORMANCE    278
LAPTOP#DESIGN_FEATURES          253
LAPTOP#QUALITY                  224
LAPTOP#MISCELLANEOUS            142
Name: category, dtype: int64

-----
## Trial Datasets

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [3]:
url1 = 'https://alt.qcri.org/semeval2016/task5/data/uploads/trial-data/english-trial/restaurants_trial_english_sl.xml'
url2 = 'https://alt.qcri.org/semeval2016/task5/data/uploads/trial-data/english-trial/restaurants_trial_english_tl.xml'
url3 = 'https://alt.qcri.org/semeval2016/task5/data/uploads/trial-data/english-trial/laptops_trial_english_sl.xml'
url4 = 'https://alt.qcri.org/semeval2016/task5/data/uploads/trial-data/english-trial/laptops_trial_english_tl.xml'

In [4]:
text = list()
urls = [url1, url2, url3, url4]

for url in urls:
    source = BeautifulSoup(requests.get(url).text, 'lxml')
    text.append(source)

### Subtask1: 각 문장 단위 분석
- **slot1**: Aspect Category detection: predefined된 E#A pair detection(classification)  
    * entity: \[food, service, .. \]  
    * attributes: \[taste, price, quality\]
- **slot2**: Opinion Target Expression(OTE): E#A를 나타내는 표현을 찾아내는 것(opinion span)
- **slot3**: Sentiment polarity: E#A에 해당하는 감성(pos., neg., neu.)

In [6]:
task1_sentences = text[0].find_all('sentence')
task1_sentences[2]

<sentence id="1090587:2">
<text>Add to that great service and great food at a reasonable price and you have yourself the beginning of a great evening.</text>
<opinions>
<opinion category="SERVICE#GENERAL" from="18" polarity="positive" target="service" to="25"></opinion>
<opinion category="FOOD#QUALITY" from="36" polarity="positive" target="food" to="40"></opinion>
<opinion category="FOOD#PRICES" from="36" polarity="positive" target="food" to="40"></opinion>
</opinions>
</sentence>

In [7]:
re_idx, idx, sent, category, polarity, target, fr, to = \
    list(), list(), list(), list(), list(), list(), list(), list()

for index, content in enumerate(task1_sentences):
    sentences = content.find('text')
    opinions = content.find_all('opinion')
    raw_id = content['id']
    num_opinions = len(opinions)
    
    # re_idx
    re_idx += [index] * num_opinions
    
    # idx
    idx += [raw_id] * num_opinions
    
    # sent
    sent += [sentences.get_text()] * num_opinions
    
    # opinions
    list_opinions = content.find_all('opinion')
    
    # category
    for o in list_opinions:
        category.append(o['category']) # category
        polarity.append(o['polarity']) # polarity
        target.append(o['target']) # target
        fr.append(int(o['from'])) # from
        to.append(int(o['to'])) # to

In [8]:
dict_1 = dict()
dict_1['re_idx'] = re_idx
dict_1['idx'] = idx
dict_1['sentence'] = sent
dict_1['category'] = category
dict_1['polarity'] = polarity
dict_1['target'] = target
dict_1['from'] = fr
dict_1['to'] = to

In [9]:
subtask1 = pd.DataFrame(dict_1)
print(subtask1['category'].value_counts())
print(' ')
print(subtask1['polarity'].value_counts())
print(' ')
print(subtask1['target'].value_counts())
subtask1.sample(3)

FOOD#QUALITY                30
SERVICE#GENERAL             12
RESTAURANT#GENERAL           9
AMBIENCE#GENERAL             8
FOOD#PRICES                  2
LOCATION#GENERAL             1
RESTAURANT#PRICES            1
DRINKS#QUALITY               1
RESTAURANT#MISCELLANEOUS     1
DRINKS#STYLE_OPTIONS         1
Name: category, dtype: int64
 
positive    39
negative    24
neutral      3
Name: polarity, dtype: int64
 
NULL                                 11
food                                  8
service                               5
Service                               3
place                                 3
trattoria                             2
meal                                  2
pork shu mai                          1
Food                                  1
staff                                 1
calamari                              1
Guacamole+shrimp appetizer            1
regular menu-fare                     1
lamb glazed with balsamic vinegar     1
Decor                  

Unnamed: 0,re_idx,idx,sentence,category,polarity,target,from,to
4,3,1090587:3,The lava cake dessert was incredible and I rec...,FOOD#QUALITY,positive,lava cake dessert,4,21
22,18,1357554:4,It's a nice place to relax and have conversation.,AMBIENCE#GENERAL,positive,place,12,17
30,26,1500453:1,This quaint and romantic trattoria is at the t...,AMBIENCE#GENERAL,positive,trattoria,25,34


### Subtask2: review 전체 단위로 ABSA

In [10]:
task2_reviews = text[1].find_all('review')
task2_reviews

[<review rid="1090587">
 <sentences>
 <sentence id="1090587:0">
 <text>Just went here for my girlfriends 23rd bday.</text>
 </sentence>
 <sentence id="1090587:1">
 <text>If you've ever been along the river in Weehawken you have an idea of the top of view the chart house has to offer.</text>
 </sentence>
 <sentence id="1090587:2">
 <text>Add to that great service and great food at a reasonable price and you have yourself the beginning of a great evening.</text>
 </sentence>
 <sentence id="1090587:3">
 <text>The lava cake dessert was incredible and I recommend it.</text>
 </sentence>
 </sentences>
 <opinions>
 <opinion category="LOCATION#GENERAL" polarity="positive"></opinion>
 <opinion category="SERVICE#GENERAL" polarity="positive"></opinion>
 <opinion category="FOOD#QUALITY" polarity="positive"></opinion>
 <opinion category="FOOD#PRICES" polarity="positive"></opinion>
 <opinion category="RESTAURANT#GENERAL" polarity="positive"></opinion>
 </opinions>
 </review>,
 <review rid="1661043">

In [11]:
task2_reviews[8]

<review rid="Z#3">
<sentences>
<sentence id="Z#3:0">
<text>Excellent food, although the interior could use some help.</text>
</sentence>
<sentence id="Z#3:1">
<text>The space kind of feels like an Alice in Wonderland setting, without it trying to be that.</text>
</sentence>
<sentence id="Z#3:2">
<text>I paid just about $60 for a good meal, though :)</text>
</sentence>
<sentence id="Z#3:3">
<text>Great sake!</text>
</sentence>
</sentences>
<opinions>
<opinion category="FOOD#QUALITY" polarity="positive"></opinion>
<opinion category="AMBIENCE#GENERAL" polarity="negative"></opinion>
<opinion category="FOOD#PRICES" polarity="positive"></opinion>
<opinion category="DRINKS#QUALITY" polarity="positive"></opinion>
<opinion category="RESTAURANT#GENERAL" polarity="positive"></opinion>
</opinions>
</review>

In [12]:
re_idx, idx, content, cat, polarity = list(), list(), list(), list(), list()

for index, review in enumerate(task2_reviews):
    sentence = review.find_all('text')
    opinion = review.find_all('opinion')
    rid = review['rid']
    num_opinions = len(opinion)
    
    # re-idx
    re_idx += [index]*num_opinions
    
    # idx
    idx += [rid]*num_opinions
    
    # content
    str_ = str()
    for s in sentence:
        sent = s.get_text() + ' '
        str_ += sent
    content += [str_[:-1]]*num_opinions
    
    # category + polarity
    cp = review.find_all('opinion')
    
    # category
    for i in cp:
        cat.append(i['category'])
    
    # polarity
    for i in cp:
        polarity.append(i['polarity'])

In [13]:
dict_ = dict()
dict_['idx'] = re_idx
dict_['content'] = content
dict_['category'] = cat
dict_['polarity'] = polarity
dict_['raw_idx'] = idx

In [14]:
subtask2 = pd.DataFrame(dict_)
print(subtask2['category'].value_counts())
print(' ')
print(subtask2['polarity'].value_counts())
subtask2.sample(3)

FOOD#QUALITY                10
RESTAURANT#GENERAL          10
SERVICE#GENERAL              8
AMBIENCE#GENERAL             5
FOOD#PRICES                  2
LOCATION#GENERAL             1
RESTAURANT#PRICES            1
DRINKS#QUALITY               1
RESTAURANT#MISCELLANEOUS     1
DRINKS#STYLE_OPTIONS         1
Name: category, dtype: int64
 
positive    25
negative    10
conflict     4
neutral      1
Name: polarity, dtype: int64


Unnamed: 0,idx,content,category,polarity,raw_idx
28,7,I was very disappointed with this restaurant. ...,RESTAURANT#GENERAL,negative,1016296
22,5,Molto bene! This quaint and romantic trattoria...,FOOD#QUALITY,positive,1500453
2,0,Just went here for my girlfriends 23rd bday. I...,FOOD#QUALITY,positive,1090587


In [15]:
subtask1.iloc[:10]

Unnamed: 0,re_idx,idx,sentence,category,polarity,target,from,to
0,1,1090587:1,If you've ever been along the river in Weehawk...,LOCATION#GENERAL,positive,view,80,84
1,2,1090587:2,Add to that great service and great food at a ...,SERVICE#GENERAL,positive,service,18,25
2,2,1090587:2,Add to that great service and great food at a ...,FOOD#QUALITY,positive,food,36,40
3,2,1090587:2,Add to that great service and great food at a ...,FOOD#PRICES,positive,food,36,40
4,3,1090587:3,The lava cake dessert was incredible and I rec...,FOOD#QUALITY,positive,lava cake dessert,4,21
5,4,1661043:0,Pizza here is consistently good.,FOOD#QUALITY,positive,Pizza,0,5
6,5,1661043:1,Salads are a delicious way to begin the meal.,FOOD#QUALITY,positive,Salads,0,6
7,6,1661043:2,You should pass on the calamari.,FOOD#QUALITY,negative,calamari,23,31
8,7,1661043:3,It is thick and slightly soggy.,FOOD#QUALITY,negative,,0,0
9,8,1661043:4,Decor is charming.,AMBIENCE#GENERAL,positive,Decor,0,5


In [16]:
subtask1['sentence'][1]

'Add to that great service and great food at a reasonable price and you have yourself the beginning of a great evening.'

In [17]:
subtask2.iloc[:10]

Unnamed: 0,idx,content,category,polarity,raw_idx
0,0,Just went here for my girlfriends 23rd bday. I...,LOCATION#GENERAL,positive,1090587
1,0,Just went here for my girlfriends 23rd bday. I...,SERVICE#GENERAL,positive,1090587
2,0,Just went here for my girlfriends 23rd bday. I...,FOOD#QUALITY,positive,1090587
3,0,Just went here for my girlfriends 23rd bday. I...,FOOD#PRICES,positive,1090587
4,0,Just went here for my girlfriends 23rd bday. I...,RESTAURANT#GENERAL,positive,1090587
5,1,Pizza here is consistently good. Salads are a ...,FOOD#QUALITY,conflict,1661043
6,1,Pizza here is consistently good. Salads are a ...,AMBIENCE#GENERAL,positive,1661043
7,1,Pizza here is consistently good. Salads are a ...,SERVICE#GENERAL,neutral,1661043
8,1,Pizza here is consistently good. Salads are a ...,RESTAURANT#GENERAL,positive,1661043
9,2,sometimes i get good food and ok service. some...,FOOD#QUALITY,conflict,1349391


In [18]:
subtask2['content'][5]

'Pizza here is consistently good. Salads are a delicious way to begin the meal. You should pass on the calamari. It is thick and slightly soggy. Decor is charming. Service is average.'