<a href="https://colab.research.google.com/github/kylehiroyasu/opinion-lab-group-1.3/blob/master/notebooks/Load_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Notebook

In [1]:
import os
from pathlib import Path
import sys
colab = False
import warnings
warnings.filterwarnings('ignore')

In [2]:
if colab:
    from getpass import getpass
    import urllib
    from google.colab import output

    user = input('User name: ')
    password = getpass('Password: ')
    password = urllib.parse.quote(password) # your password is converted into url format
    repo_name = "kylehiroyasu/opinion-lab-group-1.3"

    cmd_string = 'git clone https://{0}:{1}@github.com/{2}.git'.format(user, password, repo_name)

    os.system(cmd_string)
    # Removing the password from the variable
    cmd_string, password = "", "" 

    # Remove the output of this cell (removes authetication information)
    output.clear()

Change the directory to the repository and pull latest changes (if any). Only needed when you are on Google Colab

In [3]:
if colab:
    %cd opinion-lab-group-1.3/
    ! git pull
    ! ls

Only **execute** the next cells, if you are **local** and you are in the notebooks directory! This is not needed in Google Colab

In [4]:
%cd ..
! ls

C:\Users\ibes222\Documents\Master\SS20\NLPLab\GitHub


Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [5]:
if colab:
    %pip install -r requirements.txt
    output.clear()

## Constants

In [6]:
ROOT = Path(os.getcwd())
DATA = ROOT/'data'
SRC =  ROOT/'src'
RAW_DATA = DATA/'raw'
RAW_FILES = [
    'ABSA16_Laptops_Train_SB1.xml',
    'ABSA16_Laptops_Test_SB1_GOLD.xml',
    'ABSA16_Restaurants_Train_SB1.xml',
    'ABSA16_Restaurants_Test_SB1_GOLD.xml'
]
print(ROOT)

C:\Users\ibes222\Documents\Master\SS20\NLPLab\GitHub


In [7]:
sys.path.append(str(SRC))

## Imports

In [8]:
import numpy as np
import preprocess

## Data Import and Preprocessing

All the data is stored in `data/raw` as `xml` files. The data is stored in an hierarchical format of course with information stored in tags and tag properties.

To make the data easier to work with we've created functionality to denormalize the datasets.

In [9]:
laptops_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[0])
laptops_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[1])

restaurants_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[2])
restaurants_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[3])

### Sample

In [10]:
restaurants_train.head()

Unnamed: 0,rid,entity,attribute,polarity,id,text,outofscope
0,1004293,RESTAURANT,GENERAL,negative,1004293:0,Judging from previous posts this used to be a ...,
1,1004293,SERVICE,GENERAL,negative,1004293:1,"We, there were four of us, arrived at noon - t...",
2,1004293,SERVICE,GENERAL,negative,1004293:2,"They never brought us complimentary noodles, i...",
3,1004293,FOOD,QUALITY,negative,1004293:3,The food was lousy - too sweet or too salty an...,
4,1004293,FOOD,STYLE_OPTIONS,negative,1004293:3,The food was lousy - too sweet or too salty an...,


# Model Training



In [11]:
import time
import math

import torch as t
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, BertEmbeddings

from Dataset import dfToDataset, dfToBinarySamplingDatasets
from Trainer import Trainer

In [12]:
binary_sampling = True
train_attributes = True
train_restaurant = True
if binary_sampling:
    target_class = "GENERAL"

In [13]:
laptop_entities = {"BATTERY": 0, "COMPANY": 1, "CPU": 2, "DISPLAY": 3, "FANS_COOLING": 4, "GRAPHICS": 5, "HARDWARE": 6, "HARD_DISC": 7, "KEYBOARD": 8, "LAPTOP": 9, "MEMORY": 10, "MOTHERBOARD": 11, "MOUSE": 12, "MULTIMEDIA_DEVICES": 13, "OPTICAL_DRIVES": 14, "OS": 15, "PORTS": 16, "POWER_SUPPLY": 17, "SHIPPING": 18, "SOFTWARE": 19, "SUPPORT": 20, "WARRANTY": 21, "NaN": 22}
laptop_attributes = {"CONNECTIVITY": 0, "DESIGN_FEATURES": 1, "GENERAL": 2, "MISCELLANEOUS": 3, "OPERATION_PERFORMANCE": 4,"PORTABILITY": 5, "PRICE": 6, "QUALITY": 7, "USABILITY": 8, "NaN": 9}
restaurant_entities = {"AMBIENCE": 0, "DRINKS": 1, "FOOD": 2, "LOCATION": 3, "RESTAURANT": 4, "SERVICE": 5, "NaN": 6}
restaurant_attributes = {"GENERAL": 0, "MISCELLANEOUS": 1, "PRICES": 2, "QUALITY": 3, "STYLE_OPTIONS": 4, "NaN": 5}

if train_restaurant:
    train_set = restaurants_train
    test_set = restaurants_test
    entities = restaurant_entities
    attributes = restaurant_attributes
else:
    train_set = laptops_train
    test_set = laptops_test
    entities = laptops_entities
    attributes = laptops_attributes
    
embeddings = WordEmbeddings('glove')
hidden_dim = 100
# This is the dimension of the output of the ABAE model, the classification model gets this as input
# It does not need to be related to the number of classes etc.
output_dim = len(attributes if train_attributes else entities)

We create datasets based on whether we want to have a direct binary output (which can be interpreted as a class assignment) or outputs for each class. The 

In [14]:
if not binary_sampling:
    train_dataset = dfToDataset(train_set, entities, attributes, embeddings)
    test_dataset = dfToDataset(test_set, entities, attributes, embeddings)
else:
    train_dataset, other_train_dataset = dfToBinarySamplingDatasets(train_set, train_attributes, 
                                                                    target_class, embeddings)
    test_dataset, other_test_dataset = dfToBinarySamplingDatasets(test_set, train_attributes, 
                                                                    target_class, embeddings)

The next cell trains the model based on the given parameters. Be aware that in this step it is not possible to get any classification scores, if you are not using the with_supervised parameter as the training is done purely unsupervised.

Parameter:
- embedding_dim {int} -- the size of the input embeddings to the model
- output_dim {int} -- the output size of the ABAE model -> this can be varied
- classification_dim {int} -- the output size of the classification model trained afterwards. It receives output_dim as input and produces the classification (binary or all classes)
- epochs {int} -- number of iterations 
- lr {float} -- learning rate used
- batch_size {int} -- number of samples in a batch
- use_padding {bool} -- wheter to use padding in the model otherwise each sentence is processed one after the other
    validation_percentage {[0,1]} -- how much data should be used for validation, percentage of train_dataset
    binary_sampling_percentage {[0,1]} -- how large the batch_size of the other classes should be for a given batch_size
        of same samples (only used in binary_sampling)
    cuda {bool} -- whether to use the GPU
    use_kcl {bool} -- whether to use the KCL objective function or MCL
    with_supervised {bool} -- whether to use an additional supervised objective while training ABAE
    use_micro_average {bool} -- whether to use micro averaging in metric calculation, otherwise macro average
    train_entities {bool} -- whether to train on the entities (or alternative attributes)

In [15]:
# params:
# embedding_dim {int} -- the size of the embeddings
param = {
    "embedding_dim": hidden_dim,
    "output_dim": output_dim,
    "classification_dim": len(attributes if train_attributes else entities) if not binary_sampling else 1,
    "epochs": 40,
    "lr": 0.001,
    "batch_size": 52,
    "use_padding": False,
    "validation_percentage": 0.1,
    "binary_sampling_percentage": 0.5,
    "cuda": False,
    "use_kcl": False,
    "with_supervised": False,
    "use_micro_average": True,
    "train_entities": not train_attributes
}

if binary_sampling:
    trainer = Trainer(train_dataset, param, other_train_dataset)
else:
    trainer = Trainer(train_dataset, param)
model = trainer.train()

('Using CPU',)
('Epoch:', 0)
('Train loss:', 20.320053100585938)
('Eval Loss:', 0.9791110157966614)
('Epoch:', 1)
('Train loss:', 18.40960693359375)
('Eval Loss:', 0.8752169013023376)
('Epoch:', 2)
('Train loss:', 17.263689041137695)
('Eval Loss:', 0.8596185445785522)
('Epoch:', 3)
('Train loss:', 16.16720199584961)
('Eval Loss:', 0.7803658246994019)
('Epoch:', 4)
('Train loss:', 14.978226661682129)
('Eval Loss:', 0.7452200055122375)
('Epoch:', 5)
('Train loss:', 14.31507682800293)
('Eval Loss:', 0.70160311460495)
('Epoch:', 6)
('Train loss:', 14.113266944885254)
('Eval Loss:', 0.7010901570320129)
('Epoch:', 7)
('Train loss:', 13.999372482299805)
('Eval Loss:', 0.7129247784614563)
('Epoch:', 8)
('Train loss:', 13.908411979675293)
('Eval Loss:', 0.703326940536499)
('Epoch:', 9)
('Train loss:', 13.861610412597656)
('Eval Loss:', 0.7094868421554565)
('Epoch:', 10)
('Train loss:', 13.796316146850586)
('Eval Loss:', 0.7100728750228882)
('Epoch:', 11)
('Train loss:', 13.776556968688965)
('Ev

You can now use a linear layer with softmax/sigmoid afterwards for the mapping. This is done by calling trainer.train_classifier which automatically adds those layers at the end of the previous NN. The parameters of the previous NN can be frozen and the parameters for the training can be changed by assigning new values and passing the parameter dict into the function.

In [16]:
param["lr"] = 0.01
param["epochs"] = 40
model = trainer.train_classifier(freeze=True, new_param=param)

('Using CPU',)
('Epoch:', 0)
('Train loss:', 13.930377960205078)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969, 'f1': 0.5979381443298969}
('Eval Loss:', 0.7251308560371399)
('Epoch:', 1)
('Train loss:', 12.987274169921875)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969, 'f1': 0.5979381443298969}
('Eval Loss:', 0.7934264540672302)
('Epoch:', 2)
('Train loss:', 12.645646095275879)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969, 'f1': 0.5979381443298969}
('Eval Loss:', 0.841391384601593)
('Epoch:', 3)
('Train loss:', 12.51669692993164)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969, 'f1': 0.5979381443298969}
('Eval Loss:', 0.8508476614952087)
('Epoch:', 4)
('Train loss:', 12.375102996826172)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969, 'f1': 0.5979381443298969}
('Eval Loss:', 0.8268314599990845)
('Epoch:', 5)
('Train loss:', 12.330718994140625)
{'precision': 0.5979381443298969, 'recall': 0.5979381443298969