<a href="https://colab.research.google.com/github/kylehiroyasu/opinion-lab-group-1.3/blob/master/notebooks/Load_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Notebook

In [1]:
import os
from pathlib import Path
import sys
colab = False
import warnings
warnings.filterwarnings('ignore')

In [2]:
if colab:
    from getpass import getpass
    import urllib
    from google.colab import output

    user = input('User name: ')
    password = getpass('Password: ')
    password = urllib.parse.quote(password) # your password is converted into url format
    repo_name = "kylehiroyasu/opinion-lab-group-1.3"

    cmd_string = 'git clone https://{0}:{1}@github.com/{2}.git'.format(user, password, repo_name)

    os.system(cmd_string)
    # Removing the password from the variable
    cmd_string, password = "", "" 

    # Remove the output of this cell (removes authetication information)
    output.clear()

Change the directory to the repository and pull latest changes (if any). Only needed when you are on Google Colab

In [3]:
if colab:
    %cd opinion-lab-group-1.3/
    ! git pull
    ! ls

Only **execute** the next cells, if you are **local** and you are in the notebooks directory! This is not needed in Google Colab

In [4]:
%cd ..
! ls

/home/ibes222/Documents/Master/NLPLab/GitHub
data  notebooks  opinion  README.md  requirements.txt  src


In [5]:
if colab:
    %pip install -r requirements.txt
    output.clear()

## Constants

In [None]:
ROOT = Path(os.getcwd())
DATA = ROOT/'data'
SRC =  ROOT/'src'
RAW_DATA = DATA/'raw'
RAW_FILES = [
    'ABSA16_Laptops_Train_SB1.xml',
    'ABSA16_Laptops_Test_SB1_GOLD.xml',
    'ABSA16_Restaurants_Train_SB1.xml',
    'ABSA16_Restaurants_Test_SB1_GOLD.xml'
]
print(ROOT)

In [7]:
sys.path.append(str(SRC))

## Imports

In [8]:
import numpy as np
import preprocess

## Data Import and Preprocessing

All the data is stored in `data/raw` as `xml` files. The data is stored in an hierarchical format of course with information stored in tags and tag properties.

To make the data easier to work with we've created functionality to denormalize the datasets.

In [9]:
laptops_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[0])
laptops_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[1])

restaurants_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[2])
restaurants_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[3])

### Sample

In [10]:
restaurants_train.head()

Unnamed: 0,rid,entity,attribute,polarity,id,text,outofscope
0,1004293,RESTAURANT,GENERAL,negative,1004293:0,Judging from previous posts this used to be a ...,
1,1004293,SERVICE,GENERAL,negative,1004293:1,"We, there were four of us, arrived at noon - t...",
2,1004293,SERVICE,GENERAL,negative,1004293:2,"They never brought us complimentary noodles, i...",
3,1004293,FOOD,QUALITY,negative,1004293:3,The food was lousy - too sweet or too salty an...,
4,1004293,FOOD,STYLE_OPTIONS,negative,1004293:3,The food was lousy - too sweet or too salty an...,


# Model Training



In [11]:
import time
import math

import torch as t
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, BertEmbeddings

from Dataset import dfToDataset, dfToBinarySamplingDatasets
from Trainer import Trainer

In [12]:
binary_sampling = False
train_attributes = False
if binary_sampling:
    target_class = "AMBIENCE"

In [13]:
laptop_entities = {"BATTERY": 0, "COMPANY": 1, "CPU": 2, "DISPLAY": 3, "FANS_COOLING": 4, "GRAPHICS": 5, "HARDWARE": 6, "HARD_DISC": 7, "KEYBOARD": 8, "LAPTOP": 9, "MEMORY": 10, "MOTHERBOARD": 11, "MOUSE": 12, "MULTIMEDIA_DEVICES": 13, "OPTICAL_DRIVES": 14, "OS": 15, "PORTS": 16, "POWER_SUPPLY": 17, "SHIPPING": 18, "SOFTWARE": 19, "SUPPORT": 20, "WARRANTY": 21, "NaN": 22}
laptop_attributes = {"CONNECTIVITY": 0, "DESIGN_FEATURES": 1, "GENERAL": 2, "MISCELLANEOUS": 3, "OPERATION_PERFORMANCE": 4,"PORTABILITY": 5, "PRICE": 6, "QUALITY": 7, "USABILITY": 8, "NaN": 9}
restaurant_entities = {"AMBIENCE": 0, "DRINKS": 1, "FOOD": 2, "LOCATION": 3, "RESTAURANT": 4, "SERVICE": 5, "NaN": 6}
restaurant_attributes = {"GENERAL": 0, "MISCELLANEOUS": 1, "PRICES": 2, "QUALITY": 3, "STYLE_OPTIONS": 4, "NaN": 5}

embeddings = WordEmbeddings('glove')
hidden_dim = 100
output_dim = len(restaurant_entities) if not binary_sampling else 1

train_set = restaurants_train
test_set = restaurants_test
entities = restaurant_entities
attributes = restaurant_attributes

if not binary_sampling:
    train_dataset = dfToDataset(train_set, entities, attributes, embeddings)
    test_dataset = dfToDataset(test_set, entities, attributes, embeddings)
else:
    train_dataset, other_train_dataset = dfToBinarySamplingDatasets(train_set, train_attributes, 
                                                                    target_class, embeddings)
    test_dataset, other_test_dataset = dfToBinarySamplingDatasets(test_set, train_attributes, 
                                                                    target_class, embeddings)

The next cell trains the model based on the given parameters. Be aware that in this step it is not possible to get any classification scores, if you are not using the with_supervised parameter as the training is done purely unsupervised.

In [14]:
param = {
    "embedding_dim": hidden_dim,
    "output_dim": output_dim,
    "epochs": 2,
    "lr": 0.005,
    "batch_size": 256,
    "use_padding": False,
    "validation_percentage": 0.1,
    "binary_sampling_percentage": 0.5,
    "cuda": False,
    "use_kcl": True,
    "with_supervised": False,
    "use_micro_average": True,
    "train_entities": not train_attributes
}

if binary_sampling:
    trainer = Trainer(train_dataset, param, other_train_dataset)
else:
    trainer = Trainer(train_dataset, param)
model = trainer.train()

('Using CPU',)
('Epoch:', 0)


AssertionError: Input dimension must be 2

You can now use a linear layer with softmax/sigmoid afterwards for the mapping. This is done by calling trainer.train_classifier which automatically adds those layers at the end of the previous NN. The parameters of the previous NN can be frozen and the parameters for the training can be changed by assigning new values and passing the parameter dict into the function.

In [None]:
param["lr"] = 0.01
param["epochs"] = 2
model = trainer.train_classifier(freeze=True, new_param=param)