<a href="https://colab.research.google.com/github/kylehiroyasu/opinion-lab-group-1.3/blob/master/notebooks/Load_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing


The idea of our project is to evaluate text classifiers trained using only pairwise labels to identify entities and attributes.

The basic approach will be that we have various sentences which are labeled referring to a specific entity and attribute. However instead of training a classifier directly on these explicit labels, we will show a model two examples and then tell the model if they contain the same entity/attribute or not.

This means while we are preprocessing data the most important aspects to save are:

- the sentence
- what entity it contains
- what attribute it contains

In general this means we will concentrate on using the `train` and `GOLD` versions of the `SB1` datasets because they contain the sentence level information.

However the following lead to a complicated data model:

- a single text element can have multiple entities/attributes

Ideas for data model:

Since our data is small, and we want to work on the data in a pairiwise sentence fashion, it would be easiest to completely denormalize, and flatten the data down to the sentence level. I will start with basically one python dictionary per sentence, and then we can transform the list of dictionaries into a pandas dataframe.

Questions would be:

- how to do we treat sentences with mutliple entities and attributes
- in the data annotation scheme there is a `general` attribute value, should we use this?
- what about the out-of-scope data?
- in theory we can use sentences with no entity/attribute label as confirmation it doesn't match any given entity/label pair. should this also be done? Does that mean that two sentences with no entity/attribute do match? 
- in terms of tokenization, should we use the same tokenization for both embeddings, or match to how models were trained
- You initially suggest we start with token embeddings, for GloVe this is obviously the only choice, however more modern architectures also include document embeddings, instead of trying to compress/token embeddings


For data preprocessing, we should probably consider the following high level steps:

- General preprocessing
    - replacing '&amp;' with '&' for example
    - shortening words with repeated letters: 'the computer is slowwwww' -> 'the computer is slow'
    
    
- Embedding specific preprocessing:
    - GloVe:
        - tokenize and lowercase using Stanford Tokenizer
    - Bert:
        - ?
        
After more dicussion to keep things simple the following will be done in this notebook
- denormlize data down to the entity/attribute - sentence level
- generate statistics for attributes and entities
- start with standard tokenization


## Setup Notebook

This is the basic Notebook used for all our experiments. Before we can access our data, code and models, we first have to import them from GitHub.
See possibilities for authentication: https://stackoverflow.com/questions/48350226/methods-for-using-git-with-google-colab

In [1]:
import os
from pathlib import Path
import sys

In [0]:

from getpass import getpass
import urllib
from google.colab import output

user = input('User name: ')
password = getpass('Password: ')
password = urllib.parse.quote(password) # your password is converted into url format
repo_name = "kylehiroyasu/opinion-lab-group-1.3"

cmd_string = 'git clone https://{0}:{1}@github.com/{2}.git'.format(user, password, repo_name)

os.system(cmd_string)
# Removing the password from the variable
cmd_string, password = "", "" 

# Remove the output of this cell (removes authetication information)
output.clear()

Change the directory to the repository and pull latest changes (if any). Display the directory content and set basic datapaths.

In [6]:
%cd opinion-lab-group-1.3/
! git pull
! ls

[Errno 2] No such file or directory: 'opinion-lab-group-1.3/'
/Users/d440323/Personal/opinion-lab/github/opinion-lab-group-1.3/notebooks
Already up to date.
Load_Data.ipynb            Model_Training.ipynb
Model_Training-Copy1.ipynb Setup.ipynb


In [None]:
%pip install -r ../requirements.txt

In [None]:
!python -m spacy download en_core_web_sm

## Constants

In [2]:
ROOT = Path(os.getcwd())/'..'
DATA = ROOT/'data'
SRC =  ROOT/'src'
RAW_DATA = DATA/'raw'
RAW_FILES = [
    'ABSA16_Laptops_Train_SB1.xml',
    'ABSA16_Laptops_Test_SB1_GOLD.xml',
    'ABSA16_Restaurants_Train_SB1.xml',
    'ABSA16_Restaurants_Test_SB1_GOLD.xml'
]

INTERIM_DATA = DATA/'interim'

In [3]:
sys.path.append(str(SRC))

## Imports

In [4]:
import numpy as np
import preprocess

## Data Import and Preprocessing

All the data is stored in `data/raw` as `xml` files. The data is stored in an hierarchical format of course with information stored in tags and tag properties.

To make the data easier to work with we've created functionality to denormalize the datasets.

In [5]:
laptops_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[0], remove_stopwords=False, lemmatize=False)
laptops_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[1])

restaurants_train = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[2])
restaurants_test = preprocess.load_data_as_df(RAW_DATA/RAW_FILES[3])

### Sample

In [6]:
laptops_train.head()

Unnamed: 0,rid,id,text,entity,attribute,polarity,outofscope
0,79,79:0,Being a PC user my whole life ....,,,,
1,79,79:1,This computer is absolutely AMAZING ! ! !,LAPTOP,GENERAL,positive,
2,79,79:2,10 plus hours of battery ...,BATTERY,OPERATION_PERFORMANCE,positive,
3,79,79:3,super fast processor and really nice graphics ...,CPU,OPERATION_PERFORMANCE,positive,
4,79,79:3,super fast processor and really nice graphics ...,GRAPHICS,GENERAL,positive,


## Laptop Corpus Statistics

In [7]:
training_examples = laptops_train.shape[0]
testing_examples = laptops_test.shape[0]

'There are {} training examples, and {} testing examples'.format(training_examples, testing_examples)

'There are 3370 training examples, and 1036 testing examples'

In [8]:
laptops_test.pivot?

In [9]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [10]:
laptops_train[['entity', 'attribute']].pivot_table(index=['entity'], 
                                                   columns=['attribute'], 
                                                   aggfunc=np.size, 
                                                   fill_value=0)

attribute,CONNECTIVITY,DESIGN_FEATURES,GENERAL,MISCELLANEOUS,OPERATION_PERFORMANCE,PORTABILITY,PRICE,QUALITY,USABILITY
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
BATTERY,0,0,0,2,172,0,0,18,0
COMPANY,0,0,180,0,0,0,0,0,0
CPU,0,2,0,2,34,0,0,4,0
DISPLAY,0,56,48,0,28,0,0,106,12
FANS_COOLING,0,2,0,0,2,0,0,8,0
GRAPHICS,0,4,32,4,0,0,0,12,0
HARDWARE,0,0,4,0,2,0,0,8,0
HARD_DISC,0,28,0,0,0,0,0,22,0
KEYBOARD,0,58,26,0,18,0,0,36,42
LAPTOP,110,506,1268,284,556,102,272,448,282


In [11]:
tmp = laptops_train[['entity', 'attribute']].pivot_table(index=['entity'], 
                                                   columns=['attribute'], 
                                                   aggfunc=np.size, 
                                                   fill_value=0)

In [12]:
laptops_test[['entity', 'attribute']].pivot_table(index='entity', 
                                                  columns=['attribute'], 
                                                  aggfunc=np.size, 
                                                  fill_value=0)

attribute,CONNECTIVITY,DESIGN_FEATURES,GENERAL,MISCELLANEOUS,OPERATION_PERFORMANCE,PORTABILITY,PRICE,QUALITY,USABILITY
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
BATTERY,0,2,0,0,38,0,0,10,0
COMPANY,0,0,76,0,0,0,0,0,0
CPU,0,6,2,2,4,0,0,0,0
DISPLAY,0,26,8,0,16,0,0,40,10
FANS_COOLING,0,0,0,0,4,0,0,0,0
GRAPHICS,0,4,0,2,0,0,0,0,0
HARDWARE,0,0,0,0,0,0,0,8,0
HARD_DISC,0,26,4,0,10,0,0,2,0
KEYBOARD,0,14,4,0,18,0,0,14,12
LAPTOP,26,146,316,68,140,12,50,94,92


## Restaurant Corpus Statistics

In [13]:
training_examples = restaurants_train.shape[0]
testing_examples = restaurants_test.shape[0]

'There are {} training examples, and {} testing examples'.format(training_examples, testing_examples)

'There are 2799 training examples, and 948 testing examples'

In [14]:
restaurants_train[['entity', 'attribute']].pivot_table(index='entity', 
                                                       columns='attribute', 
                                                       aggfunc=np.size, 
                                                       fill_value=0)

attribute,GENERAL,MISCELLANEOUS,PRICES,QUALITY,STYLE_OPTIONS
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AMBIENCE,510,0,0,0,0
DRINKS,0,0,40,94,64
FOOD,0,0,180,1698,274
LOCATION,56,0,0,0,0
RESTAURANT,844,196,160,0,0
SERVICE,898,0,0,0,0


In [15]:
restaurants_test[['entity', 'attribute']].pivot_table(index='entity', 
                                                      columns='attribute', 
                                                      aggfunc=np.size, 
                                                      fill_value=0)


attribute,GENERAL,MISCELLANEOUS,PRICES,QUALITY,STYLE_OPTIONS
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AMBIENCE,132,0,0,0,0
DRINKS,0,0,8,44,24
FOOD,0,0,46,626,110
LOCATION,26,0,0,0,0
RESTAURANT,284,66,42,0,0
SERVICE,310,0,0,0,0


## Additional Corpus Statistics

In [29]:
import pandas as pd

In [30]:
def get_training_counts(df, column):
    values = df[column].unique().tolist()
    counts = []
    for value in values:
        target_df = df[df[column].isin([value])]
        target_df = target_df.drop_duplicates("text")
        target_text = target_df.text
        other_df = df[~df.text.isin(target_text.tolist())]
        other_df = other_df.drop_duplicates("text")
        counts.append({'target':value,'positive_examples':target_df.shape[0], 'negative_examples':other_df.shape[0]})
    return pd.DataFrame(counts)

In [31]:
get_training_counts(restaurants_train, 'entity')

Unnamed: 0,target,positive_examples,negative_examples
0,RESTAURANT,560,1435
1,SERVICE,419,1576
2,FOOD,757,1238
3,DRINKS,79,1916
4,AMBIENCE,226,1769
5,,292,1703
6,LOCATION,28,1967


In [32]:
get_training_counts(restaurants_train, 'attribute')

Unnamed: 0,target,positive_examples,negative_examples
0,GENERAL,994,1001
1,QUALITY,716,1279
2,STYLE_OPTIONS,156,1839
3,PRICES,177,1818
4,MISCELLANEOUS,97,1898
5,,292,1703


## Save Changes to Env

In [16]:
%pip freeze > ../requirements.txt

Note: you may need to restart the kernel to use updated packages.
