<a href="https://colab.research.google.com/github/oaarnikoivu/dissertation/blob/master/BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Detecting Emotions from Tweets with BERT on TF Hub

In [0]:
from sklearn.model_selection import train_test_split
from datetime import datetime

import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub 

print(tf.__version__)

We need to install BERT's python package.

In [0]:
!pip install bert-tensorflow

In [3]:
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization




# Data

Lets import the Sem-Eval dataset and format it such that it can be fed into BERT.

In [5]:
from google.colab import drive 
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
def load_dataset(filename):
  dataset = pd.read_csv(filename, sep='\t')
  return dataset

In [0]:
file_path = '/content/drive/My Drive/datasets/'

train_df = load_dataset(file_path + '2018-E-c-En-train.txt')
validation_df = load_dataset(file_path + '2018-E-c-En-dev.txt')
test_df = load_dataset(file_path + '2018-E-c-En-test-gold.txt')

In [8]:
train_df.columns 

Index(['ID', 'Tweet', 'anger', 'anticipation', 'disgust', 'fear', 'joy',
       'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
      dtype='object')

Our input data is the 'Tweet' column and label columns the emotion categories.

In [0]:
ID = 'id'
DATA_COLUMN = 'Tweet'
LABEL_COLUMNS = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 
                 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust']

# Data Preprocessing

Here we transform the data into a format that BERT understands. This involves two steps. First, we modify the *InputExample* class to allow for multiple labels.

- `text_a` is the text we want to classify, which in this case, is the `Request` field in our Dataframe. 
- `text_b` is used if we're training a model to understand the relationship between sentences (i.e. is `text_b` a translation of `text_a`? Is `text_b` an answer to the question asked by `text_a`?). This doesn't apply to our task, so we can leave `text_b` blank.
- `labels` are the labels for our example, i.e. anger, disgust, fear, joy, etc.

In [0]:
class InputExample(object):
  """A single training/test example for simple sequence classification."""
  def __init__(self, guid, text_a, text_b=None, labels=None):
    """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            labels: (Optional) [string]. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.labels = labels 

In [0]:
train_InputExamples = train_df.apply(lambda x: InputExample(guid=None, 
                                                           text_a = x[DATA_COLUMN],
                                                           text_b = None,
                                                           labels = x[LABEL_COLUMNS]), axis=1)

test_InputExamples = test_df.apply(lambda x: InputExample(guid=None, 
                                                           text_a = x[DATA_COLUMN],
                                                           text_b = None,
                                                           labels = x[LABEL_COLUMNS]), axis=1)

validation_InputExamples = validation_df.apply(lambda x: InputExample(guid=None, 
                                                           text_a = x[DATA_COLUMN],
                                                           text_b = None,
                                                           labels = x[LABEL_COLUMNS]), axis=1)

Next, we preprocess our data so it matches the data BERT was trained on. 

- This is taken from the documentation at: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=IhJSe0QHNG7U


In [16]:
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

def create_tokenizer_from_hub_module():
  """Get the vocab file and casing info from the Hub module."""
  with tf.Graph().as_default():
    bert_module = hub.Module(BERT_MODEL_HUB)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    with tf.Session() as sess:
      vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                            tokenization_info["do_lower_case"]])
      
  return bert.tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore








In [17]:
tokenizer.tokenize("Hello, here's an example of using the BERT tokenizer.")

['hello',
 ',',
 'here',
 "'",
 's',
 'an',
 'example',
 'of',
 'using',
 'the',
 'bert',
 'token',
 '##izer',
 '.']