<a href="https://colab.research.google.com/github/kumar-abhishek/handson-ml2/blob/master/ECPE_with_pretrained_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Algorithm

1. Take the document, split into clauses
2. Find embeddings of the clauses
3. Feed embeddings of clauses into a Bi-LSTM layer(word-level), followed by attention layer 
4. Output of previous layer gets copied into 2 components.
5. 1 component is for emotion extraction and is a Bi-LSTM layer(clause-level)
6. 2nd compoent is for cause extraction and is a Bi-LSTM layer(clause-level)
7. Loss Lp of the whole model is the weighted sum of two components:
Lp=n*Le+(1-n)*Lc
where n is a hyper-param






In [16]:
!pip install tensorflow==2.0.0


Collecting tensorflow==2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 36kB/s 
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 46.7MB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
[K     |████████████████████████████████| 450kB 60.7MB/s 
Building

In [91]:
%tensorflow_version 2.x
%tensorflow_version


Currently selected TF version: 2.x
Available versions:
* 1.x
* 2.x


In [0]:

import unicodedata
import re
import numpy as np
import os
import io
import time
import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
emotion_seeds = set(["ashamed", "delighted", "pleased", "concerned", "delight", "happy", "embarrassed", "furious", "nervous", 
                     "miffed", "angry", "mad", "anger", "excitement", "horror", "resentful", "astonished", "revulsion", 
                     "frightened", "cross", "sad", "down", "astonishment", "miserable", "worried", "sorrow", "overjoyed",
                     "dismay", "grief", "annoyance", "alarmed", "astounded", "anguish", "despair", "infuriated", 
                     "embarrassment", "peeved", "amused", "disgruntled", "indignant", "thrilled", "anxious", "excited",
                     "exasperation", "petrified", "heartbroken", "saddened", "depressed", "dismayed", "frustrated", "fedup", "livid",
                     "revulsion", "bewildered", "flabbergasted", "happier", "ecstatic", "elation", "exhilarated", "exhilaration",
                     "glee", "gleeful", "crestfallen", "sadness", "amusement", "dejected", "desolate", "despondency", "horrors",
                     "agitated", "disquiet", "horrified", "exasperated", "irked", "disgruntlement", "sickened", "revolted",
                     "devastated", "heartbreak", "inconsolable", "bewilderment", "nonplussed", "puzzlement", "disquieted",
                     "glum", "downcast", "griefstricken", "startled", "disgusted"])

In [0]:
# input is sentence, output is emotion cause pairs
# Determine clauses by splitting on punctuation.

# preprocess

# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')

def remove_nonascii(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.strip()
  return w


def preprocess_sentence(w):
  w = remove_nonascii(w)  
  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [0]:
def extract_cause(text):
  cur_cause=''
  try:
    cur_cause = re.findall('<cause>(.*?)<\\\cause>', text)[0]
    # Remove tags from line
    text=re.sub('<cause>', '', text)
    text=re.sub('<\\\cause>', '', text)
  except:
    pass
  #print('here:', text)
  return (cur_cause, text)

In [0]:
def clean_filter_clauses(all_clauses):
  cause = ''
  clauses=[]
  for clause in all_clauses:
    e_cause, e_text = extract_cause(clause)
    if e_cause!='':
      cause = remove_nonascii(e_cause)
    clauses.append(remove_nonascii(e_text))
  return cause, clauses



In [0]:

path_to_file = "data.txt"  
# 1. Remove any accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [document, emotion, cause, clauses list]
document=[]
emotion=[]
cause=[]
clause=[]
def create_dataset(path, num_examples):
  document.clear()
  emotion.clear()
  cause.clear()
  clause.clear()
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
  for i, line in enumerate(lines[:num_examples]):
    cur_emotion = re.findall('<(.*?)>', line)[0]
    # removing emotion tag in document
    text_without_emotion=line[2+len(cur_emotion):len(line)-len(cur_emotion)-3]
    #document.append(text_without_emotion)
    emotion.append(cur_emotion)

    # Determine clauses by splitting on punctuation.
    all_clauses = re.split("[.,!;:\"]+", text_without_emotion)
    filter_cause, filter_clauses = clean_filter_clauses(all_clauses)
    cause.append(filter_cause)
    clause.append([filter_clauses])
    doc = extract_cause(text_without_emotion)[1]
    # clean up document
    clean_doc = preprocess_sentence(doc)
    document.append(clean_doc)

    # clean up clauses
    # TODO
  return [document, emotion, cause, clause]

In [142]:
document, emotion, cause, clause_list = create_dataset(path_to_file, 500)
print(len(document))
for i in range(5):
  print(document[i])
  print(emotion[i])
  print(cause[i])
  print(clause_list[i])
  print('\n--------\n')

500
<start> i suppose i am happy , being so tiny it means i am able to surprise people with what is generally seen as my confident and outgoing personality . <end>
happy
being so tiny
[['i suppose i am happy', 'being so tiny', 'it means i am able to surprise people with what is generally seen as my confident and outgoing personality', '']]

--------

<start> lennox has always truly wanted to fight for the world title and was happy , because he was taking the tough route . <end>
happy
because he was taking the tough route
[['lennox has always truly wanted to fight for the world title and was happy', 'because he was taking the tough route', '']]

--------

<start> he was a professional musician now , still sensitive and happy , doing something he loved . <end>
happy
doing something he loved
[['he was a professional musician now', 'still sensitive and happy', 'doing something he loved', '']]

--------

<start> holmes is happy , because , he has the freedom of the house when we are out . <

In [143]:
#X=document
X=[tf.constant(sentence) for sentence in document]
print(X[0], X[0].shape, (X[0].numpy()))
y=emotion
print(y)

tf.Tensor(b'<start> i suppose i am happy , being so tiny it means i am able to surprise people with what is generally seen as my confident and outgoing personality . <end>', shape=(), dtype=string) () b'<start> i suppose i am happy , being so tiny it means i am able to surprise people with what is generally seen as my confident and outgoing personality . <end>'
['happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happ

In [0]:
tokenizer_emotion=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
tokenizer_emotion.fit_on_texts(emotion_seeds)


In [145]:
"""
tokenizer=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
tokenizer_emotion=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
def use_tokenizer():
  tokenizer.fit_on_texts(X)
  tokenizer_emotion.fit_on_texts(emotion_seeds)
  X_dict=tokenizer.word_index

  X_seq=tokenizer.texts_to_sequences(X)
  X_padded_seq=pad_sequences(X_seq,padding='post',maxlen=40) 
  print(X_padded_seq[:3], X_padded_seq.shape, type(X_padded_seq))
  print(X_padded_seq.shape)
  
  return X_padded_seq
"""

'\ntokenizer=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")\ntokenizer_emotion=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")\ndef use_tokenizer():\n  tokenizer.fit_on_texts(X)\n  tokenizer_emotion.fit_on_texts(emotion_seeds)\n  X_dict=tokenizer.word_index\n\n  X_seq=tokenizer.texts_to_sequences(X)\n  X_padded_seq=pad_sequences(X_seq,padding=\'post\',maxlen=40) \n  print(X_padded_seq[:3], X_padded_seq.shape, type(X_padded_seq))\n  print(X_padded_seq.shape)\n  \n  return X_padded_seq\n'

In [146]:
"""
X_padded_seq = use_tokenizer()
"""

'\nX_padded_seq = use_tokenizer()\n'

In [0]:
from keras.utils import to_categorical

y=tokenizer_emotion.texts_to_sequences(y)
y=np.array(y)

In [0]:
import tensorflow_hub as hub
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, Flatten

TFHUB_CACHE_DIR = os.path.join(os.curdir, "my_tfhub_cache")
os.environ["TFHUB_CACHE_DIR"] = TFHUB_CACHE_DIR


def UseEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

#text_model = tf.keras.Sequential([tf.keras.layers.Embedding(input_length=40,input_dim=10000,output_dim=50, input_shape=[None]),
text_model = tf.keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", dtype=tf.string, input_shape=[], output_shape=[50]),
    #hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",trainable=True, dtype=tf.string),
    #Bidirectional(LSTM(128), input_shape=[None, 50]),
    #Flatten(),
    Dense(64, activation="relu"),
    Dense(len(tokenizer_emotion.word_index), activation='softmax')
])

In [255]:
text_model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

text_model.summary()


Model: "sequential_47"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer_49 (KerasLayer)  (None, 50)                48190600  
_________________________________________________________________
dense_86 (Dense)             (None, 64)                3264      
_________________________________________________________________
dense_87 (Dense)             (None, 89)                5785      
Total params: 48,199,649
Trainable params: 9,049
Non-trainable params: 48,190,600
_________________________________________________________________


In [257]:
"""print(X_padded_seq.shape, y.shape)
text_model.fit(X_padded_seq, y, epochs=5)
"""
import numpy
print(type(X), X)
#print(type(document), numpy.asarray(document).shape, type(y), y.shape)
text_model.fit(numpy.asarray(document), numpy.asarray(y), epochs=10)

<class 'list'> [<tf.Tensor: id=16945, shape=(), dtype=string, numpy=b'<start> i suppose i am happy , being so tiny it means i am able to surprise people with what is generally seen as my confident and outgoing personality . <end>'>, <tf.Tensor: id=16946, shape=(), dtype=string, numpy=b'<start> lennox has always truly wanted to fight for the world title and was happy , because he was taking the tough route . <end>'>, <tf.Tensor: id=16947, shape=(), dtype=string, numpy=b'<start> he was a professional musician now , still sensitive and happy , doing something he loved . <end>'>, <tf.Tensor: id=16948, shape=(), dtype=string, numpy=b'<start> holmes is happy , because , he has the freedom of the house when we are out . <end>'>, <tf.Tensor: id=16949, shape=(), dtype=string, numpy=b'<start> i had problems with tutors trying to encourage me to diversify my work and experiment with other styles , but i was quite happy , with the direction my work was heading so i stubbornly stuck to it . <end>'>

<tensorflow.python.keras.callbacks.History at 0x7f6efb632128>

In [258]:
test_input=['he was tortured']
"""
test_X_seq=tokenizer.texts_to_sequences(test_input)
print(test_X_seq)

test_X_padded_seq=pad_sequences(test_X_seq,padding='post',maxlen=40)
print(test_X_padded_seq)
"""
output = text_model.predict_classes(test_input)
print(output)
for word, index in tokenizer_emotion.word_index.items():
  if index==output:
    print(word, index, output)
    break

[38]
anger 38 [38]


Questions:
1. Do we need seeding the model?
2. HW:
 a. Use pretrained embeddings
 **b. use functional apis**
 c. Limit output to the emotions set(say 10)
 d. Look at other sources of data(emotion-cause or emotion-entailments)


