<a href="https://colab.research.google.com/github/kumar-abhishek/handson-ml2/blob/master/ECPE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Algorithm

1. Take the document, split into clauses
2. Find embeddings of the clauses
3. Feed embeddings of clauses into a Bi-LSTM layer(word-level), followed by attention layer 
4. Output of previous layer gets copied into 2 components.
5. 1 component is for emotion extraction and is a Bi-LSTM layer(clause-level)
6. 2nd compoent is for cause extraction and is a Bi-LSTM layer(clause-level)
7. Loss Lp of the whole model is the weighted sum of two components:
Lp=n*Le+(1-n)*Lc
where n is a hyper-param






In [0]:
!pip install tensorflow==2.0.0


Collecting tensorflow==2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/46/0f/7bd55361168bb32796b360ad15a25de6966c9c1beb58a8e30c01c8279862/tensorflow-2.0.0-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
[K     |████████████████████████████████| 86.3MB 36kB/s 
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting tensorboard<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 46.7MB/s 
Collecting tensorflow-estimator<2.1.0,>=2.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
[K     |████████████████████████████████| 450kB 60.7MB/s 
Building

In [0]:
!pip install tensorflow-probability==0.8.0

Collecting tensorflow-probability==0.8.0
[?25l  Downloading https://files.pythonhosted.org/packages/f8/72/29ef1e5f386b65544d4e7002dfeca1e55b099ed182cd6d405c21a19ae259/tensorflow_probability-0.8.0-py2.py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 2.8MB/s 
Collecting cloudpickle==1.1.1
  Downloading https://files.pythonhosted.org/packages/24/fb/4f92f8c0f40a0d728b4f3d5ec5ff84353e705d8ff5e3e447620ea98b06bd/cloudpickle-1.1.1-py2.py3-none-any.whl
Collecting gast<0.3,>=0.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Building wheels for collected packages: gast
  Building wheel for gast (setup.py) ... [?25l[?25hdone
  Created wheel for gast: filename=gast-0.2.2-cp36-none-any.whl size=7540 sha256=cc89da8567f2f21ad86614c12a82b79eee9a043044a026d8af1c50c5bfc59590
  Stored in directory: /root/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Succe

In [1]:
%tensorflow_version 2.x
%tensorflow_version


Currently selected TF version: 2.x
Available versions:
* 1.x
* 2.x


In [0]:

import unicodedata
import re
import numpy as np
import os
import io
import time
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
emotion_seeds = set(["ashamed", "delighted", "pleased", "concerned", "delight", "happy", "embarrassed", "furious", "nervous", 
                     "miffed", "angry", "mad", "anger", "excitement", "horror", "resentful", "astonished", "revulsion", 
                     "frightened", "cross", "sad", "down", "astonishment", "miserable", "worried", "sorrow", "overjoyed",
                     "dismay", "grief", "annoyance", "alarmed", "astounded", "anguish", "despair", "infuriated", 
                     "embarrassment", "peeved", "amused", "disgruntled", "indignant", "thrilled", "anxious", "excited",
                     "exasperation", "petrified", "heartbroken", "saddened", "depressed", "dismayed", "frustrated", "fedup", "livid",
                     "revulsion", "bewildered", "flabbergasted", "happier", "ecstatic", "elation", "exhilarated", "exhilaration",
                     "glee", "gleeful", "crestfallen", "sadness", "amusement", "dejected", "desolate", "despondency", "horrors",
                     "agitated", "disquiet", "horrified", "exasperated", "irked", "disgruntlement", "sickened", "revolted",
                     "devastated", "heartbreak", "inconsolable", "bewilderment", "nonplussed", "puzzlement", "disquieted",
                     "glum", "downcast", "griefstricken", "startled", "disgusted"])

In [0]:
# input is sentence, output is emotion cause pairs
# Determine clauses by splitting on punctuation.

# preprocess

# Converts the unicode file to ascii
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')

def remove_nonascii(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  # Reference:- https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.strip()
  return w


def preprocess_sentence(w):
  w = remove_nonascii(w)  
  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<start> ' + w + ' <end>'
  return w

In [0]:
def extract_cause(text):
  cur_cause=''
  try:
    cur_cause = re.findall('<cause>(.*?)<\\\cause>', text)[0]
    # Remove tags from line
    text=re.sub('<cause>', '', text)
    text=re.sub('<\\\cause>', '', text)
  except:
    pass
  #print('here:', text)
  return (cur_cause, text)

In [0]:
def clean_filter_clauses(all_clauses):
  cause = ''
  clauses=[]
  for clause in all_clauses:
    e_cause, e_text = extract_cause(clause)
    if e_cause!='':
      cause = remove_nonascii(e_cause)
    clauses.append(remove_nonascii(e_text))
  return cause, clauses



In [0]:

path_to_file = "data.txt"  
# 1. Remove any accents
# 2. Clean the sentences
# 3. Return word pairs in the format: [document, emotion, cause, clauses list]
document=[]
emotion=[]
cause=[]
clause=[]
def create_dataset(path, num_examples):
  document.clear()
  emotion.clear()
  cause.clear()
  clause.clear()
  lines = io.open(path, encoding='UTF-8').read().strip().split('\n')
  for i, line in enumerate(lines[:num_examples]):
    cur_emotion = re.findall('<(.*?)>', line)[0]
    # removing emotion tag in document
    text_without_emotion=line[2+len(cur_emotion):len(line)-len(cur_emotion)-3]
    #document.append(text_without_emotion)
    emotion.append(cur_emotion)

    # Determine clauses by splitting on punctuation.
    all_clauses = re.split("[.,!;:\"]+", text_without_emotion)
    filter_cause, filter_clauses = clean_filter_clauses(all_clauses)
    cause.append(filter_cause)
    clause.append([filter_clauses])
    doc = extract_cause(text_without_emotion)[1]
    # clean up document
    clean_doc = preprocess_sentence(doc)
    document.append(clean_doc)

    # clean up clauses
    # TODO
  return [document, emotion, cause, clause]

In [8]:
document, emotion, cause, clause_list = create_dataset(path_to_file, 500)
print(len(document))
for i in range(5):
  print(document[i])
  print(emotion[i])
  print(cause[i])
  print(clause_list[i])
  print('\n--------\n')

500
<start> i suppose i am happy , being so tiny it means i am able to surprise people with what is generally seen as my confident and outgoing personality . <end>
happy
being so tiny
[['i suppose i am happy', 'being so tiny', 'it means i am able to surprise people with what is generally seen as my confident and outgoing personality', '']]

--------

<start> lennox has always truly wanted to fight for the world title and was happy , because he was taking the tough route . <end>
happy
because he was taking the tough route
[['lennox has always truly wanted to fight for the world title and was happy', 'because he was taking the tough route', '']]

--------

<start> he was a professional musician now , still sensitive and happy , doing something he loved . <end>
happy
doing something he loved
[['he was a professional musician now', 'still sensitive and happy', 'doing something he loved', '']]

--------

<start> holmes is happy , because , he has the freedom of the house when we are out . <

In [9]:
X=document
#X=[tf.constant(sentence) for sentence in document]
#print(X[0], X[0].shape, (X[0].numpy()))
y=emotion
print(y)

['happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 'happy', 

In [0]:
tokenizer_emotion=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
tokenizer_emotion.fit_on_texts(emotion_seeds)


In [0]:
tokenizer=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
tokenizer_emotion=keras.preprocessing.text.Tokenizer(num_words=10000, oov_token="xxxxxxx")
def use_tokenizer():
  tokenizer.fit_on_texts(X)
  tokenizer_emotion.fit_on_texts(emotion_seeds)
  X_dict=tokenizer.word_index

  X_seq=tokenizer.texts_to_sequences(X)
  X_padded_seq=pad_sequences(X_seq,padding='post',maxlen=40) 
  print(X_padded_seq[:3], X_padded_seq.shape, type(X_padded_seq))
  print(X_padded_seq.shape)
  
  return X_padded_seq

In [12]:
X_padded_seq = use_tokenizer()

[[  3  10 450  10  56  46  40  30 825  21 451  10  56 246   5 452 113  15
   57  28 453 308  31  43 826   8 827 828   2   0   0   0   0   0   0   0
    0   0   0   0]
 [  3 829  52 211 830 212   5 831  19   4 114 832   8   9  46 115  11   9
  309   4 833 834   2   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]
 [  3  11   9  12 454 835  73  96 836   8  46 310 837  11 311   2   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]] (500, 40) <class 'numpy.ndarray'>
(500, 40)


In [0]:
y=tokenizer_emotion.texts_to_sequences(y)
y=np.array(y)

In [14]:
print(y.shape)

(500, 1)


In [27]:
import tensorflow_hub as hub
from keras.models import Model, Sequential

clauses = keras.layers.Input(shape=X_padded_seq.shape[1:], name='clause_input')
clauses_embeddings = keras.layers.Embedding(input_length=40,input_dim=10000,output_dim=50) (clauses)
output1 = keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(clauses_embeddings)

attention_output = keras.layers.Attention()([output1, output1, output1]) # no idea why we use output3, 3 times TODO!!!

output_emotion = keras.layers.Bidirectional(keras.layers.LSTM(64))(attention_output)
output_emotion_dense1 = keras.layers.Dense(128, activation="relu") (output_emotion)
output_emotion_dense2 = keras.layers.Dense(len(tokenizer_emotion.word_index), activation='softmax', name='emotion_output') (output_emotion_dense1)


output_cause = keras.layers.Bidirectional(keras.layers.LSTM(64))(attention_output)
output_cause_dense1 = keras.layers.Dense(128, activation="relu") (output_cause)
output_cause_dense2 = keras.layers.Dense(len(tokenizer_emotion.word_index), activation='softmax', name='clause_output') (output_cause_dense1)

model = keras.Model(inputs=clauses, outputs=[output_emotion_dense2, output_cause_dense2])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", loss_weights=[0.5, 0.5], metrics=['accuracy'])
model.fit({'clause_input': X_padded_seq}, {'emotion_output': y, 'clause_output': y}, epochs=20)



Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fc0bbd6bef0>

In [28]:
model.evaluate(X_padded_seq, y, verbose=0)


[0.6723743081092834,
 0.6715744137763977,
 0.6731741428375244,
 0.6359999775886536,
 0.6359999775886536]

In [29]:
model.predict({'clause_input': X_padded_seq})

[array([[2.24426984e-08, 2.94211437e-04, 4.46459154e-08, ...,
         2.34549113e-09, 3.98510691e-09, 4.38057768e-08],
        [2.06071569e-08, 2.61568348e-04, 4.07083789e-08, ...,
         2.07825179e-09, 3.53869312e-09, 4.01161806e-08],
        [2.18395524e-08, 2.82198773e-04, 4.32992202e-08, ...,
         2.24915397e-09, 3.81770260e-09, 4.25323385e-08],
        ...,
        [5.87996419e-06, 3.27096820e-01, 7.36006359e-06, ...,
         5.77178116e-06, 1.19048336e-05, 1.22625552e-05],
        [5.86108354e-06, 3.26503754e-01, 7.33830575e-06, ...,
         5.74995738e-06, 1.18494972e-05, 1.22302345e-05],
        [5.87799695e-06, 3.25428605e-01, 7.37655682e-06, ...,
         5.76614229e-06, 1.18329581e-05, 1.22892952e-05]], dtype=float32),
 array([[1.6303733e-08, 4.3151993e-04, 1.1161738e-08, ..., 9.2581601e-08,
         1.3225321e-08, 4.5378727e-08],
        [1.4596586e-08, 3.8640076e-04, 9.8934372e-09, ..., 8.3660829e-08,
         1.1561262e-08, 4.0990972e-08],
        [1.5907480e-08

In [143]:
# test code to test the model prediction
test_input=['he was mad']
test_X_seq=tokenizer.texts_to_sequences(test_input)
print(test_X_seq)

test_X_padded_seq=pad_sequences(test_X_seq,padding='post',maxlen=40)
print(test_X_padded_seq)

output = model.predict_classes(test_X_padded_seq)
print(output)
for word, index in tokenizer_emotion.word_index.items():
  if index==output:
    print(word, index, output)
    break

[[11, 9, 85]]
[[11  9 85  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]]


AttributeError: ignored

In [134]:
# this piece of code just takes document as input and predicts emotion as the output of the document
text_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_length=40,input_dim=10000,output_dim=50),
    #tf.keras.layers.Flatten(),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    #tf.keras.layers.Attention(),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(len(tokenizer_emotion.word_index), activation='softmax')
])

text_model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

text_model.summary()

print(X_padded_seq.shape, y.shape)
text_model.fit(X_padded_seq, y, epochs=5)

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_36 (Embedding)     (None, 40, 50)            500000    
_________________________________________________________________
bidirectional_127 (Bidirecti (None, 128)               58880     
_________________________________________________________________
dense_69 (Dense)             (None, 128)               16512     
_________________________________________________________________
dense_70 (Dense)             (None, 89)                11481     
Total params: 586,873
Trainable params: 586,873
Non-trainable params: 0
_________________________________________________________________
(500, 40) (500, 1)
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fad09986438>

In [135]:
# test code to test the model prediction
test_input=['he was mad']
test_X_seq=tokenizer.texts_to_sequences(test_input)
print(test_X_seq)

test_X_padded_seq=pad_sequences(test_X_seq,padding='post',maxlen=40)
print(test_X_padded_seq)

output = text_model.predict_classes(test_X_padded_seq)
print(output)
for word, index in tokenizer_emotion.word_index.items():
  if index==output:
    print(word, index, output)
    break

[[11, 9, 85]]
[[11  9 85  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]]
[68]
happy 68 [68]


Questions:
1. Do we need seeding the model?
2. HW:
 a. Use pretrained embeddings
 **b. use functional apis**
 c. Limit output to the emotions set(say 10)
 d. Look at other sources of data(emotion-cause or emotion-entailments)


