<a href="https://colab.research.google.com/github/gwohlgen/colab/blob/master/Spam_Classification_with_ELMo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Classification with ELMo 
### (based on this: http://hunterheidenreich.com/blog/elmo-word-vectors-in-keras/)

### Step 1: just load the spam datafile and set things up

In [1]:
## get spam.csv
!rm spam*
!wget https://raw.githubusercontent.com/gwohlgen/misc/master/spam.csv
!ls

#!head spam.csv

--2019-05-20 09:07:41--  https://raw.githubusercontent.com/gwohlgen/misc/master/spam.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 503663 (492K) [text/plain]
Saving to: ‘spam.csv’


2019-05-20 09:07:42 (18.7 MB/s) - ‘spam.csv’ saved [503663/503663]

 data.csv      sample_data
 data_lm.pkl  'sentiment labelled sentences'
 __MACOSX     'sentiment labelled sentences.zip'
 models        spam.csv


In [2]:
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
from sklearn import preprocessing
import keras
import numpy as np

url = "https://tfhub.dev/google/elmo/2"
embed = hub.Module(url)

data = pd.read_csv('spam.csv', encoding='latin-1')



W0520 09:07:48.522742 139829444708224 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14
Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


W0520 09:07:56.136458 139829444708224 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


### Step 2: Encode the labels (y) into a one-hot encoding





In [3]:
y = list(data['v1'])
x = list(data['v2'])

le = preprocessing.LabelEncoder()
le.fit(y)

def encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def decode(le, one_hot):
    dec = np.argmax(one_hot, axis=1)
    return le.inverse_transform(dec)
  
## test label encoding
test = encode(le, ['ham', 'spam', 'ham', 'ham'])
untest = decode(le, test)

print(test)
print()
print(untest)

x_enc = x
y_enc = encode(le, y)



[[1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]]

['ham' 'spam' 'ham' 'ham']


### Step 3: split in train and test

In [4]:
x_train = np.asarray(x_enc[:5000])
y_train = np.asarray(y_enc[:5000])

x_test = np.asarray(x_enc[5000:])
y_test = np.asarray(y_enc[5000:])

print(len(x_train), len(x_test))
print('First test sentence:', len(x_train[0]), ' -- ', x_train[0], )

5000 572
First test sentence: 111  --  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


### Step 4: Define the embedding function

In [0]:
from keras.layers import Input, Lambda, Dense
from keras.models import Model
import keras.backend as K


def ELMoEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"] # use default model of ELMo
  
 

### Excursus: See the embeddings
based on: [https://tfhub.dev/google/elmo/2](https://tfhub.dev/google/elmo/2)



In [6]:
## wohlg added -- see embeddings ..
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
embeddings = elmo(
    ["the cat is on the mat", "dogs are in the fog"],
    signature="default",
    as_dict=True)["elmo"]

print(embeddings)

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
embeddings = elmo(
    ["the cat is on the mat", "dogs are in the fog"],
    signature="default",
    as_dict=True)["word_emb"]

print(embeddings)



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0520 09:33:37.492229 139829444708224 saver.py:1483] Saver not created because there are no variables in the graph to restore


Tensor("module_1_apply_default/aggregation/mul_3:0", shape=(2, 6, 1024), dtype=float32)
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0520 09:33:38.961277 139829444708224 saver.py:1483] Saver not created because there are no variables in the graph to restore


Tensor("module_2_apply_default/bilm/Reshape_1:0", shape=(2, 6, 512), dtype=float32)


### Step 5: Create the model

In [7]:
input_text = Input(shape=(1,), dtype=tf.string)
embedding = Lambda(ELMoEmbedding, output_shape=(1024, ))(input_text) # Elmo produces 1024-dim embeddings
dense = Dense(256, activation='relu')(embedding)
dense2= Dense(64, activation='relu')(dense) ## wohlg .. I added this for testing
pred = Dense(2, activation='softmax')(dense2) 
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.layers)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0520 09:34:28.329293 139829444708224 saver.py:1483] Saver not created because there are no variables in the graph to restore


[<keras.engine.input_layer.InputLayer object at 0x7f2c23a65320>, <keras.layers.core.Lambda object at 0x7f2c23a651d0>, <keras.layers.core.Dense object at 0x7f2c77cc8dd8>, <keras.layers.core.Dense object at 0x7f2c22f54f28>, <keras.layers.core.Dense object at 0x7f2c23a65048>]


### Step 6: train the model and save it

In [8]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(x_train, y_train, epochs=2, batch_size=32)
    model.save_weights('./elmo-model.h5')



Instructions for updating:
Use tf.cast instead.


W0520 09:34:40.156120 139829444708224 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.


Epoch 1/2
Epoch 2/2


### Step 7: Load the mode, and do predictions

In [0]:
with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    model.load_weights('./elmo-model.h5')  
    predicts = model.predict(x_test, batch_size=32)

y_test = decode(le, y_test)
y_preds = decode(le, predicts)

In [10]:
from sklearn import metrics

print(metrics.confusion_matrix(y_test, y_preds))

print(metrics.classification_report(y_test, y_preds))

[[496   2]
 [  5  69]]
              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       498
        spam       0.97      0.93      0.95        74

   micro avg       0.99      0.99      0.99       572
   macro avg       0.98      0.96      0.97       572
weighted avg       0.99      0.99      0.99       572

