# ELMo



## ELMo的优势
（1）ELMo能够学习到词汇用法的复杂性，比如语法、语义
（2）ELMo能够学习不同上下文情况下的词汇多义性

基于大量文本，ELMo模型时从深层的双向语言模型（deep bidirectional language model）中内部状态（internal state）学习而来的，而这些词向量很容易加入到QA、文本对齐，文本分类等任务中。

语言模型就是生成文本的模型，是多个N个词语的序列$(t_1,t_2,...,t_N)$前向语言模型就是，已知$(t_1,t_2,...,t_{k-1})$，预测下一个词语$t_k$的概率是：
$p(t_1,t_2,...,t_N)=\prod_{k=1}^{N}{p(t_k|t_1,t_2,...,t_{k-1})}$
后向语言模型如下，即通过下文预测之前：
$p(t_1,t_2,...,t_N)=\prod_{k=1}^{N}{p(t_k|t_{k+1},t_{k+2},...,t_N}$

双向语言模型将前后向语言模型结合起来，最大化前向、后向模型的联合似然函数，
如下所示：
$\sum_{k=1}^{N}{(logp(t_k|t_2,...,t_{k-1};\theta,\vec{\theta_{LSTM}},\theta_s)+logp(t_k|t_{k+1},t_{k+2},...,t_N;\theta,\overleftarrow{\theta_{LSTM}},\theta_{s}))}$

ELMo是双向语言模型biLM的多层表示的组合，对于某一个词语$t_k$, 一个L层的双向语言模型biLM能够由2L+1个相连表示：
$R_k={X^{LM},\overleftarrow{h_k^{LMj}},\overrightarrow{h_k^{LMj}}|j=1,...,L}=\{h_k^{LMj},|j=1,...,L\}$
其中${h_k^{LMj}}=[\overrightarrow{h_k^{LMj}};\overleftarrow{h_k^{LMj}}]$

ELMo将多层的BiLM的输出R整合成一个向量，$ELMo_k=E(R_k;\theta_e)$.最简单的情况是ELMo仅使用最顶层的输出，即$E(R_k)=h_k^{LM,L}$，类似于TagLM和CoVe模型。但是最好的ELMo模型是将所有BiLM层的输出加上normalized的softmax学到的权重
$s=Softmax(w)$
$E(R_k;w,\gamma)=\gamma\sum_{j=0}^{L}{s_jh_k^{LMj}}$
其中$\gamma$是缩放因子。加入每一个BiLM的输出具有不同的分布，$\gamma$某种程度上来说相当于在weighting前对每一层BiLM使用layer normalization

## 如何使用ELMo的词向量
（1）直接将ELMo词向量ELMo_k 和普通的词向量x_k拼接（concat）$[x_k;ELMo_k]$
（2）直接将ELMo词向量ELMo_k 与隐层输出向量h_h拼接$[h_k;ELMo_k]$, 在SNLI, SQuAD上都有提升。



In [1]:
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub

In [None]:
# 参考 https://github.com/strongio/keras-elmo/blob/master/Elmo%20Keras.ipynb

In [2]:
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in os.listdir(directory):
        with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
            data["sentence"].append(f.read())
            data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
    return pd.DataFrame.from_dict(data)

In [3]:
# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df["polarity"] = 1
    neg_df["polarity"] = 0
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

In [4]:
# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
    dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)

    train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
    test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))

    return train_df, test_df

In [None]:
train_df, test_df = download_and_load_datasets()
train_df.head()

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
  417792/84125825 [..............................] - ETA: 57:08

In [None]:
class ElmoEmbeddingLayer(Layer):
    def __init__(self, **kwargs):
        self.dimensions = 1024
        self.trainable=True
        super(ElmoEmbeddingLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.elmo = hub.Module('https://tfhub.dev/google/elmo/2', trainable=self.trainable,
                               name="{}_module".format(self.name))

        self.trainable_weights += K.tf.trainable_variables(scope="^{}_module/.*".format(self.name))
        super(ElmoEmbeddingLayer, self).build(input_shape)

    def call(self, x, mask=None):
        result = self.elmo(K.squeeze(K.cast(x, tf.string), axis=1),
                      as_dict=True,
                      signature='default',
                      )['default']
        return result

    def compute_mask(self, inputs, mask=None):
        return K.not_equal(inputs, '--PAD--')

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.dimensions)

In [None]:
def build_model(): 
    input_text = layers.Input(shape=(1,), dtype="string")
    embedding = ElmoEmbeddingLayer()(input_text)
    dense = layers.Dense(256, activation='relu')(embedding)
    pred = layers.Dense(1, activation='sigmoid')(dense)

    model = Model(inputs=[input_text], outputs=pred)

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
  
    return model

In [None]:
# Create datasets (Only take up to 150 words for memory)
train_text = train_df['sentence'].tolist()
train_text = [' '.join(t.split()[0:150]) for t in train_text]
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
train_label = train_df['polarity'].tolist()

test_text = test_df['sentence'].tolist()
test_text = [' '.join(t.split()[0:150]) for t in test_text]
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = test_df['polarity'].tolist()

In [None]:
model = build_model()
model.fit(train_text, 
          train_label,
          validation_data=(test_text, test_label),
          epochs=1,
          batch_size=32)

In [None]:
pre_save_preds = model.predict(test_text[0:100]) # predictions before we clear and reload model

# Clear and load model
model = None
model = build_model()
model.load_weights('ElmoModel.h5')

post_save_preds = model.predict(test_text[0:100]) # predictions after we clear and reload model
all(pre_save_preds == post_save_preds) # Are they the same?