This notebook tries to investigate different ways of include word embeddings as features.

## ELMO

[ELMo](https://allennlp.org/elmo) is the state-of-the-art deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts. We use [ELMoForManyLangs](https://github.com/HIT-SCIR/ELMoForManyLangs) to train our own embedding representations.

In [129]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

In [71]:
from elmoformanylangs import Embedder
from sklearn.model_selection import train_test_split

from config import train_data_path
from fgclassifier.utils import read_data


X, y = read_data(train_data_path, seg_words=True, sample_n=None)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

2018-11-20 14:07:27,468 INFO: Read cache data/train/sentiment_analysis_trainingset.csv.segged_sample_None.tsv..


In [2]:
embedder = Embedder('data/elmo-zhs-100k-mc4-lr0.001/')

2018-11-20 10:05:12,659 INFO: char embedding size: 8844
2018-11-20 10:05:13,141 INFO: word embedding size: 69598
2018-11-20 10:05:18,929 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(69598, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(8844, 50, padding_idx=8841)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_feat

In [65]:
def article_to_sentences(articles):
    sentences, aids = [], []
    for aid, article in enumerate(articles):
        for s in article.split('。'):
            s = s.strip()
            if not s:
                continue
            sentences.append(s.split())
            aids.append(aid)
    return sentences, aids


sentences, aids = article_to_sentences(X_train[:2])
X_train_elmo = embedder.sents2elmo(sentences)
print(X_train_elmo[0].shape, len(X_train_elmo))

2018-11-20 11:20:28,216 INFO: 1 batches, avg len: 130.0


(54, 1024) 3


In [70]:
print('\n\n'.join(' '.join(x) for x in sentences))

很少 吃 兔子 肉 ！ BBLANKK 在 那 附近 本来 想 去 吃 泰国菜 的 ， 最后 还是 选择 了 这家 ， 果然 没 让 失望 ， 进去 后 的 座位 上 都 有 只 流氓兔 的 抱 枕 ， 点 了 一个 爆辣 的 兔头 ， 姐妹 吃 的 一直 赞不绝口 ！ BBLANKK

手 撕 烤 兔 ， 本 以为 是 自己 撕 的 ， 结果 是 服务员 到 你 面前 撕 ！ BBLANKK 然后 拌 ！ BBLANKK 那 味道 吃 的 真的 太 过瘾 了 ！ BBLANKK 姐妹 一直 在 赞 ！ BBLANKK 这 主要 还是 针对 能 吃 辣 ， 口味重 的 人 ！ BBLANKK 自我 觉得 也 超级 够味 BBLANKK ！ 还点 了 串串 ， 上来 以后 吃 的 发现 不是 自己 点 的 ， 服务员 去 核对 后 ， 又 重新 上 了 ！ BBLANKK 还 不停 的 抱歉 ！ BBLANKK 这 服务 也 不错 ， 如果 每个 服务员 脸上 在 挂些 微笑 则 更好 ！ BBLANKK 总之 吃 的 很 过瘾 ！ BBLANKK 价格 也 真的 很 便宜 ！ BBLANKK 撑 到 爆 ， 大众 买单 ， 两 人才 98 ！ 划算

最近 微博 比较 火 的 一家 店 ， 突然 心血来潮 就 去 了 ！ 路上 堵车 ， 一个多 小时 才 到 ， 而且 到 地方 已经 快 7 点 了 吧 ， 还 剩 最后 一个 紫米 的 ， 赶紧 买买 买 ！ 还 买 了 绿野仙踪 ， 一个 香蕉 的 忘记 名字 了 ， 还 买 了 一盒 奶冻 一样 的 东西 ， 比较 粗线条 ， 名字 不 记得 了 也 没 拍 ！ 奶冻 还行 ， 不过 味道 没有 85 的 浓 ！ 面包 都 比较 健康 的 口感 ， 都 不怎么 甜 ， 紫米 的 刚 开始 接受 不了 ， 感觉 不怎么样 ， 里面 有 超多 紫米 ， 核桃 ， 葡萄干 ， 桂圆 ， 后来 越 吃 越 好吃 ， 香蕉 的 那个 香蕉 烤熟 了 ， 吃 起来 不 好吃 ， 绿野仙踪 挺好吃 的 ， 绿豆 抹 茶 跟 乳酪 ！ 略 咸 的 口味 ， 去 的 较晚 ， 没买 到 托斯卡 尼 不 开心 ， 下次 接着 去 ， 还要 尝试 巧克力 什么 的 那款 ！ 总的来说 不错 ， 价格 略贵 但是 用料 真的 太足 了

If we check the shape of the embedding vectors, each review is represented
by multiple sentences. Each word (token) in a sentence is represented
by a vector of length 1024. Would 1024 dimensions be able to represent all
the latent meanings in the world? Maybe not, but a combination of them
could capture important information in this specific corpus decently well.

We need to find a way to use these embedings in our downstream classification
task. There are many ways to do it. [This article](https://arxiv.org/abs/1806.06259)
experimented many of them. Basically it says simper averages work good enough
for most tasks.

Another thing we noticed is that since we are splitting sentences only by 「。」，
we missed separating sentences that ends with "！" or "？". As "！" and "？"
do bear more semantics than period, it might not be a bad thing.  

Anyway, let's improve our sentence splitting function a little bit:

In [79]:
import re
RE_EXCL = re.compile('！+')
RE_QUES = re.compile('？+')

def split_by(s, regexp, char):
    if char in s.strip(char):
        tmp = regexp.split(s)
        last = tmp.pop()
        ret = [x + char for x in tmp]
        ret.append(last)  # add last sentence back
        return ret

def article_to_sentences(articles):
    sentences, aids, slens = [], [], []
    for aid, article in enumerate(articles):
        ss = article.split('。')
        while ss:
            s = ss.pop(0).strip()
            if not s:
                continue
            tmp = split_by(s, RE_EXCL, '！')
            if tmp:
                ss = tmp + ss
                continue
                
            tmp = split_by(s, RE_QUES, '？')
            if tmp:
                ss = tmp + ss
                continue
                
            tokens = s.split()
            sentences.append(tokens)
            # keep a record of article ids and sentence length
            # so that we know which sentence/word belongs to
            # which article
            aids.append(aid)
            slens.append(len(tokens))
    return sentences, aids, slens


sentences, aids, slens = article_to_sentences(X_train[:2])
print('\n'.join(' '.join(x) for x in sentences))

print()
X_train_elmo = embedder.sents2elmo(sentences)
print(X_train_elmo[0].shape, len(X_train_elmo), len(slens))

2018-11-20 14:32:04,661 INFO: 1 batches, avg len: 18.0


很少 吃 兔子 肉 ！
BBLANKK 在 那 附近 本来 想 去 吃 泰国菜 的 ， 最后 还是 选择 了 这家 ， 果然 没 让 失望 ， 进去 后 的 座位 上 都 有 只 流氓兔 的 抱 枕 ， 点 了 一个 爆辣 的 兔头 ， 姐妹 吃 的 一直 赞不绝口 ！
BBLANKK
手 撕 烤 兔 ， 本 以为 是 自己 撕 的 ， 结果 是 服务员 到 你 面前 撕 ！
BBLANKK 然后 拌 ！
BBLANKK 那 味道 吃 的 真的 太 过瘾 了 ！
BBLANKK 姐妹 一直 在 赞 ！
BBLANKK 这 主要 还是 针对 能 吃 辣 ， 口味重 的 人 ！
BBLANKK 自我 觉得 也 超级 够味 BBLANKK ！
还点 了 串串 ， 上来 以后 吃 的 发现 不是 自己 点 的 ， 服务员 去 核对 后 ， 又 重新 上 了 ！
BBLANKK 还 不停 的 抱歉 ！
BBLANKK 这 服务 也 不错 ， 如果 每个 服务员 脸上 在 挂些 微笑 则 更好 ！
BBLANKK 总之 吃 的 很 过瘾 ！
BBLANKK 价格 也 真的 很 便宜 ！
BBLANKK 撑 到 爆 ， 大众 买单 ， 两 人才 98 ！
划算
最近 微博 比较 火 的 一家 店 ， 突然 心血来潮 就 去 了 ！
路上 堵车 ， 一个多 小时 才 到 ， 而且 到 地方 已经 快 7 点 了 吧 ， 还 剩 最后 一个 紫米 的 ， 赶紧 买买 买 ！
还 买 了 绿野仙踪 ， 一个 香蕉 的 忘记 名字 了 ， 还 买 了 一盒 奶冻 一样 的 东西 ， 比较 粗线条 ， 名字 不 记得 了 也 没 拍 ！
奶冻 还行 ， 不过 味道 没有 85 的 浓 ！
面包 都 比较 健康 的 口感 ， 都 不怎么 甜 ， 紫米 的 刚 开始 接受 不了 ， 感觉 不怎么样 ， 里面 有 超多 紫米 ， 核桃 ， 葡萄干 ， 桂圆 ， 后来 越 吃 越 好吃 ， 香蕉 的 那个 香蕉 烤熟 了 ， 吃 起来 不 好吃 ， 绿野仙踪 挺好吃 的 ， 绿豆 抹 茶 跟 乳酪 ！
略 咸 的 口味 ， 去 的 较晚 ， 没买 到 托斯卡 尼 不 开心 ， 下次 接着 去 ， 还要 尝试 巧克力 什么 的 那款 ！
总的来说 不错 ， 价格 略贵 但是 用料 真的 太足 了 ！

Let's try to build the averages. We can do it word-by-word,
or calculate averages in sentences first, then take sentence
average for an article (review), which might've
given shorter sentences higher weights than they deserved.

In [81]:
sentences, aids, slens = article_to_sentences(X_test[:10])
embs = embedder.sents2elmo(sentences)

2018-11-20 14:33:28,180 INFO: 6 batches, avg len: 14.7


In [84]:
# Take sentence average on words
sents_avg = np.vstack(np.mean(x, axis=0) for x in embs)
df = pd.DataFrame(np.hstack([np.array([aids]).T, sents_avg]))
# Then take article avearges based on sentences
# the first column are the article ids
df[0] = df[0].astype('int')
df.groupby(0).mean()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.068202,0.293435,-0.946647,0.373405,0.019808,-0.310937,-0.316767,-0.079124,-0.098896,-0.287463,...,0.508182,-0.198488,0.178393,-0.403227,-0.50139,0.386936,0.076814,-0.177638,0.216966,-0.211923
1,-0.028542,0.093681,-0.924516,0.674874,-0.074027,-0.198827,-0.504699,0.062472,0.138705,-0.276513,...,0.588356,-0.263873,0.328776,-0.450251,-0.257557,0.023268,-0.341576,-0.014883,0.103347,-0.061155
2,0.248675,0.021631,-1.012321,0.531638,0.065087,-0.145901,-0.556056,-0.039598,0.485349,0.047028,...,0.317467,-0.099202,0.078911,-0.304988,0.038534,0.122042,-0.422591,-0.23444,0.100334,-0.086534
3,0.121712,0.19193,-1.231352,0.40495,0.14316,-0.060729,0.188212,-0.08565,0.169458,-0.272459,...,0.143824,0.124883,0.50567,-0.55885,-0.620959,0.625774,0.275083,-0.248307,0.324095,-0.462545
4,0.251268,0.366207,-1.047636,0.246586,-0.294999,0.240501,-0.355867,0.07159,0.111405,-0.22696,...,0.328104,-0.07221,0.58455,-0.452596,-0.19337,0.233115,0.113667,-0.072564,0.15479,-0.488066
5,-0.120469,0.011523,-1.135528,0.837119,0.166698,-0.525736,-0.585312,0.046266,0.663572,0.158238,...,0.30573,-0.132625,-0.224237,-0.227819,0.195378,-0.319985,-0.537222,-0.356351,-0.121881,-0.036729
6,0.158894,0.032087,-1.188803,0.605884,-0.230254,0.110983,-0.513036,0.277207,0.419474,-0.207109,...,0.077929,0.107817,0.205626,-0.079184,-0.238272,0.354426,-0.160015,-0.219914,-0.018861,-0.255687
7,0.001826,0.304018,-0.801234,0.362024,-0.059323,-0.169137,-0.137594,0.014599,0.292199,-0.334517,...,0.750622,-0.212787,0.395819,-0.112058,-0.742296,0.169252,-0.220114,-0.057841,0.079855,-0.347297
8,0.1193,0.060534,-0.907626,0.326221,-0.129374,0.134551,-0.322363,0.180659,0.279821,-0.278353,...,0.574352,-0.240929,0.182954,-0.081244,-0.27395,0.306851,-0.160054,0.057242,0.103492,-0.242089
9,0.045324,0.399205,-1.057199,0.490464,0.064823,-0.298407,-0.111564,-0.027645,0.063582,-0.403712,...,0.404983,-0.216591,0.170507,-0.260297,-0.485226,0.482495,-0.006189,-0.139399,0.120037,-0.254346


Here's how to take averages word by word:

In [85]:
word_aids = np.repeat(aids, slens) # word article ids
words = np.vstack(embs)
df = pd.DataFrame(np.hstack([np.array([word_aids]).T, words]))
# Then take article avearges based on sentences
# the first column are the article ids
df[0] = df[0].astype('int')
df.groupby(0).mean()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,1015,1016,1017,1018,1019,1020,1021,1022,1023,1024
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.101779,0.283301,-0.900231,0.345071,0.022835,-0.299208,-0.232614,-0.100858,-0.100119,-0.260453,...,0.555478,-0.215898,0.132628,-0.422329,-0.462794,0.295195,0.064091,-0.166212,0.169991,-0.169296
1,0.070025,0.063162,-0.878004,0.743448,-0.069338,-0.124913,-0.556164,-0.026335,0.243542,-0.211147,...,0.687062,-0.255119,0.312055,-0.457865,-0.279171,0.006943,-0.28155,0.003033,0.132788,-0.011245
2,0.194546,-0.069376,-0.94338,0.561626,0.01715,0.002253,-0.408099,-0.051316,0.345255,-0.165207,...,0.46193,-0.164422,0.247815,-0.450748,-0.010738,0.114379,-0.215294,-0.125112,0.170125,-0.207496
3,0.281636,0.127619,-0.966139,0.287935,0.171048,-0.079147,-0.015877,0.095158,0.081275,-0.30231,...,0.443954,-0.04262,0.261498,-0.44975,-0.626974,0.35635,0.132579,-0.321036,0.315697,-0.345978
4,0.144019,0.258029,-1.091062,0.266817,-0.081337,0.245391,-0.244364,0.084588,0.071181,-0.24254,...,0.344382,-0.072969,0.427867,-0.324701,-0.040438,0.153624,-0.159518,0.031568,0.105475,-0.480427
5,0.047805,0.099159,-0.922499,0.560363,0.175776,-0.301265,-0.403926,-0.062262,0.414132,-0.095919,...,0.588578,-0.120086,0.178956,-0.372808,0.113399,-0.096326,-0.386146,-0.082621,-0.064487,-0.211222
6,0.150593,0.045343,-1.10508,0.51456,-0.154937,0.297473,-0.400714,0.175799,0.418541,-0.212346,...,0.262562,0.139245,0.398199,-0.147461,-0.227501,0.350058,-0.038029,-0.111012,0.050968,-0.209893
7,-0.054791,0.211176,-0.81011,0.25132,-0.028393,-0.247055,-0.20022,-0.002998,0.196519,-0.311758,...,0.742207,-0.215268,0.291482,-0.181914,-0.634936,0.153558,-0.252822,-0.046519,0.042969,-0.187608
8,0.133544,0.08285,-0.978896,0.402337,-0.032145,0.132255,-0.392965,0.290704,0.30242,-0.203162,...,0.533943,-0.190472,0.145426,-0.108401,-0.172996,0.393919,-0.295246,0.145372,0.017513,-0.150516
9,-0.046585,0.279045,-0.99854,0.436874,0.025244,-0.248909,0.048599,0.054734,0.12404,-0.409424,...,0.442144,-0.300073,0.139323,-0.363365,-0.491794,0.431296,-0.010705,-0.172927,0.135138,-0.279213


Putting it all together:

In [122]:
from elmoformanylangs import Embedder as ElmoEmbedder

embedder = ElmoEmbedder('data/elmo-zhs-100k-mc4-lr0.001/')

2018-11-20 15:47:17,811 INFO: char embedding size: 8844
2018-11-20 15:47:18,191 INFO: word embedding size: 69598
2018-11-20 15:47:23,175 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(69598, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(8844, 50, padding_idx=8841)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_feat

In [121]:
embedder2 = ElmoEmbedder('../data/zhs.model/')

2018-11-20 15:47:10,826 INFO: char embedding size: 6169
2018-11-20 15:47:11,218 INFO: word embedding size: 71222
2018-11-20 15:47:16,098 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(71222, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(6169, 50, padding_idx=6166)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_feat

In [197]:
import pandas as pd
import numpy as np

from collections import defaultdict
from sklearn.base import BaseEstimator
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from elmoformanylangs import Embedder as ElmoEmbedder
from fgclassifier.features import FeaturePipeline, logger, article_to_sentences


class ElmoRawVectorizer(BaseEstimator):
    """
    Elmo Word Embedding with pre-trained models
    
    Parameters
    -----------
        embedder:  where you save ElMoForManyLangs model
    """
    
    def __init__(self, embedder, batch_size=64, split_sentence=True):
        if isinstance(embedder, str):
            embedder = ElmoEmbedder(embedder, batch_size=batch_size)
        else:
            embedder.batch_size = batch_size
        self.embedder = embedder
        # Whether to split reviews into sentences before passing to embedder
        # otherwise a review will be considered as one sentence.
        self.split_sentence = split_sentence
        
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)
    
    def fit(self, X, y=None):
        # No fitting needed, because we are using pre-trained
        # models, which is already loaded while initializing
        return self
    
    def transform(self, X):
        sentences, aids, slens = article_to_sentences(X, self.split_sentence)
        embs = embedder.sents2elmo(sentences)
        return (aids, slens, embs)
    

class ElmoTransformer(ElmoVectorizer):
    """Transform ELMo embeddings to 1D features
    by averaging word vectors for all sentences
    
    Parameters
    ----------------------------------
        strategy:  How to calcualte the vector. Choose from
                   - word:      word by word average
                   - sentence:  take in-sentence average by words first,
                                then take sentence averages for articles
    """
    
    def __init__(self, strategy='word'):
        self.strategy = strategy
        
    def transform(self, X):
        aids, slens, embs = X  # X must be a tuple
        if self.strategy == 'word':
            word_aids = np.repeat(aids, slens) # word article ids
            words = np.vstack(embs)
            df = pd.DataFrame(np.hstack([np.array([word_aids]).T, words]))
        else:
            sents_avg = np.vstack(np.mean(x, axis=0) for x in embs)
            df = pd.DataFrame(np.hstack([np.array([aids]).T, sents_avg]))
        df[0] = df[0].astype('int')
        df = df.groupby(0).mean()
        return df.values
    
fm_spec = {
    'elmo_raw': ElmoRawVectorizer(embedder, batch_size=32),
    # embedder2 uses the official pre-tained model downloaded from
    #   https://github.com/HIT-SCIR/ELMoForManyLangs
    'elmo_raw2': ElmoRawVectorizer(embedder2, batch_size=32),
    'elmo_raw3': ElmoRawVectorizer(embedder, batch_size=32, split_sentence=False),
    'elmo_raw4': ElmoRawVectorizer(embedder2, batch_size=32, split_sentence=False),
    'elmo': ['elmo_raw', ElmoTransformer()],
    'elmo2': ['elmo_raw2', ElmoTransformer()],
    'elmo3': ['elmo_raw3', ElmoTransformer()],
    'elmo4': ['elmo_raw4', ElmoTransformer()],
    'elmo_sent_avg': ['elmo_raw', ElmoTransformer(strategy='sentence')],
    'elmo_sent_avg2': ['elmo_raw2', ElmoTransformer(strategy='sentence')],
    'elmo_svd_500': ['elmo', TruncatedSVD(n_components=500)],
    'elmo_sent_avg_svd_500': ['elmo_sent_avg', TruncatedSVD(n_components=500)],
    'count': [CountVectorizer(ngram_range=(1, 4), min_df=0.005, max_df=0.95)],
    'tfidf': ['count', TfidfTransformer()],
    'lsa_100': ['tfidf', TruncatedSVD(n_components=100)],
    'lsa_500': ['tfidf', TruncatedSVD(n_components=500)],
}

In [None]:
fm = defaultdict(dict)

In [198]:
# del fm['count']
# del fm['tfidf']
fm.keys()

dict_keys(['elmo_raw', 'elmo', 'elmo_sent_avg', 'elmo_svd_500', 'elmo_sent_avg_svd_500', 'lsa_100', 'elmo_sent_avg_svg_500', 'elmo_raw2', 'elmo2', 'elmo_sent_avg2', 'elmo_raw3', 'elmo_raw4', 'elmo3', 'elmo4'])

In [199]:
for name in fm_spec.keys():
    logger.info(f'Building features for {name}...')
    model = FeaturePipeline(name, spec=fm_spec, cache=fm)
    model.fit_transform(X_train[:100])
    model.transform(X_test[:10])

2018-11-20 17:28:31,232 INFO: Building features for elmo_raw...
2018-11-20 17:28:31,235 INFO:   elmo_raw: fit_transform use cache.
2018-11-20 17:28:31,237 INFO:   elmo_raw: transform use cache.
2018-11-20 17:28:31,239 INFO: Building features for elmo_raw2...
2018-11-20 17:28:31,241 INFO:   elmo_raw2: fit_transform use cache.
2018-11-20 17:28:31,243 INFO:   elmo_raw2: transform use cache.
2018-11-20 17:28:31,245 INFO: Building features for elmo_raw3...
2018-11-20 17:28:31,249 INFO:   elmo_raw3: fit_transform use cache.
2018-11-20 17:28:31,252 INFO:   elmo_raw3: transform use cache.
2018-11-20 17:28:31,253 INFO: Building features for elmo_raw4...
2018-11-20 17:28:31,255 INFO:   elmo_raw4: fit_transform use cache.
2018-11-20 17:28:31,258 INFO:   elmo_raw4: transform use cache.
2018-11-20 17:28:31,262 INFO: Building features for elmo...
2018-11-20 17:28:31,264 INFO:   elmo: fit_transform use cache.
2018-11-20 17:28:31,266 INFO:   elmo: transform use cache.
2018-11-20 17:28:31,268 INFO: Bui

## Performance

In [104]:
all_avg_scores, all_scores = defaultdict(dict), defaultdict(dict)

In [145]:
from fgclassifier.train import fm_cross_check

conf = {
    'fm_cache': fm,
    'y_train': y_train[:100],
    'y_test': y_test[:10],
    'results': {
        'avg': all_avg_scores,
        'all': all_scores
    }
}

# fm_cross_check(
#     ['tfidf'],
#     ['ComplementNB', 'DummyStratified'], **conf)

# We'd only need to run the dummy models on one feature model,
# as they do not care about the features
results = fm_cross_check(
    ['elmo', 'elmo2',
     'elmo3', 'elmo4',
     'elmo_sent_avg', 'elmo_sent_avg2'
    ],
    ['LinearSVC', 'Ridge'], **conf)

2018-11-20 16:51:37,504 INFO: Train for elmo -> LinearSVC...
2018-11-20 16:51:41,951 INFO: [Validate]: F1 Scores
2018-11-20 16:51:41,952 INFO:   location_traffic_convenience            	0.6875
  'precision', 'predicted', average, warn_for)
2018-11-20 16:51:41,954 INFO:   location_distance_from_business_district	0.4737
2018-11-20 16:51:41,956 INFO:   location_easy_to_find                   	0.5139
  'recall', 'true', average, warn_for)
2018-11-20 16:51:41,958 INFO:   service_wait_time                       	0.2963
2018-11-20 16:51:41,960 INFO:   service_waiters_attitude                	0.1667
2018-11-20 16:51:41,962 INFO:   service_parking_convenience             	1.0000
2018-11-20 16:51:41,964 INFO:   service_serving_speed                   	0.5139
2018-11-20 16:51:41,967 INFO:   price_level                             	0.1818
2018-11-20 16:51:41,969 INFO:   price_cost_effective                    	0.4778
2018-11-20 16:51:41,971 INFO:   price_discount                          	0.4329
2

2018-11-20 16:51:46,848 INFO:   service_parking_convenience             	0.4737
2018-11-20 16:51:46,850 INFO:   service_serving_speed                   	0.4444
2018-11-20 16:51:46,852 INFO:   price_level                             	0.3485
2018-11-20 16:51:46,854 INFO:   price_cost_effective                    	0.4118
2018-11-20 16:51:46,857 INFO:   price_discount                          	0.5524
2018-11-20 16:51:46,859 INFO:   environment_decoration                  	0.7619
2018-11-20 16:51:46,861 INFO:   environment_noise                       	0.8039
2018-11-20 16:51:46,863 INFO:   environment_space                       	0.1667
2018-11-20 16:51:46,866 INFO:   environment_cleaness                    	0.2929
2018-11-20 16:51:46,868 INFO:   dish_portion                            	0.2792
2018-11-20 16:51:46,870 INFO:   dish_taste                              	0.3173
2018-11-20 16:51:46,872 INFO:   dish_look                               	0.2059
2018-11-20 16:51:46,874 INFO:   dish_rec

2018-11-20 16:51:56,144 INFO:   dish_taste                              	0.2652
2018-11-20 16:51:56,146 INFO:   dish_look                               	0.2059
2018-11-20 16:51:56,148 INFO:   dish_recommendation                     	0.2500
2018-11-20 16:51:56,150 INFO:   others_overall_experience               	0.5231
2018-11-20 16:51:56,152 INFO:   others_willing_to_consume_again         	0.6703
2018-11-20 16:51:56,153 INFO: ---------------------------------------------------
2018-11-20 16:51:56,154 INFO: 【elmo4 -> LinearSVC】: 0.4447
2018-11-20 16:51:56,155 INFO: ---------------------------------------------------
2018-11-20 16:51:56,156 INFO: Train for elmo4 -> Ridge...
2018-11-20 16:51:56,297 INFO: [Validate]: F1 Scores
2018-11-20 16:51:56,299 INFO:   location_traffic_convenience            	0.4118
  'precision', 'predicted', average, warn_for)
2018-11-20 16:51:56,301 INFO:   location_distance_from_business_district	0.4737
2018-11-20 16:51:56,303 INFO:   location_easy_to_find       

2018-11-20 16:52:04,307 INFO:   location_traffic_convenience            	0.6000
  'precision', 'predicted', average, warn_for)
2018-11-20 16:52:04,309 INFO:   location_distance_from_business_district	0.4737
2018-11-20 16:52:04,311 INFO:   location_easy_to_find                   	0.2500
  'recall', 'true', average, warn_for)
2018-11-20 16:52:04,314 INFO:   service_wait_time                       	0.4737
2018-11-20 16:52:04,316 INFO:   service_waiters_attitude                	0.2333
2018-11-20 16:52:04,318 INFO:   service_parking_convenience             	0.4737
2018-11-20 16:52:04,320 INFO:   service_serving_speed                   	0.2059
2018-11-20 16:52:04,324 INFO:   price_level                             	0.3167
2018-11-20 16:52:04,326 INFO:   price_cost_effective                    	0.8667
2018-11-20 16:52:04,328 INFO:   price_discount                          	0.3162
2018-11-20 16:52:04,330 INFO:   environment_decoration                  	0.3485
2018-11-20 16:52:04,332 INFO:   en

In [200]:
len(fm['count']['model'].named_steps['count'].vocabulary_)

37981

In [204]:
from sklearn.neural_network import MLPClassifier
from fgclassifier.baseline import Baseline

model = Baseline(
    classifier=MLPClassifier(
        hidden_layer_sizes=(300, 100),
        early_stopping=True,
        learning_rate='adaptive',
        learning_rate_init=0.004,
        max_iter=300
    )
)
for fmn in ['tfidf', 'elmo', 'elmo2']:
    model.fit(fm[fmn]['train'], conf['y_train'])
    print(f'【{fmn} -> MLP】', model.score(fm[fmn]['test'], conf['y_test']))

2018-11-20 17:32:06,584 INFO: [Validate]: F1 Scores
  'precision', 'predicted', average, warn_for)
2018-11-20 17:32:06,586 INFO:   location_traffic_convenience            	0.4737
2018-11-20 17:32:06,588 INFO:   location_distance_from_business_district	0.4737
2018-11-20 17:32:06,590 INFO:   location_easy_to_find                   	0.2963
2018-11-20 17:32:06,592 INFO:   service_wait_time                       	1.0000
2018-11-20 17:32:06,595 INFO:   service_waiters_attitude                	0.2500
2018-11-20 17:32:06,597 INFO:   service_parking_convenience             	0.4737
2018-11-20 17:32:06,599 INFO:   service_serving_speed                   	0.4737
2018-11-20 17:32:06,601 INFO:   price_level                             	0.2059
2018-11-20 17:32:06,604 INFO:   price_cost_effective                    	0.4118
2018-11-20 17:32:06,606 INFO:   price_discount                          	0.2500
2018-11-20 17:32:06,609 INFO:   environment_decoration                  	0.6000
2018-11-20 17:32:06,6

【tfidf -> MLP】 0.37606387291758814


2018-11-20 17:32:10,691 INFO: [Validate]: F1 Scores
  'precision', 'predicted', average, warn_for)
2018-11-20 17:32:10,692 INFO:   location_traffic_convenience            	0.4737
2018-11-20 17:32:10,694 INFO:   location_distance_from_business_district	0.4737
2018-11-20 17:32:10,696 INFO:   location_easy_to_find                   	0.2963
2018-11-20 17:32:10,698 INFO:   service_wait_time                       	1.0000
2018-11-20 17:32:10,700 INFO:   service_waiters_attitude                	0.2500
2018-11-20 17:32:10,703 INFO:   service_parking_convenience             	0.4737
2018-11-20 17:32:10,705 INFO:   service_serving_speed                   	0.4737
2018-11-20 17:32:10,707 INFO:   price_level                             	0.0455
2018-11-20 17:32:10,709 INFO:   price_cost_effective                    	0.4118
2018-11-20 17:32:10,711 INFO:   price_discount                          	0.2500
2018-11-20 17:32:10,713 INFO:   environment_decoration                  	0.2308
2018-11-20 17:32:10,7

【elmo -> MLP】 0.3619754246011987


2018-11-20 17:32:13,489 INFO: [Validate]: F1 Scores
  'precision', 'predicted', average, warn_for)
2018-11-20 17:32:13,492 INFO:   location_traffic_convenience            	0.4737
2018-11-20 17:32:13,493 INFO:   location_distance_from_business_district	0.4737
2018-11-20 17:32:13,496 INFO:   location_easy_to_find                   	0.2963
2018-11-20 17:32:13,498 INFO:   service_wait_time                       	1.0000
2018-11-20 17:32:13,500 INFO:   service_waiters_attitude                	0.2381
2018-11-20 17:32:13,502 INFO:   service_parking_convenience             	0.4737
2018-11-20 17:32:13,504 INFO:   service_serving_speed                   	0.4737
2018-11-20 17:32:13,506 INFO:   price_level                             	0.3333
2018-11-20 17:32:13,509 INFO:   price_cost_effective                    	0.4118
2018-11-20 17:32:13,511 INFO:   price_discount                          	0.2500
2018-11-20 17:32:13,513 INFO:   environment_decoration                  	0.6970
2018-11-20 17:32:13,5

【elmo2 -> MLP】 0.40503625946350413


Simple multi-layer perceptron was not particularly helpful. But bear in mind, we did only use a very small sample.