# Word Embedding from Scratch

2021/01/07
- 「混合精度を使った単語分散表現の計算高速化」について、ノートブックを作成してみました。
 - [Word Embedding with mixed precision](https://colab.research.google.com/drive/1LkZ7pjxbrUh2AIcsR7Zki2tv4V3Du2I3?usp=sharing)

## Setup

In [1]:
%tensorflow_version 2.x

In [2]:
!pip install -U gensim

Collecting gensim
  Downloading gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 95 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


### Hyper-parameters

In [3]:
negative_samples = 1
num_words = 10000
window_size = 1
emb_dim = 50

### Imports

In [4]:
from pprint import pprint

import numpy as np
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import skipgrams, make_sampling_table
from tensorflow.keras.layers import Input, Dot, Flatten, Embedding, Dense
from tensorflow.keras.models import Model, load_model

## The dataset

今回は、データセットとして[ja.text8](https://github.com/Hironsan/ja.text8)を使います。このデータセットは、Wikipediaを前処理し、100MB切り出して作成したコーパスです。分かち書き済みなので、単語分散表現の学習をするために簡単に使い始めることができます。


In [None]:
# !mkdir data
# !wget https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip -P data/
# !unzip data/ja.text8.zip -d data/

--2020-11-01 08:49:08--  https://s3-ap-northeast-1.amazonaws.com/dev.tech-sketch.jp/chakki/public/ja.text8.zip
Resolving s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)... 52.219.12.18
Connecting to s3-ap-northeast-1.amazonaws.com (s3-ap-northeast-1.amazonaws.com)|52.219.12.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33905114 (32M) [application/zip]
Saving to: ‘data/ja.text8.zip’


2020-11-01 08:49:11 (12.5 MB/s) - ‘data/ja.text8.zip’ saved [33905114/33905114]

Archive:  data/ja.text8.zip
  inflating: data/ja.text8           


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load ja.text8

In [7]:
def load_data(filepath, encoding='utf-8'):
    with open(filepath, encoding=encoding) as f:
        return f.read()


text = load_data(filepath='/content/drive/MyDrive/colab/自然言語処理入門/ja.text8')

### Preprocess the dataset

In [8]:
def build_vocabulary(text, num_words=None):
    tokenizer = Tokenizer(num_words=num_words, oov_token='<UNK>')
    tokenizer.fit_on_texts([text])
    return tokenizer


def create_dataset(text, vocab, num_words, window_size, negative_samples):
    data = vocab.texts_to_sequences([text]).pop()
    sampling_table = make_sampling_table(num_words)
    couples, labels = skipgrams(data, num_words,
                                window_size=window_size,
                                negative_samples=negative_samples,
                                sampling_table=sampling_table)
    word_target, word_context = zip(*couples)
    word_target = np.reshape(word_target, (-1, 1))
    word_context = np.reshape(word_context, (-1, 1))
    labels = np.asarray(labels)
    return [word_target, word_context], labels


vocab = build_vocabulary(text, num_words)
x, y = create_dataset(text, vocab, num_words, window_size, negative_samples)

## The model

### Build the model

In [9]:
class EmbeddingModel:

    def __init__(self, vocab_size, emb_dim=100):
        self.word_input = Input(shape=(1,), name='word_input')
        self.word_embed = Embedding(input_dim=vocab_size,
                                    output_dim=emb_dim,
                                    input_length=1,
                                    name='word_embedding')

        self.context_input = Input(shape=(1,), name='context_input')
        self.context_embed = Embedding(input_dim=vocab_size,
                                       output_dim=emb_dim,
                                       input_length=1,
                                       name='context_embedding')

        self.dot = Dot(axes=2)
        self.flatten = Flatten()
        self.output = Dense(1, activation='sigmoid')

    def build(self):
        word_embed = self.word_embed(self.word_input)
        context_embed = self.context_embed(self.context_input)
        dot = self.dot([word_embed, context_embed])
        flatten = self.flatten(dot)
        output = self.output(flatten)
        model = Model(inputs=[self.word_input, self.context_input],
                      outputs=output)
        return model


model = EmbeddingModel(num_words, emb_dim)
model = model.build()

In [10]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
word_input (InputLayer)         [(None, 1)]          0                                            
__________________________________________________________________________________________________
context_input (InputLayer)      [(None, 1)]          0                                            
__________________________________________________________________________________________________
word_embedding (Embedding)      (None, 1, 50)        500000      word_input[0][0]                 
__________________________________________________________________________________________________
context_embedding (Embedding)   (None, 1, 50)        500000      context_input[0][0]              
______________________________________________________________________________________________

### Train the model

In [None]:
epochs = 100
batch_size = 128
save_path = '/tmp/model'
log_dir = 'logs'

model.compile(
    optimizer='adam',
    loss='binary_crossentropy'
)

history = model.fit(
    x, y,
    validation_split=0.2,
    epochs=epochs,
    batch_size=batch_size,
    callbacks=[
               EarlyStopping(monitor='val_loss', patience=3),
               ModelCheckpoint(
                   filepath=save_path,
                   monitor='val_loss',
                   save_best_only=True,
                   mode='min'
               ),
               TensorBoard(log_dir=log_dir)
    ]
)

Epoch 1/100
INFO:tensorflow:Assets written to: /tmp/model/assets
Epoch 2/100
INFO:tensorflow:Assets written to: /tmp/model/assets
Epoch 3/100
INFO:tensorflow:Assets written to: /tmp/model/assets
Epoch 4/100
Epoch 5/100
Epoch 6/100

### Load the trained model

In [None]:
model = load_model(save_path)

### Predict similarities

In [None]:
class InferenceAPI:
    """A model API that generates output sequence.

    Attributes:
        model: Model.
        vocab: vocabulary.
    """

    def __init__(self, model, vocab):
        self.vocab = vocab
        self.weights = model.get_layer('word_embedding').get_weights()[0]

    def most_similar(self, word, topn=10):
        word_index = self.vocab.word_index.get(word, 1)
        sim = self._cosine_similarity(word_index)
        pairs = [(s, i) for i, s in enumerate(sim)]
        pairs.sort(reverse=True)
        pairs = pairs[1: topn + 1]
        res = [(self.vocab.index_word[i], s) for s, i in pairs]
        return res

    def similarity(self, word1, word2):
        word_index1 = self.vocab.word_index.get(word1, 1)
        word_index2 = self.vocab.word_index.get(word2, 1)
        weight1 = self.weights[word_index1]
        weight2 = self.weights[word_index2]
        return cosine(weight1, weight2)

    def _cosine_similarity(self, target_idx):
        target_weight = self.weights[target_idx]
        similarity = cosine_similarity(self.weights, [target_weight])
        return similarity.flatten()

In [None]:
api = InferenceAPI(model, vocab)
api.most_similar(word='日本')

In [None]:
# 未知語
api.most_similar(word='hogefuga')

### Visualize loss

In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir logs

## gensim version

In [None]:
import logging
from gensim.models.word2vec import Word2Vec, Text8Corpus

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = Text8Corpus('data/ja.text8')
model = Word2Vec(sentences, size=100, window=5, sg=1)

In [None]:
model.wv.most_similar('日本', topn=10)

In [None]:
model.wv.most_similar('猫', topn=10)

In [None]:
model.wv.similarity('猫', '犬')

In [None]:
model.wv.similarity('猫', '車')

In [None]:
model.wv.similarity('セダン', '車')

## Pretrained word embedding

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ja.300.vec.gz -P data/

--2020-11-01 11:06:40--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ja.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 104.22.75.142, 172.67.9.4, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1279641604 (1.2G) [binary/octet-stream]
Saving to: ‘data/cc.ja.300.vec.gz’


2020-11-01 11:07:32 (23.5 MB/s) - ‘data/cc.ja.300.vec.gz’ saved [1279641604/1279641604]



In [None]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('data/cc.ja.300.vec.gz', binary=False)

2020-11-01 11:11:55,385 : INFO : loading projection weights from data/cc.ja.300.vec.gz
2020-11-01 11:21:12,585 : INFO : loaded (2000000, 300) matrix from data/cc.ja.300.vec.gz


In [None]:
model.most_similar('猫')

  if np.issubdtype(vec.dtype, np.int):


[('ネコ', 0.8059155941009521),
 ('ねこ', 0.7272598147392273),
 ('子猫', 0.720253586769104),
 ('仔猫', 0.7062687873840332),
 ('ニャンコ', 0.7058036923408508),
 ('野良猫', 0.7030349969863892),
 ('犬', 0.6505385041236877),
 ('ミケ', 0.6356303691864014),
 ('野良ねこ', 0.6340526342391968),
 ('飼猫', 0.6265145540237427)]

## Load as FastText format

- [models.fasttext – FastText model](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors)


In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ja.300.bin.gz -P data/

--2020-11-01 11:34:56--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ja.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4491428494 (4.2G) [application/octet-stream]
Saving to: ‘data/cc.ja.300.bin.gz’


2020-11-01 11:44:41 (7.34 MB/s) - ‘data/cc.ja.300.bin.gz’ saved [4491428494/4491428494]



In [None]:
!gunzip data/cc.ja.300.bin.gz

In [None]:
from gensim.models.fasttext import load_facebook_vectors
model = load_facebook_vectors('data/cc.ja.300.bin') # Oops, out-of-memory.