# Text to Word Embedding

Here we use [Tencent AI Lab Embedding](https://ai.tencent.com/ailab/nlp/embedding.html) (8824330 words, 200 dimensional) to get word level and char level features.

## 1 Download & Decompress

In [0]:
!wget -c https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz

In [0]:
!tar -xzvf Tencent_AILab_ChineseEmbedding.tar.gz

In [0]:
!ls -lh

total 22G
-rw-r--r-- 1 root root 1.8K Oct 19 10:41 README.txt
-rw-r--r-- 1 root root 6.4G Oct 19 11:29 Tencent_AILab_ChineseEmbedding.tar.gz
-rw-r--r-- 1 root root  16G Oct 19 10:50 Tencent_AILab_ChineseEmbedding.txt


In [0]:
!head Tencent_AILab_ChineseEmbedding.txt

## 2 Extract embeddings needed

To save space and memory, we only extract words and chars that appear in the dataset.

You may only apply this method when we already have the text to be predicted.

In [0]:
from google.colab import drive
drive.mount('/gdrive')

!pip install jieba tqdm > /dev/null

import numpy as np
import pandas as pd
import pickle
import jieba
from tqdm import tqdm

jieba.setLogLevel(20)

### 2.1 Read text and cut for words and chars

In [0]:
path = '/gdrive/My Drive/Colab Notebooks/labelled.txt'
labelled = pd.read_csv(path, sep='\t', header=None)
y = labelled[0].tolist()
contents = labelled[1].tolist()

In [0]:
contents_word, contents_char, contents_woch = [], [], []
wordset, charset = set(), set()
for content in tqdm(texts):
    words = list(jieba.cut_for_search(content))
    chars = list(content)
    contents_word.append(words)
    contents_char.append(chars)
    contents_woch.append(words + chars)
    wordset.update(words)
    charset.update(chars)
wochset = wordset.union(charset)

100%|██████████| 754843/754843 [03:24<00:00, 3686.85it/s]


In [0]:
[len(s) for s in [wordset, charset, wochset]]

[385237, 8760, 386813]

### 2.2 Select appeared words and chars

In [0]:
word_index, embedding_matrix_word = {}, []
char_index, embedding_matrix_char = {}, []
woch_index, embedding_matrix_woch = {}, []
with open('Tencent_AILab_ChineseEmbedding.txt') as f:
    next(f)
    i = j = k = 0
    for line in tqdm(f, total=8824330):
        e = line[:-1].split(' ')
        w, v = e[0], np.array(e[1:], dtype=float)
        if w in wordset:
            word_index[w] = i
            i += 1
            embedding_matrix_word.append(v)
        if w in charset:
            char_index[w] = j
            j += 1
            embedding_matrix_char.append(v)
        if w in wochset:
            woch_index[w] = k
            k += 1
            embedding_matrix_woch.append(v)

100%|██████████| 8824330/8824330 [12:19<00:00, 11932.58it/s]


In [0]:
embeddings_word = [word_index, np.array(embedding_matrix_word)]
embeddings_char = [char_index, np.array(embedding_matrix_char)]
embeddings_woch = [woch_index, np.array(embedding_matrix_woch)]

In [0]:
[len(i) for i in [word_index, char_index, woch_index]]

[273139, 8656, 274713]

In [0]:
[e[1].shape for e in [embeddings_word, embeddings_char, embeddings_woch]]

[(273139, 200), (8656, 200), (274713, 200)]

## 3 Text to embedding indexes

### 3.1 Generate embedding for out of vocabulary words

Similar  to [Kim (2014)](https://www.aclweb.org/anthology/D14-1181).

In [0]:
std = np.std(embeddings_woch[1], axis=0)
unk = np.random.uniform(-1, 1, embeddings_woch[1].shape[1]) * std
unk = unk.reshape(1, -1)

In [0]:
['<UNK>' in e[0].keys() 
 for e in [embeddings_word, embeddings_char, embeddings_woch]]

[False, False, False]

### 3.2 Text to indexes

In [0]:
def text_to_indexes(embeddings, contents):
    index, e_mat = embeddings
    # Add unknown word token to embedding matrix.
    index['<UNK>'] = len(index.keys())
    e_mat = np.concatenate([e_mat, unk], axis=0)
    indexes = [[index[element] if element in index.keys() else index['<UNK>']
                for element in content] for content in tqdm(contents) ]
    return (index, e_mat), indexes

In [0]:
embeddings_word, X_word = text_to_indexes(embeddings_word, contents_word)
embeddings_char, X_char = text_to_indexes(embeddings_char, contents_char)
embeddings_woch, X_woch = text_to_indexes(embeddings_woch, contents_woch)

100%|██████████| 754843/754843 [00:10<00:00, 71807.84it/s] 
100%|██████████| 754843/754843 [00:10<00:00, 72703.26it/s] 
100%|██████████| 754843/754843 [00:16<00:00, 45877.72it/s]


## 4 Pickle and save

In [0]:
pickle.dump(embeddings_word, open('embeddings_word.p', 'wb'))
pickle.dump(embeddings_char, open('embeddings_char.p', 'wb'))
pickle.dump(embeddings_woch, open('embeddings_woch.p', 'wb'))
pickle.dump(X_word, open('X_word.p', 'wb'))
pickle.dump(X_char, open('X_char.p', 'wb'))
pickle.dump(X_woch, open('X_woch.p', 'wb'))
pickle.dump(y, open('y.p', 'wb'))

In [0]:
!rm c*

In [0]:
!ls -lh *.p

-rw------- 1 root root  14M Dec  1 07:55 embeddings_char.p
-rw------- 1 root root 425M Dec  1 07:55 embeddings_woch.p
-rw------- 1 root root 423M Dec  1 07:55 embeddings_word.p
-rw-r--r-- 1 root root  58M Dec  1 07:55 X_char.p
-rw-r--r-- 1 root root  99M Dec  1 07:55 X_woch.p
-rw-r--r-- 1 root root  42M Dec  1 07:55 X_word.p
-rw------- 1 root root 1.5M Dec  1 07:55 y.p


In [0]:
!cp *.p /gdrive/My\ Drive/Colab\ Notebooks