# Text data
## 1. Vectorizing text

There are three ways to represent text as number.

- Text -> Words -> Vector
- Text -> Character -> Vector
- Text -> n-gram -> Vector

**n-gram means subset of n words.** used for shallow algorithms.

And these units, like words, character and n-gram, are **token**. So, we can say it is done by text -> token(ize) -> vector.

To transform token to vector, there are two ways, one-hot or word embedding.

### 1.1 One-hot encoding

There is keras utility to encoding. It is much more comfortable and It has lots of functions.

In [1]:
from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000) # Number of workds in dictionary (most used 1000)
tokenizer.fit_on_texts(samples) # Create index

sequences = tokenizer.texts_to_sequences(samples) 

one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

word_index = tokenizer.word_index

print(len(word_index))

Using TensorFlow backend.


9


In [2]:
sequences

[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

In [3]:
one_hot_results.shape

(2, 1000)

In [4]:
one_hot_results

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [5]:
one_hot_results[:,10]

array([0., 0.])

In [6]:
word_index

{'the': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'mat': 5,
 'dog': 6,
 'ate': 7,
 'my': 8,
 'homework': 9}

##### one-hot hashing, Variation of one-hot encoding

It is used when we have too large dictionary to use.

### 1.2 Word embedding

Word embedding is most powerful and popular method to vectorize text. In one-hot encoding, we have lots of 0 elements in matrix which

means sparse and high dimension. We can handle data more efficiently. (Dense <-> Sparse)


- Usually, we have 256, 512 or 1024 dimension of word embedding in project. If we try to use one-hot encoding, dimension of dictionary would be more than 20,000.

- And it feeds data directly, not like one-hot vector.

#### How to use word embedding 

##### Build and train own Word embedding & Pre-trained word embedding.

There are two ways to build word embedding.

- In project we try to solve, train word embedding like neural network

- Use pretrained word embedding used for other projects

#### 1.2.1 Train word embedding 

Embedding features its own way to mapping words with relationship. Using this relationship, like King + Female = Queen, King + Plural = Kings, 

word embedding represent lots of words.

In technology so far, even in real human, can't accurately map all the words due to its differnce for each country or culture.

Hence, it is quite reasonable to train new word embedding in new projects.

In [7]:
from keras.layers import Embedding

embedding_layer = Embedding(1000, 64) # Word index 1~999 + 1 and embedding dimension 64.

Embedding layer gets input as (samples, sequence_length) 2D tensor.(Integer)
sample, shorter than sequence_length, would be filled with 0 and longer than sequence_length, would be cut.

Embedding layer returns output as (samples, sequence_length, embedding_dimensionalyity) 3D tensor.(Real number)

Embedding layer is initialized randomly and adjusted by back propagation.

In [None]:
from keras.datasets import imdb
from keras import preprocessing

max_features = 10000 # Number of words for feature
maxlen = 20 # Length of text to use (most used)

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)


# Transform list(Text) into 2D tensor (samples, maxlen)
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [13]:
x_train.shape

(25000, 20)

In [14]:
x_train[1,:]

array([  23,    4, 1690,   15,   16,    4, 1355,    5,   28,    6,   52,
        154,  462,   33,   89,   78,  285,   16,  145,   95])

In [9]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, Embedding

model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen)) # Output shape is (samples, maxlen, 8)

model.add(Flatten()) # Unroll 3D embedding tensor to 2D tensor (samples, maxlen * 8)

model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [10]:
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


It returns 75% val accuracy. Given that I used only 20 words in samples, It is quite good result.

But it can't consider relationship between words or structure of sentence. To figure out this, we need to add 1D or RNN layer

on the embedding layer. (Talk about later)

#### 1.2.2 Pretrained word embedding

Like pretrained CNN, we can use pretrained word embedding when we have little datasets for problem.

Word2Vec or GloVe are famous embedding. Let's see how it works in next notebook.