<a href="https://colab.research.google.com/github/lateefurrahman/AI-Q1-learning-resources/blob/master/Word_level_one_hot_encoding_(toy_example).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Word-level one-hot encoding (toy example)


In [0]:
import numpy as np


In [0]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']


In [0]:
token_index = {}
for sample in samples:
    for word in sample.split():
        if word not in token_index:
            token_index[word] = len(token_index) + 1

In [0]:
max_length = 10
results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1))

In [0]:
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

Character-level one-hot encoding (toy example)


In [0]:
import string
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))

In [0]:
max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.


Using Keras for word-level one-hot encoding

In [0]:
from tensorflow.keras.preprocessing.text import Tokenizer


In [0]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']


In [0]:
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)


In [0]:
sequences = tokenizer.texts_to_sequences(samples)


In [0]:
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')


In [13]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9 unique tokens.


Word-level one-hot encoding with hashing trick (toy example)

In [0]:
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.


Instantiating an Embedding layer


In [0]:
from tensorflow.keras.layers import Embedding
embedding_layer = Embedding(1000, 64)

Word index Embedding layer Corresponding word vector

Loading the IMDB data for use with an Embedding layer

In [0]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras import preprocessing
max_features = 10000
maxlen = 20
(x_train, y_train), (x_test, y_test) = imdb.load_data(
  num_words=max_features)


In [22]:
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_train

array([[  65,   16,   38, ...,   19,  178,   32],
       [  23,    4, 1690, ...,   16,  145,   95],
       [1352,   13,  191, ...,    7,  129,  113],
       ...,
       [  11, 1818, 7561, ...,    4, 3586,    2],
       [  92,  401,  728, ...,   12,    9,   23],
       [ 764,   40,    4, ...,  204,  131,    9]], dtype=int32)

In [23]:
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
x_test

array([[ 286,  170,    8, ...,   14,    6,  717],
       [  10,   10,  472, ...,  125,    4, 3077],
       [  34,    2,   45, ...,    9,   57,  975],
       ...,
       [ 226,   20,  272, ...,   21,  846, 5518],
       [  55,  117,  212, ..., 2302,    7,  470],
       [  19,   14,   20, ...,   34, 2005, 2643]], dtype=int32)

Using an Embedding layer and classifier on the IMDB data

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense

In [0]:
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))

In [0]:
model.add(Flatten())


In [0]:
model.add(Dense(1, activation='sigmoid'))

In [28]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten (Flatten)            (None, 160)               0         
_________________________________________________________________
dense (Dense)                (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________


In [29]:
history = model.fit(x_train, y_train,epochs=10,batch_size=32,validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
