<p style="font-family:Roboto; font-size: 28px; color: magenta"> Python for NLP: Word Embeddings for Deep Learning in Keras</p>

In [None]:
'''The N-Gram model is basically a way to convert text data into numeric form 
so that it can be used by statistical algorithms'''

<p style="font-family:Roboto; font-size: 28px; color: cyan"> Python for NLP: Problems with One-Hot Encoded Feature Vector Approaches</p>

In [None]:
'''
A potential drawback with one-hot encoded feature vector approaches such as N-Grams, bag of words and TF-IDF approach 
is that the feature vector for each document can be huge
'''

<p style="font-family:Roboto; font-size: 28px; color: magenta"> Word Embeddings</p>

In [None]:
'''
In word embeddings, every word is represented as an n-dimensional dense vector. 
The words that are similar will have similar vectors. 
Word embeddings techniques such as GloVe and Word2Vec have proven to be extremely efficient for converting words 
into corresponding dense vectors
'''

In [None]:
'''
To implement word embeddings, the Keras library contains a layer called Embedding(). 
The embedding layer is implemented in the form of a class in Keras 
and is normally used as a first layer in the sequential model for NLP tasks.
'''

'''
The embedding layer can be used to perform three tasks in Keras:

It can be used to learn word embeddings and save the resulting model
It can be used to learn the word embeddings in addition to performing the NLP tasks such as text classification, sentiment analysis, etc.
It can be used to load pre trained word embeddings and use them in a new model
'''

<p style="font-family:consolas; font-size: 22px; color: magenta"> Custom Word Embeddings</p>

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Load libraries</p>

In [1]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Load dataset</p>

In [3]:
corpus = [
    # Positive Reviews

    'This is an excellent movie',
    'The move was fantastic I like it',
    'You should watch it is brilliant',
    'Exceptionally good',
    'Wonderfully directed and executed I like it',
    'It\'s a fantastic series',
    'Never watched such a brilliant movie',
    'It is a Wonderful movie',

    # Negative Reviews

    "horrible acting",
    'waste of money',
    'pathetic picture',
    'It was very boring',
    'I did not like the movie',
    'The movie was horrible',
    'I will not recommend',
    'The acting is pathetic'
]

In [4]:
'''Our corpus has 8 positive reviews and 8 negative reviews. The next step is to create a label set for our data.'''
sentiments = array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0])

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Data Preprocessing</p>

In [6]:
'''Let's first find the total number of words in our corpus:'''
from nltk.tokenize import word_tokenize

all_words = []
for sent in corpus:
    tokenize_word = word_tokenize(sent)
    for word in tokenize_word:
        all_words.append(word)

print(len(all_words))

72


In [8]:
'''However, we do not want the duplicate words. 
We can retrieve all the unique words from a list by passing the list into the set function'''
unique_words = set(all_words)
'''We will add a buffer of 6 to our vocabulary size and will set the value of vocab_length to 50.'''
print(len(unique_words))

44


In [11]:
'''The Embedding layer expects the words to be in numeric form. Therefore, we need to convert the sentences 
in our corpus to numbers. 
One way to convert text to numbers is by using the one_hot function from the tensorflow.keras.preprocessing.text library'''
vocab_length = 50
embedded_sentences = [one_hot(sent, vocab_length) for sent in corpus]
print(embedded_sentences )

[[3, 37, 26, 34, 37], [32, 14, 42, 38, 35, 32, 31], [26, 15, 1, 31, 37, 49], [19, 13], [19, 47, 2, 8, 35, 32, 31], [39, 28, 38, 18], [26, 2, 30, 28, 49, 37], [31, 37, 28, 7, 37], [34, 48], [44, 42, 15], [44, 27], [31, 42, 34, 1], [35, 32, 43, 32, 32, 37], [32, 37, 42, 34], [35, 4, 43, 20], [32, 48, 37, 44]]


In [13]:
'''The embedding layer expects sentences to be of equal size. However, our encoded sentences are of different sizes'''
'''One way to make all the sentences of uniform size is to increase the length of all the sentences and 
make it equal to the length of the largest sentence'''
word_count = lambda sentence: len(word_tokenize(sentence))
longest_sentence = max(corpus, key=word_count)
length_long_sentence = len(word_tokenize(longest_sentence))

In [14]:
'''
Next to make all the sentences of equal size, we will add zeros to the empty indexes that will be created 
as a result of increasing the sentence length. 
To append the zeros at the end of the sentences, we can use the pad_sequences method
'''
padded_sentences = pad_sequences(embedded_sentences, length_long_sentence, padding='post')
print(padded_sentences)

[[ 3 37 26 34 37  0  0]
 [32 14 42 38 35 32 31]
 [26 15  1 31 37 49  0]
 [19 13  0  0  0  0  0]
 [19 47  2  8 35 32 31]
 [39 28 38 18  0  0  0]
 [26  2 30 28 49 37  0]
 [31 37 28  7 37  0  0]
 [34 48  0  0  0  0  0]
 [44 42 15  0  0  0  0]
 [44 27  0  0  0  0  0]
 [31 42 34  1  0  0  0]
 [35 32 43 32 32 37  0]
 [32 37 42 34  0  0  0]
 [35  4 43 20  0  0  0]
 [32 48 37 44  0  0  0]]


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Create a very simple text classification model</p>

In [19]:
model = Sequential()
# The dimension of each word vector will be 20 
model.add(Embedding(vocab_length, 20, input_length=length_long_sentence))
model.add(Flatten())
# Since it is a binary classification problem, we use the sigmoid function as the loss function at the dense layer.
model.add(Dense(1, activation='sigmoid'))

In [21]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

None


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Train text classification model</p>

In [22]:
model.fit(padded_sentences, sentiments, epochs=100, verbose=1)

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - acc: 0.6250 - loss: 0.6906
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 110ms/step - acc: 0.7500 - loss: 0.6864
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 95ms/step - acc: 0.8125 - loss: 0.6822
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 96ms/step - acc: 0.8125 - loss: 0.6780
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 95ms/step - acc: 0.8125 - loss: 0.6738
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 97ms/step - acc: 0.8125 - loss: 0.6696
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 155ms/step - acc: 0.9375 - loss: 0.6655
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 99ms/step - acc: 0.9375 - loss: 0.6613
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 110ms/step - acc: 0.93

<keras.src.callbacks.history.History at 0x2864b8b7aa0>

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Evaluate model</p>

In [23]:
loss, accuracy = model.evaluate(padded_sentences, sentiments, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


<p style="font-family:consolas; font-size: 22px; color: magenta"> Loading Pre Trained Word Embeddings</p>

In [None]:
'''The smallest file is named "Glove.6B.zip". The size of the file is 822 MB. 
The file contains 50, 100, 200, and 300 dimensional word vectors for 400k words.
We will be using the 100 dimensional vector.'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Load libraries</p>

In [29]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from tensorflow.keras.preprocessing.text import Tokenizer 

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Create dataset</p>

In [30]:
corpus = [
    # Positive Reviews

    'This is an excellent movie',
    'The move was fantastic I like it',
    'You should watch it is brilliant',
    'Exceptionally good',
    'Wonderfully directed and executed I like it',
    'It\'s a fantastic series',
    'Never watched such a brilliant movie',
    'It is a Wonderful movie',

    # Negative Reviews

    "horrible acting",
    'waste of money',
    'pathetic picture',
    'It was very boring',
    'I did not like the movie',
    'The movie was horrible',
    'I will not recommend',
    'The acting is pathetic'
]

In [31]:
sentiments = array([1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0])

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Data Preprocessing</p>

In [32]:
word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(corpus)

In [43]:
'''To get the number of unique words in the text, you can simply count the length of word_index dictionary 
of the word_tokenizer object. Remember to add 1 with the vocabulary size. 
This is to store the dimensions for the words for which no pre-trained word embeddings exist'''
vocab_length = len(word_tokenizer.word_index) + 1
vocab_length

43

In [35]:
'''Finally, to convert sentences to their numeric counterpart, call the texts_to_sequences function 
and pass it the whole corpus'''
embedded_sentences = word_tokenizer.texts_to_sequences(corpus)
print(embedded_sentences)

[[15, 3, 16, 17, 1], [4, 18, 6, 9, 5, 7, 2], [19, 20, 21, 2, 3, 10], [22, 23], [24, 25, 26, 27, 5, 7, 2], [28, 8, 9, 29], [30, 31, 32, 8, 10, 1], [2, 3, 8, 33, 1], [11, 12], [34, 35, 36], [13, 37], [2, 6, 38, 39], [5, 40, 14, 7, 4, 1], [4, 1, 6, 11], [5, 41, 14, 42], [4, 12, 3, 13]]


In [36]:
'''Find the number of words in the longest sentence and then to apply padding 
to the sentences having shorter lengths than the length of the longest sentence'''
from nltk.tokenize import word_tokenize

word_count = lambda sentence: len(word_tokenize(sentence))
longest_sentence = max(corpus, key=word_count)
length_long_sentence = len(word_tokenize(longest_sentence))

padded_sentences = pad_sequences(embedded_sentences, length_long_sentence, padding='post')

print(padded_sentences)

[[15  3 16 17  1  0  0]
 [ 4 18  6  9  5  7  2]
 [19 20 21  2  3 10  0]
 [22 23  0  0  0  0  0]
 [24 25 26 27  5  7  2]
 [28  8  9 29  0  0  0]
 [30 31 32  8 10  1  0]
 [ 2  3  8 33  1  0  0]
 [11 12  0  0  0  0  0]
 [34 35 36  0  0  0  0]
 [13 37  0  0  0  0  0]
 [ 2  6 38 39  0  0  0]
 [ 5 40 14  7  4  1  0]
 [ 4  1  6 11  0  0  0]
 [ 5 41 14 42  0  0  0]
 [ 4 12  3 13  0  0  0]]


In [37]:
'''The next step is to load the GloVe word embeddings and then create our embedding matrix 
that contains the words in our corpus and their corresponding values from GloVe embeddings'''
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()
glove_file = open('./data/glove.6B.100d.txt', encoding="utf8")

In [38]:
'''We will create a dictionary that will contain words as keys 
and the corresponding 100 dimensional vectors as values, in the form of an array'''
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary [word] = vector_dimensions

glove_file.close()

In [None]:
'''The dictionary embeddings_dictionary now contains words and corresponding GloVe embeddings for all the words.'''

In [39]:
print(len(unique_words))

44


In [40]:
'''
We want the word embeddings for only those words that are present in our corpus. 
We will create a two dimensional NumPy array of 44 (size of vocabulary) rows and 100 columns. 
The array will initially contain zeros. The array will be named as embedding_matrix
'''
embedding_matrix = zeros((vocab_length, 100))
for word, index in word_tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Create our sequential model</p>

In [None]:
model = Sequential()
# Since we are using pre-trained word embeddings that contain 100 dimensional vectors, we set the vector dimension to 100
embedding_layer = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))



In [42]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

None


In [None]:
'''You can see that since we have 43 words in our vocabulary and each word will be represented 
as a 100 dimensional vector, the number of parameters for the embedding layer will be 43 x 100 = 4300'''

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Train our model</p>

In [44]:
model.fit(padded_sentences, sentiments, epochs=100, verbose=1)

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - acc: 0.5000 - loss: 0.7623
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 141ms/step - acc: 0.5000 - loss: 0.7310
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 132ms/step - acc: 0.5625 - loss: 0.7012
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 207ms/step - acc: 0.5625 - loss: 0.6729
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step - acc: 0.5625 - loss: 0.6461
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 252ms/step - acc: 0.5625 - loss: 0.6209
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 338ms/step - acc: 0.5625 - loss: 0.5971
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 390ms/step - acc: 0.6875 - loss: 0.5747
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 507ms/step - acc:

<keras.src.callbacks.history.History at 0x2865e8166c0>

<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Evaluate our model</p>

In [45]:
loss, accuracy = model.evaluate(padded_sentences, sentiments, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


<p style="font-family:consolas; font-size: 22px; color: magenta"> Word Embeddings with Keras Functional API</p>

In [None]:
'''
The rest of the script remains similar as it was in the last section. 
The only change will be in the development of a deep learning model. 
Let's implement the same deep learning model as we implemented in the last section with Keras Functional API.
'''
from keras.models import Model
from keras.layers import Input
# In the Keras Functional API, you have to define the input layer separately before the embedding layer.
deep_inputs = Input(shape=(length_long_sentence,))
embedding = Embedding(vocab_length, 100, weights=[embedding_matrix], input_length=length_long_sentence, trainable=False)(deep_inputs) # line A
flatten = Flatten()(embedding)
hidden = Dense(1, activation='sigmoid')(flatten)
model = Model(inputs=deep_inputs, outputs=hidden)



<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Compile the model </p>

In [47]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
print(model.summary())

None


<p style="font-family:Roboto; font-size: 22px; color: orange; text-decoration-line: overline; "> Part: _Evaluate our model</p>

In [48]:
model.fit(padded_sentences, sentiments, epochs=100, verbose=1)
loss, accuracy = model.evaluate(padded_sentences, sentiments, verbose=0)

print('Accuracy: %f' % (accuracy*100))

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - acc: 0.4375 - loss: 0.7305
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 95ms/step - acc: 0.5000 - loss: 0.6993
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 168ms/step - acc: 0.6250 - loss: 0.6709
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 473ms/step - acc: 0.6250 - loss: 0.6451
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 123ms/step - acc: 0.6250 - loss: 0.6216
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 206ms/step - acc: 0.6875 - loss: 0.5999
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step - acc: 0.7500 - loss: 0.5798
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 118ms/step - acc: 0.7500 - loss: 0.5609
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 117ms/step - acc: 