<a href="https://colab.research.google.com/github/mahima-c/deep-learning/blob/main/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Creating your own embedding using gensim			

GloVe vectors trained on various large corpora (number of tokens ranging from 6 billion to 840 billion, vocabulary size from 400 thousand to 2.2 million) and of various dimensions (50, 100, 200, 300) are available from the GloVe project download page (https://nlp.stanford.edu/projects/glove/). It can be downloaded directly from the site, or using gensim or spaCy data downloaders.

In [None]:
!mkdir data

In [None]:
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8")
model = Word2Vec(dataset)
model.save("data/text8-word2vec.bin")



gensim is an open source Python library designed to extract semantic meaning from text documents. One of its features is an excellent implementation of the Word2Vec algorithm, with an easy to use API that allows you to train and query your own Word2Vec model.

Exploring the embedding space with gensim


Let us reload the Word2Vec model we just built and explore it using the gensim API. The actual word vectors can be accessed as a custom gensim class from the model's wv attribute:

In [None]:
from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin")
word_vectors = model.wv

In [None]:
words = word_vectors.vocab.keys()
print([x for i, x in enumerate(words) if i < 10])
assert("king" in words)

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


We can look for similar words to a given word ("king"), shown as follows:



In [None]:
def print_most_similar(word_conf_pairs, k):
   for i, (word, conf) in enumerate(word_conf_pairs):
       print("{:.3f} {:s}".format(conf, word))
       if i >= k-1:
           break
   if k < len(word_conf_pairs):
       print("...")
print_most_similar(word_vectors.most_similar("king"), 5)

0.769 prince
0.711 queen
0.700 kings
0.698 emperor
0.689 throne
...


The most_similar() method with a single parameter produces the following output. Here the floating point score is a measure of the similarity, higher values being better than lower values. As you can see, the similar words seem to be mostly accurate:

In [None]:
word_vectors.most_similar("king")

[('prince', 0.7693274021148682),
 ('queen', 0.7108398675918579),
 ('kings', 0.7000777721405029),
 ('emperor', 0.6982017755508423),
 ('throne', 0.6885628700256348),
 ('sultan', 0.6810340881347656),
 ('constantine', 0.6770751476287842),
 ('pharaoh', 0.6768440008163452),
 ('darius', 0.6639723777770996),
 ('vii', 0.6617316603660583)]

You can also do vector arithmetic similar to the country-capital example we described earlier. Our objective is to see if the relation Paris : France :: Berlin : Germany holds true. This is equivalent to saying that the distance in embedding space between Paris and France should be the same as that between Berlin and Germany. In other words, France - Paris + Berlin should give us Germany. In code, then, this would translate to:

In [None]:
print_most_similar(word_vectors.most_similar(
   positive=["france", "berlin"], negative=["paris"]), 1
)

0.798 germany
...


The preceding similarity value reported is Cosine similarity, but a better measure of similarity was proposed by Levy and Goldberg [9] which is also implemented in the gensim API:

In [None]:
print_most_similar(word_vectors.most_similar_cosmul(
   positive=["france", "berlin"], negative=["paris"]), 1
)

0.966 germany
...


gensim also provides a doesnt_match() function, which can be used to detect the odd one out of a list of words:

In [None]:
print(word_vectors.doesnt_match(["hindus", "parsis", "singapore", "christians"]))


singapore


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


This gives us singapore as expected, since it is the only country among a set of words identifying religions.

We can also calculate the similarity between two words. Here we demonstrate that the distance between related words is less than that of unrelated words:

In [None]:
for word in ["woman", "dog", "whale", "tree"]:
   print("similarity({:s}, {:s}) = {:.3f}".format(
       "man", word,
       word_vectors.similarity("man", word)
   ))

similarity(man, woman) = 0.736
similarity(man, dog) = 0.456
similarity(man, whale) = 0.266
similarity(man, tree) = 0.316


The similar_by_word() function is functionally equivalent to similar() except that the latter normalizes the vector before comparing by default. There is also a related similar_by_vector() function which allows you to find similar words by specifying a vector as input. Here we try to find words that are similar to "singapore":

In [None]:
print(print_most_similar(
   word_vectors.similar_by_word("singapore"), 5)
)

0.886 malaysia
0.852 indonesia
0.844 bahamas
0.823 brunei
0.813 taiwan
...
None


We can also compute the distance between two words in the embedding space using the distance() function. This is really just 1 - similarity():

In [None]:
print("distance(singapore, malaysia) = {:.3f}".format(
   word_vectors.distance("singapore", "malaysia")
))

distance(singapore, malaysia) = 0.114


We can also look up vectors for a vocabulary word either directly from the word_vectors object, or by using the word_vec() wrapper, shown as follows:

In [None]:
vec_song = word_vectors["song"]
vec_song_2 = word_vectors.word_vec("song", use_norm=True)

In [None]:
vec_song,vec_song_2

(array([-0.3156664 ,  0.10720551, -3.144506  ,  1.0351552 , -0.5526515 ,
         1.9685628 ,  0.7325977 ,  1.2272649 ,  0.02926658, -0.03596711,
        -0.39086467, -1.527401  ,  2.4827774 , -0.13089536, -1.0787141 ,
         1.3572806 ,  0.9153923 , -1.8670173 ,  0.2830431 , -2.4028175 ,
        -0.12260415,  1.5906199 ,  2.0122545 , -0.27040014, -2.3553162 ,
        -2.8955455 ,  3.6636055 ,  0.36103353,  0.9389921 ,  1.7202243 ,
        -1.7266402 ,  0.18956617,  1.1643379 , -0.91071165,  1.4218795 ,
        -0.7663593 , -2.126566  ,  2.51259   ,  2.4552925 ,  0.36232647,
        -0.21931393,  0.5776023 , -1.430306  , -2.007075  ,  0.541459  ,
        -0.91987723,  0.46336558,  2.3473291 ,  2.8090487 ,  2.637745  ,
         0.10578953, -3.1830933 ,  0.6218483 , -2.4736285 ,  1.4344423 ,
        -1.2269671 , -0.48875085,  0.21356896,  0.8835547 , -1.2243378 ,
         1.6037246 ,  0.5887441 , -0.17149971,  2.3121934 ,  1.1105942 ,
         2.9615018 ,  0.23544699, -1.2602105 , -0.1

**Our example is a spam detector that will classify Short Message Service (SMS) or text messages as either "ham" or "spam."** 

In [None]:
import argparse
import gensim.downloader as api
import numpy as np
import os
import shutil
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix

Getting the data


The data for our model is available publicly and comes from the SMS spam collection dataset from the UCI Machine Learning Repository [11]. The following code will download the file and parse it to produce a list of SMS messages and their corresponding labels:

In [None]:
def download_and_read(url):
   local_file = url.split('/')[-1]
   p = tf.keras.utils.get_file(local_file, url,
       extract=True, cache_dir=".")
   labels, texts = [], []
   local_file = os.path.join("datasets", "SMSSpamCollection")
   with open(local_file, "r") as fin:
       for line in fin:
           label, text = line.strip().split('\t')
           labels.append(1 if label == "spam" else 0)
           texts.append(text)
   return texts, labels
DATASET_URL =  "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
texts, labels = download_and_read(DATASET_URL)


Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip


Making the data ready for use
The next step is to process the data so it can be consumed by the network. The SMS text needs to be fed into the network as a sequence of integers, where each word is represented by its corresponding ID in the vocabulary. We will use the Keras tokenizer to convert each SMS text into a sequence of words, and then create the vocabulary using the fit_on_texts() method on the tokenizer.

We then convert the SMS messages to a sequence of integers using the texts_to_sequences(). Finally, since the network can only work with fixed length sequences of integers, we call the pad_sequences() function to pad the shorter SMS messages with zeros.

The longest SMS message in our dataset has 189 tokens (words). In many applications where there may be a few outlier sequences that are very long, we would restrict the length to a smaller number by setting the maxlen flag. In that case, sentences longer than maxlen tokens would be truncated, and sentences shorter than maxlen tokens would be padded:

In [None]:
# tokenize and pad text
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    text_sequences)
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
print("{:d} sentences, max length: {:d}".format(
    num_records, max_seqlen))

5574 sentences, max length: 189


We will also convert our labels to categorical or one-hot encoding format, because the loss function we would like to choose (categorical cross-entropy) expects to see the labels in that format:

In [None]:
# labels
NUM_CLASSES = 2
cat_labels = tf.keras.utils.to_categorical(
    labels, num_classes=NUM_CLASSES)

In [None]:
# vocabulary
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)
print("vocab size: {:d}".format(vocab_size))

vocab size: 9010


In [None]:
# dataset
dataset = tf.data.Dataset.from_tensor_slices(
    (text_sequences, cat_labels))
dataset = dataset.shuffle(10000)
test_size = num_records // 4
val_size = (num_records - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)
BATCH_SIZE = 128
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)

Building the embedding matrix


In [None]:
# import gensim.downloader as api
# api.info("models").keys()

ValueError: ignored

In [None]:
def build_embedding_matrix(sequences, word2idx, embedding_dim,
       embedding_file):
   if os.path.exists(embedding_file):
       E = np.load(embedding_file)
   else:
       vocab_size = len(word2idx)
       E = np.zeros((vocab_size, embedding_dim))
       word_vectors = api.load(EMBEDDING_MODEL)
       for word, idx in word2idx.items():
           try:
               E[idx] = word_vectors.word_vec(word)
           except KeyError:   # word not in embedding
               pass
       np.save(embedding_file, E)
   return E
EMBEDDING_DIM = 300
DATA_DIR = "data"
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")
EMBEDDING_MODEL = "glove-wiki-gigaword-300"
E = build_embedding_matrix(text_sequences, word2idx, 
   EMBEDDING_DIM,
   EMBEDDING_NUMPY_FILE)
print("Embedding matrix:", E.shape)




Define the spam classifier


The input is a sequence of integers. The first layer is an Embedding layer, which converts each input integer to a vector of size (embedding_dim). Depending on the run mode, that is, whether we will learn the embeddings from scratch, do transfer learning, or do fine-tuning, the Embedding layer in the network would be slightly different. When the network starts with randomly initialized embedding weights (run_mode == "scratch"), and learns the weights during the training, we set the trainable parameter to True. In the transfer learning case (run_mode == "vectorizer"), we set the weights from our embedding matrix E but set the trainable parameter to False, so it doesn't train. In the fine-tuning case (run_mode == "finetuning"), we set the embedding weights from our external matrix E, as well as set the layer to trainable.

Output of the embedding is fed into a convolutional layer. Here fixed size 3-token-wide 1D windows (kernel_size=3), also called time steps, are convolved against 256 random filters (num_filters=256) to produce vectors of size 256 for each time step. Thus, the output vector shape is (batch_size, time_steps, num_filters).

Output of the convolutional layer is sent to a 1D spatial dropout layer. Spatial dropout will randomly drop entire feature maps output from the convolutional layer. This is a regularization technique to prevent over-fitting. This is then sent through a Global max pool layer, which takes the maximum value from each time step for each filter, resulting in a vector of shape (batch_size, num_filters).

In [None]:
class SpamClassifierModel(tf.keras.Model):
   def __init__(self, vocab_sz, embed_sz, input_length,
           num_filters, kernel_sz, output_sz,
           run_mode, embedding_weights,
           **kwargs):
       super(SpamClassifierModel, self).__init__(**kwargs)
       if run_mode == "scratch":
           self.embedding = tf.keras.layers.Embedding(vocab_sz,
               embed_sz,
               input_length=input_length,
               trainable=True)
       elif run_mode == "vectorizer":
           self.embedding = tf.keras.layers.Embedding(vocab_sz,
               embed_sz,
               input_length=input_length,
               weights=[embedding_weights],
               trainable=False)
       else:
           self.embedding = tf.keras.layers.Embedding(vocab_sz,
               embed_sz,
               input_length=input_length,
               weights=[embedding_weights],
               trainable=True)
       self.conv = tf.keras.layers.Conv1D(filters=num_filters,
           kernel_size=kernel_sz,
           activation="relu")
       self.dropout = tf.keras.layers.SpatialDropout1D(0.2)
       self.pool = tf.keras.layers.GlobalMaxPooling1D()
       self.dense = tf.keras.layers.Dense(output_sz,
           activation="softmax")
   def call(self, x):
       x = self.embedding(x)
       x = self.conv(x)
       x = self.dropout(x)
       x = self.pool(x)
       x = self.dense(x)
       return x
# model definition
conv_num_filters = 256
conv_kernel_size = 3
model = SpamClassifierModel(
   vocab_size, EMBEDDING_DIM, max_seqlen,
   conv_num_filters, conv_kernel_size, NUM_CLASSES,
   run_mode, E)
model.build(input_shape=(None, max_seqlen))

NameError: ignored

In [None]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])


Train and evaluate the model


One thing to notice is that the dataset is somewhat imbalanced, there are only 747 instances of spam, compared to 4827 instances of ham. The network could achieve close to 87% accuracy simply by always predicting the majority class. To alleviate this problem, we set class weights to indicate that an error on a spam SMS is 8 times as expensive as an error on a ham SMS. This is indicated by the CLASS_WEIGHTS variable, which is passed into the model.fit() call as an additional parameter.

After training for 3 epochs, we evaluate the model against the test set, and report the accuracy and confusion matrix of the model against the test set:

In [None]:
NUM_EPOCHS = 3
# data distribution is 4827 ham and 747 spam (total 5574), which
# works out to approx 87% ham and 13% spam, so we take reciprocals
# and this works out to being each spam (1) item as being 
# approximately 8 times as important as each ham (0) message.
CLASS_WEIGHTS = { 0: 1, 1: 8 }
# train model
model.fit(train_dataset, epochs=NUM_EPOCHS,
   validation_data=val_dataset,
   class_weight=CLASS_WEIGHTS)
# evaluate against test set
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
   Ytest_ = model.predict_on_batch(Xtest)
   ytest = np.argmax(Ytest, axis=1)
   ytest_ = np.argmax(Ytest_, axis=1)
   labels.extend(ytest.tolist())
   predictions.extend(ytest.tolist())
print("test accuracy: {:.3f}".format(accuracy_score(labels, predictions)))
print("confusion matrix")
print(confusion_matrix(labels, predictions))

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/elmo/2"
tf.compat.v1.disable_eager_execution()
elmo = hub.Module(module_url, trainable=False)
embeddings = elmo([
       "i like green eggs and ham",
       "would you eat them in a box"
   ],
   signature="default",
   as_dict=True
)["elmo"]
print(embeddings.shape)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


(2, 7, 1024)


Output is (2, 7, 1024). The first index tells us that our input contained 2 sentences. The second index refers to the maximum number of words across all sentences, in this case, 7. The model automatically pads the output to the longest sentence. The third index gives us the size of the contextual word embedding created by ELMo; each word is converted to a vector of size (1024).

In [None]:
module_url = "https://tfhub.dev/google/tf2-preview/elmo/2"
embed = hub.KerasLayer(module_url)
embeddings = embed([
   "i like green eggs and ham",
   "would you eat them in a box"
])["elmo"]
print(embeddings.shape)

OSError: ignored

The BERT model comes in two major flavors—BERT-base and BERT-large. BERT-base has 12 encoder layers, 768 hidden units, and 12 attention heads, with 110 million parameters in all. BERT-large has 24 encoder layers, 1024 hidden units, and 16 attention heads, with 340 million parameters. More details can be found in the BERT GitHub repository [34].

BERT Pretraining is a very expensive process and can currently only be achieved using Tensor Processing Units (TPUs), which are only available from Google via its Colab network [32] or Google Cloud Platform [33]. However, fine-tuning the BERT-base with custom datasets is usually achievable on GPU instances.

Once the BERT model is fine-tuned for your domain, the embeddings from the last four hidden layers usually produce good results for downstream tasks. Which embedding or combination of embeddings (via summing, averaging, max-pooling, or concatenating) to use is usually based on the type of task.

Using BERT as a feature extractor
