### What Are Word Embeddings?

A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

    "One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities."
Ref:https://machinelearningmastery.com/what-are-word-embeddings/

One famous example to understand word embeddings is that if you give try: King-Man+Woman, computer will return a vector which is very close to the word 'Queen'. So we can say: King-Man+Woman=Queen. Similarly, Madrid-Spain+France=Paris.s

However, word embeddings sometimes capture worst biases. For example, although they are correctly learn that Man is to King as Woman, they also seem to learn Doctor as Woman is to Nurse.

In [2]:
import tensorflow as tf
import numpy as np

In [4]:
tf.random.set_seed(42)

embedding_layer = tf.keras.layers.Embedding(input_dim=5, output_dim=2)
embedding_layer(np.array([2,4,2]))

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[-0.02580125, -0.04535739],
       [-0.04483942,  0.01235889],
       [-0.02580125, -0.04535739]], dtype=float32)>

See, category 2 gets encoded twice as 2D vector, while category 4 gets encoded as [-0.04483942,  0.01235889]. Since the layer is not trained yet, these encoding are just random.

For the categorical text attribute, you can simply chain a String Lookup layers and Embedding layer like this:

In [9]:
tf.random.set_seed(42)
ocean_prox = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]

str_lookup_layer = tf.keras.layers.StringLookup()
str_lookup_layer.adapt(ocean_prox)

lookup_and_embed = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string),
    str_lookup_layer,
    tf.keras.layers.Embedding(input_dim=str_lookup_layer.vocabulary_size(),output_dim=2)
])

lookup_and_embed(np.array(["<1H OCEAN", "ISLAND", "<1H OCEAN"]))

<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 0.03129688, -0.04329207],
       [ 0.02822297,  0.03865187],
       [ 0.03129688, -0.04329207]], dtype=float32)>

Putting Everything Together

In [14]:
tf.random.set_seed(42)
np.random.seed(42)

X_train_num = np.random.rand(10_000, 8)
X_train_cat = np.random.choice(ocean_prox, size=10_000)
y_train = np.random.rand(10_000, 1)
X_valid_num = np.random.rand(2_000, 8)
X_valid_cat = np.random.choice(ocean_prox, size=2_000)
y_valid = np.random.rand(2_000, 1)

num_input = tf.keras.layers.Input(shape=[8], name="num")
cat_input = tf.keras.layers.Input(shape=[], dtype=tf.string, name="cat")
cat_embeddings = lookup_and_embed(cat_input) 
encoded_inputs = tf.keras.layers.concatenate([num_input, cat_embeddings])
outputs = tf.keras.layers.Dense(1)(encoded_inputs)

model = tf.keras.models.Model(inputs=[num_input, cat_input], outputs=[outputs])
model.compile(loss="mse", optimizer="sgd")
history = model.fit((X_train_num, X_train_cat), y_train, epochs=5, validation_data=((X_valid_num, X_valid_cat), y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Text Preprocessing

In [15]:
train_data = ["to be", "!(to be)", "That's the question", "Be be, be."]

text_vec_layer = tf.keras.layers.TextVectorization()
text_vec_layer.adapt(train_data)

text_vec_layer(['Be good!', 'Question: be or be?'])

<tf.Tensor: shape=(2, 4), dtype=int64, numpy=
array([[2, 1, 0, 0],
       [6, 2, 1, 2]], dtype=int64)>

TextVectorization's adapt() method first converted the traning sentences to lowercase and removed punctuation, which is why "Be", "be" and "be?", all are encoded as "be"=2. Next, the sentences were split on whitespace, and the resulting words were sorted by the descending freequency, producing final vocabulary. When encoding sentences, unknown gets encoded as 1s. Lastly, since the first sentence is shorter than the second, it was padded with 0s.

In [19]:
text_vec_layer = tf.keras.layers.TextVectorization(standardize=None, split=None, ragged=True)
text_vec_layer.adapt(train_data)

text_vec_layer(['Be good!', 'Question: be or be?'])

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 1], dtype=int64)>

The TextVectorization class has so many option. For example, you can preserve the case and punctuation if you want by setting standardize=None, also you can prevent spliting by setting split=None. You can pass your own function to this argument. 

You can set output_sequence_length argument to ensure that all get cropped or padded to the desired length. Set ragged=True to get the ragged tensor instead of regular tensor.

In [20]:
text_vec_layer = tf.keras.layers.TextVectorization(output_mode="tf_idf")
text_vec_layer.adapt(train_data)

text_vec_layer(['Be good!', 'Question: be or be?'])

<tf.Tensor: shape=(2, 6), dtype=float32, numpy=
array([[0.96725637, 0.6931472 , 0.        , 0.        , 0.        ,
        0.        ],
       [0.96725637, 1.3862944 , 0.        , 0.        , 0.        ,
        1.0986123 ]], dtype=float32)>

Alternatively, you can set output_model argument to "multi_hot" or "count" to get the corresponding encodings. However, simply counting words is not ideal: words like "to" and "the" are so frequent that they are hardly matter at all. Whereas rarer words such as "basketball" are much informative. So, it is usally prefer to set to "tf_idf" instead of "multi_hot" or "count". "tf_idf" stands for term-frequency x inverse-document-frequnency (TF-IDF).

### Using Pretrained Language Model Components

In [22]:
import tensorflow_hub as hub

hub_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim50/2")
sentence_embeddings = hub_layer(tf.constant(["To be", "Not to be"]))
sentence_embeddings.numpy().round(2)

array([[-0.25,  0.28,  0.01,  0.1 ,  0.14,  0.16,  0.25,  0.02,  0.07,
         0.13, -0.19,  0.06, -0.04, -0.07,  0.  , -0.08, -0.14, -0.16,
         0.02, -0.24,  0.16, -0.16, -0.03,  0.03, -0.14,  0.03, -0.09,
        -0.04, -0.14, -0.19,  0.07,  0.15,  0.18, -0.23, -0.07, -0.08,
         0.01, -0.01,  0.09,  0.14, -0.03,  0.03,  0.08,  0.1 , -0.01,
        -0.03, -0.07, -0.1 ,  0.05,  0.31],
       [-0.2 ,  0.2 , -0.08,  0.02,  0.19,  0.05,  0.22, -0.09,  0.02,
         0.19, -0.02, -0.14, -0.2 , -0.04,  0.01, -0.07, -0.22, -0.1 ,
         0.16, -0.44,  0.31, -0.1 ,  0.23,  0.15, -0.05,  0.15, -0.13,
        -0.04, -0.08, -0.16, -0.1 ,  0.13,  0.13, -0.18, -0.04,  0.03,
        -0.1 , -0.07,  0.07,  0.03, -0.08,  0.02,  0.05,  0.07, -0.14,
        -0.1 , -0.18, -0.13, -0.04,  0.15]], dtype=float32)

The hub.KerasLayer layer downloads the module from the given url. It takes strings as input and encodes each one of as a single vector (50 dimentions, in this case).