# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 4: From MLP over RNN to Transformer</font>

# <font color="#003660">Notebook 2: MLP with Embeddings</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... are able to train a neural network with word embeddings as features.
    </font>
</div>
</center>
</p>

# Using Word Embeddings as Features

Instead of using word counts as features, we can also represent sentences as sequences of word embeddings. This results in a 2D data structure (sequence_length*embedding_dim). As "normal" neural networks cannot process such 2D tensors, we have to find a way to reduce the dimensionality of this structure.

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `sklearn` is a free software machine learning library for the Python programming language.
- `tensorflow` is an end-to-end open source platform for machine learning, especially deep learning.
- `matplotlib` is a plotting library for the Python programming language and its numerical mathematics extension NumPy


In [None]:
# Install packages
!pip install pymysql

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import getpass
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from sklearn import metrics
import matplotlib.pyplot as plt

Check if we are running on GPU.

In [None]:
tf.config.experimental.list_physical_devices('GPU')

# Load documents

We load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.read_sql(sql=sql_query, con=engine)

Display `shape` of the data.

In [None]:
corpus.shape

Split data into three sets: training, validation, and test.

In [None]:
train_corpus = corpus[corpus["testset"] == 0]
val_corpus = train_corpus.iloc[80000:100000,]
train_corpus = train_corpus.iloc[0:80000,].sample(10000)
test_corpus = corpus[corpus["testset"] == 1]

For each dataset, store features and targets in separate variables

In [None]:
train_corpus_features = train_corpus[["description"]]
train_corpus_target = train_corpus[["points"]]
val_corpus_features = val_corpus[["description"]]
val_corpus_target = val_corpus[["points"]]
test_corpus_features = test_corpus[["description"]]
test_corpus_target = test_corpus[["points"]]

Create [TensorFlow `Datasets`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) from the Pandas Dataframes. The use of TensorFlow Datasets follows a common pattern:

1.   Create a dataset from raw data (e.g., a Pandas dataframe, a CSV file, multiple text files).
2.   Apply transformations to preprocess the data in the dataset (e.g., vectorize text data).
3. Iterate over the dataset and process its elements. Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Here, we use the `from_tensor_slices` constructor to create datasets from dataframes.

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((tf.cast(train_corpus_features.values, tf.string),
                                               tf.cast(train_corpus_target.values, tf.int32)))

val_ds = tf.data.Dataset.from_tensor_slices((tf.cast(val_corpus_features.values, tf.string),
                                             tf.cast(val_corpus_target.values, tf.int32)))

test_ds = tf.data.Dataset.from_tensor_slices((tf.cast(test_corpus_features.values, tf.string),
                                              tf.cast(test_corpus_target.values, tf.int32)))

Display some stats and examples from the created datasets.

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

# Vectorize documents

We will now use [TensorFlow's `TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) function to transform raw texts into numerical vectors. Instead of counting word appearances (like in the BOW model), we simply map words to integers (`output_mode = 'int'`) this time.

In [None]:
max_tokens = 10000
max_length = 100

text_vectorization = TextVectorization(
    max_tokens = max_tokens,
    output_mode = "int",
    output_sequence_length = max_length
)

Some apects of the `TextVectorization` function (e.g., the size and contents of the vocabulary) have to be fit using training data, which can be done with the `adapt` function.

In [None]:
train_ds_features_only = train_ds.map(lambda x, y: x)
text_vectorization.adapt(train_ds_features_only)

Show the vocabulary that our vectorizer knows after being fit to the training data.

In [None]:
text_vectorization.get_vocabulary()[0:10]

Next, we apply our `text_vectorization` function to all three datasets. This corresponds to step 2 mentioned above.

In [None]:
vectorized_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

Show results.

In [None]:
for inputs, targets in vectorized_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

# Train model

We are now ready to specify a neural network and feed it with the vectroized datasets. For convenience, we define a custome function `get_model` which defines the network architecture, creates a model from it, and compiles this model (by defining, e.g., an otpimizer and loss function). Note that we have to somehow reduce the dimensionality of the output of the embedding layer (sequence_length*embedding_dim). Here, we simply perform a global average pooling (i.e., average each element of the embedding vector over all tokens).

*Question*: Can you think of another way to reduce the dimensionality of the output of the embedding layer?

In [None]:
def get_model(max_tokens=10000, hidden_dim=16):
    inputs = keras.Input(shape=(max_length,), dtype="int64")
    embedded = layers.Embedding(input_dim=max_tokens, output_dim=200, mask_zero=True)(inputs)
    #hidden1 = layers.GlobalAveragePooling1D()(embedded)
    hidden1 = layers.Flatten()(embedded)
    hidden2 = layers.Dense(hidden_dim, activation = "relu")(hidden1)
    outputs = layers.Dense(1, activation = "linear")(hidden2)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer = tf.optimizers.Adam(),
                  loss = "mean_absolute_error",
                  metrics = ["mean_absolute_error"])
    return model

Instantiate model and show it's architecture.

In [None]:
model = get_model(max_tokens)
model.summary()

Fit model on training data and save best model to disk.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("embed.keras", save_best_only=True)]

history = model.fit(vectorized_train_ds.cache(),
          validation_data = vectorized_val_ds.cache(),
          epochs = 3,
          batch_size = 64,
          callbacks = callbacks)

Plot the learning process.

In [None]:
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('Model accuracy')
plt.ylabel('Mean Absolute Error')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# Make predictions

Load best model from training phase.

In [None]:
model = keras.models.load_model("embed.keras")

Make predictions on test set.

In [None]:
preds = model.predict(vectorized_test_ds)

Calculate accuracy metrics.

In [None]:
print(metrics.mean_absolute_error(test_corpus_target, preds))

# Use pre-trained word embeddings

In the above model, we learn word embeddings on-the-fly as a by-product of the regression task. This will result in word embeddings that are tuned for regression. Yet, these word embeddings have been learned from a relatively small dataset (here: 10.000 short online reviews).

*Question*: Can we improve our model by using pre-trained word embeddings?

Let's first create a dictionary to retrieve the integer index of a word in our vocabulary.

In [None]:
voc = text_vectorization.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [None]:
word_index["and"]

Next, we need a function to load the embeddings from a file and transform the contents into a matrix with as many rows as we have words in the vocabulary and as many columns as we have dimensions in the embeddings.

In [None]:
def create_embedding_matrix(filepath, word_index, embedding_dim):
    # create a matrix with the right dimensions and fill it with zeros
    vocab_size = len(word_index)
    embedding_matrix = np.zeros((vocab_size, embedding_dim))

    # open embeddings file, read it line by line, and
    # for every word in the vocabulary that is in the embeddings file
    # fill the matrix with the pre-trained embedding values
    # YOUR CODE GOES HERE

    # return embedding_matrix
    return embedding_matrix

Now we can load embeddings from a file.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
embedding_dim = 300
embedding_matrix = create_embedding_matrix(
    "/content/drive/MyDrive/colab_notebooks/AML4TA2022/Session_03/data/wine_300dim_10minwords_4context",
    word_index, embedding_dim)

In [None]:
embedding_matrix[word_index["and"]]

How many entries of the matrix are non-zero?

In [None]:
np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1)) / len(word_index)

Go back to the model definition above and initialize the embedding layer with embedding_matrix...