# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Week 4: From MLP over RNN to LSTM</font>

# <font color="#003660">Notebook 1: MLP with BOW</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... understand the idea behind simple neural networks, and <br>
        ... are able to transform text data so that it can be processed by a neural networks.
    </font>
</div>
</center>
</p>

# What is a Neural Network?

A neural network takes an input vector of *m* variables *X* = (X1, X2, ...,Xm) and learns a nonlinear function *f(X*) to predict the response *Y*. The figure below shows a single neuron with inputs, weights, aggregation function, activation function, and output.

<br><img width=512 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/perceptron.png"/><br>
<center>Source: Raschka (2021)</center>

A neural network with just a single neuron is not more powerful than an ordinary linear or logistic regression. The high predictive power of modern neural networks comes from stacking multiple layers of neurons so that the outputs of one layer are the inputs of the next.

<br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/nn_w_layers_a_labels.png"/><br>

The figure below illustrates how the weights of a neural network with multiple layers can be learned. Initially, all weights are random numbers. Predictions are computed by applying the above specified transformations node by node, layer by layer. This is called the *forward pass*. Once a prediction is computed, the loss function compares the prediction to the true value of the target and calculates a loss score. The optimizer takes the loss score as a feedback signal to update the weights of the layers in a direction that will lower the loss score for the current example. This adjustment is performed by applying the *backpropagation* algorithm.

<br><img width=512 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/big_picture.png"/><br>
<center>Source: Chollet (2021)</center>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `numpy` is a library adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `sklearn` is a free software machine learning library for the Python programming language.
- `tensorflow` is an end-to-end open source platform for machine learning, especially deep learning.
- `matplotlib` is a plotting library for the Python programming language and its numerical mathematics extension NumPy



In [None]:
# Install packages
!pip install pymysql

In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine, text
import getpass
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from sklearn import metrics
import matplotlib.pyplot as plt

Check if we are running on GPU.

In [None]:
tf.config.experimental.list_physical_devices('GPU')

# Load documents

As always, we load our data from a MySQL database. For security reasons, we don't store the database credentials here; please have a look at Panda to get them.

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.DataFrame(engine.connect().execute(text(sql_query)))

Display `shape` of the data.

In [None]:
corpus.shape

Split data into three sets: training, validation, and test. Note that we draw 10.000 random documents from the training set to speed up the training process.

In [None]:
train_corpus = corpus[corpus["testset"] == 0]
val_corpus = train_corpus.iloc[80000:100000,]
train_corpus = train_corpus.iloc[0:80000,].sample(10000)
test_corpus = corpus[corpus["testset"] == 1]

For each dataset, store features and targets in separate variables

In [None]:
train_corpus_features = train_corpus[["description"]]
train_corpus_target = train_corpus[["points"]]
val_corpus_features = val_corpus[["description"]]
val_corpus_target = val_corpus[["points"]]
test_corpus_features = test_corpus[["description"]]
test_corpus_target = test_corpus[["points"]]

Create [TensorFlow `Datasets`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) from the Pandas Dataframes. The use of TensorFlow Datasets follows a common pattern:

1.   Create a dataset from raw data (e.g., a Pandas dataframe, a CSV file, multiple text files).
2.   Apply transformations to preprocess the data in the dataset (e.g., vectorize text data).
3. Iterate over the dataset and process its elements (e.g., for training or making predictions). Iteration happens in a streaming fashion, so the full dataset does not need to fit into memory.

Here, we use the `from_tensor_slices` constructor to create datasets from dataframes.

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((tf.cast(train_corpus_features.values, tf.string),
                                               tf.cast(train_corpus_target.values, tf.int32)))

val_ds = tf.data.Dataset.from_tensor_slices((tf.cast(val_corpus_features.values, tf.string),
                                             tf.cast(val_corpus_target.values, tf.int32)))

test_ds = tf.data.Dataset.from_tensor_slices((tf.cast(test_corpus_features.values, tf.string),
                                              tf.cast(test_corpus_target.values, tf.int32)))

Display some stats and examples from the created datasets. Because `train_ds` is usually processed in a streaming fashion, we need to use a loop to access its contents.

In [None]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

# Vectorize documents

We will now use [TensorFlow's `TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) function to transform raw texts into numerical vectors (e.g., frequency counts, tf-idf)

In [None]:
max_tokens = 10000
text_vectorization = TextVectorization(
    max_tokens = max_tokens,
    output_mode = "count"
)

Some apects of the `TextVectorization` function (e.g., the size and contents of the vocabulary) have to be fit using training data, which can be done with the `adapt` function (which can only be applied to the features (x) of the dataset). 

In [None]:
train_ds_features_only = train_ds.map(lambda x, y: x)
text_vectorization.adapt(train_ds_features_only)

Show some of the vocabulary that our vectorizer knows after being fit to the training data. When we reuse this vectorizer on new data (e.g., test set), only the words in this vocabulary will be considered.

In [None]:
text_vectorization.get_vocabulary()[0:20]

Next, we apply our `text_vectorization` function to all three datasets. This corresponds to step 2 mentioned above.

In [None]:
vectorized_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

vectorized_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls = 4)

Show results.

In [None]:
for inputs, targets in vectorized_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("===")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

# Train model

We are now ready to specify a neural network and feed it with the vectroized datasets. For convenience, we define a custome function `get_model` which defines the network architecture, creates a model from it, and compiles this model (by defining, e.g., an otpimizer and loss function).

In [None]:
def get_model(max_tokens=10000, hidden_dim=32):
    inputs = keras.Input(shape = (max_tokens,))
    hidden = layers.Dense(hidden_dim, activation = "relu")(inputs)
    outputs = layers.Dense(1, activation = "linear")(hidden)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer = tf.optimizers.Adam(),
                  loss = "mean_absolute_error",
                  metrics = ["mean_absolute_error"])
    return model

Instantiate model and show it's architecture.

In [None]:
model = get_model(max_tokens)
model.summary()

Fit model on training data and save best model to disk.

In [None]:
callbacks = [keras.callbacks.ModelCheckpoint("bow.tf", save_best_only = True)]

history = model.fit(vectorized_train_ds.cache(),
          validation_data = vectorized_val_ds.cache(),
          epochs = 3,
          batch_size = 64,
          callbacks = callbacks)

Plot the learning process.

In [None]:
plt.plot(history.history['mean_absolute_error'])
plt.plot(history.history['val_mean_absolute_error'])
plt.title('Model accuracy')
plt.ylabel('Mean Absolute Error')
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# Make predictions

Load best model from training phase.

In [None]:
model = keras.models.load_model("bow.tf")

Make predictions on test set.

In [None]:
preds = model.predict(vectorized_test_ds)

Calculate accuracy metrics.

In [None]:
print(metrics.mean_absolute_error(test_corpus_target, preds))