# 17 Deep networks for natural language processing

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. A. Kedia and M. Rasu (2020) Hands-On Python Natural Language Processing. Packt Publishing

The following Python modules will be required. Make sure that you have them installed.
- `matplotlib`
- `requests`
- `tensorflow`
- `re`
- `sklearn`
- `json`
- `zipfile`
- `nltk`
- `spacy`
- `gensim`
- `numpy`

This lecture will closely follow the book \[1\].

## Lesson 1

### Required initialization and helper functions

Before we begin some initialization is required

In [None]:
import tensorflow as tf

# This initialization code is required due to an error 
# "NotFoundError: No algorithm worked"
# when using Conv2D
# Probabliy due to problems with cuda 11.
# Remove this when fixed
# https://github.com/tensorflow/tensorflow/issues/43174
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, GlobalMaxPooling1D, Dropout, LSTM, Embedding
from tensorflow.keras.losses import CategoricalCrossentropy, BinaryCrossentropy

The function `plot_hist` below plots learning curves: loss and accuracy vs. epoch of training.

It accepts a lits of training histories `hist_list` computed for different networks and plots them as row of images.

In [None]:
import matplotlib.pyplot as plt

def plot_hist_one(hist, name, axs):
    """Plot one loss and accuracy"""
    epochs = len(hist.history['loss'])
    xs = list(range(epochs))
    
    ax = axs[0]
    ax.plot(xs, hist.history['loss'], label='loss')
    ax.plot(xs, hist.history['val_loss'], label='val_loss')
    ax.set_ylabel('loss')
    ax.set_yscale('log')

    ax = axs[1]
    ax.plot(xs, hist.history['accuracy'], label='accuracy');
    ax.plot(xs, hist.history['val_accuracy'], label='val_accuracy');
    ax.set_ylabel('accuracy')

    for ax in axs:
        ax.grid()
        ax.set_xlabel('epoch')
        ax.legend(title=name)    

def plot_hist(hist_list, hist_names):
    """Plot loss and accuracy for many network"""
    N = len(hist_list)
    fig, axs = plt.subplots(nrows=N, ncols=2, figsize=(10, 3*N))
    if N == 1:
        axs = [axs]
    for hist, name, ax in zip(hist_list, hist_names, axs):
        plot_hist_one(hist, name, ax)
    plt.tight_layout()

### Dense network for document classification

In this section we will consider supervised learning model able to perform classification of document.

We will learn it to to classify questions. 

The training dataset is a corpus of questions labeled by one of six categories. 

The purpose is to create a model that accepts a question as its input and predicts its category at the output.

Obviously taking other datasets one can use this approach to train a spam filter or create a model for sentiment analysis.

We need a function for downloading a dataset from a course repository.

In [None]:
import requests
import matplotlib.pyplot as plt

def load_txt_dataset(file_name, dtype=float):
    """Downloads txt dataset from repo."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200
    return web_data.text.splitlines()

The dataset we use is takes at url https://cogcomp.seas.upenn.edu/Data/QA/QC/ and copied to the course repository for convenience.

Let us download the dataset and what is it.

In [None]:
raw_corpus = load_txt_dataset("illinois_univ_lab_quest.txt")
for doc in raw_corpus[:10]:
    print(doc)
print(f"\nTotally {len(raw_corpus)} records")

We have a list of records, each one contains a questions itself and its category. 

The lits of the categories is
- ABBREVIATION
- ENTITY
- DESCRIPTION
- HUMAN
- LOCATION
- NUMERIC

Moreover each category is supplied with a more fine features after column, e.g. "DESC:manner" or "ABBR:exp". 

We will omit them and preserve only the course classes written with capital letters.

We need to parse each line to split questions and their categories.

It can be done with a regular expressions with the pattern `([A-Z]+):[a-z]+\s(.+)`

Before processing the whole corpus let us see how it works.

It has two blocks wrapped with parentheses: `([A-Z]+)` and `(.+)`. They are called groups and highlight what we want to catch:

- `([A-Z]+)`: the first group matches one or more capital letters staring from the very beginning. It will catch category names.
- `:[a-z]+\s`: colon followed by one or more lowercase letters followed by space. It matches fine category description that have to be omitted.
- `(.+)`: the second group to catch, any letters up to the end.

Here is an example that shows how it works.

In [None]:
import re

rge = re.compile(r"([A-Z]+):[a-z]+\s(.+)")

for doc in raw_corpus[:10]:
    print(rge.findall(doc))

Now we are ready to extract lists of questions and category names from our corpus.

In [None]:
import re

txt = raw_corpus[:10]

rge = re.compile(r"([A-Z]+):[a-z]+\s(.+)")

quests = []
categs = []
for doc in raw_corpus:
    result = rge.findall(doc)
    categs.append(result[0][0])
    quests.append(result[0][1])

for c, q in zip(categs[:10], quests[:10]):    
    print(c, q)

Let us analyze what are the questions and what are the categories.

For convenience we define a dictionary that maps shorten forms of category names and print several questions from each category

In [None]:
full_categ_names = {"ABBR": "ABBREVIATION", "ENTY": "ENTITY", "DESC": "DESCRIPTION", 
                    "HUM": "HUMAN",  "LOC": "LOCATION", "NUM": "NUMERIC"}

unique_categs = list(set(categs))
N = 10
for extracted_categ in unique_categs:
    tmp = [q for q, c in zip(quests, categs) if c==extracted_categ]
    print(full_categ_names[extracted_categ])
    for q in tmp[:N]:
        print(f"\t{q}")

As we discussed in previous lectures predicting categories requires converting them into one-hot form.

There many ways to do it. Both tensorflow and numpy libraries provide the corresponding function. 

Let us check show it can be done using sklearn.

This library suggests a class `OneHotEncoder` that converts a series or categorical features into one one-hot vectors.

Here is an illustration:

In [None]:
from sklearn.preprocessing import OneHotEncoder

test = [['Male', 'Yang'], ['Female', 'Yang'], ['Female', 'Adult']]

enc = OneHotEncoder()
enc.fit(test)
print(enc.categories_)
print(enc.transform(test).toarray())

If there is only one categorical features it still must be wrapped into a list like this:

In [None]:
from sklearn.preprocessing import OneHotEncoder

test = [['Baby'], ['Yang'], ['Adult']]  # not just ['Baby', 'Yang', 'Adult']

enc = OneHotEncoder()
enc.fit(test)
print(enc.categories_)
print(enc.transform(test).toarray())

We are ready to create one-hot vectors for our categories:

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
categs1 = [[c] for c in categs]
encoder.fit(categs1)
print(encoder.categories_)
num_categs = encoder.categories_[0].shape[0]
print(num_categs)

labels = encoder.transform(categs1).toarray()
print(labels)

Now the questions have to be prepared: cleaning, tokenization and normalization have be done.

We take a copy of function for cleaning and tokenization from the previous lecture.

In [None]:
import nltk

def tokenize_and_clean(sentence):
    """Tokenize sentence and clean it.
    """
    raw_tokens = nltk.word_tokenize(sentence)
    tokens = []
    for tok in raw_tokens:
        t1 = tok[1:] if tok[0] == "'" else tok  # we do not want to remove tokens like "'ve" 
        if t1.isalpha():
            tokens.append(tok.lower())
    return tokens

Now we create TF-IDF representation for questions. 

We copy a code from a previous lecture with a small modification: cleaning has been added and nltk tokenizer is used.

Removing stopwords we have to preserve "wh-" words: we are dealing with questions and these words are meaningful.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')

stemmer = nltk.stem.snowball.SnowballStemmer(language = 'english')

wh_words = set(['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom'])
stop = set(nltk.corpus.stopwords.words('english')) - wh_words

def stemmed_tokenizer(doc):
    raw_tokens = tokenize_and_clean(doc)
    tokens = [tok for tok in raw_tokens if tok not in stop]
    stem_tokens = [stemmer.stem(tok) for tok in tokens]
    return stem_tokens

vectorizer = TfidfVectorizer(tokenizer=stemmed_tokenizer)
tfidf_quests = vectorizer.fit_transform(quests);

The result of the vectorization is a matrix of TF-IDF vectors, one row for one question. 

In [None]:
print(tfidf_quests.toarray().shape)
print(labels.shape)

Split the whole dataset into testing and training parts.

In [None]:
from sklearn.model_selection import train_test_split

p_test = 0.1
n_test = round(p_test * labels.shape[0])

X_train, X_test, y_train, y_test = train_test_split(tfidf_quests.toarray(), labels, random_state=0, 
                                                    test_size=n_test, shuffle=True)

print(f"train size {len(y_train)}")
print(f" test size {len(y_test)}")

Now everything is ready to create and train a model.

It will be simple two layer model.

However, preliminary tests have shown that it is prone to a strong overfitting. 

We fight it using two tools. 

First is already known dropout. Let us remember that it temporarily stops training of some of neurons and thus force model to find generalization of the data instead of mere remembering.

Another remedy is regularization of neuron weights: observe parameter `tf.keras.regularizers.l2` of the first dense layer.

It adds sum of squared neuron weights directly to the loss function multiplied by the coefficient `l2=0.01`. 

This coefficient controls the penalty level applied for too high by magnitude values of weights.

As a result during training the optimization algorithm seeks for the minimum of the loss function provided that the neuron coefficients are as small as possible.

It reduces the information capacity of a network and thus reduces the overfitting.

In [None]:
input_shape = X_train.shape[1:]

model = Sequential([
    Dense(256, input_shape=input_shape, activation='relu', 
          kernel_regularizer=tf.keras.regularizers.l2(l2=0.01)),
    Dropout(0.8),
    Dense(num_categs)
])

model.compile(optimizer='adam',  
             loss=CategoricalCrossentropy(from_logits=True),
             metrics=['accuracy'])

model.summary()

In [None]:
hist = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=2)

In [None]:
plot_hist([hist], ["questions classification"])

In [None]:
acc, loss = model.evaluate(X_test, y_test)
print(f"acc={acc}, loss={loss}")

We have trained our model not so long so its accuracy is at the level of 80\%. More epochs of training are required to attain higher performance.

Let us check how the model can work.

Here the steps that are required to ask a model to predict a question category:

In [None]:
q1 = "Who Framed Roger Rabbit?"

vec = vectorizer.transform([q1]).toarray()
y_pred = model.predict(vec)
print(y_pred)
cat_pred = tf.argmax(y_pred, axis=1).numpy()
print(cat_pred)
cat_pref = cat_pred[0]
encoder.categories_[0][cat_pred]

We can wrap all this into a function and test it with various questions.

In [None]:
def classify(quest, model, vectorizer, encoder):
    vec = vectorizer.transform([q1]).toarray()
    y_pred = model.predict(vec)
    cat_pred = tf.argmax(y_pred, axis=1).numpy()[0]
    return encoder.categories_[0][cat_pred]    

In [None]:
q1 = "Who Framed Roger Rabbit?"
print(classify(q1, model, vectorizer, encoder))

In [None]:
q1 = "How far is the sun?"
print(classify(q1, model, vectorizer, encoder))

In [None]:
q1 = "MNIST, is this an abbreviation or what?"
print(classify(q1, model, vectorizer, encoder))

In [None]:
q1 = "I wonder where I put my glasses"
print(classify(q1, model, vectorizer, encoder))

Our model is very simple. What can be done to improve it?

First of all more accurate cleaning is required. Probably dots must be preserved in tokens representing domain names like ".com". 

Since abbreviations are essential in this corpus, it is better to avoid their conversion to lowercase.

Replacing stemming with lemmatisation will probably also improve the performance.

Vector model TF-IDF is not the best one known today. Embedding models `Doc2Vec` or `Sent2Vec` are expected to work better.

Finally one can also play with the network structure: add more layers, change layer sizes, increase or decrease regularization coefficient and dropout ratio.

### Convolutional network for document classification

On previous lecture we considered convolutional network for images classification. 

Similarly convolution can be leveraged to texts. 

Each sentence (or a document) of a corpus is represented as word vectors that are stacked one after another. 

Convolution kernel has width corresponding to the length of a word vector and height equal to an odd number (typically) to grab together with the current vector two or more its neighbors.

After convolution new vector size equals to the number of filters (kernels) applied. Sentence length is unchanged if zero padding is used.

This is called 1 dimensional convolution.

Also 1 dimensional pooling is applied after the convolution. 

Pooling is applied along sentences, separately along each vector element. 

The pooling reduces the length of sentences. The reduction coefficient depends on the pooling widow size. Typically this is 2.

As usual there are average or max pooling.

Global pooling is has a window size equal to the number of sentences. Its result is a single vector.

![convol_1d.svg](attachment:convol_1d.svg)

In this section we consider text classification problem. 

Notice that the same problem was consider in the previous section. 

The difference is that now we will consider how convolutions can be used for this.

We will build sarcasm detector. 

The dataset is taken from Kaggle at https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection
and copied to a course repository for convenience.

This dataset is collected from two news website. First is TheOnion that aims at producing sarcastic versions of current events and HuffPost that publishes real (and non-sarcastic) news.

Only headlines of news are gathered.

Let us first download zipped file and open it as Python list.

The file has format JSON. This is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects.

To read and parse it we use Python library `json`. The outcome is Python dictionaries representing each data record.

In [None]:
import json
import requests
from io import BytesIO, TextIOWrapper
from zipfile import ZipFile

def load_ziptxt_dataset(file_name):
    """Downloads zipped dataset from repo and return it as a text."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200

    # unzip the content
    zf = ZipFile(BytesIO(web_data.content))
    
    # zipped file name
    zipped_name = zf.namelist()[0]
    print(f"Download {file_name}, unzip {zipped_name}")
    
    # Open unpacked file
    with zf.open(zipped_name, 'r') as file:
        # TextIOWrapper(file) converts byte strings to plain strings
        data = []
        for record in TextIOWrapper(file):
            data.append(json.loads(record))
    return data

In [None]:
raw_data = load_ziptxt_dataset("Sarcasm_Headlines_Dataset_v2.json.zip")
print(f"Number of records {len(raw_data)}")

print()
for record in raw_data[:5]:
    print(record)

We see that there three fields: "is_sarcastic", "headline" and "article_link". 

We will use only first two omitting article links.

In [None]:
alllabels = []
documents = []

for record in raw_data:
    alllabels.append(int(record['is_sarcastic']))
    documents.append(record['headline'])

Check the balance: if we have same number of sarcastic and non sarcastic news.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(alllabels, bins=2);

Now we need to preprocess documents.

Since we will use Word2Vec vectorization stemming is not appropriate since Word2Vec models are built for full words, not for their stems.

So we perform lemmatization with the help of `spacy`.

After lemmatization a cleaning is perform using the pattern `[a-z]+`.

This very aggressive cleaning that remove every token not including only letters.

Notice that the reduced verbs with apostrophe like "'ve" are also removed. 

This is because at the next step remove stopwords according to a nltk list where all such modal verbs are present.

In [None]:
import re
import spacy
import nltk
nltk.download('stopwords')

nlp = spacy.load('en_core_web_sm')
stop = set(nltk.corpus.stopwords.words('english'))
rge = re.compile("[a-z]+")

def preproc(sentence, nlp, stop, rge):
    nlp_tokens = nlp(sentence)
    tokens = [token.lemma_ for token in nlp_tokens]
    tokens = [tok for tok in tokens if rge.fullmatch(tok) is not None]
    tokens = [tok for tok in tokens if tok not in stop]
    return tokens

This code applies the preprocessor to the documents. Sometimes it removes all sentence, sometimes only one or to tokens left. 

Such short documents are ignored, since sarcasm is sufficiently complicated construct that requires many words to be expressed.

In [None]:
# It takes some time
min_length = 3
tokens, labels = [], []
for lab, doc in zip(alllabels, documents):
    toks = preproc(doc, nlp, stop, rge)
    if len(toks) >= min_length:
        tokens.append(toks)
        labels.append(lab)
    
print(tokens[:10])

Vectorization will be done using prebuilt Word2Vec model provided by `gensim`. 

Below there are two version. 

Calling the better and the larger model "word2vec-google-news-300" is commented out to avoid large traffic.

In [None]:
import gensim.downloader as api

vectorizer = api.load("glove-wiki-gigaword-50")  # 66 MB
# vectorizer = api.load("word2vec-google-news-300")  # 1.7 GB !

This function below computes vector representation of sentences. 

Since sentences have different sizes we take some value `max_length` as the largest sentence length taken into account.

Longer sentences are truncated and shorter ones are padded with zero vectors.

Some words may be not found in the vocabulary of the Word2Vec model. In such cases we put a stub instead: a random vector that is required to indicate the here there some word.

In [None]:
import numpy as np
rng = np.random.default_rng(seed=0)
stub = rng.uniform(-1,1, size=vectorizer.vector_size)

def vectorize_data(data, vectorizer, max_length, stub):
    vector_size = vectorizer.vector_size
    vectors = []
    padding_vector = [0.0] * vector_size
    
    for i, data_point in enumerate(data):
        data_point_vectors = []
        count = 0
        
        for token in data_point:
            if count >= max_length:
                break
            if vectorizer.get_index(token, default=-1) >= 0:
                data_point_vectors.append(vectorizer[token])
            else:
                data_point_vectors.append(stub)
            count = count + 1
        
        if len(data_point_vectors) < max_length:
            to_fill = max_length - len(data_point_vectors)
            for _ in range(to_fill):
                data_point_vectors.append(padding_vector)
        
        vectors.append(data_point_vectors)
        
    return np.array(vectors)

Maximum length of sentence that we take into account is computed as 90 percentile of the distribution of sentence lengths. 

It means that 90\% of sentences has the length no longer then this.

The reason that we do not take the longest sentence is that there are few very large sentences and we do not want to enlarge the model due to them.

In [None]:
sizes = [len(sent) for sent in tokens]
print(max(sizes))
max_length = sorted(sizes)[9*len(tokens)//10]
print(max_length)

This is the vectorization:

In [None]:
tokvecs = vectorize_data(tokens, vectorizer, max_length, stub)
print(tokvecs.shape)

Now we split the dataset into training and testing parts.

In [None]:
from sklearn.model_selection import train_test_split

p_test = 0.3
n_test = round(p_test * len(labels))

X_train, X_test, y_train, y_test = train_test_split(tokvecs, np.array(labels), random_state=0, 
                                                    test_size=n_test, shuffle=True)

print(f"train shape {X_train.shape}")
print(f"test shape {X_test.shape}")

And finally the model.

We use here one convolution layer accompanied by global pooling and two dense layers. 

Overfitting is reduced via dropouts and neuron weights reducing.

Since there two classes to predict the loss function is binary cross entropy.

In [None]:
input_shape = X_train.shape[1:]
l2_penalty = 0.01
dropout_rate = 0.5

model = Sequential([
    Conv1D(input_shape=input_shape,
           filters=8, kernel_size=3, padding='same', activation='relu',
           kernel_regularizer=tf.keras.regularizers.l2(l2_penalty)),
    GlobalMaxPooling1D(),
    Dense(10, activation='relu',
          kernel_regularizer=tf.keras.regularizers.l2(l2_penalty)),
    Dropout(dropout_rate),
    Dense(5, activation='relu',
          kernel_regularizer=tf.keras.regularizers.l2(l2_penalty)),
    Dropout(dropout_rate),
    Dense(1)
])

model.compile(optimizer='adam',  
             loss=BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])

model.summary()

In [None]:
hist = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=2)

In [None]:
plot_hist([hist], ["sarcasm"])

In [None]:
acc, loss = model.evaluate(X_test, y_test)
print(f"acc={acc}, loss={loss}")

Let us now apply the model to a arbitrary expressions.

First consider the steps with printing intermediate results.

In [None]:
ss = "You look so good, really"

tok = preproc(ss, nlp, stop, rge)
print(tok)
vec = vectorize_data([tok], vectorizer, max_length, stub)
print(vec)
y_pred = model.predict(vec)
print(y_pred)
print("sarcastic" if y_pred[0][0] < 0 else "non sarcastic")

The steps can be wrapped to a functions

In [None]:
def classify(ss):
    tok = preproc(ss, nlp, stop, rge)
    if len(toks) < min_length:
        raise ValueError
    vec = vectorize_data([tok], vectorizer, max_length, stub)
    y_pred = model.predict(vec)
    return "sarcastic" if y_pred[0][0] < 0 else "non sarcastic"

Now the examples

In [None]:
ss = "Marry had a little lamb, little lamb, little lamb"
print(classify(ss))

In [None]:
ss = "That's just what I needed today!"
print(classify(ss))

In [None]:
ss = "Well, what a surprise."
print(classify(ss))

In [None]:
ss = "Great! I hope I'm a waitress at the Cheesecake Factory for my whole life!"
print(classify(ss))

In [None]:
ss = "NaNs are treated as missing values: disregarded in fit, and maintained in transform."
print(classify(ss))

In [None]:
ss = "Fender releases new hybrid gas-electric guitar"
print(classify(ss))

In [None]:
ss = "You can now message the president on facebook"
print(classify(ss))

We see that the predictions often do not correspond to our intuition.

The model obviously have to be improved.

Probably detection of sarcasm requires all words in sentences so we may try to make cleaning and stopwords removal not so aggressive.

It is reasonable to try to refuse the removing of the stop words.

The lemmatization can be also the problem.

Let us remember: stemming are lemmatizaton are required to reduce the vocabulary size and thus the vector size in Bag of Words and TF-IDF models. 

When embedding models are used this problem is absent. 

And the embedding model are built on the raw texts without lemmatization. 

Probably using sentences in their original form can help to make the detector more adequate.

Of course the model architecture can also be adjust: more convolution layers can be added and sizes of their kernels as well as sizes of dense layers can be tuned up.

### Exercises

1\. Following the recommendations provided at the end of the section "Dense network for document classification" try to improve the model from this section.

2\. Following the recommendations provided at the end of the section "Convolutional network for document classification" try to improve the model from this section.
