# Deeplearning - Nasir Hussain - 2021/08/28

# 11 Deep learning for text

## 11.1 Natural language processing: The bird’s eye view

-  NLP is about using machine learning and large datasets to give computers the ability not to understand language but to ingest a piece of language as input and return something useful, like predicting the following:
  - text classification
    - What’s the topic of this text?
  - content filtering
    - Does this text contain abuse?
  - sentiment analysis
    - Does this text sound positive or negative?
  - language modeling
    - What should be the next word in this incomplete sentence?
  - translation
    - How would you say this in German?
  - summarization
    - How would you summarize this article in one paragraph?

## 11.2 Preparing text data

- Deep learning models only process numeric tensors
- Vectorizing text is the process of transforming text into numeric tensors.
  - First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
  - You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization.
  - indexing all tokens present in the data.
  - You convert each such token into a numerical vector

### 11.2.1 Text standardization

- Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with
  - convert to lowercase and remove punctuation characters
  - convert special characters to a standard form
  - converting variations of a term into a single shared representation

### 11.2.2 Text splitting (tokenization)

- break text into units to be vectorized (tokens)
- Methods of tockenization
  - Word-level tokenization
    - Where tokens (units) are space-separated (or punctuation-separated) substrings.
    - A variant of this is to further split words into subwords when applicable
      - for instance, treating “staring” as “star+ing” or “called” as “call+ed.”
  - N-gram tokenization
    - Where tokens are groups of N consecutive words.
    - a way to artificially inject a small amount of local word order information into the model 
      - For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
  - Character-level tokenization
    - Where each character is its own token.
      - used in specialized contexts, like text generation or speech recognition.
- text-processing models
  - sequence models
    - those that care about word order
    - use word-level tokenization
  - bag-of-words models
    - those that treat input words as a set, discarding their original order
    - N-gram tokenization

#### Understanding N-grams and bag-of-words


- N-grams are groups of N (or fewer) consecutive words that you can extract from a sentence

  ```
  the cat sat on the mat.

  bag of 2-gram
  {"the", "the cat", "cat", "cat sat", "sat",
  "sat on", "on", "on the", "the mat", "mat"}
  
  bag of 3-gram
  {"the", "the cat", "cat", "cat sat", "the cat sat",
  "sat", "sat on", "on", "cat sat on", "on the",
  "sat on the", "the mat", "mat", "on the mat"}
  ```
- One-dimensional convnets, recurrent neural networks, and Transformers are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by looking at continuous word or character sequences.

### 11.2.3 Vocabulary indexing

- encode each token into a numerical representation. 
  - do this in a stateless way
    - hashing each token into a fixed binary vector
  - build an index of all terms found in the training data
    - assign a unique integer to each entry in the vocabulary.

  ```python
  #  restrict the vocabulary to only the top 20,000 or 30,000 most common words
words
  vocabulary = {} 
  for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
      if token not in vocabulary:
      vocabulary[token] = len(vocabulary)
  ```

- vector encoding

  ```python
  def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1
    return vector
  ```

- while doing so always create a “out of vocabulary” index (abbreviated as OOV index)
  - a catch-all for any token that wasn’t in the index.
- When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” 
- Special Token
  - OOV token (index 1)
  - mask token (index 0) for padding

  ```
  [5, 7, 124, 4,89] and [8, 34, 21]

          ||
          ||
          \/

  [[5, 7, 124, 4, 89]
  [8, 34, 21, 0, 0]] 
  ```

### 11.2.4 Using the TextVectorization layer

- Python way

In [None]:
# python way to perform all above tasks

import string
 
class Vectorizer:
  
  def standardize(self, text):
    text = text.lower()
    return "".join(char for char in text if char not in string.punctuation)
 
  def tokenize(self, text):
    text = self.standardize(text)
    return text.split()

  def make_vocabulary(self, dataset):
    self.vocabulary = {"": 0, "[UNK]": 1}
    for text in dataset:
      text = self.standardize(text)
      tokens = self.tokenize(text)
      for token in tokens:
        if token not in self.vocabulary:
          self.vocabulary[token] = len(self.vocabulary)
    self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())

  def encode(self, text):
    text = self.standardize(text)
    tokens = self.tokenize(text)
    return [self.vocabulary.get(token, 1) for token in tokens]

  def decode(self, int_sequence):
    return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()

dataset = [ 
  "I write, erase, rewrite",
  "Erase again, and then",
  "A poppy blooms.",
]

vectorizer.make_vocabulary(dataset)

In [None]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [None]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


- tf way
  - `TextVectorization` uses by default follwoing settings but can be altered
  - convert to lowercase and remove punctuation” for text standardization
  - “split on whitespace” for tokenization. 

In [None]:
# vectorization with tf with default seettings
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",
)

In [None]:
# vectorization with tf with custom seettings
import re 
import string 
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

def custom_standardization_fn(string_tensor):
  lowercase_string = tf.strings.lower(string_tensor)
  return tf.strings.regex_replace(lowercase_string, f"[{re.escape(string.punctuation)}]", "")
 
def custom_split_fn(string_tensor):
  return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [None]:
# To index the vocabulary of a text corpus, just call the adapt() method of the layer 
  # with a Dataset object that yields strings
  # just with a list of Python strings:
dataset = [
  "I write, erase, rewrite",
  "Erase again, and then",
  "A poppy blooms.",
]

text_vectorization.adapt(dataset)

In [None]:
# Listing 11.1 Displaying the vocabulary
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [None]:
# encode and then decode an example sentence
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print("Actual Sentence")
print(test_sentence)
print("Encoded Sentence")
print(encoded_sentence)
print("Decoded Sentence")
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

Actual Sentence
I write, rewrite, and still rewrite again
Encoded Sentence
tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
Decoded Sentence
i write rewrite and [UNK] rewrite again


#### Using the TextVectorization layer in a tf.data pipeline or as part of a model


- two ways to use TextVectorization layer 
  - put it in the tf.data pipeline

    ```python
    int_sequence_dataset = string_dataset.map(
      text_vectorization,
      num_parallel_calls=4) 
    ```

    -  happen synchronously with the rest of the model
  - part of the model

    ```python
    text_input = keras.Input(shape=(), dtype="string")
    vectorized_text = text_vectorization(text_input)
    embedded_input = keras.layers.Embedding(...)(vectorized_text)
    output = ...
    model = keras.Model(text_input, output) 
    ```
    -  happen a-synchronously with the rest of the model

- TextVectorization layer enables you to include text preprocessing right into your model, making it easier to deploy

## 11.3 Two approaches for representing groups of words: Sets and sequences

- sequence models
  - RNNs and Transformers
  - those that care about word order
  - use word-level tokenization
- bag-of-words models
  - those that treat input words as a set, discarding their original order
  - N-gram tokenization

### 11.3.1 Preparing the IMDB movie reviews data

In [None]:
# download an unzip data
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  9855k      0  0:00:08  0:00:08 --:--:-- 16.3M


- 25,000 text files for training and another 25,000 for testing
  - 12500 +ve
  - 12500 -ve

In [None]:
# remove un-necessary folder
!rm -r aclImdb/train/unsup

In [None]:
# inspect data
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [None]:
# prepare a validation set 
  # by setting apart 20% of the training text files in a new directory, aclImdb/val
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [None]:
# create text datasets
train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
# Listing 11.2 Displaying the shapes and dtypes of the first batch
for inputs, targets in train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b"A movie about dealing with the problems with growing up and being true to yourself, Blue Juice is mind candy for those who like surfing and Cornwall. Sean Pertwee is the real star of this film, while the more famous Catherine Zeta Jones plays his girlfriend and Ewan Mcgregor plays his drug addicted pal.<br /><br />For those who don't like surfing or Cornwall in the slightest, you'll find that it takes a long time before the movie even hints at being interesting. The beginning is slow and spends too much time on long shots of only slightly interesting landscapes. Plus too many main characters leads to most of them being one dimensional. The plot is an interesting idea but because of the shallow characters you have no idea why they act in the situations they're put in.<br /><br />Only Ewan, Sean and Catherine's characters make this a film worth being on videotap

### 11.3.2 Processing words as a set: The bag-of-words approach

#### SINGLE WORDS (UNIGRAMS) WITH BINARY ENCODING

In [None]:
# Listing 11.3 Preprocessing our datasets with a TextVectorization layer
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)

text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)


In [None]:
# Listing 11.4 Inspecting the output of our binary unigram dataset
for inputs, targets in binary_1gram_train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


In [None]:
# Listing 11.5 Our model-building utility
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

In [None]:
# Listing 11.6 Training and testing the binary unigram model
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.884


#### BIGRAMS WITH BINARY ENCODING

In [None]:
# Listing 11.7 Configuring the TextVectorization layer to return bigrams
text_vectorization = TextVectorization(
  ngrams=2,
  max_tokens=20000,
  output_mode="multi_hot",
)

In [None]:
# Listing 11.8 Training and testing the binary bigram model
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.899


#### BIGRAMS WITH TF-IDF ENCODING

In [None]:
# Listing 11.9 Configuring the TextVectorization layer to return token counts
text_vectorization = TextVectorization(
 ngrams=2,
 max_tokens=20000,
 output_mode="count"
)

# this might gives a problem to count some ir-relevent words such as the, a, is
  # use normalization to mitigate this issue
  # use only divide only normalization
  # use TF-IDF stands for “term frequency, inverse document frequency.”

##### Understanding TF-IDF normalization

- “term frequency,” how many times the term appears in the current document,
- “document frequency,” which estimates how often the term comes up across the dataset.
- tf-idf = tf/idf

In [None]:
# TF-IDF
import math
def tfidf(term, document, dataset):
  term_freq = document.count(term)
  doc_freq = math.log(sum(doc.count(term) for doc in dataset) + 1)
  return term_freq / doc_freq

In [None]:
# Listing 11.10 Configuring TextVectorization to return TF-IDF-weighted outputs
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

In [None]:
# Listing 11.11 Training and testing the TF-IDF bigram model
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.876


##### Exporting a model that processes raw strings

In [None]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [None]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
 ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data) 
print(f"{float(predictions[0] * 100):.2f} percent positive")

87.69 percent positive
