# Deeplearning - Anees Ahmad - 2020/05/03

# 11 Deep learning for text

## 11.1 Natural language processing: The bird’s eye view

-  NLP is about using machine learning and large datasets to give computers the ability not to understand language but to ingest a piece of language as input and return something useful, like predicting the following:
  - text classification
    - What’s the topic of this text?
  - content filtering
    - Does this text contain abuse?
  - sentiment analysis
    - Does this text sound positive or negative?
  - language modeling
    - What should be the next word in this incomplete sentence?
  - translation
    - How would you say this in German?
  - summarization
    - How would you summarize this article in one paragraph?

## 11.2 Preparing text data

- Deep learning models only process numeric tensors
- Vectorizing text is the process of transforming text into numeric tensors.
  - First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
  - You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization.
  - indexing all tokens present in the data.
  - You convert each such token into a numerical vector

### 11.2.1 Text standardization

- Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with
  - convert to lowercase and remove punctuation characters
  - convert special characters to a standard form
  - converting variations of a term into a single shared representation

### 11.2.2 Text splitting (tokenization)

- break text into units to be vectorized (tokens)
- Methods of tockenization
  - Word-level tokenization
    - Where tokens (units) are space-separated (or punctuation-separated) substrings.
    - A variant of this is to further split words into subwords when applicable
      - for instance, treating “staring” as “star+ing” or “called” as “call+ed.”
  - N-gram tokenization
    - Where tokens are groups of N consecutive words.
    - a way to artificially inject a small amount of local word order information into the model 
      - For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
  - Character-level tokenization
    - Where each character is its own token.
      - used in specialized contexts, like text generation or speech recognition.
- text-processing models
  - sequence models
    - those that care about word order
    - use word-level tokenization
  - bag-of-words models
    - those that treat input words as a set, discarding their original order
    - N-gram tokenization

#### Understanding N-grams and bag-of-words


- N-grams are groups of N (or fewer) consecutive words that you can extract from a sentence

  ```
  the cat sat on the mat.

  bag of 2-gram
  {"the", "the cat", "cat", "cat sat", "sat",
  "sat on", "on", "on the", "the mat", "mat"}
  
  bag of 3-gram
  {"the", "the cat", "cat", "cat sat", "the cat sat",
  "sat", "sat on", "on", "cat sat on", "on the",
  "sat on the", "the mat", "mat", "on the mat"}
  ```
- One-dimensional convnets, recurrent neural networks, and Transformers are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by looking at continuous word or character sequences.

### 11.2.3 Vocabulary indexing

- encode each token into a numerical representation. 
  - do this in a stateless way
    - hashing each token into a fixed binary vector
  - build an index of all terms found in the training data
    - assign a unique integer to each entry in the vocabulary.

  ```python
  #  restrict the vocabulary to only the top 20,000 or 30,000 most common words
words
  vocabulary = {} 
  for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
      if token not in vocabulary:
      vocabulary[token] = len(vocabulary)
  ```

- vector encoding

  ```python
  def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1
    return vector
  ```

- while doing so always create a “out of vocabulary” index (abbreviated as OOV index)
  - a catch-all for any token that wasn’t in the index.
- When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” 
- Special Token
  - OOV token (index 1)
  - mask token (index 0) for padding

  ```
  [5, 7, 124, 4,89] and [8, 34, 21]

          ||
          ||
          \/

  [[5, 7, 124, 4, 89]
  [8, 34, 21, 0, 0]] 
  ```

### 11.2.4 Using the TextVectorization layer

- Python way

In [None]:
# python way to perform all above tasks

import string
 
class Vectorizer:
  
  def standardize(self, text):
    text = text.lower()
    return "".join(char for char in text if char not in string.punctuation)
 
  def tokenize(self, text):
    text = self.standardize(text)
    return text.split()

  def make_vocabulary(self, dataset):
    self.vocabulary = {"": 0, "[UNK]": 1}
    for text in dataset:
      text = self.standardize(text)
      tokens = self.tokenize(text)
      for token in tokens:
        if token not in self.vocabulary:
          self.vocabulary[token] = len(self.vocabulary)
    self.inverse_vocabulary = dict((v, k) for k, v in self.vocabulary.items())

  def encode(self, text):
    text = self.standardize(text)
    tokens = self.tokenize(text)
    return [self.vocabulary.get(token, 1) for token in tokens]

  def decode(self, int_sequence):
    return " ".join(self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()

dataset = [ 
  "I write, erase, rewrite",
  "Erase again, and then",
  "A poppy blooms.",
]

vectorizer.make_vocabulary(dataset)

In [None]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [None]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


- tf way
  - `TextVectorization` uses by default follwoing settings but can be altered
  - convert to lowercase and remove punctuation” for text standardization
  - “split on whitespace” for tokenization. 

In [None]:
# vectorization with tf with default seettings
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",
)

In [None]:
# vectorization with tf with custom seettings
import re 
import string 
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

def custom_standardization_fn(string_tensor):
  lowercase_string = tf.strings.lower(string_tensor)
  return tf.strings.regex_replace(lowercase_string, f"[{re.escape(string.punctuation)}]", "")
 
def custom_split_fn(string_tensor):
  return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

In [None]:
# To index the vocabulary of a text corpus, just call the adapt() method of the layer 
  # with a Dataset object that yields strings
  # just with a list of Python strings:
dataset = [
  "I write, erase, rewrite",
  "Erase again, and then",
  "A poppy blooms.",
]

text_vectorization.adapt(dataset)

In [None]:
# Listing 11.1 Displaying the vocabulary
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [None]:
# encode and then decode an example sentence
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print("Actual Sentence")
print(test_sentence)
print("Encoded Sentence")
print(encoded_sentence)
print("Decoded Sentence")
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

Actual Sentence
I write, rewrite, and still rewrite again
Encoded Sentence
tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
Decoded Sentence
i write rewrite and [UNK] rewrite again


#### Using the TextVectorization layer in a tf.data pipeline or as part of a model


- two ways to use TextVectorization layer 
  - put it in the tf.data pipeline

    ```python
    int_sequence_dataset = string_dataset.map(
      text_vectorization,
      num_parallel_calls=4) 
    ```

    -  happen synchronously with the rest of the model
  - part of the model

    ```python
    text_input = keras.Input(shape=(), dtype="string")
    vectorized_text = text_vectorization(text_input)
    embedded_input = keras.layers.Embedding(...)(vectorized_text)
    output = ...
    model = keras.Model(text_input, output) 
    ```
    -  happen a-synchronously with the rest of the model

- TextVectorization layer enables you to include text preprocessing right into your model, making it easier to deploy