# Deep learning for text

Chapter contents:
* 머신 러닝 응용을 위한 텍스트 데이터 전처리
* 텍스트 처리를 위한 Bag-of-words 접근법 및 시퀀스 모델링 접근법
* Transformer 아키텍처
* Sequence-to-sequence 학습

## Natural-language processing: The bird's eye view

NLP 작업들:

* "이 텍스트의 주제는 무엇입니까?" (**텍스트 분류**)
* "이 텍스트에 욕설이 포함되어 있습니까?" (**콘텐츠 필터링**)
* "이 텍스트는 긍정적으로 들리나요, 아니면 부정적으로 들리나요?" (**감정분석(sentiment analysis)**)
* "이 불완전한 문장에서 다음 단어는 무엇이어야 합니까?" (**언어 모델링**)
* "이것은 독일어로 어떻게 말할까요?" (**번역**)
* "이 글을 한 단락으로 요약한다면?" (**요약**)


Short history:
* 1990-2010: 의사결정나무(decision trees), 로지스틱 회귀 등(feature engineering)
* 2015-2017: LSTM (Keras에서 최초로 사용하기 쉬운 오픈 소스 구현)
* 2017-현재: 트랜스포머(Transformer)

## Preparing text data

미분 가능한 함수들로 구성된 딥 러닝 모델은 숫자 열만 처리할 수 있습니다. 원시 텍스트를 입력으로 사용할 수 없습니다. 텍스트를 숫자 텐서로 변환해야합니다.

Text $\rightarrow$ numeric tensor: Vectorization

**Vectorization**

* 표준화(Standardization): 텍스트 표준화(소문자화, 구두점 제거 등)
* 토큰화(Tokenization): 단위(토큰)로 분할(예: 문자, 단어, 단어 그룹)
* 인덱싱(Indexing) : Tokens $\rightarrow$ 숫자 벡터


<img src="https://drek4537l1klr.cloudfront.net/chollet2/Figures/11-01.png" width="400"><p style="text-align:center">Figure 11.1 From raw text to vectors.</p>

### Text standardization


* “sunset came. i was staring at the Mexico sky. Isnt nature splendid??”
* “Sunset came; I stared at the México sky. Isn’t nature splendid?”

becomes

* “sunset came i was staring at the mexico sky isnt nature splendid”
* “sunset came i stared at the méxico sky isnt nature splendid”

### Text splitting (tokenization)

* 단어 수준(Word-level) 토큰화(공백 또는 구두점으로 구분된 하위 문자열)
* N-gram 토큰화: 토큰은 N개의 연속된 단어로 그룹화됩니다(예: "the cat" 또는 "he was"는 2-gram 토큰(bigram이라고도 함))
* 문자 수준(Character-level) 토큰화: 각 문자는 고유한 토큰입니다. (드물게 사용되는)

Text-processing models

* Sequence models (word-level tokenization)
 
* Bag-of-words models (N-gram tokenization)

### Vocabulary indexing

각 토큰을 숫자 표현으로 인코딩

* 훈련 데이터("어휘(vocabulary)")에서 발견된 모든 용어의 색인을 작성합니다.
* 어휘(vocabulary)의 각 항목에 고유한 정수 할당


In [None]:
#  Example (not for run)

vocabulary = {} 
for text in dataset:
    text = standardize(text)
    tokens = tokenize(text)
    for token in tokens:
        if token not in vocabulary:
            vocabulary[token] = len(vocabulary)
            
def one_hot_encode_token(token):
    vector = np.zeros((len(vocabulary),))
    token_index = vocabulary[token]
    vector[token_index] = 1 
    return vector

일반적으로 사용하는 두 가지 특수 토큰(two special tokens)이 있습니다. OOV 토큰(인덱스 1)과 마스크 토큰(인덱스 0)입니다.

* "어휘 외(out of vocabulary)" 인덱스(OOV 인덱스로 축약됨): 인덱스에 없는 모든 토큰에 대한 포괄.
    - ```token_index = volabulary.get(token, 1)```


* 마스크 토큰(mask token), 시퀀스 크기가 다른 배치를 만드는 데 사용(인덱스 0),
   - [5,7,124,4,89], [8,34,21]
   - [5,7,124,4,89], [8,34,21,0,0]

### Using the TextVectorization layer

In [5]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)

vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

In [6]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [7]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


성능이 좋지 않으므로 Keras의 TextVectorization을 사용하는 것이 좋습니다.

In [8]:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int",
)

기본 설정

* 텍스트 표준화(stardardization)를 위한 "소문자로 변환하고, 구두점 제거"
* 토큰화(tokenization)를 위한 "공백으로 분할"

사용자 정의 함수(custom functions)도 제공할 수 있습니다.

In [9]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
    lowercase_string = tf.strings.lower(string_tensor)
    return tf.strings.regex_replace(
        lowercase_string, f"[{re.escape(string.punctuation)}]", "")

def custom_split_fn(string_tensor):
    return tf.strings.split(string_tensor)

text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
)

텍스트 말뭉치(corpus)의 어휘를 인덱싱하려면 ```adapt()``` 메서드를 호출합니다.

In [10]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

**Displaying the vocabulary**

In [11]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [12]:
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)


In [13]:
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


## Two approaches for representing groups of words: Sets and sequences

* 간단한 접근법: 텍스트 = 정렬되지 않은 단어 세트(bag-of-words 모델)
* 순서 문제: 텍스트 = 단어의 단계(시계열과 유사)(recurrent 모델)
* 하이브리드: 순서에 구애받지 않지만, 단어 위치 정보 포함(Transformer)

### Preparing the IMDB movie reviews data

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

In [None]:
!rm -r aclImdb/train/unsup

In [14]:
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy


In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

In [15]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**Displaying the shapes and dtypes of the first batch**

In [16]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'This film deals with the atrocity in Derry 30 years ago which is commonly known as Bloody Sunday.<br /><br />The film is well researched, acted and directed. It is as close to the truth as we will get until the outcome of the Saville enquiry. The film puts the atrocity into context of the time. It also shows the savagery of the soldiers on the day of the atrocity. The disgraceful white-wash that was the Widgery Tribunal is also dealt with.<br /><br />Overall, this is an excellent drama which is moving and shocking. When the Saville report comes out, watch this film again to see how close to the truth it is.', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


### Processing words as a set: The bag-of-words approach

#### Single words (unigrams) with binary encoding

"the cat sat on the mat" $\rightarrow$ {"cat","mat","on","sat","the"}

텍스트를 멀티-핫 벡터(multi-hot vector)로 인코딩

* 어휘에 있는 단어수만큼의 차원을 가진 벡터
* 이진법: 0은 부재, 1은 텍스트에 단어가 있음을 표시

**Preprocessing our datasets with a `TextVectorization` layer**

In [17]:
text_vectorization = TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**Inspecting the output of our binary unigram dataset**

In [18]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


**Our model-building utility**

In [19]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

**Training and testing the binary unigram model**

In [20]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.887


#### Bigrams with binary encoding

"the cat sat on the mat" $\rightarrow$ {"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}

**Configuring the `TextVectorization` layer to return bigrams**

In [21]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

**Training and testing the binary bigram model**

In [22]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.893


* binary_1gram: 0.889
* binary_2gram: 0.896

#### Bigrams with TF-IDF encoding

"the cat sat on the mat" $\rightarrow$  {"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1, "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

**Configuring the `TextVectorization` layer to return token counts**

In [23]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

**Configuring `TextVectorization` to return TF-IDF-weighted outputs**

* term frequency: 문서에 주어진 용어(term)가 더 많이 나타날수록 해당 용어는 문서의 내용을 이해하는 데 더 중요합니다. 

* inverse document frequency : 동시에 데이터 세트의 모든 문서에서 용어가 나타나는 빈도도 중요합니다. 거의 모든 문서에 나타나는 용어(예: "the" 또는 "a")는 특히 유익한 정보가 아닙니다. 반면에 모든 텍스트의 작은 부분 집합에만 나타나는 용어는 매우 구별되며 중요합니다. 

* TF-IDF(term frequency-inverse document frequency)는 이 두 가지 아이디어를 융합한 메트릭입니다.

In [24]:
# not for execution
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

**Training and testing the TF-IDF bigram model**

In [25]:
with tf.device('/CPU:0'):    # this if it does not run on GPU
    text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.894


* binary_1gram: 0.889
* binary_2gram: 0.896
* tfidf_2gram: 0.892

#### Exporting A Model That Processes Raw Strings

전처리(standardization, splitting, and indexing)는 tf.data 파이프라인의 일부였습니다.

독립형(standalone) 솔루션에는 적합하지 않으므로, 추론 환경(예: 다른 언어/OS/하드웨어 등)에서 전처리를 다시 구현해야 합니다.

솔루션: 모델에 전처리 포함시킵니다.(원시 데이터 -> 모델 -> 출력)

In [26]:
inputs = keras.Input(shape=(1,), dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [27]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

89.21 percent positive


### Processing words as a sequence: The sequence model approach

#### A first practical example

**Downloading the data**

**Preparing the data**

**Preparing integer sequence datasets**

In [28]:
from tensorflow.keras import layers

max_length = 300     # original 600 is too big to handle on a notebook GPU
max_tokens = 10000   # original 20000 is too big to handle on a notebook GPU
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [29]:
for inputs, targets in int_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 300)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(
[  11   14    4  890   18    9  626   46  179   50    4 1124  480 1402
  498 3699 1213    5  655    3 5903   15  255 7260  117   25 5078   13
  360  252   25  186  786  131   10   96  357   10   14    8   16    4
  145  929  934   18  328  113   10  369   86  264    7    9    6 1069
    4 1128    3  507   39    4 1051  128  439   96   22  101    2  338
  137    5  255 7260   72   75    2  611 1071    3    2   18    1   13
   36   11  218   21    9    7   83  113  194 8761    2 2441 9221    5
  536    3  860  227  338 4430   13   12  288   10  497    2   18    4
  694   80    8 2708    5   30 1585    9  117 4339    4  423    5    1
   12   10   42   96   22  178 4428 4710   13   10   81   22  362   11
   18   19   10   26    6  920   10   26  294  441    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0   

**A sequence model built on one-hot encoded vector sequences**

In [30]:
batch_size = 16
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, None)]            0         
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 10000)       0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               2568448   
 l)                                                              
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,568,513
Trainable params: 2,568,513
Non-trainable params: 0
_________________________________________________

**Training a first basic sequence model**

In [31]:
callbacks = [
    keras.callbacks.ModelCheckpoint("one_hot_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, batch_size=batch_size, callbacks=callbacks)
model = keras.models.load_model("one_hot_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.873


* binary_1gram: 0.887
* binary_2gram: 0.892
* tfidf_2gram: 0.897
* one_hot_bidir_lstm: 0.873

Model trains very slowly
* inputs are large (600, 20,000). 12,000,000 floats per movie review

Performs worse than the very light and fast binary unigram model

One-hot not so good idea. Better: word embeddings.

#### Understanding word embeddings

Word encoding: feature-engineering decision.

Injections assumptions about the structure of the feature space

One-hot assumption: The different tokens are all independent from each other (one-hot vectors are all orthogonal to one another)

Example: "film" and "movie" encoded vectors


geometric relationship <--> semantic relationship

geometric distance (e.g. L2 distance)  <--> semantic distance

Word embeddings : map human language into a structured geometric space.

<img src="https://drek4537l1klr.cloudfront.net/chollet2/Figures/11-03.png" width="150"><p style="text-align:center">Figure 11.2 A toy example 
of a word-embedding space.</p>

One-hot encoding:
* binary
* sparse
* very high-dimensional (20,000)

Word embeddings:
* floating-point
* low-dimensional (256~1024)

<img src="https://drek4537l1klr.cloudfront.net/chollet2/Figures/11-02.png" width="300"><p style="text-align:center">Figure 11.3 Word representations 
obtained from one-hot encoding or hashing are sparse, high-dimensional, 
and hardcoded. Word embeddings are dense, relatively low-dimensional, and 
learned from data.</p>

Obtain word embeddings:
* learn them jointly with the main task (start with random)
* use pretrained word embeddings

#### Learning word embeddings with the Embedding layer


**Instantiating an `Embedding` layer**

Word index $\rightarrow$ Embedding layer $\rightarrow$ Corresponding word vector

Rank-2 (batch_size, sequence_length) $\rightarrow$  Rank-3 (batch_size, sequence_length, embedding_dimensionality)

In [32]:
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)

**Model that uses an `Embedding` layer trained from scratch**

In [33]:
from tensorflow import keras
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 256)         2560000   
                                                                 
 bidirectional_1 (Bidirectio  (None, 64)               55680     
 nal)                                                            
                                                                 
 dropout_4 (Dropout)         (None, 64)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,615,745
Trainable params: 2,615,745
Non-trainable params: 0
_________________________________________________

* binary_1gram: 0.887
* binary_2gram: 0.892
* tfidf_2gram: 0.897
* one_hot_bidir_lstm: 0.873
* embeddings_bidir_lstm: 0.868

#### Understanding padding and masking

bidirectional RNN -> one RNN looks at the tokens in their natural order

last iterations: vectors that encode padding (possibly for several hundreds of iterations)

Masking: tell the RNN to skip padding (boolean vector)

Example
```
>>> embedding_layer = Embedding(input_dim=10, output_dim=256, mask_zero=True)
>>> some_input = [
... [4, 3, 2, 1, 0, 0, 0],
... [5, 4, 3, 2, 1, 0, 0],
... [2, 1, 0, 0, 0, 0, 0]]
>>> mask = embedding_layer.compute_mask(some_input)
<tf.Tensor: shape=(3, 7), dtype=bool, numpy=
array([[ True, True, True, True, False, False, False],
 [ True, True, True, True, True, False, False],
 [ True, True, False, False, False, False, False]])>
```

**Using an `Embedding` layer with masking enabled**

In [34]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(
    input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("embeddings_bidir_gru_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_2 (Embedding)     (None, None, 256)         2560000   
                                                                 
 bidirectional_2 (Bidirectio  (None, 64)               73984     
 nal)                                                            
                                                                 
 dropout_5 (Dropout)         (None, 64)                0         
                                                                 
 dense_8 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2,634,049
Trainable params: 2,634,049
Non-trainable params: 0
_________________________________________________

* binary_1gram: 0.887
* binary_2gram: 0.892
* tfidf_2gram: 0.897
* one_hot_bidir_lstm: 0.873
* embeddings_bidir_lstm: 0.868
* embeddings_bidir_lstm_with_masking: 0.872

#### Using pretrained word embeddings

사용 가능한 훈련 데이터가 너무 적은 경우 미리 계산된 임베딩을 사용할 수 있습니다.

이미지 분류를 위해 사전 훈련된 convnet을 사용하는 것과 동일한 아이디어입니다.

한 가지 예: 2014년 스탠포드 연구원들이 개발한 Global Vectors for Word Representation (GloVe)

In [36]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip # use browser instead.
!unzip -q glove.6B.zip

**Parsing the GloVe word-embeddings file**

In [41]:
import numpy as np
path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file, encoding='utf-8') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(f"Found {len(embeddings_index)} word vectors.")

Found 400000 word vectors.


**Preparing the GloVe word-embeddings matrix**

In [42]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
word_index = dict(zip(vocabulary, range(len(vocabulary))))

embedding_matrix = np.zeros((max_tokens, embedding_dim))
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [43]:
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

**Model that uses a pretrained Embedding layer**

In [44]:
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=10, callbacks=callbacks)
model = keras.models.load_model("glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 100)         1000000   
                                                                 
 bidirectional_3 (Bidirectio  (None, 64)               34048     
 nal)                                                            
                                                                 
 dropout_6 (Dropout)         (None, 64)                0         
                                                                 
 dense_9 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,034,113
Trainable params: 34,113
Non-trainable params: 1,000,000
____________________________________________

* binary_1gram: 0.887
* binary_2gram: 0.892
* tfidf_2gram: 0.897
* one_hot_bidir_lstm: 0.873
* embeddings_bidir_lstm: 0.868
* embeddings_bidir_lstm_with_masking: 0.872
* glove_embeddings_sequence model: 0.868

On this task pretrained embeddings were not very helpful

Dataset contains enough samples to learn a specialized embedding space from scratch

Works better for smaller datasets

#### Summary

* Two kinds of NLP models:
 - bag-of-words models (no order) with Dense layers
 - sequence models that process word order with RNN, a 1D convnet, or a Transformer
 
* Word embeddings: vector spaces where semantic relationships between words are modeled as distance relationships between vectors that represent those words.