## 11e December

# Text Processing

- Text processing
    - NLP - Natural Language Processing
- Olika typer av NLP
    - Classification
        - vad är textens ämne?
    - Censur/Filtrering
        - Does this text contain abuse
    - Sentiment analasys
        - Är texten positiv eller negativ?
    - LLM
        - Vad bör vara det nästa ordet i meningen? (generativ)
    - Översättning
        - Hur säger man "du är vacker" på Urdu?
    - Summering
        - Summera texten till en kort paragraf, tack.

### Hur processar vi text för neurala nätverk?

- Standardisering
    - .lower()
    - ta bort skiljetecken (men försiktigt då de är viktiga)
- Tokenisering och tensorisering
    - Splitta texten till tokens, där varje ord eller grupp av ord blir omgjorda till vektorer, alla blir en tensor
    - Skapar ett index över vektorerna och vi använder sedan indexet.
- **Exempel**
    - "The cat sat on the mat!"
    - the cat sat on the mat
    - "cat", "sat", "on", "mat"
    - [2, 34, 53, 8]
    - onehot encoding eller embedding

- Tre sätt att hantera tokens: (finns fler)
    - **word-level tokenization**
        - Mellanslagsseparerade substrings -> ord typ.
        - Finns varianter som splittar ord som funkar bra i språk som har sammanslagna ord. (Bildörr -> bil dörr)
            - eller om man vill splitta ut ändelser "mädchen"
            - Man behöver alltså ha koll på språket man hanterar!
    - **N-grams**
        - Tokens är grupper av *N* antal ord.
        - Exempel: "the cat", "he was", "over there"
        - 2-grams eller bigrams (grekiska prefix, "hexagram")
            - "bag of words"
    - **Character-level tokenization**
        - Varje bokstav tokeniseras för sig.
        - Användbart för typ cyrilliska/kinesiska/japanska eller fonetiska språkanalyser

Dagens dataset
https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

In [1]:
import os, pathlib, shutil, random
base_dir = pathlib.Path("../data/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    try:
        os.makedirs(val_dir / category)
        files = os.listdir(train_dir / category)
        random.Random(1337).shuffle(files)
        num_vaL_samples = int(0.2 * len(files))
        val_files = files[:num_vaL_samples]
        for fname in val_files:
            shutil.move(train_dir / category / fname, val_dir / category / fname)
    except FileExistsError:
        pass

In [2]:
import keras
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
    train_dir, batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    val_dir, batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    base_dir / "test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [3]:
for inputs, targets in train_ds:
    print("inputs:", inputs[:10])
    print("targets:", targets[:10])
    print("inputs shape:", inputs.shape, inputs.dtype)
    print("targets shape:", targets.shape, targets.dtype)
    break

inputs: tf.Tensor(
[b"What can I say, it's a damn good movie. See it if you still haven't. Great camera works and lighting techniques. Awesome, just awesome. Orson Welles is incredible 'The Lady From Shanghai' can certainly take the place of 'Citizen Kane'."
 b'Fot the most part, this movie feels like a "made-for-TV" effort. The direction is ham-fisted, the acting (with the exception of Fred Gwynne) is overwrought and soapy. Denise Crosby, particularly, delivers her lines like she\'s cold reading them off a cue card. Only one thing makes this film worth watching, and that is once Gage comes back from the "Semetary." There is something disturbing about watching a small child murder someone, and this movie might be more than some can handle just for that reason. It is absolutely bone-chilling. This film only does one thing right, but it knocks that one thing right out of the park. Worth seeing just for the last 10 minutes or so.'
 b'I was not really a big fan of Star Trek until past 2-3 

In [4]:
text_vectorization = keras.layers.TextVectorization(
    max_tokens=20_000, output_mode="multi_hot"
)

text_only_train_ds = train_ds.map(lambda x, _: x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [5]:
def get_model(max_tokens=20000, hidden_dims=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = keras.layers.Dense(hidden_dims, activation="relu")(inputs)
    x = keras.layers.Dropout(0.5)(x)
    outputs = keras.layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [6]:
model = get_model()
model.summary()

In [7]:
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
]

model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), epochs=10, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 17ms/step - accuracy: 0.7400 - loss: 0.5280 - val_accuracy: 0.8808 - val_loss: 0.3236
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8835 - loss: 0.3159 - val_accuracy: 0.8870 - val_loss: 0.3058
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9027 - loss: 0.2679 - val_accuracy: 0.8878 - val_loss: 0.3053
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9123 - loss: 0.2430 - val_accuracy: 0.8878 - val_loss: 0.3207
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9162 - loss: 0.2351 - val_accuracy: 0.8850 - val_loss: 0.3348
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9209 - loss: 0.2245 - val_accuracy: 0.8860 - val_loss: 0.3591
Epoch 7/10
[1m625/625[0m

<keras.src.callbacks.history.History at 0x20e900632f0>

In [8]:
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 39ms/step - accuracy: 0.8812 - loss: 0.3101
Test acc: 0.882


## term frequency–inverse document frequency

In [14]:
text_vectorization = keras.layers.TextVectorization(
    ngrams=2, max_tokens=20_000, output_mode="tf_idf"
)

text_vectorization.adapt(text_only_train_ds)

tfidf_2grams_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
tfidf_2grams_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
tfidf_2grams_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [15]:
model = get_model()
model.summary()

In [17]:
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2grams.keras", save_best_only=True)
]

In [18]:
model.fit(tfidf_2grams_train_ds.cache(), validation_data=tfidf_2grams_val_ds.cache(), epochs=10, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 37ms/step - accuracy: 0.6930 - loss: 0.6772 - val_accuracy: 0.8686 - val_loss: 0.3222
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.8502 - loss: 0.3584 - val_accuracy: 0.8832 - val_loss: 0.3110
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8733 - loss: 0.3067 - val_accuracy: 0.8884 - val_loss: 0.3152
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8787 - loss: 0.2836 - val_accuracy: 0.8734 - val_loss: 0.3426
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8879 - loss: 0.2640 - val_accuracy: 0.8794 - val_loss: 0.3526
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.8994 - loss: 0.2456 - val_accuracy: 0.8634 - val_loss: 0.3766
Epoch 7/10
[1m625/625[0m

<keras.src.callbacks.history.History at 0x20edbc20710>

In [19]:
model = keras.models.load_model("tfidf_2grams.keras")
print(f"Test acc: {model.evaluate(tfidf_2grams_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 42ms/step - accuracy: 0.8813 - loss: 0.3034
Test acc: 0.879


## Sekvensera texten, embedding

In [25]:
max_length = 600
max_tokens = 20_000
text_vectorization = keras.layers.TextVectorization(
    max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length
)
text_vectorization.adapt(text_only_train_ds)

In [26]:
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("int_sequence.keras", save_best_only=True)
]

In [None]:
# import tensorflow as tf

# inputs = keras.Input(shape=(None,), dtype="int64")
# embedded = tf.one_hot(inputs, depth=max_tokens)
# x = keras.layers.Bidirectional(keras.layers.LSTM(32))(embedded)
# x = keras.layers.Dropout(0.5)(x)
# outputs = keras.layers.Dense(1, activation="sigmoid")(x)
# model = keras.Model(inputs, outputs)
# model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])

ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.operations`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


In [None]:
model.fit(int_train_ds.cache(), validation_data=int_val_ds.cache(), epochs=10, callbacks=callbacks)