<a href="https://colab.research.google.com/github/luigiselmi/dl_tensorflow/blob/main/deep_learning_for_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning for text
In this notebook we will use deep learning algorithms to perform some tasks on text. Common tasks are: text classification, content filtering, sentiment analysis, translation, text summarization, language modeling. Since deep learning models are differentiable functions that can only process tensors of numbers we have to transform text into numerical tensors. The steps to transform text into numbers are

1. Text standardization
2. Tokenization
3. Convert the tokens into numerical arrays

In the first step we perform the same kind of transformations used to build a search engine: lower case, remove punctuation, word stemming. The words left represent the tokens, elements of a "clean" vocabulary. We can also build token made of two words, called bigram, or more words called N-gram. The number of tokens N defines the dimensionality of a space where each token represents a dimension and a text can be represented as vector in such space. We can define a metric is such a space so that we can measure the distance between two sequences of tokens. Depending on the task at hand, we might need to process the tokens in the order in which they appear in the text. In this case we will build a sequence model. If our task doesn't need the order of the tokens we will build a bag-of-words model.

## Bag-of-words models
We will use the [IMDB dataset]('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) to train our bag-of-words model for sentiment analysis, that is a kind of text classification. The dataset contains 50k movie reviews. We download the dataset and extract the files into a folder. The dataset contains two subfolders train/ and test/ each containing 25k reviews split into two subfolders pos/ and neg/ with 12500 txt files. Each file contains a short text, the content of the review. The name of the file is created from the review's unique identifier and the score given to the movie. A score equal or higher than 7 is positive, a score equal or lower than 4 is negative. This same example is available for PyTorch in [another repository](https://github.com/luigiselmi/machine_learning_notes/blob/main/pml3/sentiment_analysis.ipynb).

In [1]:
!wget -nv 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz' -P './data'

2024-10-27 20:33:32 URL:https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz [84125825/84125825] -> "./data/aclImdb_v1.tar.gz" [1]


In [2]:
!tar -xf data/aclImdb_v1.tar.gz -C data/

In [3]:
!rm -r data/aclImdb/train/unsup

We print the content of one review

In [4]:
!cat data/aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

We use 20% of the training data for validation

In [5]:
import os, pathlib, shutil, random
base_dir = pathlib.Path("data/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir / category / fname)

We create three datasets for train, validation and test using the utility function text_dataset_from_directory() that returns a [Dataset](https://keras.io/api/data_loading/text/) object that asynchronouly fetches the data from a source data folder that has a structure like

main_directory/  
...class_a/  
......a_text_1.txt  
......a_text_2.txt  
...class_b/  
......b_text_1.txt  
......b_text_2.txt  

Each dataset contains a number of batches with 32 reviews with the respective numerical labels starting from 0 for the first class, in our case negative review, and 1 for the positive reviews.

In [6]:
from tensorflow import keras
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory('data/aclImdb/train', batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory('data/aclImdb/val', batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory('data/aclImdb/test', batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [7]:
print('Number of batches for training: {:d}'.format(len(train_ds)))
print('Number of batches for validation: {:d}'.format(len(val_ds)))
print('Number of batches for test: {:d}'.format(len(test_ds)))

Number of batches for training: 625
Number of batches for validation: 157
Number of batches for test: 782


In [8]:
for inputs, targets in train_ds:
  print("inputs.shape:", inputs.shape)
  print("inputs.dtype:", inputs.dtype)
  print("targets.shape:", targets.shape)
  print("targets.dtype:", targets.dtype)
  print("inputs[0]:", inputs[0])
  print("targets[0]:", targets[0])
  break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'Excellent episode movie ala Pulp Fiction. 7 days - 7 suicides. It doesnt get more depressing than this. Movie rating: 8/10 Music rating: 10/10', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


We use the set of the most common 20000 tokens with only one word, aka unigrams, to define a space of the same dimensionality to represent each review as a vector in this space

In [9]:
from keras.layers import TextVectorization
text_vectorization = TextVectorization(max_tokens=20000, output_mode="multi_hot",)
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [10]:
from tensorflow import keras
from tensorflow.keras import layers
def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop",
  loss="binary_crossentropy",
  metrics=["accuracy"])
  return model

In [11]:
model = get_model()
model.summary()
callbacks = [
  keras.callbacks.ModelCheckpoint("binary_1gram.keras",
  save_best_only=True)
]

In [13]:
#model.fit(binary_1gram_train_ds.cache(),
#          validation_data=binary_1gram_val_ds.cache(),
#          epochs=10,
#          callbacks=callbacks)