In [12]:
import tensorflow as tf
import tensorflow.keras as keras

In [13]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]  # the 1st review, its first 10 words

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

**(?)** How to get rid of the `VisibleDeprecationWarning`?

In [15]:
imdb_data = keras.datasets.imdb.load_data()
type(imdb_data)

tuple

In [16]:
len(imdb_data)

2

In [19]:
for i in range(2):
    print(i, type(imdb_data[i]), len(imdb_data[i]))

0 <class 'tuple'> 2
1 <class 'tuple'> 2


In [21]:
for i in range(2):
    print(i, type(imdb_data[0][i]), type(imdb_data[1][i]))

0 <class 'numpy.ndarray'> <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'> <class 'numpy.ndarray'>


In [22]:
for i in range(2):
    print(i, imdb_data[1][i].shape, imdb_data[1][i].shape)

0 (25000,) (25000,)
1 (25000,) (25000,)


## Decode a Review

In [23]:
word_index = keras.datasets.imdb.get_word_index()
type(word_index)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


dict

In [24]:
len(word_index)

88584

In [26]:
word_index["movie"], word_index["montage"] 

(17, 4223)

`word_index` is a dictionary whose keys are words (i.e. strings) and whose values are the encoded indices.

In [32]:
# id_to_word is a dictionary being nearly the opposite of word_index
id_to_word = {index + 3: word for word, index in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token

**(?)** Why `index + 3`?<br>
**(R)** Note the diff btw two entities

01. `index`
  - `index` is what `keras.datasets.imdb.get_word_index()` gave us.
02. `id_`
  - `id_` is index shifted to the right by 3 integers to allow spaces for the 3 special tokens `"<pad>", "<sos>", "<unk>"`

In [33]:
example_review = " ".join([id_to_word[id_] for id_ in X_train[0][:10]])
example_review

'<sos> this film was just brilliant casting location scenery story'

```python
[word_index[word]+3 for word in example_review.split(" ")]
```
<br>

```
KeyError: '<sos>'
```

In [39]:
word_to_id = {word: id_ for id_, word in id_to_word.items()}
print([word_to_id[word] for word in example_review.split(" ")])
print(X_train[0][:10])
print([word_index[word]+3 for word in
"this film was just brilliant casting location scenery story".split(" ")])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]
[14, 22, 16, 43, 530, 973, 1622, 1385, 65]


Let's handle the preprocessing exclusively in tensorflow, so that the entiring processing is inside the model and thus can be shifted outside Python, to mobile devices and web browsers.

In [40]:
import tensorflow_datasets as tfds

In [41]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
type(datasets), type(info)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/phunc20/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /home/phunc20/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80KGZZ/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /home/phunc20/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80KGZZ/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /home/phunc20/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete80KGZZ/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]

[1mDataset imdb_reviews downloaded and prepared to /home/phunc20/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


(dict, tensorflow_datasets.core.dataset_info.DatasetInfo)

In [42]:
len(datasets)

3

In [44]:
datasets

{'test': <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 'train': <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.int64)>,
 'unsupervised': <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.int64)>}

In [45]:
train_size = info.splits["train"].num_examples
train_size

25000

In [46]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

**(?)** Try to dig deeper into why the choices of these functions such as `tf.strings.substr`, etc. In particular, what kind of form does the `X_batch` take before entering this function of `preprocess`?

In [48]:
type(datasets["train"].batch(32))

tensorflow.python.data.ops.dataset_ops.BatchDataset

**(?)** We are not modifying `y_batch`. Why not just use input arg `X_batch` and return `X_batch` alone?<br>
**(R)** Later on, there will be a line of code
```python
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
```
This line shows how we would like to use our function `preprocess` and also the reason why it cannot be a function of `X_batch` alone (i.e. must include `y_batch` as input arg as well.)

### Construct the vocabulary
01. going thru the whole training set
02. applying our `preprocess()` function
03. using a `Counter` (in `collections` module) to count the number of occurrences of each word

In [49]:
from collections import Counter
vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

**(?)** Try to understand and explain the convoluted line `vocabulary.update(list(review.numpy()))`.

In [50]:
vocabulary.most_common()[:3]

[(b'<pad>', 214309), (b'the', 61137), (b'a', 38564)]

As expected, there should be a lot of `"<pad>"`.