# Text classification with RNNs
## Preamble: installing and importing packages

In [1]:
try:
    import datasets
except ModuleNotFoundError:
    !pip install datasets
    import datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 11.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 33.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 33.8 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 33.5 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 33.9 MB/s 
Installing collected packag

In [2]:
try:
    from unidecode import unidecode
except ModuleNotFoundError:
    !pip install unidecode
    from unidecode import unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unidecode
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 14.5 MB/s 
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.6


In [3]:
import os
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds



## Load training dataset

We are going to work with a [ dataset that contains 58k carefully curated Reddit comments labeled for 27 emotions](https://www.tensorflow.org/datasets/catalog/goemotions). 
This dataset can be retreived using the [`datasets` library from the catalog of tensorflow ](https://huggingface.co/docs/datasets/index).

The next cells load some information on the dataset:

In [4]:
SEED = 34

In [5]:
DATA_HANDLE = "go_emotions"

In [6]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder(DATA_HANDLE)


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/7.03k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.05k [00:00<?, ?B/s]



In [7]:
ds_builder.info.description

'The GoEmotions dataset contains 58k carefully curated Reddit comments labeled for 27 emotion categories or Neutral.\nThe emotion categories are admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire,\ndisappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness,\noptimism, pride, realization, relief, remorse, sadness, surprise.\n'

Each element in the dataset has two features: the review text itself, and the associated label:

In [8]:
ds_builder.info.features

{'text': Value(dtype='string', id=None),
 'labels': Sequence(feature=ClassLabel(names=['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral'], id=None), length=-1, id=None),
 'id': Value(dtype='string', id=None)}

Now we are going to load the training data:

In [10]:
from datasets import load_dataset

train_ds = load_dataset(DATA_HANDLE, split="train")



Downloading and preparing dataset go_emotions/simplified to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d...


Downloading data:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/203k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/201k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

Dataset go_emotions downloaded and prepared to /root/.cache/huggingface/datasets/go_emotions/simplified/0.0.0/2637cfdd4e64d30249c3ed2150fa2b9d279766bfcd6a809b9f085c61a90d776d. Subsequent calls will reuse this data.


As seen in `ds_builder.info.features`, each data sample has three fields: the `text` and the `label` string and the id of the text. Here is the text for one particular sample

In [12]:
train_ds[10]['text']

'Demographics? I don’t know anybody under 35 who has cable tv.'

### Normalizing characters
Some of the tools we'll be using later cannot flawlessly handle all unicode characters. To avoid problems, we will normalize all characters to their closest ASCII equivalent using the function `unidecode` (imported from [`unidecode` package](https://pypi.org/project/Unidecode/)).

The function basically replaces all characters bearing [diacritic signs](https://en.wikipedia.org/wiki/Diacritic) with their corresponding plain character, as well as any symbols with close ASCII equivalents. The result is a text with no accents, cedillas, no € symbol, etc.

In [13]:
unidecode(train_ds[10]['text'])

"Demographics? I don't know anybody under 35 who has cable tv."

In [15]:
train_ds = train_ds.map(lambda sample: {'text': unidecode(sample['text']), 'label': sample['labels'],'id':sample['id']})

  0%|          | 0/43410 [00:00<?, ?ex/s]