# Representing Text Lab

### Introduction

In this lesson, we'll get started with the torchtext library, and practice downloading data, tokenizing the data, and numericalizing our data.  Let's get started.

### Loading our Data

Ok, get started by loading the `datasets` and `data` modules from torchtext.

In [1]:
from torchtext import datasets, data
import torchtext
import torch

Next let's define our Field and LabelField objects.  We should set up the field for text so that we use the spacy tokenizer.

In [93]:
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)



In [94]:
train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)



Ok, let's get started by looking at how long the training and test data is.

In [95]:
len(train_data)

5452

If we call the `fields` method, we can see a dictionary that points to the fields we defined earlier.

In [96]:
train_data.fields

# {'text': <torchtext.data.field.Field at 0x1333d4a90>,
#  'label': <torchtext.data.field.LabelField at 0x132a27d90>}

{'text': <torchtext.data.field.Field at 0x1612a9a10>,
 'label': <torchtext.data.field.LabelField at 0x1717f9a10>}

Ok, so we can see that our train data consists of 5452 observations.  Ok, let's select our first example, to see what kind of data we have.

In [97]:
first_example = train_data.examples[0]

first_example

# <torchtext.data.example.Example at .... > 

<torchtext.data.example.Example at 0x164d54e90>

We can see that we have a list of questions.

In [98]:
first_example.text

['How',
 'did',
 'serfdom',
 'develop',
 'in',
 'and',
 'then',
 'leave',
 'Russia',
 '?']

In [99]:
first_example.label

'DESC'

Ok, next we want to numericalize our text and our labels.  Let's first build the vocabulary for our text.  Here, we do not need to set a `max_size` as we have a fairly small dataset.

In [100]:
TEXT.build_vocab(train_data)

Let's also build the vocab for the LABEL.

In [115]:
LABEL.build_vocab(train_data)

In [148]:
LABEL.vocab.stoi

defaultdict(None,
            {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})

Let's take a look at the 10 most frequent words in our dataset.

In [113]:
TEXT.vocab.freqs.most_common(10)

# [('?', 5352),
#  ('the', 3614),
#  ('What', 3246),
#  ('is', 1679),
#  ('of', 1547),
#  ('in', 1138),
#  ('a', 1014),
#  ('`', 835),
#  ('How', 764),
#  ("'s", 721)]

[('?', 5352),
 ('the', 3614),
 ('What', 3246),
 ('is', 1679),
 ('of', 1547),
 ('in', 1138),
 ('a', 1014),
 ('`', 835),
 ('How', 764),
 ("'s", 721)]

Ok, seems like a list of questions here.  Next, let's use the BucketIterator to batch our data.  LEt's set a batch size of data.  And let's batch both our training and test data.

In [122]:
train_batches, test_batches = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 50)

In [137]:
len(train_batches)
# 110

110

The 110 batches makes sense as we have roughly 5000 training observations.  Ok, now we can explore what this returns if we iterate through `train_batches` and then break from iterating after the first iteration.  If done correctly, `first_text_batch` should be assigned to the first batch of documents, and `first_label_batch` should have the corresponding labels.

In [134]:
for text, label in train_batches:
    first_text_batch, first_label_batch = text, label
    break

In [138]:
first_text_batch.shape

torch.Size([25, 50])

In [139]:
first_label_batch.shape

# torch.Size([50])

torch.Size([50])

Take a look at the first observation in the batch.

In [143]:
first_observation = first_text_batch[:, 0]

first_observation

# tensor([   4,   23, 8153, 7851,   43,    2,    1,    1,    1,    1,    1,    1,
#            1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
#            1])

tensor([   4,   23, 8153, 7851,   43,    2,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,    1,
           1])

Write a method to take in a vector and translate it to a list of words using the vocab object.

In [144]:
def vec_to_words(vector):
    return [TEXT.vocab.itos[i] for i in vector]

In [146]:
vec_to_words(first_observation)[:10]

['What',
 'do',
 'peacocks',
 'mate',
 'with',
 '?',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>']

### Summary

In this lesson, we practiced using the `torchtext` library.  We saw how we can both download and tokenize our data, then numericalize by building the vocabulary and using the bucketiterator, and finally, how to convert our numericalized documents back to text.