# Working with Pytorch Text 

### Introduction

In this lesson, we'll begin working using the Pytorch library to work with NLP problems.  We'll see how we can use Pytorch to download datasets, tokenize our data, and numericalize our dat.  Let's get started. 

### Loading our Data

Pytorch has the `torchtext` library for downloading and coercing textual data.  

Let's see if `torchtext` is already installed.

In [15]:
# !pip install torchtext

Ok, if torchtext is installed, we'll use torchtext's IMDB dataset to begin exploring the library.  In the IMDB dataset, each observation conists of the text of a movie review, and a label indicating if the movie review was positive or negative.  

Let's load up the data and take a look.

In [18]:
from torchtext import datasets, data
import torchtext
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

ModuleNotFoundError: No module named 'torchtext'

Above we are use the `datasets.IMDB.splits` method to download the IMDB dataset, split into both a training dataset and a test dataset.  Notice that we are passing through instances of `data.Field` and `data.LabelField`.  This provides some initial processing of the data for us.  

Remember that we are downloading a set of text documents.  So passing through the `data.Field(tokenize = 'spacy')` says to use spacy to tokenize our documents.  This will provide for splitting our documents between words, as well certain punctuation like apostrophes.  

> If we do not specify a tokenizer, torch text will simply tokenize based on spaces.

The `LabelField(dtype = torch.float)` means that our positive or negative reviews can be represented as a 1 or 0.

Ok, now let's take a look at our training data.

In [4]:
print(vars(train_data.examples[0]))

{'text': ['For', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'Imagine', 'a', 'movie', 'where', 'Joe', 'Piscopo', 'is', 'actually', 'funny', '!', 'Maureen', 'Stapleton', 'is', 'a', 'scene', 'stealer', '.', 'The', 'Moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'Watch', 'for', 'Alan', '"', 'The', 'Skipper', '"', 'Hale', 'jr', '.', 'as', 'a', 'police', 'Sgt', '.'], 'label': 'pos'}


Our data consists of a list of example instances where each example instance represents a different document.  We can see that an example has two attributes, `text`, which has our tokenized text.  And the related `label`. 

> If we want, we can also split our training data into a train and validation set, with pasing through the seed.

In [6]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

### Numericalizing our Data

Ok, so at this point we have downloaded our dataset, and tokenized the documents, which we can access through `train_data` and `test_data`.  The next step is for us to numericalize this data.  In this step, each unique word in our dataset will be assigned a number, and instead of representing a document as a list of words, we'll represent it as a list of corresponding numbers.  

This is called building a *vocabulary*.

Ok let's do it.  As we can see below, we specify a `max_vocab` so that we will only numericalize the 25,000 words that appear most often.  When torchtext encounters a word in our corpus that is not in the top 25,000 words, it will replace it with the `<unk>` token, which is also assigned a number.  

Ok, let's get to it.

In [None]:
# TEXT = data.Field(tokenize = 'spacy')
# LABEL = data.LabelField(dtype = torch.float)

TEXT.build_vocab(train_data, max_size = 25000)
LABEL.build_vocab(train_data)

So we built our vocabulary by passing through our train data and specifying a max vocab size.  If we looka the length of the vocab, we see that the vocab size is 25002.

In [None]:
len(TEXT.vocab.itos)

The extra two numbers is because of the `<unknown>` token, and the `<pad>`.  We'l discuss the `<pad>` token a little later.

### Using our Vocabulary

Now that we've build out vocabulary, let's take a look at how we can use it.  We've already seen one of the main properties we can use, `vocab.itos`.  The itos property stands for integer to string, and as can see it's a list of our most popular words where each word is represented by an index.

In [None]:
TEXT.vocab.itos

We can also use the stoi, which stands for string to integer, and returns a dictionary. 

In [10]:
list(TEXT.vocab.stoi.items())[:40]

[('the', 202478), (',', 192130), ('.', 165491), ('a', 109230), ('and', 109174), ('of', 101087), ('to', 93504), ('is', 76398), ('in', 61293), ('I', 54008), ('it', 53329), ('that', 48904), ('"', 44043), ("'s", 43247), ('this', 42371), ('-', 37003), ('/><br', 35684), ('was', 34978), ('as', 30125), ('with', 29740)]


Finally, we can also take a look at how often each token appears in our dataset.

In [18]:
TEXT.vocab.freqs.most_common(20)

[('<unk>', 0), ('<pad>', 1), ('the', 2), (',', 3), ('.', 4), ('a', 5), ('and', 6), ('of', 7), ('to', 8), ('is', 9), ('in', 10), ('I', 11), ('it', 12), ('that', 13), ('"', 14), ("'s", 15), ('this', 16), ('-', 17), ('/><br', 18), ('was', 19), ('as', 20), ('with', 21), ('for', 22), ('movie', 23), ('film', 24), ('The', 25), ('but', 26), ('on', 27), ("n't", 28), ('(', 29), (')', 30), ('you', 31), ('are', 32), ('not', 33), ('have', 34), ('his', 35), ('be', 36), ('he', 37), ('one', 38), ('!', 39)]


We can also check the labels, ensuring 0 is for negative and 1 is for positive.

In [12]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


### Creating a BucketIterator

We can finish up with our introduction to torchtext by using a bucket iterator.  The bucket iterator is used to batch our data.  We can create batches of both our training and test data by passing through our parsed training and test data, and the batch_size and device.

In [19]:
BATCH_SIZE = 64

device = torch.device('cuda')

train_iter, test_iter = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 100,
    device = device)



And now allows us to iterate through batches of training data. 

In [27]:
for text, label in train_iterator:
    text_batch, label_batch = text, label
    break

> Notice that each column represents a different instance, so if we wish to select the first document we do so with the following.

In [47]:
first_row = text_batch[:, 0]

And if we wish to convert these back to the original text, we can do so.

In [50]:
translated = [TEXT.vocab.itos[i] for i in first_row]
translated[:10]

['This',
 'is',
 'an',
 'unusual',
 'Laurel',
 '&',
 'Hardy',
 'comedy',
 'with',
 'something']

### Summary

In this lesson, we began working with the `torchtext` module.  As we saw, we can use the module to download datasets and process our data.  We began by downloading and tokenizing our IMDB data by specifying the processing for both our TEXT and LABEL fields.

In [None]:
# from torchtext import datasets

# # TEXT = data.Field(tokenize = 'spacy')
# # LABEL = data.LabelField(dtype = torch.float)

# # train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

We saw that each our `train_data` consists of a list of example objects, where each example represents a different document, and has both `text` and `label` properties, with the text already tokenized.  With our text downloaded and tokenized, the next step was to numericalize our data with a call to `TEXT.build_vocab(train_data, max_size = 25000)`.

From, here we could access a `TEXT.vocab` object that contained the mapping of each index to the related word through the index to string property.

In [None]:
Text.vocab.itos

Finally, we used the BucketIterator to batch both our data training and testing data.

In [None]:
train_iter, test_iter = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 100,
    device = device)