# Torchtext: is it useful?
In this notebook, we will think about tasks that we can use for data augmentation.

We will need
1. A lot of text for Variational Auto-Encoder / Language Model training
2. Labeled datasets for testing performance after augmentation.

Let's explore torchtext package:

https://torchtext.readthedocs.io/en/latest/datasets.html

Interesting tasks: named entity recognition (NER), part-of-speech tagging (POS tagging).

## Preprocessing with data.Field 

In [7]:
import torchtext.data as data
import torchtext.datasets as datasets

In [9]:
# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train)
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=3, device=0)



In [14]:
x = next(iter(train_iter))
x


[torchtext.data.batch.Batch of size 3 from IMDB]
	[.text]:('[torch.LongTensor of size 3x125]', '[torch.LongTensor of size 3]')
	[.label]:[torch.LongTensor of size 3]

In [20]:
x.text[1]

tensor([125, 108, 101])

In [21]:
x.label

tensor([1, 2, 2])

In [26]:
x.text[0][0].shape, x.text[0][1].shape,  x.text[0][2].shape

(torch.Size([125]), torch.Size([125]), torch.Size([125]))

In [31]:
x.text[0][0]

tensor([ 25408,   4874,      8,      2,  13260,      3,    160,      7,   3808,
            31,      3,   1451,   2082,     11,   6792,  79830,   1145,     60,
          8262,      5,     83,   9377,  18060,      6,      2,    979,     27,
            41,     71,   3164,     31,      3,   2563,      4,    204,    217,
         11217,      2,   4312,   3390,     17,    130,  15529,    204,    241,
           887,    207,   1981,    375,     84,      6,   3708,    353,      5,
            10,   2768,  36150,    179,      2,    379,    345, 198000,   1845,
             6,      2,    333,  22012,  23511,     13, 121381,  92943,     34,
             2,    225,  18809,    222,   8613,      5,      2,   5932,   3973,
          1899,    965,   8016,     12,    186,     26,   5234,      8,     83,
           225,   5001,  51282,    775,  20573,   2563,   5611,      7,   6442,
             4,  85711,  33392,   1939,  44336, 151253,      4,   1567,   4063,
         68723,  63424,    102,     73, 

In [32]:
x.text[0][1]

tensor([     3,    400,    358,    649,     24,     34,  15806,  48330,     15,
           300,      4,   2705,     15,     21,   1336,    222,   1150,  14227,
        175395,    695,    479,      8,      3,  13982,    109,      4,     23,
            39,     16,    500,      4,   1508,  21276,   4399,    358,   3913,
             8,   1416,    389,   1991,    150,     34,      3,    394,     29,
             2,   1991,      7,   2245,   4399,     31,   4763,  19506,     16,
         14507,    128,   2114,  54814,     28,    119,     31,     35, 146211,
             4,      2,    333,     34,  14988,      7,    549,     15,     56,
            15,   1295,   5554,    216,     23,      3,   6356,      5,  13983,
            18,     78,    333,   3397,    190,    119,     22,   4572,  24337,
             4,      2,  21276,   3154,     19,   4796,      7,     30,      9,
          1796,    125,    705,     73,     12,    133,     44,    126,   2222,
             1,      1,      1,      1, 

`x.text[1]` is a tensor of sequence lenghts.

It seems, that torchtext isn't that useful as I thought. All I need is a raw text with labels, it gives me some stuff to process it as well, but I don't need it. In my experiments I will probably use Bert finetuing as well, which has it's own Tokenizer and takes raw text as an input.

But it has a nice collections of datasets, it can be quite useful. Let's see what we can do.

## Raw Field

In [51]:
# set up fields
TEXT = data.RawField()
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=3, device=0)



In [52]:
next(iter(train))

<torchtext.data.example.Example at 0x7f4e26b343c8>

Well, wtf? 

How can I access this example?

In [53]:
example = next(iter(train))
example

<torchtext.data.example.Example at 0x7f4e26b343c8>

In [54]:
example

<torchtext.data.example.Example at 0x7f4e26b343c8>

In [55]:
example.text

"I thought that Baseketball was one of the most funniest films i have ever seen! It's witty humour made me giggle all the way through, and the fact that Trey and Matt are so over the top, boosts the film's comedy. <br /><br />I have just bought Baseketball on DVD and its just one of those movies where you would never get tired of watching it. I have a very short attention span and i think this film has so any funny bits that it keeps me entertained throughout. The humorous quotes are memorable, and can make me laugh for hours if i remember them later..<br /><br />So overall i think that Baseketball is brilliant movie which everyone should go see, especially if you're younger like me as it will keep you laughing for a long time afterwards. <br /><br />P.s Does anybody think its weird for me to like them both? hehe"

In [56]:
example.label

'pos'

In [57]:
len(train)

25000

In [58]:
len(example.text.split())

153

Torchtext is not that useful, I won't use it. 