# TorchText

The torchtext package consists of **data processing utilities** and **popular datasets for natural language**.

## package reference
- https://pytorch.org/text/stable/index.html


- torchtext
- torchtext.nn.modules.multiheadattention
    - MultiheadAttentionContainer
    - InProjContainer
    - ScaledDotProduct
- torchtext.data.functional
    - generate_sp_model
    - load_sp_model
    - sentencepiece_numericalizer
    - sentencepiece_tokenizer
    - custom_replace
    - simple_space_split
    - numericalize_tokens_from_iterator
- torchtext.data.metrics
    - bleu_score
- torchtext.data.utils
    - get_tokenizer
    - ngrams_iterator
- torchtext.datasets
    - Text Classification
    - Language Modeling
    - Machine Translation
    - Sequence Tagging
    - Question Answer
- torchtext.vocab
    - Vocab
    - SubwordVocab
    - Vectors
    - GloVe
    - FastText
    - CharNGram
    - build_vocab_from_iterator
- torchtext.utils
    - reporthook
    - download_from_url
    - unicode_csv_reader
    - extract_archive

### 눈이 가는 package

- torchtext.nn.modules.multiheadattention

    - url : https://pytorch.org/text/stable/nn_modules.html
    - MultiheadAttentionContainer
    - InProjContainer
    - ScaledDotProduct

- torchtext.datasets

    - url : https://pytorch.org/text/stable/datasets.html
    - text classification, language modeling, machine translation, sequence tagging, question answer 대상 데이터셋 구비
   
- torchtext.vocab

    - url : https://pytorch.org/text/stable/vocab.html
    - 직접 학습 or pred-trained word embedding 을 활용해 Vocab 객체를 만들어낼 수 있음

# Tutorial example
- IMDB Sentiment analysis
- 비교
    - **torchtext < 0.9.0 : legacy**
        - torchtext.legacy
    - **torchtext == 0.9.0 : new API**
        - torchtext
        
    

## What can we do with 'torchtext' ?


- preprocess the text input and prepare the data to train/validate a model


**1. Train/validate/test split**: generate train/validate/test data set if they are available


**2. Tokenization**: break a raw text string sentence into a list of words


**3. Vocab**: define a "contract" from tokens to indexes


**4. Numericalize**: convert a list of tokens to the corresponding indexes


**5. Batch**: generate batches of data samples and add padding if necessary



## step 1 : Create a dataset object

```python
=====================================torchtext.legacy=====================================

from torchtext.legacy import data
from torchtext.legacy import datasets

TEXT = data.Field()
LABEL = data.LabelField()

legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

```




```python
========================================torchtext=========================================

from torchtext.datasets import IMDB

train_iter, test_iter = IMDB(split=('train', 'test'))

```

### Legacy

```
Legacy에선, 데이터 전처리를 위해 Field class가 필히 선언되어야 합니다.
Field class에는 tokenizer와 numberzation이 포함되어 있습니다.

따라서,

Text와 Label의 Field class를 각각 먼저 선언해 객체를 만들고 시작합니다

```

In [2]:
import torch
import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets

#### TEXT, LABEL

In [3]:
TEXT = data.Field()
LABEL = data.LabelField()

In [14]:
print([i for i in dir(TEXT) if i[0] != '_'])

['batch_first', 'build_vocab', 'dtype', 'dtypes', 'eos_token', 'fix_length', 'ignore', 'include_lengths', 'init_token', 'is_target', 'lower', 'numericalize', 'pad', 'pad_first', 'pad_token', 'postprocessing', 'preprocess', 'preprocessing', 'process', 'sequential', 'stop_words', 'tokenize', 'tokenizer_args', 'truncate_first', 'unk_token', 'use_vocab', 'vocab_cls']


In [15]:
print([i for i in dir(LABEL) if i[0] != '_'])

['batch_first', 'build_vocab', 'dtype', 'dtypes', 'eos_token', 'fix_length', 'ignore', 'include_lengths', 'init_token', 'is_target', 'lower', 'numericalize', 'pad', 'pad_first', 'pad_token', 'postprocessing', 'preprocess', 'preprocessing', 'process', 'sequential', 'stop_words', 'tokenize', 'tokenizer_args', 'truncate_first', 'unk_token', 'use_vocab', 'vocab_cls']


#### train / test data split

- datasets.splits(Field1, Field2)

In [19]:
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

In [20]:
legacy_examples = legacy_train.examples
print(f"TEXT : \n {legacy_examples[0].text}, \
    \n\nLABEL : \n{legacy_examples[0].label}")

TEXT : 
 ['I', 'happened', 'to', 'see', 'this', 'movie', 'twice', 'or', 'more', 'and', 'found', 'it', 'well', 'made!', 'WWII', 'had', 'freshly', 'ended', 'and', 'the', 'so-called', '"Cold', 'War"', 'was', 'about', 'to', 'begin.', 'This', 'movie', 'could,', 'therefore,', 'be', 'defined', 'as', 'one', 'of', 'the', 'best', '"propaganda",', 'patriotic', 'movies', 'preparing', 'Americans', 'and,', 'secondly,', 'people', 'from', 'the', 'still', 'to', 'be', 'formed', '"Western', 'NATO', 'block"', 'of', 'countries', 'to', 'face', 'the', 'next', 'coming', 'menace.', 'The', 'movie', 'celebrates', 'the', 'might', 'of', 'the', 'US,', 'through', 'the', 'centuries,', 'while', 'projecting', 'itself', 'onwards', 'to', 'the', 'then', 'present', 'war,', 'which', 'had', 'just', 'ended.', 'Nice', 'and', 'funny', 'is', 'the', 'way', 'of', 'describing', 'the', 'discovering', 'of', 'the', 'American', 'Continent', 'by', 'Columbus', 'and', 'pretty', 'the', '"espisode"', 'of', 'New', 'Amsterdam', 'and', 'the', 

In [21]:
print([i for i in dir(legacy_train) if i[0] != '_'])

['dirname', 'download', 'examples', 'fields', 'filter_examples', 'iters', 'name', 'sort_key', 'split', 'splits', 'urls']


### New

```

New API 에선, Field라는 class가 없습니다.
New API는 torchtext에서 연결된 IMDB데이터셋에서, 바로 한줄 한줄의 iterator를 생성하도록 합니다.

train_iter = (LABEL line 1, TEXT line 1)

위와 같이 튜플로 각 raw text data가 끌려오게 됩니다.

```

In [32]:
from torchtext.datasets import IMDB

In [33]:
train_iter, test_iter = IMDB(split=('train', 'test'))

In [29]:
example = next(train_iter)

In [34]:
example

('neg',
 "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />")

In [31]:
print(example[0])

neg


In [32]:
print(example[1])

If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />


## Step 2 Build the data processing pipeline


```python
=====================================torchtext.legacy=====================================

TEXT = data.Field(tokenize=data.get_tokenizer('basic_english'),
                  init_token='<SOS>', eos_token='<EOS>', lower=True)
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

TEXT.build_vocab(legacy_train)
LABEL.build_vocab(legacy_train)

legacy_vocab = TEXT.vocab
legacy_stoi = legacy_vocab.stoi
legacy_itos = legacy_vocab.itos
TEXT.build_vocab(legacy_train, min_freq=100)
legacy_vocab2 = TEXT.vocab
```




```python
========================================torchtext=========================================

from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

from collections import Counter
from torchtext.vocab import Vocab

train_iter, test_iter = IMDB(split=('train', 'test'))
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

# text_transform : numberization! (vocab.stoi 대신..)
text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]

# label_transform : numberization!
label_transform = lambda x: 1 if x == 'pos' else 0

```

### Legacy

```
Legacy에선, default tokenizer가 'Field' class에 내장되어 있습니다. split() python function으로 구현되어 있다고 하네요!
따라서, 다른 tokenizer를 원할 경우,
사용자가 data.get_tokenizer로 tokenizer를 호출해서 'Field' class 객체 선언 시 알려줘야 한다고 합니다.

sequence 모델의 경우, BOS, EOS 그리고 special token 등이 전처리 과정에서 반영되어야 하기 때문에,
이런 것도 'Field' class 객체 선언 시 반영되어야 한다고 해요!
```

In [4]:
TEXT = data.Field(tokenize=data.get_tokenizer('basic_english'),
                  init_token='<SOS>', eos_token='<EOS>', lower=True)
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

In [11]:
print(data.get_tokenizer.__doc__)


    Generate tokenizer function for a string sentence.

    Args:
        tokenizer: the name of tokenizer function. If None, it returns split()
            function, which splits the string sentence by space.
            If basic_english, it returns _basic_english_normalize() function,
            which normalize the string first and split by space. If a callable
            function, it will return the function. If a tokenizer library
            (e.g. spacy, moses, toktok, revtok, subword), it returns the
            corresponding library.
        language: Default en

    Examples:
        >>> import torchtext
        >>> from torchtext.data import get_tokenizer
        >>> tokenizer = get_tokenizer("basic_english")
        >>> tokens = tokenizer("You can now install TorchText using pip!")
        >>> tokens
        >>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

    


In [16]:
# basic_english tokenizer를 통해 tokenize 된 결과입니다. 
# space 기준으로 하는 default tokenizer와의 차이는 크게 보이진 않네요. 아직은.

print(legacy_train.examples[0].text)

['i', 'happened', 'to', 'see', 'this', 'movie', 'twice', 'or', 'more', 'and', 'found', 'it', 'well', 'made', '!', 'wwii', 'had', 'freshly', 'ended', 'and', 'the', 'so-called', 'cold', 'war', 'was', 'about', 'to', 'begin', '.', 'this', 'movie', 'could', ',', 'therefore', ',', 'be', 'defined', 'as', 'one', 'of', 'the', 'best', 'propaganda', ',', 'patriotic', 'movies', 'preparing', 'americans', 'and', ',', 'secondly', ',', 'people', 'from', 'the', 'still', 'to', 'be', 'formed', 'western', 'nato', 'block', 'of', 'countries', 'to', 'face', 'the', 'next', 'coming', 'menace', '.', 'the', 'movie', 'celebrates', 'the', 'might', 'of', 'the', 'us', ',', 'through', 'the', 'centuries', ',', 'while', 'projecting', 'itself', 'onwards', 'to', 'the', 'then', 'present', 'war', ',', 'which', 'had', 'just', 'ended', '.', 'nice', 'and', 'funny', 'is', 'the', 'way', 'of', 'describing', 'the', 'discovering', 'of', 'the', 'american', 'continent', 'by', 'columbus', 'and', 'pretty', 'the', 'espisode', 'of', 'ne

```
Legacy에선, vocabulary를 만들 때, 
TEXT field에 저장된, text data를 대상으로 vocabulary를 만들게 됩니다. 이전에 이미 data를 받아서 TEXT (Field 객체)에 담아주었던 기억이 있습니다. legacy는 Field라는 객체를 철저히 활용하네요.
```

In [17]:
TEXT.build_vocab(legacy_train)
LABEL.build_vocab(legacy_train)

```
legacy에선, vocabulary로 아래와 같은 3가지를 할 수 있습니다.

1. vocabulary의 전체 길이 확인

2. String2Index (stoi) and Index2String (itos)

3. n번 이상 출현한 단어만을 갖는 vocabulary 생성
```

In [29]:
legacy_vocab = TEXT.vocab
print("The length of the legacy vocab is", len(legacy_vocab))
legacy_stoi = legacy_vocab.stoi
print("The index of 'i' is", legacy_stoi['i'])
print("The index of 'funny' is", legacy_stoi['funny'])
legacy_itos = legacy_vocab.itos
print("The token at index 686 is", legacy_itos[686])
print('\n')
# Set up the mim_freq value in the Vocab class
TEXT.build_vocab(legacy_train, min_freq=100)
legacy_vocab2 = TEXT.vocab
print("The length of the legacy vocab (min_freq=100) is", len(legacy_vocab2))

The length of the legacy vocab is 4264
The index of 'i' is 15
The index of 'funny' is 171
The token at index 686 is knew


The length of the legacy vocab (min_freq=100) is 4264


### New

```

New API 에선, Field라는 class가 없습니다.
따라서, Field에서 tokenizer와 special token을 지정해줬던 것과는 다르게 지정하게 됩니다.

tokenizer는 get_tokenizer() 함수로 Field를 벗어나 직접 선언할 수 있습니다.
vocabulary 선언은 Vocab class가 따로 있기 때문에, 이를 활용합니다.

보다 Field를 벗어나, 유연해진 느낌입니다.

```

In [30]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

In [34]:
from collections import Counter
from torchtext.vocab import Vocab

train_iter, test_iter = IMDB(split=('train', 'test'))
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

```python
for (label, line) in train_iter:

label is 'net'

line is
"If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />"

```

In [76]:
# text_transform : numberization! (vocab.stoi 대신..)
text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]

# label_transform : numberization!
label_transform = lambda x: 1 if x == 'pos' else 0

# Print out the output of text_transform
print("input to the text_transform:", "here is an example")
print("output of the text_transform:", text_transform("here is an example"))

input to the text_transform: here is an example
output of the text_transform: [1, 134, 12, 43, 467, 2]


In [77]:
print(vocab['<BOS>'])
print(vocab['<EOS>'])

1
2


In [94]:
vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f6460c530a0>>,
            {'<unk>': 0,
             '<BOS>': 1,
             '<EOS>': 2,
             '<PAD>': 3,
             'the': 4,
             '.': 5,
             ',': 6,
             'and': 7,
             'a': 8,
             'of': 9,
             'to': 10,
             "'": 11,
             'is': 12,
             'it': 13,
             'in': 14,
             'i': 15,
             'this': 16,
             'that': 17,
             's': 18,
             'was': 19,
             'as': 20,
             'for': 21,
             'with': 22,
             'movie': 23,
             'but': 24,
             'film': 25,
             ')': 26,
             '(': 27,
             'you': 28,
             't': 29,
             'on': 30,
             'not': 31,
             'he': 32,
             'are': 33,
             'his': 34,
             'have': 35,
             'be': 36,
             'one': 37,
     

**잠깐, Counter?**

- 참고 : https://excelsior-cjh.tistory.com/94

In [56]:
from collections import Counter
colors = ['blue', 'blue', 'blue', 'red', 'red']
strings = 'adfklajlka;jva;krjl;vafnjkagn;jl;bj'
counter = Counter(colors)

In [42]:
counter

Counter({'blue': 3, 'red': 2})

In [46]:
print([i for i in dir(counter) if i[0] != '_'])

['clear', 'copy', 'elements', 'fromkeys', 'get', 'items', 'keys', 'most_common', 'pop', 'popitem', 'setdefault', 'subtract', 'update', 'values']


In [57]:
counter_str = Counter(strings)

In [59]:
print(counter_str)

Counter({'a': 6, 'j': 6, ';': 5, 'k': 4, 'l': 4, 'f': 2, 'v': 2, 'n': 2, 'd': 1, 'r': 1, 'g': 1, 'b': 1})


In [61]:
print(counter_str.most_common())

[('a', 6), ('j', 6), (';', 5), ('k', 4), ('l', 4), ('f', 2), ('v', 2), ('n', 2), ('d', 1), ('r', 1), ('g', 1), ('b', 1)]


In [62]:
print(counter_str.most_common()[0])

('a', 6)


In [63]:
counter_str.update({'happy' : 100})

In [65]:
print(counter_str)

Counter({'happy': 100, 'a': 6, 'j': 6, ';': 5, 'k': 4, 'l': 4, 'f': 2, 'v': 2, 'n': 2, 'd': 1, 'r': 1, 'g': 1, 'b': 1})


## Step 3: Generate batch iterator

-  build an iterator to generate data batch


```python
=====================================torchtext.legacy=====================================

from torchtext.legacy.data import BucketIterator

legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

legacy_train_bucketiterator, legacy_test_bucketiterator = data.BucketIterator.splits(
    (legacy_train, legacy_test),
    sort_key=lambda x: len(x.text),
    batch_size=8, 
    device = device)
```




```python
========================================torchtext=========================================

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    
    label_list, text_list = [], []
    
    for (_label, _text) in batch:
        
        label_list.append(label_transform(_label))
        
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
        
    return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)


import random

train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8  # A batch size of 8

def batch_sampler():
    
    # indices = [(0, 317), (1, 254), (2, 101), (3, 148), ...]
    indices = [(i, len(tokenizer(s[1]))) for i, s in enumerate(train_list)]
    random.shuffle(indices)
    pooled_indices = []
    # create pool of indices with similar lengths 
    for i in range(0, len(indices), batch_size * 100):
        pooled_indices.extend(sorted(indices[i:i + batch_size * 100], key=lambda x: x[1]))

    pooled_indices = [x[0] for x in pooled_indices]

    # yield indices for current batch
    for i in range(0, len(pooled_indices), batch_size):
        yield pooled_indices[i:i + batch_size]

bucket_dataloader = DataLoader(train_list, batch_sampler=batch_sampler(),
                               collate_fn=collate_batch)

```

### Legacy

```
Legacy에선, Iterator class가 사용됩니다.

주목할만한 건, BucketIterator의 효과입니다.

```

참고 : 
- 1) https://gmihaila.medium.com/better-batches-with-pytorchtext-bucketiterator-12804a545e2a
- 2) https://colab.research.google.com/github/gmihaila/ml_things/blob/master/notebooks/pytorch/pytorchtext_bucketiterator.ipynb

In [68]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

legacy_train_iterator, legacy_test_iterator = data.Iterator.splits(
    (legacy_train, legacy_test), 
    batch_size=8, 
    device = device)

In [69]:
from torchtext.legacy.data import BucketIterator

legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)

legacy_train_bucketiterator, legacy_test_bucketiterator = data.BucketIterator.splits(
    (legacy_train, legacy_test),
    sort_key=lambda x: len(x.text),
    batch_size=8, 
    device = device)

### New

```
New에선, DataLoader를 그대로 사용합니다.

대신, DataLoader의 argument로


collate_fn
batch_sampler


이 2가지를 추가로 지정합니다.

1) collate_fn

데이터 전처리(numberization) + 패딩(pad_sequence)

2) batch_sampler

유사한 길이를 갖는 batch의 indices를 내보내는 generator! 
(a generator that yields batch of indices for which the corresponding batch of data is of similar length)

```

In [78]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    
    label_list, text_list = [], []
    
    for (_label, _text) in batch:
        
        label_list.append(label_transform(_label))
        
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
        
    return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

In [72]:
import random

train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8  # A batch size of 8

def batch_sampler():
    
    # indices = [(0, 317), (1, 254), (2, 101), (3, 148), ...]
    indices = [(i, len(tokenizer(s[1]))) for i, s in enumerate(train_list)]
    random.shuffle(indices)
    pooled_indices = []
    # create pool of indices with similar lengths 
    for i in range(0, len(indices), batch_size * 100):
        pooled_indices.extend(sorted(indices[i:i + batch_size * 100], key=lambda x: x[1]))

    pooled_indices = [x[0] for x in pooled_indices]

    # yield indices for current batch
    for i in range(0, len(pooled_indices), batch_size):
        yield pooled_indices[i:i + batch_size]

bucket_dataloader = DataLoader(train_list, batch_sampler=batch_sampler(),
                               collate_fn=collate_batch)

print(next(iter(bucket_dataloader)))

(tensor([0, 0, 1, 0, 1, 0, 1, 1]), tensor([[    1,     1,     1,     1,     1,     1,     1,     1],
        [ 1450,    16,     0,   100,    15,     8,    15,    16],
        [14942,    23,   358,   403,   423,    64,   222,    23],
        [10785,   918,   123,     4,     0,  1515,    16,    12],
        [   14,  1251,    24,   255,    64,     7,    23,   391],
        [  862,     0,     8,  2368,    83,  2200,   290,     9],
        [   57,  1118,     0,   133,     5,   186,    30,  2058],
        [   56,    14,   257,     0,    13,    12,   540,     5],
        [   19,     8,   208,    68,    12,   421,   322,    47],
        [   14,  3525,    23,     5,     8,    14,    29,  1174],
        [   17,  1311,    21,    67,    64,    16,     5,  2315],
        [ 3130,     5,   353,   586,     0,  2435,  2038,  1484],
        [   57,   711,    23,     7,    73,   359,     5,     6],
        [    8,    17,  1063,     4,     9,  7277,     7,     4],
        [  392,     6,    74, 14818,    4

In [97]:
train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8  # A batch size of 8

[(i, len(tokenizer(s[1]))) for i, s in enumerate(train_list)]

[(0, 317),
 (1, 254),
 (2, 101),
 (3, 148),
 (4, 380),
 (5, 135),
 (6, 130),
 (7, 340),
 (8, 586),
 (9, 275),
 (10, 296),
 (11, 165),
 (12, 153),
 (13, 167),
 (14, 409),
 (15, 247),
 (16, 92),
 (17, 1043),
 (18, 99),
 (19, 174),
 (20, 221),
 (21, 186),
 (22, 198),
 (23, 374),
 (24, 121),
 (25, 211),
 (26, 281),
 (27, 123),
 (28, 260),
 (29, 402),
 (30, 183),
 (31, 183),
 (32, 172),
 (33, 218),
 (34, 457),
 (35, 147),
 (36, 177),
 (37, 327),
 (38, 259),
 (39, 143),
 (40, 159),
 (41, 155),
 (42, 974),
 (43, 253),
 (44, 551),
 (45, 161),
 (46, 147),
 (47, 130),
 (48, 165),
 (49, 161),
 (50, 228),
 (51, 218),
 (52, 545),
 (53, 279),
 (54, 325),
 (55, 511),
 (56, 307),
 (57, 159),
 (58, 135),
 (59, 57),
 (60, 232),
 (61, 222),
 (62, 201),
 (63, 389),
 (64, 150),
 (65, 393),
 (66, 307),
 (67, 164),
 (68, 86),
 (69, 843),
 (70, 233),
 (71, 64),
 (72, 163),
 (73, 133),
 (74, 457),
 (75, 133),
 (76, 169),
 (77, 219),
 (78, 136),
 (79, 390),
 (80, 747),
 (81, 147),
 (82, 411),
 (83, 184),
 (84, 

In [99]:
train_list[0]

('neg',
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

## Step 4: Iterate batch to train a model

```python
=====================================torchtext.legacy=====================================

for item in legacy_train_bucketiterator:
    model(item)

# Or

next(iter(legacy_train_bucketiterator))
```




```python
========================================torchtext=========================================

for idx, (label, text) in enumerate(bucket_dataloader):
    model(item)

# Or

next(iter(bucket_dataloader))
```