In [30]:
from model.data_loader import *

### building kaggle dataset

Initially we have only `ner_dataset.csv` in `data/kaggle` and run `python3 build_kaggle_dataset.py`. The structure of files are presented below. We just unrolled sentences and labels and put them into separate files.

In [4]:
!tree .

[01;34m.[00m
├── [01;32mner_dataset.csv[00m
├── [01;34mtest[00m
│   ├── labels.txt
│   └── sentences.txt
├── [01;34mtrain[00m
│   ├── labels.txt
│   └── sentences.txt
└── [01;34mval[00m
    ├── labels.txt
    └── sentences.txt

3 directories, 7 files


In [8]:
!head -30 ner_dataset.csv

Sentence #,Word,POS,Tag
Sentence: 1,Thousands,NNS,O
,of,IN,O
,demonstrators,NNS,O
,have,VBP,O
,marched,VBN,O
,through,IN,O
,London,NNP,B-geo
,to,TO,O
,protest,VB,O
,the,DT,O
,war,NN,O
,in,IN,O
,Iraq,NNP,B-geo
,and,CC,O
,demand,VB,O
,the,DT,O
,withdrawal,NN,O
,of,IN,O
,British,JJ,B-gpe
,troops,NNS,O
,from,IN,O
,that,DT,O
,country,NN,O
,.,.,O
Sentence: 2,Families,NNS,O
,of,IN,O
,soldiers,NNS,O
,killed,VBN,O
,in,IN,O


In [11]:
!head -1 train/sentences.txt

Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .


In [12]:
!head -1 train/labels.txt

O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O


In [14]:
! cat train/sentences.txt | wc -l

   33570


In [15]:
! cat train/labels.txt | wc -l

   33570


### building vocabulary

We now run `python3 build_vocab.py --data_dir data/kaggle` and extract list of tags and list of words. We also store parameters in `dataset_params.json`.

In [17]:
!ls

dataset_params.json tags.txt            [34mtrain[m[m               words.txt
[31mner_dataset.csv[m[m     [34mtest[m[m                [34mval[m[m


In [18]:
!head tags.txt

O
B-geo
B-gpe
B-per
I-geo
B-org
I-org
B-tim
B-art
I-art


In [19]:
!head words.txt

Thousands
of
demonstrators
have
marched
through
London
to
protest
the


In [20]:
!cat dataset_params.json

{
    "train_size": 33570,
    "dev_size": 7194,
    "test_size": 7194,
    "vocab_size": 35180,
    "number_of_tags": 17,
    "pad_word": "<pad>",
    "pad_tag": "O",
    "unk_word": "UNK"
}

In [21]:
!cat words.txt | wc -l

   35180


In [22]:
!cat tags.txt | wc -l

      17


### data loader

#### constructor

We're building standard vocabularies matching words and tags with indicies. We use them for vectorizing sentences. 

In [29]:
data_dir = 'data/kaggle'

In [39]:
!cat data/kaggle/dataset_params.json

{
    "train_size": 33570,
    "dev_size": 7194,
    "test_size": 7194,
    "vocab_size": 35180,
    "number_of_tags": 17,
    "pad_word": "<pad>",
    "pad_tag": "O",
    "unk_word": "UNK"
}

In [32]:
# load json file into dict
json_path = os.path.join(data_dir, 'dataset_params.json')
assert os.path.isfile(json_path), "No json file found at {}, run build_vocab.py".format(json_path)
dataset_params = utils.Params(json_path) 

In [35]:
dataset_params.dict

{'train_size': 33570,
 'dev_size': 7194,
 'test_size': 7194,
 'vocab_size': 35180,
 'number_of_tags': 17,
 'pad_word': '<pad>',
 'pad_tag': 'O',
 'unk_word': 'UNK'}

In [40]:
# loading vocab
vocab_path = os.path.join(data_dir, 'words.txt')
vocab = {}
with open(vocab_path) as f:
    for i, l in enumerate(f.read().splitlines()):
        vocab[l] = i

In [41]:
len(vocab)

35180

In [47]:
list(vocab.items())[:10]

[('Thousands', 0),
 ('of', 1),
 ('demonstrators', 2),
 ('have', 3),
 ('marched', 4),
 ('through', 5),
 ('London', 6),
 ('to', 7),
 ('protest', 8),
 ('the', 9)]

In [48]:
# our vocabulary contains <pad> and UNK
!tail -2 data/kaggle/words.txt

<pad>
UNK


In [49]:
vocab['<pad>'], vocab['UNK']

(35178, 35179)

In [51]:
# setting the indices for UNKnown words and PADding symbols
unk_ind = vocab[dataset_params.unk_word]
pad_ind = vocab[dataset_params.pad_word]
unk_ind, pad_ind

(35179, 35178)

In [52]:
# loading tags (we require this to map tags to their indices)
tags_path = os.path.join(data_dir, 'tags.txt')
tag_map = {}
with open(tags_path) as f:
    for i, t in enumerate(f.read().splitlines()):
        tag_map[t] = i
tag_map

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

#### `load_sentences_labels()`

That's the function that do vectorization that is required for `Embedding` layer: list of lists, where each inner list contains numbers (indicies of words in a the dictionary). So here's the structure of `data` that is returned by `load_data()`:
```python
data = {'train': {'data': sentences, # list of lists
                  'labels': labels,
                  'size': len(sentences)}
        ...
       }
```
But sentences still have different length. We pad or truncate them later.

In [58]:
sentences_file, labels_file = 'data/kaggle/train/sentences.txt', \
                              'data/kaggle/train/labels.txt'

In [59]:
sentences = []
labels = []
with open(sentences_file) as f:
    for sentence in f.read().splitlines():
        # replace each token by its index if it is in vocab
        # else use index of UNK_WORD
        s = [vocab[token] if token in vocab 
             else unk_ind
             for token in sentence.split(' ')]
        sentences.append(s)

with open(labels_file) as f:
    for sentence in f.read().splitlines():
        # replace each label by its index
        l = [tag_map[label] for label in sentence.split(' ')]
        labels.append(l)    

In [62]:
len(sentences), dataset_params.dict['train_size']

(33570, 33570)

In [64]:
# we encode a sentence as a list of indicies in the dictionary
sentences[0][:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [78]:
[sentences[i][:10] for i in range(5)]

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [22, 1, 23, 24, 11, 9, 25, 26, 9, 27],
 [42, 4, 18, 9, 43, 1, 44, 7, 45, 46],
 [49, 50, 9, 51, 1, 52, 53, 54, 55, 56],
 [61, 8, 62, 63, 9, 64, 1, 9, 65, 66]]

In [80]:
# sentences are not of the same length, 
# so we have to pad them
[len(sentences[i]) for i in range(5)]

[24, 30, 14, 15, 25]

In [66]:
inv_vocab = {v: k for k, v in vocab.items()}
[inv_vocab[i] for i in sentences[0][:10]]

['Thousands',
 'of',
 'demonstrators',
 'have',
 'marched',
 'through',
 'London',
 'to',
 'protest',
 'the']

In [68]:
# finally we load vectorized data for train/val/test
def load_sentences_labels(sentences_file, labels_file, d):
    sentences = []
    labels = []
    with open(sentences_file) as f:
        for sentence in f.read().splitlines():
            # replace each token by its index if it is in vocab
            # else use index of UNK_WORD
            s = [vocab[token] if token in vocab 
                 else unk_ind
                 for token in sentence.split(' ')]
            sentences.append(s)

    with open(labels_file) as f:
        for sentence in f.read().splitlines():
            # replace each label by its index
            l = [tag_map[label] for label in sentence.split(' ')]
            labels.append(l)
            
    d['data'] = sentences
    d['labels'] = labels
    d['size'] = len(sentences)


data = {}
types = ['train', 'val']
for split in ['train', 'val', 'test']:
    if split in types:
        sentences_file = os.path.join(data_dir, split, "sentences.txt")
        labels_file = os.path.join(data_dir, split, "labels.txt")
        data[split] = {}
        load_sentences_labels(sentences_file, labels_file, data[split])

In [69]:
len(data)

2

In [70]:
data.keys()

dict_keys(['train', 'val'])

In [71]:
data['train'].keys()

dict_keys(['data', 'labels', 'size'])

In [72]:
data['train']['data'][0][:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### `data_iterator()`

This function produces batches separately for `train/dev/test` using `batch_size` from parameters file of the model. It also makes padding using the following length:
```python
batch_max_len = max([len(s) for s in batch_sentences])
```
It returns generator and we can iterate over it to get `batch_data, batch_labels` on each iteration. Shapes of these values are below. As usual we get: `batch_size, batch_max_len`. These are our setntences padded or truncated.

First of all let's run `DataLoader` and then we'll recreate this last function. There's an inconsistency with tag symbol for padding. It's better to store it in a tag_map or somewhere else than just create it in comments in this function.

In [99]:
data_dir = 'data/kaggle'
json_path = os.path.join(data_dir, 'dataset_params_ex.json')
params = utils.Params(json_path)
dl = DataLoader(data_dir, params)

In [101]:
len(dl.vocab), len(dl.tag_map)

(35180, 17)

In [102]:
list(dl.vocab.items())[:10]

[('Thousands', 0),
 ('of', 1),
 ('demonstrators', 2),
 ('have', 3),
 ('marched', 4),
 ('through', 5),
 ('London', 6),
 ('to', 7),
 ('protest', 8),
 ('the', 9)]

In [103]:
data = dl.load_data(types=['train', 'val', 'test'], data_dir=data_dir)

In [104]:
data.keys()

dict_keys(['train', 'val', 'test'])

In [105]:
data['train'].keys()

dict_keys(['data', 'labels', 'size'])

In [106]:
data['train']['data'][0][:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [112]:
# finally let's create an iterator
json_path = os.path.join('experiments/base_model', 'params.json')
params = utils.Params(json_path)
it = dl.data_iterator(data['train'], params)

In [113]:
batch_data, batch_labels = next(it)

In [116]:
batch_data.shape, batch_labels.shape

(torch.Size([5, 30]), torch.Size([5, 30]))

In [118]:
# that's our first sentence padded with <pad>
batch_data[0, :]

tensor([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
           10,    11,    12,    13,    14,     9,    15,     1,    16,    17,
           18,    19,    20,    21, 35178, 35178, 35178, 35178, 35178, 35178])

In [119]:
batch_labels[0, :]

tensor([ 0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,
         2,  0,  0,  0,  0,  0, -1, -1, -1, -1, -1, -1])

Now let's try to recreate `data_iterator()`.

In [120]:
data.keys()

dict_keys(['train', 'val', 'test'])

In [123]:
data = data['train']

In [124]:
data.keys()

dict_keys(['data', 'labels', 'size'])

In [121]:
params.dict

{'learning_rate': 0.001,
 'batch_size': 5,
 'num_epochs': 10,
 'lstm_hidden_dim': 50,
 'embedding_dim': 50,
 'save_summary_steps': 100,
 'cuda': 0}

In [122]:
pad_ind

35178

In [125]:
# how many batches do we have
(data['size']+1)//params.batch_size

6714

In [126]:
# we remember that 33570 - # of sentences in train data
# and batch_size = 5 (see above)
data['size'], data['size'] / 5 

33570

In [127]:
33570 / 5

6714.0

Let's create one batch of data. It should be of size `(5, batch_max_len)`. Then we compute `max_len`. To make padding we just create `numpy` arrays with padding symbol and just copy our data in these arrays. And then we just convert `numpy` arrays into `pytorch` tensors.

In [128]:
# let's take first five sentences
order = list(range(data['size']))
i = 0
batch_sentences = [data['data'][idx] for idx in 
                   order[i*params.batch_size:(i+1)*params.batch_size]]
batch_tags = [data['labels'][idx] for idx in 
              order[i*params.batch_size:(i+1)*params.batch_size]]
len(batch_sentences)

5

In [129]:
data['data'][0][:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [130]:
batch_sentences[0][:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [131]:
[len(s) for s in batch_sentences]

[24, 30, 14, 15, 25]

In [132]:
batch_max_len = max([len(s) for s in batch_sentences])
batch_max_len

30

In [144]:
batch_data = pad_ind*np.ones((len(batch_sentences), batch_max_len))

In [145]:
batch_data.shape

(5, 30)

In [146]:
# it contains padding symbol at all positions
batch_data[0].astype(int)

array([35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178,
       35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178,
       35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178, 35178,
       35178, 35178, 35178])

In [139]:
# first sentence is indeed 24 words long
# so we expect padding of 6
cur_len = len(batch_sentences[0])
cur_len

24

In [147]:
batch_data[0][:cur_len] = batch_sentences[0]

In [148]:
batch_data[0].astype(int)

array([    0,     1,     2,     3,     4,     5,     6,     7,     8,
           9,    10,    11,    12,    13,    14,     9,    15,     1,
          16,    17,    18,    19,    20,    21, 35178, 35178, 35178,
       35178, 35178, 35178])