# Build vocabulary and data iterator

In this notebook we are going to create the vocabulary object that will be responsible for:
- Creating dataset's vocabulary.
- Filtering dataset in terms of the rare words occurrence and sentences lengths.
- Mapping words to their numerical representation (word2index) and reverse (index2word).
- Enabling the use of pre-trained word vectors.


The second object to create is a data iterator whose task will be:
- Sorting dataset examples.
- Generating batches.
- Sequence padding.
- Enabling BatchIterator instance to iterate through all batches.

Let's begin with importing all necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import re
import torch
from collections import defaultdict, Counter
from pprint import pprint
import warnings
warnings.filterwarnings('ignore')

Now we are going to build the vocabulary class that includes all the features mentioned at the beginning of this notebook. We want our class to enable to use of pre-trained vectors and construct the weights matrix. To be able to perform that task, we have to supply the vocabulary model with a set of pre-trained vectors.

Glove vectors can be downloaded from the following website:
https://nlp.stanford.edu/projects/glove/
<br>
Fasttext word vectors can be found under the link:
https://fasttext.cc/docs/en/english-vectors.html

In [2]:
class Vocab:
    
    """The Vocab class is responsible for:
    Creating dataset's vocabulary.
    Filtering dataset in terms of the rare words occurrence and sentences lengths.
    Mapping words to their numerical representation (word2index) and reverse (index2word).
    Enabling the use of pre-trained word vectors.


    Parameters
    ----------
    dataset : pandas.DataFrame or numpy.ndarray
        Pandas or numpy dataset containing in the first column input strings to process and target non-string 
        variable as last column.
    target_col: int, optional (default=None)
        Column index refering to targets strings to process.
    word2index: dict, optional (default=None)
        Specify the word2index mapping.
    sos_token: str, optional (default='<SOS>')
        Start of sentence token.
    eos_token: str, optional (default='<EOS>')
        End of sentence token.
    unk_token: str, optional (default='<UNK>')
        Token that represents unknown words.
    pad_token: str, optional (default='<PAD>')
        Token that represents padding.
    min_word_count: float, optional (default=5)
        Specify the minimum word count threshold to include a word in vocabulary if value > 1 was passed.
        If min_word_count <= 1 then keep all words whose count is greater than the quantile=min_word_count
        of the count distribution.
    max_vocab_size: int, optional (default=None)
        Maximum size of the vocabulary.
    max_seq_len: float, optional (default=0.8)
        Specify the maximum length of the sequence in the dataset, if max_seq_len > 1. If max_seq_len <= 1 then set
        the maximum length to value corresponding to quantile=max_seq_len of lengths distribution. Trimm all
        sequences whose lengths are greater than max_seq_len.
    use_pretrained_vectors: boolean, optional (default=False)
        Whether to use pre-trained Glove vectors.
    glove_path: str, optional (default='Glove/')
        Path to the directory that contains files with the Glove word vectors.
    glove_name: str, optional (default='glove.6B.100d.txt')
        Name of the Glove word vectors file. Available pretrained vectors:
        glove.6B.50d.txt
        glove.6B.100d.txt
        glove.6B.200d.txt
        glove.6B.300d.txt
        glove.twitter.27B.50d.txt
        To use different word vectors, load their file to the vectors directory (Glove/).
    weights_file_name: str, optional (default='Glove/weights.npy')
        The path and the name of the numpy file to which save weights vectors.

    Raises
    -------
    ValueError('Use min_word_count or max_vocab_size, not both!')
        If both: min_word_count and max_vocab_size are provided.
    FileNotFoundError
        If the glove file doesn't exists in the given directory.

    """
    
    
    def __init__(self, dataset, target_col=None, word2index=None, sos_token='<SOS>', eos_token='<EOS>', unk_token='<UNK>',
             pad_token='<PAD>', min_word_count=5, max_vocab_size=None, max_seq_len=0.8,
             use_pretrained_vectors=False, glove_path='Glove/', glove_name='glove.6B.100d.txt',
             weights_file_name='Glove/weights.npy'):
        
        # Convert pandas dataframe to numpy.ndarray
        if isinstance(dataset, pd.DataFrame):
            dataset = dataset.to_numpy()
        
        self.dataset = dataset
        self.target_col = target_col
        
        if self.target_col:
            self.y_lengths = []
            
        self.x_lengths = []
        self.word2idx_mapping = word2index
        
        # Define word2idx and idx2word as empty dictionaries
        if self.word2idx_mapping:
            self.word2index = self.word2idx_mapping
        else:
            self.word2index = defaultdict(dict)
            self.index2word = defaultdict(dict)            
        
        # Instantiate special tokens
        self.sos_token = sos_token
        self.eos_token = eos_token
        self.unk_token = unk_token
        self.pad_token = pad_token
        
        # Instantiate min_word_count, max_vocab_size and max_seq_len
        self.min_word_count = min_word_count
        self.max_vocab_size = max_vocab_size
        self.max_seq_len = max_seq_len
        
        self.use_pretrained_vectors = use_pretrained_vectors
        
        if self.use_pretrained_vectors: 
            self.glove_path = glove_path
            self.glove_name = glove_name
            self.weights_file_name = weights_file_name
        
        self.build_vocab()
        
        
    def build_vocab(self):
        """Build the vocabulary, filter dataset sequences and create the weights matrix if specified.
        
        """
        # Create a dictionary that maps words to their count
        self.word_count = self.word2count()

        # Trim the vocabulary
        # Get rid of out-of-vocabulary words from the dataset
        if self.min_word_count or self.max_vocab_size:
            self.trimVocab()
            self.trimDatasetVocab()

        # Trim sequences in terms of length
        if self.max_seq_len:
            if self.x_lengths:
                self.trimSeqLen()

            else:
                # Calculate sequences lengths
                self.x_lengths = [len(seq.split()) for seq in self.dataset[:, 0]]
                
                if self.target_col:
                    self.y_lengths = [len(seq.split()) for seq in self.dataset[:, self.target_col]]
                    
                self.trimSeqLen()                

                
        # Map each tokens to index
        if not self.word2idx_mapping:
            self.mapWord2index()
               
        # Crate index2word mapping
        self.index2word = {index: word for word, index in self.word2index.items()}
        
        # Map dataset tokens to indices
        self.mapWords2indices()
        
        # Create weights matrix based on Glove vectors
        if self.use_pretrained_vectors:
            self.glove_vectors()       
        
            
    def word2count(self):
        """Count the number of words occurrences.
        
        """
        # Instantiate the Counter object
        word_count = Counter()

        # Iterate through the dataset and count tokens
        for line in self.dataset[:, 0]:
            word_count.update(line.split())
            
            # Include strings from target column
            if self.target_col:
                for line in self.dataset[:, self.target_col]:
                    word_count.update(line.split())
            
        return word_count
    

    def trimVocab(self):
        """Trim the vocabulary in terms of the minimum word count or the vocabulary maximum size.
        
        """
        # Trim the vocabulary in terms of the minimum word count
        if self.min_word_count and not self.max_vocab_size:
            # If min_word_count <= 1, use the quantile approach
            if self.min_word_count <= 1:
                # Create the list of words count
                word_stat = [count for count in self.word_count.values()]
                # Calculate the quantile of words count
                quantile = int(np.quantile(word_stat, self.min_word_count))
                print('Trimmed vocabulary using as mininum count threashold: quantile({:3.2f}) = {}'.\
                      format(self.min_word_count, quantile))
                # Filter words using quantile threshold
                self.trimmed_word_count = {word: count for word, count in self.word_count.items() if count >= quantile}
            # If min_word_count > 1 use standard approach
            else:
                # Filter words using count threshold
                self.trimmed_word_count = {word: count for word, count in self.word_count.items()\
                                   if count >= self.min_word_count}
                print('Trimmed vocabulary using as minimum count threashold: count = {:3.2f}'.format(self.min_word_count))
                     
        # Trim the vocabulary in terms of its maximum size
        elif self.max_vocab_size and not self.min_word_count:
            self.trimmed_word_count = {word: count for word, count in self.word_count.most_common(self.max_vocab_size)}
            print('Trimmed vocabulary using maximum size of: {}'.format(self.max_vocab_size))
        else:
            raise ValueError('Use min_word_count or max_vocab_size, not both!')
            
        print('{}/{} tokens has been retained'.format(len(self.trimmed_word_count.keys()),
                                                     len(self.word_count.keys())))

    
    def trimDatasetVocab(self):
        """Get rid of rare words from the dataset sequences.
        
        """
        for row in range(self.dataset.shape[0]):
            trimmed_x = [word for word in self.dataset[row, 0].split() if word in self.trimmed_word_count.keys()]
            self.x_lengths.append(len(trimmed_x))
            self.dataset[row, 0] = ' '.join(trimmed_x)
        print('Trimmed input strings vocabulary')
                            
        if self.target_col:
            for row in range(self.dataset.shape[0]):
                trimmed_y = [word for word in self.dataset[row, self.target_col].split()\
                             if word in self.trimmed_word_count.keys()]
                self.y_lengths.append(len(trimmed_y))
                self.dataset[row, self.target_col] = ' '.join(trimmed_y)
            print('Trimmed target strings vocabulary')
            
                
    def trimSeqLen(self):
        """Trim dataset sequences in terms of the length.
        
        """
        if self.max_seq_len <= 1:
            x_threshold = int(np.quantile(self.x_lengths, self.max_seq_len)) 
            if self.target_col:
                y_threshold = int(np.quantile(self.y_lengths, self.max_seq_len)) 
        else:
            x_threshold = self.max_seq_len
            if self.target_col:
                y_threshold =  self.max_seq_len
        
        if self.target_col:      
            for row in range(self.dataset.shape[0]):
                x_truncated = ' '.join(self.dataset[row, 0].split()[:x_threshold])\
                if self.x_lengths[row] > x_threshold else self.dataset[row, 0]
                
                # Add 1 if the EOS token is going to be added to the sequence
                self.x_lengths[row] = len(x_truncated.split()) if not self.eos_token else \
                                      len(x_truncated.split()) + 1
                
                self.dataset[row, 0] = x_truncated
                
                y_truncated = ' '.join(self.dataset[row, self.target_col].split()[:y_threshold])\
                if self.y_lengths[row] > y_threshold else self.dataset[row, self.target_col]
                
                # Add 1 or 2 to the length to inculde special tokens
                y_length = len(y_truncated.split())
                if self.sos_token and not self.eos_token:
                    y_length = len(y_truncated.split()) + 1
                elif self.eos_token and not self.sos_token:
                    y_length = len(y_truncated.split()) + 1
                elif self.sos_token and self.eos_token:
                    y_length = len(y_truncated.split()) + 2
                    
                self.y_lengths[row] = y_length
                
                self.dataset[row, self.target_col] = y_truncated
                
            print('Trimmed input sequences lengths to the length of: {}'.format(x_threshold))
            print('Trimmed target sequences lengths to the length of: {}'.format(y_threshold))
            
        else:
            for row in range(self.dataset.shape[0]):

                x_truncated = ' '.join(self.dataset[row, 0].split()[:x_threshold])\
                if self.x_lengths[row] > x_threshold else self.dataset[row, 0]
                
                # Add 1 if the EOS token is going to be added to the sequence
                self.x_lengths[row] = len(x_truncated.split()) if not self.eos_token else \
                                      len(x_truncated.split()) + 1
                
                self.dataset[row, 0] = x_truncated
                
            print('Trimmed input sequences lengths to the length of: {}'.format(x_threshold))
                
        
    def mapWord2index(self):
        """Populate vocabulary word2index dictionary.
        
        """
        # Add special tokens as first elements in word2index dictionary
        token_count = 0
        for token in [self.pad_token, self.sos_token, self.eos_token, self.unk_token]:
            if token:
                self.word2index[token] = token_count
                token_count += 1
        
        # If vocabulary is trimmed, use trimmed_word_count
        if self.min_word_count or self.max_vocab_size:
            for key in self.trimmed_word_count.keys():
                self.word2index[key] = token_count
                token_count += 1
            
        # If vocabulary is not trimmed, iterate through dataset    
        else:
            for line in self.dataset.iloc[:, 0]:
                for word in line.split():
                    if word not in self.word2index.keys():
                        self.word2index[word] = token_count
                        token_count += 1
            # Include strings from target column
            if self.target_col:
                for line in self.dataset.iloc[:, self.target_col]:
                    for word in line.split():
                        if word not in self.word2index.keys():
                            self.word2index[word] = token_count
                            token_count += 1
                            
        self.word2index.default_factory = lambda: self.word2index[self.unk_token]
                            
        
    def mapWords2indices(self):
        """Iterate through the dataset to map each word to its corresponding index.
        Use special tokens if specified.
        
        """
        for row in range(self.dataset.shape[0]):
            words2indices = []
            for word in self.dataset[row, 0].split():
                words2indices.append(self.word2index[word])
                    
            # Append the end of the sentence token
            if self.eos_token:
                words2indices.append(self.word2index[self.eos_token])
                
            self.dataset[row, 0] = np.array(words2indices)
                
        # Map strings from target column
        if self.target_col:
            for row in range(self.dataset.shape[0]):
                words2indices = []
                
                # Insert the start of the sentence token
                if self.sos_token:
                    words2indices.append(self.word2index[self.sos_token])
                    
                for word in self.dataset[row, self.target_col].split():
                    words2indices.append(self.word2index[word])

                        
                # Append the end of the sentence token
                if self.eos_token:
                    words2indices.append(self.word2index[self.eos_token])
                    
                self.dataset[row, self.target_col] = np.array(words2indices)
           
        print('Mapped words to indices')

    
    def glove_vectors(self):
        """ Read glove vectors from a file, create the matrix of weights mapping vocabulary tokens to vectors.
        Save the weights matrix to the numpy file.
        
        """
        # Load Glove word vectors to the pandas dataframe
        try:
            gloves = pd.read_csv(self.glove_path + self.glove_name, sep=" ", quoting=3, header=None, index_col=0)
        except FileNotFoundError:
            print('File: {} not found in: {} directory'.format(self.glove_name, self.glove_path))
            
        # Map Glove words to vectors
        print('Start creating glove_word2vector dictionary')
        self.glove_word2vector = gloves.T.to_dict(orient='list')
        
        # Extract embedding dimension
        emb_dim = int(re.findall('\d+' ,self.glove_name)[-1])
        # Length of the vocabulary
        matrix_len = len(self.word2index)
        # Initialize the weights matrix
        weights_matrix = np.zeros((matrix_len, emb_dim))
        words_found = 0

        # Populate the weights matrix
        for word, index in self.word2index.items():
            try: 
                weights_matrix[index] = np.array(self.glove_word2vector[word])
                words_found += 1
            except KeyError:
                # If vector wasn't found in Glove, initialize random vector
                weights_matrix[index] = np.random.normal(scale=0.6, size=(emb_dim, ))
         
        # Save the weights matrix into numpy file
        np.save(self.weights_file_name, weights_matrix, allow_pickle=False)
        
        # Delete glove_word2vector variable to free the memory
        del self.glove_word2vector
                        
        print('Extracted {}/{} of pre-trained word vectors.'.format(words_found, matrix_len))
        print('{} vectors initialized to random numbers'.format(matrix_len - words_found))
        print('Weights vectors saved into {}'.format(self.weights_file_name))
                
                

Now that the Vocab class is ready, to test its functionality, firstly we have to load the dataset that will be processed and used to build the vocabulary.

In [3]:
# Load the training set
train_dataset = pd.read_csv('dataset/datasets_feat_clean/train_feat_clean.csv', 
                      usecols=['clean_review', 'polarity', 'word_count', 'label'],
                      dtype={'clean_review': str, 'label': np.int16})

In [4]:
# Change the columns order
train_dataset = train_dataset[['clean_review', 'polarity', 'word_count', 'label']]

In [5]:
# Display the first 5 rows from the dataset
train_dataset.head()

Unnamed: 0,clean_review,polarity,word_count,label
0,amaze good wonderful film early ninety franchi...,0.2482,391,1
1,wrong end see tell chick go crazy eat old woma...,0.1763,145,0
2,guess emperor clothe see list pbs night hopefu...,0.1145,165,0
3,earth well movie funny sweet good plot unique ...,0.381,55,1
4,doe eye high school student kathleen beller fi...,0.2095,688,1


Below we will instantiate the Vocab class, that will cause that the dataset processing begins. After it finished we will be able to access vocab attributes to check out whether all objects are created properly.

In [6]:
train_vocab = Vocab(train_dataset, target_col=None, word2index=None, sos_token='<SOS>', eos_token='<EOS>',
                    unk_token='<UNK>', pad_token='<PAD>', min_word_count=None, max_vocab_size=5000, max_seq_len=0.8,
                    use_pretrained_vectors=False, glove_path='glove/', glove_name='glove.6B.100d.txt',
                    weights_file_name='glove/weights.npy')

Trimmed vocabulary using maximum size of: 5000
5000/130416 tokens has been retained
Trimmed input strings vocabulary
Trimmed input sequences lengths to the length of: 121
Mapped words to indices


In [7]:
# Display word2index dictionary
train_vocab.word2index

defaultdict(<function __main__.Vocab.mapWord2index.<locals>.<lambda>()>,
            {'<PAD>': 0,
             '<SOS>': 1,
             '<EOS>': 2,
             '<UNK>': 3,
             'movie': 4,
             'film': 5,
             'like': 6,
             'time': 7,
             'good': 8,
             'character': 9,
             'watch': 10,
             'see': 11,
             'story': 12,
             'think': 13,
             'well': 14,
             'scene': 15,
             'great': 16,
             'look': 17,
             'know': 18,
             'end': 19,
             'people': 20,
             'go': 21,
             'bad': 22,
             'get': 23,
             'love': 24,
             'act': 25,
             'play': 26,
             'way': 27,
             'come': 28,
             'thing': 29,
             'find': 30,
             'man': 31,
             'make': 32,
             'plot': 33,
             'work': 34,
             'actor': 35,
             'want': 36,
  

In [8]:
# Depict the first dataset sequence
train_vocab.dataset[0][0]

array([ 281,    8,  250,    5,  154, 2455,  373,  509, 1076, 1197,  796,
        872,  328,   84,  261, 2533, 2533,   64, 1650,  499,  203,    5,
         94,   11, 2281,  979,  796,  381, 1125,  788,  904, 4835,  979,
         71,  392,  450, 1167,  450,  174,  979, 1497, 2958,  513,  186,
        552, 2174, 1275,  979,  875, 1232, 4316, 2778,  617, 3143, 1233,
        151,   72,  618,   20,  518,  712,   32, 1180,  825,  137,  206,
       1566,   13, 2027, 1815,   57,  133,   40,  620, 1795, 1854,  253,
       2369, 4904, 2233, 1954, 4836,  170,  361, 2174,  133,  299, 3461,
          9,  373, 1228,  170, 2424,  273,  111,   89, 1592, 3248, 2554,
       2778,  253, 1541, 1012,    9, 4948,   76,  337,  846,  979,  875,
       1181,  873,  180,  316, 1012,    9,  562, 2586, 1264,  659, 1529,
          2])

In [9]:
# Load the validation set
val_dataset = pd.read_csv('dataset/datasets_feat_clean/val_feat_clean.csv', 
                      usecols=['clean_review', 'polarity', 'word_count','label'],
                      dtype={'clean_review': str, 'label': np.int16})

In [10]:
# Change the columns order
val_dataset = val_dataset[['clean_review', 'polarity', 'word_count', 'label']]

In [11]:
# Display the first 5 rows from the dataset
val_dataset.head()

Unnamed: 0,clean_review,polarity,word_count,label
0,go movie twice week sum word normally use ligh...,0.272,155,1
1,year big fan park work old boy time favorite.w...,-0.05667,119,0
2,movie potential handle differently need differ...,-0.012695,152,0
3,movie difficult review give away plot suffice ...,0.1484,173,1
4,plot worth discussion hint corruption murder p...,0.2153,105,0


In [12]:
val_vocab = Vocab(val_dataset, target_col=None, word2index=train_vocab.word2index, sos_token='<SOS>', eos_token='<EOS>',
                  unk_token='<UNK>', pad_token='<PAD>', min_word_count=None, max_vocab_size=5000, max_seq_len=0.8,
                  use_pretrained_vectors=False, glove_path='Glove/', glove_name='glove.6B.100d.txt',
                  weights_file_name='Glove/weights.npy')

Trimmed vocabulary using maximum size of: 5000
5000/59089 tokens has been retained
Trimmed input strings vocabulary
Trimmed input sequences lengths to the length of: 119
Mapped words to indices


In [13]:
# Display word2index dictionary
val_vocab.word2index

defaultdict(<function __main__.Vocab.mapWord2index.<locals>.<lambda>()>,
            {'<PAD>': 0,
             '<SOS>': 1,
             '<EOS>': 2,
             '<UNK>': 3,
             'movie': 4,
             'film': 5,
             'like': 6,
             'time': 7,
             'good': 8,
             'character': 9,
             'watch': 10,
             'see': 11,
             'story': 12,
             'think': 13,
             'well': 14,
             'scene': 15,
             'great': 16,
             'look': 17,
             'know': 18,
             'end': 19,
             'people': 20,
             'go': 21,
             'bad': 22,
             'get': 23,
             'love': 24,
             'act': 25,
             'play': 26,
             'way': 27,
             'come': 28,
             'thing': 29,
             'find': 30,
             'man': 31,
             'make': 32,
             'plot': 33,
             'work': 34,
             'actor': 35,
             'want': 36,
  

In [14]:
# Depict the first dataset sequence
val_vocab.dataset[10][0]

array([ 387,  458,   38,   11,    5, 1011,   64, 1949,    5, 1447, 1641,
          4,    3, 2101, 2861,   23,  585,  794,    5,   26,  292, 2768,
         49,   62,  165,  663,   44,   31,   36,  226,   26, 1279,  809,
          5, 3140,   62,  165,  119,   74,  132,   19, 2768, 4237, 1272,
         50,  730,  473,    6,    5,  795, 1641,   33, 1554,   12,  685,
        423,  222,   57,   28,  151,   23,  333,   47,    5,   77,   22,
        809,  237, 1391,  208,  680,    7, 1094,   22,    5,   23,  605,
         36,  423,    2])

The next task to do is to create the BatchIterator class that will enable to sort dataset examples, generate batches of input and output variables, apply padding if required and be capable of iterating through all created batches. To warrant that the padding operation within one batch is limited, we have to sort examples within entire dataset according to sequences lengths, so that each batch will contain sequences with the most similar lengths and the number of padding tokens will be reduced.

In [15]:
class BatchIterator:
    
    """The BatchIterator class is responsible for:
    Sorting dataset examples.
    Generating batches.
    Sequence padding.
    Enabling BatchIterator instance to iterate through all batches.

    Parameters
    ----------
    dataset : pandas.DataFrame or numpy.ndarray
        If vocab_created is False, pass Pandas or numpy dataset containing in the first column input strings
        to process and target non-string variable as last column. Otherwise pass vocab.dataset object.
    batch_size: int, optional (default=None)
        The size of the batch. By default use batch_size equal to the dataset length.
    vocab_created: boolean, optional (default=True)
        Whether the vocab object is already created.
    vocab: Vocab object, optional (default=None)
        Use if vocab_created = True, pass the vocab object.
    target_col: int, optional (default=None)
        Column index refering to targets strings to process.
    word2index: dict, optional (default=None)
        Specify the word2index mapping.
    sos_token: str, optional (default='<SOS>')
        Use if vocab_created = False. Start of sentence token.
    eos_token: str, optional (default='<EOS>')
        Use if vocab_created = False. End of sentence token.
    unk_token: str, optional (default='<UNK>')
        Use if vocab_created = False. Token that represents unknown words.
    pad_token: str, optional (default='<PAD>')
        Use if vocab_created = False. Token that represents padding.
    min_word_count: float, optional (default=5)
        Use if vocab_created = False. Specify the minimum word count threshold to include a word in vocabulary
        if value > 1 was passed. If min_word_count <= 1 then keep all words whose count is greater than the
        quantile=min_word_count of the count distribution.
    max_vocab_size: int, optional (default=None)
        Use if vocab_created = False. Maximum size of the vocabulary.
    max_seq_len: float, optional (default=0.8)
        Use if vocab_created = False. Specify the maximum length of the sequence in the dataset, if 
        max_seq_len > 1. If max_seq_len <= 1 then set the maximum length to value corresponding to
        quantile=max_seq_len of lengths distribution. Trimm all sequences whose lengths are greater
        than max_seq_len.
    use_pretrained_vectors: boolean, optional (default=False)
        Use if vocab_created = False. Whether to use pre-trained Glove vectors.
    glove_path: str, optional (default='Glove/')
        Use if vocab_created = False. Path to the directory that contains files with the Glove word vectors.
    glove_name: str, optional (default='glove.6B.100d.txt')
        Use if vocab_created = False. Name of the Glove word vectors file. Available pretrained vectors:
        glove.6B.50d.txt
        glove.6B.100d.txt
        glove.6B.200d.txt
        glove.6B.300d.txt
        glove.twitter.27B.50d.txt
        To use different word vectors, load their file to the vectors directory (Glove/).
    weights_file_name: str, optional (default='Glove/weights.npy')
        Use if vocab_created = False. The path and the name of the numpy file to which save weights vectors.

    Raises
    -------
    ValueError('Use min_word_count or max_vocab_size, not both!')
        If both: min_word_count and max_vocab_size are provided.
    FileNotFoundError
        If the glove file doesn't exist in the given directory.
    TypeError('Cannot convert to Tensor. Data type not recognized')
        If the data type of the sequence cannot be converted to the Tensor.

    Yields
    ------
    dict
        Dictionary that contains variables batches.

    """
        
        
    def __init__(self, dataset, batch_size=None, vocab_created=False, vocab=None, target_col=None, word2index=None,
             sos_token='<SOS>', eos_token='<EOS>', unk_token='<UNK>', pad_token='<PAD>', min_word_count=5,
             max_vocab_size=None, max_seq_len=0.8, use_pretrained_vectors=False, glove_path='Glove/',
             glove_name='glove.6B.100d.txt', weights_file_name='Glove/weights.npy'):    
    
        # Create vocabulary object
        if not vocab_created:
            self.vocab = Vocab(dataset, target_col=target_col, word2index=word2index, sos_token=sos_token, eos_token=eos_token,
                               unk_token=unk_token, pad_token=pad_token, min_word_count=min_word_count,
                               max_vocab_size=max_vocab_size, max_seq_len=max_seq_len,
                               use_pretrained_vectors=use_pretrained_vectors, glove_path=glove_path,
                               glove_name=glove_name, weights_file_name=weights_file_name)
            
            # Use created vocab.dataset object
            self.dataset = self.vocab.dataset      
        
        else:
            # If vocab was created then dataset should be the vocab.dataset object
            self.dataset = dataset
            self.vocab = vocab
            
        self.target_col = target_col 
        
        self.word2index = self.vocab.word2index
            
        # Define the batch_size
        if batch_size:
            self.batch_size = batch_size
        else:
            # Use the length of dataset as batch_size
            self.batch_size = len(self.dataset)
                
        self.x_lengths = np.array(self.vocab.x_lengths)
        
        if self.target_col:
            self.y_lengths = np.array(self.vocab.y_lengths)
            
        self.pad_token = self.vocab.word2index[pad_token]
            
        self.sort_and_batch()

        
    def sort_and_batch(self):
        """ Sort examples within entire dataset, then perform batching and shuffle all batches.

        """
        # Extract row indices sorted according to lengths
        if not self.target_col:
            sorted_indices = np.argsort(self.x_lengths)
        else:
            sorted_indices = np.lexsort((self.y_lengths, self.x_lengths))
        
        # Sort all sets
        self.sorted_dataset = self.dataset[sorted_indices[::-1]]
        self.sorted_x_lengths = np.flip(self.x_lengths[sorted_indices])
        
        if self.target_col:
            self.sorted_target = self.sorted_dataset[:, self.target_col]
            self.sorted_y_lengths = np.flip(self.x_lengths[sorted_indices])
        else:
            self.sorted_target = self.sorted_dataset[:, -1]
        
        # Initialize input, target and lengths batches
        self.input_batches = [[] for _ in range(self.sorted_dataset.shape[1]-1)]
        
        self.target_batches, self.x_len_batches = [], []

        self.y_len_batches = [] if self.target_col else None
        
        # Create batches
        for i in range(self.sorted_dataset.shape[1]-1):
            # The first column contains always sequences that should be padded.
            if i == 0:
                self.create_batches(self.sorted_dataset[:, i], self.input_batches[i], pad_token=self.pad_token)
            else:
                self.create_batches(self.sorted_dataset[:, i], self.input_batches[i])
                
        if self.target_col:
            self.create_batches(self.sorted_target, self.target_batches, pad_token=self.pad_token)
            self.create_batches(self.sorted_y_lengths, self.y_len_batches)
        else:
            self.create_batches(self.sorted_target, self.target_batches)
        
        self.create_batches(self.sorted_x_lengths, self.x_len_batches)
        
        # Shuffle batches
        self.indices = np.arange(len(self.input_batches[0]))
        np.random.shuffle(self.indices)
        
        for j in range(self.sorted_dataset.shape[1]-1):
            self.input_batches[j] = [self.input_batches[j][i] for i in self.indices]
        
        self.target_batches = [self.target_batches[i] for i in self.indices]
        self.x_len_batches = [self.x_len_batches[i] for i in self.indices]
        
        if self.target_col:
            self.y_len_batches = [self.y_len_batches[i] for i in self.indices]
        
        print('Batches created')
        
        
    def create_batches(self, sorted_dataset, batches, pad_token=-1):
        """ Convert each sequence to pytorch Tensor, create batches and pad them if required.
        
        """
        # Calculate the number of batches
        n_batches = int(len(sorted_dataset)/self.batch_size)

        # Create list of batches
        list_of_batches = np.array([sorted_dataset[i*self.batch_size:(i+1)*self.batch_size].copy()\
                                    for i in range(n_batches+1)])

        # Convert each sequence to pytorch Tensor
        for batch in list_of_batches:
            tensor_batch = []
            tensor_type = None
            for seq in batch:
                # Check seq data type and convert to Tensor
                if isinstance(seq, np.ndarray):
                    tensor = torch.LongTensor(seq)
                    tensor_type = 'int'
                elif isinstance(seq, np.integer):
                    tensor = torch.LongTensor([seq])
                    tensor_type = 'int'
                elif isinstance(seq, np.float):
                    tensor = torch.FloatTensor([seq])
                    tensor_type = 'float'
                elif isinstance(seq, int):
                    tensor = torch.LongTensor([seq])
                    tensor_type = 'int'
                elif isinstance(seq, float):
                    tensor = torch.FloatTensor([seq])
                    tensor_type = 'float'
                else:
                    raise TypeError('Cannot convert to Tensor. Data type not recognized')

                tensor_batch.append(tensor)
            if pad_token != -1:
                # Pad required sequences
                pad_batch = torch.nn.utils.rnn.pad_sequence(tensor_batch, batch_first=True)
                batches.append(pad_batch)
            else:
                if tensor_type == 'int':
                    batches.append(torch.LongTensor(tensor_batch))
                else:
                    batches.append(torch.FloatTensor(tensor_batch))

                
    def __iter__(self):
        """ Iterate through batches.
        
        """
        # Create a dictionary that holds variables batches to yield
        to_yield = {}
        
        # Iterate through batches
        for i in range(len(self.input_batches[0])):

            for j in range(len(self.input_batches)):
                to_yield['input_{}'.format(j)] = self.input_batches[j][i]
                
            to_yield['target'] = self.target_batches[i]
            to_yield['x_lengths'] = self.x_len_batches[i]
            
            if self.target_col:
                to_yield['y_length'] = self.y_len_batches[i]


            yield to_yield
            
            
    def __len__(self):
        """ Return iterator length.
        
        """
        return len(self.input_batches[0])
        

Now we are going to instantiate the BatchIterator class and check out whether all tasks were conducted correctly.

In [16]:
train_iterator = BatchIterator(train_dataset, batch_size=32, vocab_created=False, vocab=None, target_col=None,
                               word2index=None, sos_token='<SOS>', eos_token='<EOS>', unk_token='<UNK>',
                               pad_token='<PAD>', min_word_count=5, max_vocab_size=None, max_seq_len=0.8,
                               use_pretrained_vectors=False, glove_path='glove/', glove_name='glove.6B.100d.txt',
                               weights_file_name='glove/weights.npy')

Trimmed vocabulary using as minimum count threashold: count = 5.00
26346/130416 tokens has been retained
Trimmed input strings vocabulary
Trimmed input sequences lengths to the length of: 138
Mapped words to indices
Batches created


In [17]:
# Print the size of first input batch
len(train_iterator.input_batches[0][0])

32

In [18]:
# Run the BatchIterator and print the first set of batches
for batches in train_iterator:
    pprint(batches)
    break

{'input_0': tensor([[  173,   692,  2273,  ...,  5350,   135,     2],
        [ 4692,   165,  2234,  ...,  2234,  1449,     2],
        [ 1051,   173,    28,  ...,  3564,   485,     2],
        ...,
        [   74,    28,     7,  ...,  1276,  1524,     2],
        [  463, 14307,     5,  ...,   565,   416,     2],
        [ 4889, 14402,   968,  ...,   633,  2163,     2]]),
 'input_1': tensor([-0.1105, -0.0653, -0.0124, -0.0297,  0.0717,  0.0883, -0.0433,  0.2168,
         0.1687,  0.1799,  0.0545, -0.1493,  0.4133,  0.0135,  0.0040,  0.2640,
         0.0687, -0.0337,  0.0461, -0.1031,  0.0701,  0.2874,  0.1074, -0.0500,
         0.2964,  0.1423,  0.0289,  0.2441,  0.2040,  0.0717,  0.1062,  0.1110]),
 'input_2': tensor([243, 201, 276, 243, 247, 236, 278, 196, 211, 223, 242, 234, 245, 259,
        271, 222, 227, 235, 287, 214, 249, 237, 234, 209, 237, 249, 317, 221,
        287, 262, 273, 215]),
 'target': tensor([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
  

In [19]:
val_iterator = BatchIterator(val_dataset, batch_size=32, vocab_created=False, vocab=None, target_col=None,
                             word2index=train_iterator.word2index, sos_token='<SOS>', eos_token='<EOS>',
                             unk_token='<UNK>', pad_token='<PAD>', min_word_count=5, max_vocab_size=None,
                             max_seq_len=0.8, use_pretrained_vectors=False, glove_path='glove/',
                             glove_name='glove.6B.100d.txt', weights_file_name='glove/weights.npy')

Trimmed vocabulary using as minimum count threashold: count = 5.00
14097/59089 tokens has been retained
Trimmed input strings vocabulary
Trimmed input sequences lengths to the length of: 132
Mapped words to indices
Batches created


In [20]:
val_iterator.word2index

defaultdict(<function __main__.Vocab.mapWord2index.<locals>.<lambda>()>,
            {'<PAD>': 0,
             '<SOS>': 1,
             '<EOS>': 2,
             '<UNK>': 3,
             'amaze': 4,
             'good': 5,
             'wonderful': 6,
             'film': 7,
             'early': 8,
             'ninety': 9,
             'franchise': 10,
             'grow': 11,
             'stargate': 12,
             'sg1': 13,
             'doubt': 14,
             'worthy': 15,
             'addition': 16,
             'science': 17,
             'fiction': 18,
             'genre': 19,
             'right': 20,
             'stand': 21,
             'shoulder': 22,
             'star': 23,
             'trek': 24,
             'king': 25,
             'feature': 26,
             'series': 27,
             'see': 28,
             'command': 29,
             'military': 30,
             'organisation': 31,
             'figure': 32,
             'system': 33,
             'travel': 

In [21]:
# Run the BatchIterator and print the first set of batches
for batches in val_iterator:
    pprint(batches)
    break

{'input_0': tensor([[  135,   173,  2685,  ...,     6,   335,     2],
        [   28,  1600,     7,  ...,   703,   323,     2],
        [ 1276,   163,   173,  ...,   181,    80,     2],
        ...,
        [   28, 19217,     7,  ...,   292,  1233,     2],
        [ 5551,  1867,   889,  ...,    23,     2,     0],
        [ 8425,  5020,  1000,  ...,   755,     2,     0]]),
 'input_1': tensor([ 0.3280,  0.0333,  0.0146, -0.1302, -0.0811,  0.0029, -0.2673,  0.5180,
         0.2483,  0.0739,  0.0571,  0.0545,  0.1115,  0.4592,  0.2960,  0.4360,
        -0.0438,  0.1438, -0.0969,  0.3300, -0.0528,  0.2960, -0.4077,  0.3357,
         0.0596, -0.2222, -0.0294, -0.1250,  0.0578, -0.2844, -0.0988,  0.2455]),
 'input_2': tensor([ 93, 113, 113,  84,  93, 122,  96,  97, 118, 111, 111, 103, 104, 108,
        105, 113,  97,  61,  99,  89, 126, 122,  93, 113, 151,  83, 108,  86,
        101, 101, 110,  72]),
 'target': tensor([1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
  

In the next notebook we are going to create the neural network model.