<a href="https://colab.research.google.com/github/manvendra7/TSAI_END2.0-Assignment-8/blob/main/Pytorch_Preprocessing_Explanation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are Pytorch Datasets?

Loading the Dataset -

PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allows us to use preloaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

What are pytorch Datasets?

PyTorch Datasets are just things that have a length and are indexable so that len(dataset) will work and dataset[index] will return a tuple of (x,y).

Pytorch’s data sets have "dunder/magic methods" __getitem__ (for dataset[index] functionality) and __len__ (for len(dataset) functionality).

In [41]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

In [112]:
class CustomTextDataset(Dataset):

  "Custom Dataset class to load QnA data"

  def __init__(self,src, tgt,path):

      self.df = pd.read_csv(path,sep='\t',encoding='iso-8859-1')   # read the data
      self.df = self.df[['Question','Answer']].dropna().reset_index(drop=True)   # drop the null values from the dataset

      self.src = self.df[src]   # source column or Questions
      self.tgt = self.df[tgt]   # target column or Answers

  def __len__(self):
      return len(self.tgt)   # return the length of dataset

  def __getitem__(self, idx):  # returns a dictionary of questions and answers
      src = self.src[idx]
      tgt = self.tgt[idx]
      sample = {"SRC":src,"TGT" :tgt}   
      return sample

data = CustomTextDataset('Question','Answer','/content/full_question_answer_data.txt')
next(iter(data))

{'SRC': 'Was Abraham Lincoln the sixteenth President of the United States?',
 'TGT': 'yes'}

In [113]:
list(DataLoader(data))[:11]

[{'SRC': ['Was Abraham Lincoln the sixteenth President of the United States?'],
  'TGT': ['yes']},
 {'SRC': ['Was Abraham Lincoln the sixteenth President of the United States?'],
  'TGT': ['Yes.']},
 {'SRC': ['Did Lincoln sign the National Banking Act of 1863?'],
  'TGT': ['yes']},
 {'SRC': ['Did Lincoln sign the National Banking Act of 1863?'],
  'TGT': ['Yes.']},
 {'SRC': ['Did his mother die of pneumonia?'], 'TGT': ['no']},
 {'SRC': ['Did his mother die of pneumonia?'], 'TGT': ['No.']},
 {'SRC': ["How many long was Lincoln's formal education?"],
  'TGT': ['18 months']},
 {'SRC': ["How many long was Lincoln's formal education?"],
  'TGT': ['18 months.']},
 {'SRC': ['When did Lincoln begin his political career?'], 'TGT': ['1832']},
 {'SRC': ['When did Lincoln begin his political career?'], 'TGT': ['1832.']},
 {'SRC': ['What did The Legal Tender Act of 1862 establish?'],
  'TGT': ['the United States Note, the first paper currency in United States history']}]

In [127]:
for idx,sample in enumerate(DataLoader(data)):
  print(idx)
  print(sample)
  print('-'*100)

  if idx == 5:
    break

0
{'SRC': ['Was Abraham Lincoln the sixteenth President of the United States?'], 'TGT': ['yes']}
----------------------------------------------------------------------------------------------------
1
{'SRC': ['Was Abraham Lincoln the sixteenth President of the United States?'], 'TGT': ['Yes.']}
----------------------------------------------------------------------------------------------------
2
{'SRC': ['Did Lincoln sign the National Banking Act of 1863?'], 'TGT': ['yes']}
----------------------------------------------------------------------------------------------------
3
{'SRC': ['Did Lincoln sign the National Banking Act of 1863?'], 'TGT': ['Yes.']}
----------------------------------------------------------------------------------------------------
4
{'SRC': ['Did his mother die of pneumonia?'], 'TGT': ['no']}
----------------------------------------------------------------------------------------------------
5
{'SRC': ['Did his mother die of pneumonia?'], 'TGT': ['No.']}
--------

In [115]:
bat_size = 2
DL_DS = DataLoader(data, batch_size=bat_size)

# loop through each batch in the DataLoader object
for (idx,batch) in enumerate(DL_DS):

    # Print the 'text' data of the batch
     print(idx, 'SRC data: ', batch['SRC'], '\n')

    # Print the 'class' data of batch
     print(idx, 'TGT data: ', batch['TGT'], '\n')

     if idx == 5:
       break

0 SRC data:  ['Was Abraham Lincoln the sixteenth President of the United States?', 'Was Abraham Lincoln the sixteenth President of the United States?'] 

0 TGT data:  ['yes', 'Yes.'] 

1 SRC data:  ['Did Lincoln sign the National Banking Act of 1863?', 'Did Lincoln sign the National Banking Act of 1863?'] 

1 TGT data:  ['yes', 'Yes.'] 

2 SRC data:  ['Did his mother die of pneumonia?', 'Did his mother die of pneumonia?'] 

2 TGT data:  ['no', 'No.'] 

3 SRC data:  ["How many long was Lincoln's formal education?", "How many long was Lincoln's formal education?"] 

3 TGT data:  ['18 months', '18 months.'] 

4 SRC data:  ['When did Lincoln begin his political career?', 'When did Lincoln begin his political career?'] 

4 TGT data:  ['1832', '1832.'] 

5 SRC data:  ['What did The Legal Tender Act of 1862 establish?', 'What did The Legal Tender Act of 1862 establish?'] 

5 TGT data:  ['the United States Note, the first paper currency in United States history', 'The United States Note, the f

In machine learning or deep learning text needs to be cleaned and turned in to vectors prior to training. DataLoader has a handy parameter called collate_fn. This parameter allows you to create separate data processing functions and will apply the processing within that function to the data before it is output.

need to use and learn collate and put it in

In [96]:
%%bash
python -m spacy download en
python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Collecting de_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py): started
  Building wheel for de-core-news-sm (setup.py): finished with status 'done'
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp37-none-any.whl size=14907055 sha256=801c86f29c6a4f1e61b7b9a1b042a9143182571eada9ae9a4c0cb166a1be2d21
  Stored in directory: /tmp/pip-ephem-wheel-cache-62q4ej5w/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Inst

In [97]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import Iterable, List

In [98]:
###########################################################################################
###########################################################################################
import torchtext
from torchtext.data import get_tokenizer
tokenizer = get_tokenizer("basic_english")
tokens = tokenizer("You can now install TorchText using pip!")
tokens
#['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

In [99]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import AG_NEWS

tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

vocab(['hello','we','are','languagecorpus'])

# vocab(['hello','we','are','languagecorpus'])

[12544, 507, 42, 0]

In [101]:
from torchtext.datasets import Multi30k
train_iter = Multi30k(split='train')
next(iter(train_iter))

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
 'Two young, White males are outside near many bushes.\n')

In [102]:
SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'
token_transform = {}
vocab_transform = {}

In [103]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy',language='de')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy',language='en')

def yield_tokens(data_iter:Iterable,language:str):
  language_index = {SRC_LANGUAGE:0,TGT_LANGUAGE:1}
  for data_sample in data_iter:
    yield token_transform[language](data_sample[language_index[language]])

In [104]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

In [106]:
for ln in [SRC_LANGUAGE,TGT_LANGUAGE]:
  train_iter = Multi30k(split='train')
  vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter,ln),
                                                  min_freq=1,
                                                  specials=special_symbols,
                                                  special_first = True)

In [107]:
# Set UNK_IDX as the default index. This index is returned when the token is not found. 
# If not set, it throws RuntimeError when the queried token is not found in the Vocabulary. 
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
  vocab_transform[ln].set_default_index(UNK_IDX)

In [109]:
vocab_transform[TGT_LANGUAGE](['hello','i','am','vocab','transform'])

[5466, 2590, 3427, 0, 0]

In [None]:
# def load_data(test_split, batch_size):
#     """Loads the data"""
#     sonar_dataset = CustomTextDataset('./sonar.all-data')
#     # Create indices for the split
#     dataset_size = len(sonar_dataset)
#     test_size = int(test_split * dataset_size)
#     train_size = dataset_size - test_size

#     train_dataset, test_dataset = random_split(sonar_dataset,
#                                                [train_size, test_size])

#     train_loader = DataLoader(
#         train_dataset.dataset,
#         batch_size=batch_size,
#         shuffle=True)
#     test_loader = DataLoader(
#         test_dataset.dataset,
#         batch_size=batch_size,
#         shuffle=True)

#     return train_loader, test_loader

## Understanding the Collate function

In [128]:
# pytorch's default collate function taken from https://github.com/pytorch/pytorch/blob/master/torch/utils/data/_utils/collate.py

import torch
import re
import collections
from torch._six import string_classes

np_str_obj_array_pattern = re.compile(r'[SaUO]')


def default_convert(data):
    r"""Converts each NumPy array data field into a tensor"""
    elem_type = type(data)
    if isinstance(data, torch.Tensor):
        return data
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        # array of string classes and object
        if elem_type.__name__ == 'ndarray' \
                and np_str_obj_array_pattern.search(data.dtype.str) is not None:
            return data
        return torch.as_tensor(data)
    elif isinstance(data, collections.abc.Mapping):
        return {key: default_convert(data[key]) for key in data}
    elif isinstance(data, tuple) and hasattr(data, '_fields'):  # namedtuple
        return elem_type(*(default_convert(d) for d in data))
    elif isinstance(data, collections.abc.Sequence) and not isinstance(data, string_classes):
        return [default_convert(d) for d in data]
    else:
        return data


default_collate_err_msg_format = (
    "default_collate: batch must contain tensors, numpy arrays, numbers, "
    "dicts or lists; found {}")


def default_collate(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    elem = batch[0]
    elem_type = type(elem)
    if isinstance(elem, torch.Tensor):
        out = None
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum([x.numel() for x in batch])
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, collections.abc.Mapping):
        return {key: default_collate([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate(samples) for samples in zip(*batch)))
    elif isinstance(elem, collections.abc.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
            raise RuntimeError('each element in list of batch should be of equal size')
        transposed = zip(*batch)
        return [default_collate(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

In [129]:
item_list = [1,2,3,4,5]
default_collate(item_list)

tensor([1, 2, 3, 4, 5])

In [130]:
item_list = ([1,2,3,4,5],[6,7,8,9,10])
default_collate(item_list)

[tensor([1, 6]),
 tensor([2, 7]),
 tensor([3, 8]),
 tensor([4, 9]),
 tensor([ 5, 10])]

In [132]:
item_list = [(1,2),(3,4),(5,6),(7,8)]
default_collate(item_list)

[tensor([1, 3, 5, 7]), tensor([2, 4, 6, 8])]

In [142]:
item_list = [[1,2,3],[3,4,5],[5,6,7],[7,8,9]]
default_collate(item_list)

[tensor([1, 3, 5, 7]), tensor([2, 4, 6, 8]), tensor([3, 5, 7, 9])]

In [141]:
import torch
def our_own_collate(data):

 # from [[x1,x2,y],[...]] to [tensor([[x1,x2],.....]) tensor([y,.....])] 

 xs = [[data_item[0],data_item[1]] for data_item in data]
 y = [data_item[2] for data_item in data]

 return torch.tensor(xs),torch.tensor(y)

item_list = [[1,2,3],[3,4,5],[5,6,7],[7,8,9]]
our_own_collate(item_list)

(tensor([[1, 2],
         [3, 4],
         [5, 6],
         [7, 8]]), tensor([3, 5, 7, 9]))

## Custom collate function for loading sequential dataset

In [None]:
######################################################################
# Collation
# ---------
#   
# As seen in the ``Data Sourcing and Processing`` section, our data iterator yields a pair of raw strings. 
# We need to convert these string pairs into the batched tensors that can be processed by our ``Seq2Seq`` network 
# defined previously. Below we define our collate function that convert batch of raw strings into batch tensors that
# can be fed directly into our model.   
#


from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]), 
                      torch.tensor(token_ids), 
                      torch.tensor([EOS_IDX])))

# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tesors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch