## Deep Learning with Sequence Data and Text

1. Preprocessing Text: Tokenization, Vectorization and n-gram representation
2. 
3. 
4. 
5. 
6. 

**Key Points**

* Text is a type of Sequential Data.
* DL Sequential Models (RNNs, LSTMs), can find patterns in Sequential Data and Text:
    * Natural Language Understanding
    * Document Classification
    * Sentiment Classification

## Pre-processing text

1. Convert text to tokens (words/characters). (**Tokenization**)
    * Example: **Input text** - "I am studying PyTorch". **Tokens** will be - "I", "am", "studying", "PyTorch"
2. Map each token to a vector. (**Vectorization**) There are two techniques for Vectorization
    * One-hot encoding
    * Word embedding
    
Let's do this in PyTorch.

In [2]:
# Necessary Imports
import torch, torchvision
import time
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable 
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import models

## Tokenization

I'll try and convert a quote from one of the most famous scientist, politician: Late Indian President, Dr. APJ Abdul Kalam.

"Never stop fighting until you arrive at your destined place – that is, the unique you. Have an aim in life, continuously acquire knowledge, work hard, and have the perseverance to realize the great life."

Libraries for Tokenization: spaCy

In [3]:
text = "Never stop fighting until you arrive at your destined place – that is, the unique you. Have an aim in life, continuously acquire knowledge, work hard, and have the perseverance to realize the great life."

In [10]:
tokens_words = text.split(" ") # use a space as delimiter

In [11]:
# starting 4 tokens - converting text to words
tokens_words[:4]

['Never', 'stop', 'fighting', 'until']

In [12]:
# converting text to characters
tokens_char = list(text)

In [13]:
# starting 4 tokens - characters
tokens_char[:4]

['N', 'e', 'v', 'e']

## N-gram Representations

Challenge - Loses sequential nature of text.

In [14]:
from nltk import ngrams

In [18]:
two_grams = list(ngrams(text.split(), 2)) # 2 - number of words to grouped

In [21]:
print(two_grams[:4], end='')

[('Never', 'stop'), ('stop', 'fighting'), ('fighting', 'until'), ('until', 'you')]

## Vectorization

1. One Hot Encoding
2. Word Embedding

### One Hot Encoding

In [22]:
# each token is represented by a vector of length N
# N - vocab size 
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.length = 0
    def add_word(self, word):
        if word not in self.idx2word:
            self.idx2word.append(word)
            self.word2idx[word] = self.length + 1
            self.length += 1
        return self.word2idx[word]
    def __len__(self):
        return len(self.word2idx)
    def onehot_encoded(self, word):
        vec = np.zeros(self.length)
        vec[self.word2idx[word]] = 1
        return vec

In [23]:
dic = Dictionary()
for token in text.split(' '):
    dic.add_word(token)

In [29]:
print(dic.word2idx)

{'Never': 1, 'stop': 2, 'fighting': 3, 'until': 4, 'you': 5, 'arrive': 6, 'at': 7, 'your': 8, 'destined': 9, 'place': 10, '–': 11, 'that': 12, 'is,': 13, 'the': 14, 'unique': 15, 'you.': 16, 'Have': 17, 'an': 18, 'aim': 19, 'in': 20, 'life,': 21, 'continuously': 22, 'acquire': 23, 'knowledge,': 24, 'work': 25, 'hard,': 26, 'and': 27, 'have': 28, 'perseverance': 29, 'to': 30, 'realize': 31, 'great': 32, 'life.': 33}


In [30]:
dic.onehot_encoded('Never')

array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [32]:
dic.onehot_encoded('fighting')

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

**Disadvantages**:

1. Too sparse (too many zeros)
2. Size of vector very large, if size of the text increases. (Therefore not generally used in DL)

### Word Embedding

In [33]:
# semantically closer words - have similar representation

In [35]:
import torchtext
from torchtext import data

**Class:** `data.Field`

Attributes:

1. `sequential`:
```python
if(True):
    # apply tokenization
else:
    # don't apply tokenization
```
2. `use_vocab`:
```python
if(True):
    # use a vocab object
else:
    # data should be already numerical.
```
Similarly, find others in the docs below

In [41]:
print(help(data.Field))

Help on class Field in module torchtext.data.field:

class Field(RawField)
 |  Defines a datatype together with instructions for converting to Tensor.
 |  
 |  Field class models common text processing datatypes that can be represented
 |  by tensors.  It holds a Vocab object that defines the set of possible values
 |  for elements of the field and their corresponding numerical representations.
 |  The Field object also holds other parameters relating to how a datatype
 |  should be numericalized, such as a tokenization method and the kind of
 |  Tensor that should be produced.
 |  
 |  If a Field is shared between two columns in a dataset (e.g., question and
 |  answer in a QA dataset), then they will have a shared vocabulary.
 |  
 |  Attributes:
 |      sequential: Whether the datatype represents sequential data. If False,
 |          no tokenization is applied. Default: True.
 |      use_vocab: Whether to use a Vocab object. If False, the data in this
 |          field should alrea

In [37]:
TEXT = data.Field(lower=True, batch_first=True, fix_length=20)

In [38]:
LABEL = data.Field(sequential=False)

In [42]:
# downloading datasets
train, text = torchtext.datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:23<00:00, 3.59MB/s]


In [46]:
print("Datatype of train.fields: ", type(train.fields))

Datatype of train.fields:  <class 'dict'>


In [49]:
print(vars(train[0]))

{'text': ['elvira', 'mistress', 'of', 'the', 'dark', 'is', 'one', 'of', 'my', 'fav', 'movies,', 'it', 'has', 'every', 'thing', 'you', 'would', 'want', 'in', 'a', 'film,', 'like', 'great', 'one', 'liners,', 'sexy', 'star', 'and', 'a', 'outrageous', 'story!', 'if', 'you', 'have', 'not', 'seen', 'it,', 'you', 'are', 'missing', 'out', 'on', 'one', 'of', 'the', 'greatest', 'films', 'made.', 'i', "can't", 'wait', 'till', 'her', 'new', 'movie', 'comes', 'out!'], 'label': 'pos'}
