### Tokenization

When dealing with a natural language processing task, we take a text corpus and break it down into smaller units. We will break the sentences down into individual words, where each word represents a meaning along with the other words in its proximity to convey the intent of a sentence.


In [1]:
# Using lambda function to create a tokenizer

tokenizer = lambda x : x.split()

In [2]:
tokenizer('This is a test string')

['This', 'is', 'a', 'test', 'string']

We tokenized the sentences with spaces.

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Alternative Approach: using nltk

from nltk.tokenize import word_tokenize

word_tokenize("This is a, test string")

['This', 'is', 'a', ',', 'test', 'string']

### Creating Fields

Fields let us define the datatype and help us create tensors out of textual data by specifying the set of operations to be performed on the data.

In [5]:
from torchtext.data import Field

# Fields for Reviews
Review = Field(sequential= True, tokenize= tokenizer , lower= True)

# Fields for Labels
Label = Field(sequential= False, use_vocab= False)

# Adding token at end and starting of the input string

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True)

# Setting a field with a fix length

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True, fix_length= 50)

# Setting an unknown token

SequenceField = Field(tokenize= tokenizer, init_token= '<sos>', eos_token= '<sos>', lower= True, unk_token= '<unk>')

### Developing the Dataset

TorchText can read data from text files, CSV/TSV files, JSON files, and directories and converts them into a dataset. Datasets are preprocessed blocks of data that are read into memory, and can be used by other data structures. 

In [6]:
from torchtext.data import TabularDataset

# Selecting the training columns
train_datafields = [("id", None), ("content", Review), ("Business", Label), ("SciTech", Label), ("Sports", Label), ("World", Label)]

# Selecting the testing columns
test_datafields = [('id',None), ('content',Review)]

# Reading the training and validation file
train, valid = TabularDataset.splits(path= '/content/NewsClassificaiton' ,train='train.csv', validation='valid.csv', format='csv', skip_header=True, fields=train_datafields)

# Reading the test file

test = TabularDataset(path= '/content/NewsClassificaiton/test.csv', format = 'csv', skip_header= True, fields= test_datafields)

In [8]:
# Building the vocabulary
Review = Field(sequential= True, tokenize= tokenizer , lower= True)

Review.build_vocab(train, min_freq=2)

We used the TabularDataset module in torchtext to read the CSV file, which can also be used to read inputs in the TSV, JSON, and Python dictionaries, which define a dataset of columns.   

### Developing Iterators

Iterators are used to load batches of data from the dataset. They provide methods to make loading data and moving data to the appropriate device easier. We could use these iterator objects to iterate over the data while running through the epochs 

In [9]:
from torchtext.data import BucketIterator
import torch

In [10]:
# Defining the Batch Size

BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, valid_iter, test_iter = BucketIterator.splits((train, valid, test),
                                     batch_size=BATCH_SIZE,
                                     device=device,
                                     sort_key=lambda x: len(x.comment_text), 
                                     sort_within_batch=False)