<a href="https://colab.research.google.com/github/mkaramib/trax/blob/main/QuestionClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Classification
In this notebook, I will implement a question classifier using Trax deep learning framework. 

In [1]:
import numpy as np_base  # regular ol' numpy
import os
import random as rnd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from unicodedata import normalize
import re


# Initialize
Some of the libraries need to be downlowed or initialized such as NLTK tokenizer and stop-words. Following lines will do these steps. 

In [2]:
# initialize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Trax:
In this section, we need to install [trax](https://github.com/google/trax) if it is not installed. 

In [3]:
!pip install -q -U trax
import trax
from trax import layers as tl  # core building block
from trax import shapes  # data signatures: dimensionality and type
from trax import fastmath  # uses jax, offers numpy on steroids
from trax.supervised import training

[K     |████████████████████████████████| 471kB 6.5MB/s 
[K     |████████████████████████████████| 2.6MB 10.6MB/s 
[K     |████████████████████████████████| 174kB 38.8MB/s 
[K     |████████████████████████████████| 1.4MB 35.6MB/s 
[K     |████████████████████████████████| 1.1MB 58.9MB/s 
[K     |████████████████████████████████| 3.7MB 39.1MB/s 
[K     |████████████████████████████████| 71kB 9.4MB/s 
[K     |████████████████████████████████| 348kB 51.5MB/s 
[K     |████████████████████████████████| 890kB 50.3MB/s 
[K     |████████████████████████████████| 2.9MB 52.2MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone




Check which version of Trax has been installed.



In [4]:
!pip list | grep trax

trax                          1.3.6                


## Trax Numpy
The key mathematical benefit of Trax is using JAX to implement its numpy version. So, following line will import Trax' numpy.

In [5]:
# import trax.fastmath.numpy
import trax.fastmath.numpy as np

# Data 
In this section, all the training and testing questions are read.
*   Train Data: contains 1000, 2000, 3000, 4000, or 5500 questions in each file.
*   Validation Data: we use one of training files for validation data.
*   Test Data: contains close to 500 questions to evaluate the trained model.

In each file(train and test), each line contains a question which has the following format:
*   QuestionCategory: Question content.

In [22]:
train_file = open("./questions/train_2000.label", mode='r',encoding="ISO-8859-1")
val_file = open("./questions/train_1000.label", mode='r',encoding="ISO-8859-1")
test_file = open("./questions/TREC_10.label", mode='r', encoding="ISO-8859-1")

## Tokenization
One of the key steps in the preprocess is to tokenize the questions. In this experiment, we use NLTK tokenization. 


In [23]:
def tokenize(question):
    """
    separate the question type as well as question tokens
    :param question: given question
    :return: question_category, question_terms
    """
    colon = question.find(':')            # index of first colon to separate the question category
    q_cat = question[0:colon]             # get question type
    content_normalized = normalize('NFKC', question[colon:])  # normalize the content
    content_normalized = re.sub("[^a-zA-Z. ]", "", content_normalized)  # remove non-alphabetic parts of question
    terms_all = word_tokenize(content_normalized)             # tokenize the content

    # remove the stop words
    terms = [w for w in terms_all if not w in stop_words]
    #terms = terms_all
    return q_cat, terms

## Data Preprocess
In this step, all the question are read, tokenized and stored in list of tuples: *(category, terms)*

In [24]:
def load_data(file):
  """
  read the lines from the given file and prepare the list of tuples of questions.
  :param file: given file
  :return: list of tuples(category, terms)
  """
  questions = []
  lines = file.readlines()
  for line in lines:
    cat, terms = tokenize(line)
    questions.append((cat, terms))
  return questions

# Vocabulary and Targets
We need the vocabulary and targets to train. In this step, we will make them ready. In the future, we need to convert the content(question) to tensor which is list of numbers. So, we need to keep unique id for each term. We have covered this in the vocab dictionary.

In [25]:
def build_vocabulary(questions):
  """
  Generate the vocabulary from the questions. 
  :param questions: given list of tuples(cat, terms)
  :return: list of unique categories, vocabulary(dictionary of term:Id)
  """
  cats = [cat for (cat, _) in questions]
  vocab = {'__PAD__': 0, '__UNK__': 1}
  for (_, terms) in questions:
    for term in terms:
       if term not in vocab:
         vocab[term] = len(vocab) 
  return list(set(cats)), vocab

# Build Tensor
Oen of the first steps in training any neural network is to convert any input to tensor.

In [None]:
def question_to_tensor(question, vocab, unk_token="__UNK__"):
  """
  convert the given question into tensor
  :param question: list of terms of question, [t1, t2, ...]
  :param vocab: dictionary of vocabulary
  :param unk_token: token to be used for the terms that are not in the vocabs.
  :return: tensor = [1, 4, 2, ...]
  """
  tensor = []
  for term in question:
    # get the id for the term
    word_id = vocab[term] if term in vocab else vocab[unk_token]
    tensor.append(word_id)

  return tensor

# Batch Data Generator
In most Deep NN models, the inputs are given in batches. A batch generator is implemented to generate batches of data samples for *train*, *validation*, and *test*. 

In [None]:
def data_generator(data, vocab, cats, batch_size, loop, shuffle=False):
  '''
  Generate a batch of samples from the given data.
  :param data: list of tuples of questions:(cat, terms).
  :param vocab: vocabulary dictionary {term:id, ...}
  :param cats: list of categories of questions.
  :param batch_size: size of batch.
  :param loop: True/False to loop back at the end of data.
  :param shuffle: True/False, shuffle the data or not.
  :Yield: inputs: subset of data samples, target: corresponing targets of selected inputs.
  '''

  # build a list of indexes for data samples
  data_l = len(data)
  data_indexes = list(range(data_l))

  # get the max length of questions for padding. 
  max_l = 0
  max_l = max(max_l, len(q)) for (_,q) in data

  # shuffle the indexes if it is True
  if shuffle:
    rnd.shuffle(data_indexes)

  stop = False
  index = 0

  while not stop:
    batch = []
    targets = []
    
    for i in range(batch_size):

        # if at the end of data.
        if index >= len(data_indexes):
          if not loop:
            stop = True
            break
          
          # start index from 0
          index = 0
          
          # shuffle the data indexes if required
          if shuffle:
            rnd.shuffle(data_indexes)
          
        # get the question, convert to tensor, and append the data and target
        q = data[data_indexes[index]]
        q_tensor = question_to_tensor(q[1], vocab)
        
        # pad the batched tensors to the longest question in the data.
        q_tensor_pad = q_tensor + [vocab["__PAD__"]]*(max_l - len(q_tensor)) 
        
        batch.append(q_tensor)
        targets.append(cats.index(q[0]))

        # increase index
        index += 1

    # if stop
    if stop:
      break

    # yield the batch and targets
    yield np.array(batch), np.array(targets)

We need to build data generators for training, validation, and testing processes. 

In [None]:
def train_generator(train_questions,shuffle=False):
  return data_generator(train_questions, vocab, cats, batch_size, True,shuffle=False)

def eval_generator(eval_questions, shuffle=False):
  return data_generator(eval_questions, vocab, cats, batch_size, True,shuffle=False)

def test_generator(test_questions, shuffle=False):
  return data_generator(test_questions, vocab, cats, batch_size, True,shuffle=False)

# Classifier
Classifier is a function that builds a neural network.

In [None]:
def classifier(vocab_size=len(Vocab), embedding_dim=256, output_dim=2, mode='train'):
    # create embedding layer
    embed_layer = tl.Embedding(
        vocab_size = vocab_size,    # Size of the vocabulary
        d_feature = embedding_dim)  # Embedding dimension
    
    # Create a mean layer, to create an "average" word embedding
    mean_layer = tl.Mean(axis=1)
    
    # Create a dense layer, one unit for each output
    dense_output_layer = tl.Dense(n_units = output_dim)
    
    # Create the log softmax layer.
    log_softmax_layer = tl.LogSoftmax()
    
    # Create a model using tl.Serial to combine all layers
    model = tl.Serial(
      embed_layer,        # embedding layer
      mean_layer,         # mean layer
      dense_output_layer, # dense output layer 
      log_softmax_layer   # log softmax layer
    )

    # return the model of type
    return model

# Train
Training a model includes various steps:
1.  In order to train the model, we need to define the *train_task* and *eval_tax*.  
2.  Define the training loop using above tasks.



In [None]:
def train_model(classifier, batch_size, n_steps, output_dir):
    '''
    Input: 
        classifier: the model you are building
        batch_size: the batch size for data-generation
        n_steps: the evaluation steps
        output_dir: folder to save your files
    Output:
        trainer: trax trainer
    '''
    # 1- Define the train_task
    train_task = training.TrainTask(
        labeled_data=train_generator(batch_size=batch_size, shuffle=True),
        loss_layer=tl.CrossEntropyLoss(),
        optimizer=trax.optimizers.Adam(0.01),
        n_steps_per_checkpoint=10,
    )

    # 2- Define the eva_task
    eval_task = training.EvalTask(
        labeled_data=val_generator(batch_size=batch_size, shuffle=True),
        metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    )

    # 3- training loop
    training_loop = training.Loop(
                                classifier, # The learning model
                                train_task, # The training task
                                eval_task = eval_task, # The evaluation task
                                output_dir = output_dir) # The output directory

    # 4- run the training loop
    training_loop.run(n_steps = n_steps)

    # 5- Return the training_loop, since it has the model.
    return training_loop

# Main Function
Main process includes following steps:
1.   Initialization
  *   Batch size
  *   Random seed
2.   Load data-sets
  *   define files
  *   load data samples
3.   Build vocabulary (terms and question categories)
4.   Define data-generators
  *   test data-generator (other generators are defined in the corresponding tasks)
5.   









In [None]:
# initialize batch_size, random seed
batch_size = 16
rnd.seed(271)
n_steps = 100
output_dir="./output_dir/"

# load training questions
training_data = load_data(train_file)

# load validation questions
validation_data = load_data(val_file)

# load test questions
test_data = load_data(test_file)

# build vocabulary
cats, vocab = build_vocabulary(training_data)

# define data-generators
test_data_gen = test_generator(batch_size=batch_size, shuffle=False)

# define the model(classifier)
#model = classifier()

# instantiate training 
#training_loop = train_model(model, batch_size=batch_size, n_steps=n_steps, output_dir=output_dir)


