<a href="https://colab.research.google.com/github/mkaramib/trax/blob/main/QuestionClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Classification
In this notebook, I will implement a question classifier using Trax deep learning framework. 

In [None]:
import numpy as np_base  # regular ol' numpy
import os
import random as rnd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from unicodedata import normalize
import re

# Initialize
Some of the libraries need to be downlowed or initialized such as NLTK tokenizer and stop-words. Following lines will do these steps. 

In [None]:
# initialize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Trax
In this section, we need to install [trax](https://github.com/google/trax) if it is not installed. 

In [None]:
!pip install -q -U trax
import trax
from trax import layers as tl  # core building block
from trax import shapes  # data signatures: dimensionality and type
from trax import fastmath  # uses jax, offers numpy on steroids
from trax.supervised import training

Check which version of Trax has been installed.



In [None]:
!pip list | grep trax

## Trax Numpy
The key mathematical benefit of Trax is using JAX to implement its numpy version. So, following line will import Trax' numpy.

In [None]:
# import trax.fastmath.numpy
import trax.fastmath.numpy as np

# Data Processing
In this section, all the training and testing questions are read.
*   Train Data: contains 1000, 2000, 3000, 4000, or 5500 questions in each file.
*   Validation Data: we use one of training files for validation data.
*   Test Data: contains close to 500 questions to evaluate the trained model.

In each file(train and test), each line contains a question which has the following format:
*   QuestionCategory: Question content.

In [None]:
train_file = open("./questions/train_2000.label", mode='r',encoding="ISO-8859-1")
val_file = open("./questions/train_1000.label", mode='r',encoding="ISO-8859-1")
test_file = open("./questions/TREC_10.label", mode='r', encoding="ISO-8859-1")

## Tokenization
One of the key steps in the preprocess is to tokenize the questions. In this experiment, we use NLTK tokenization. 


In [None]:
def tokenize(question):
    """
    separate the question type as well as question tokens
    :param question: given question
    :return: question_category, question_terms
    """
    colon = question.find(':')            # index of first colon to separate the question category
    q_cat = question[0:colon]             # get question type
    content_normalized = normalize('NFKC', question[colon:])  # normalize the content
    content_normalized = re.sub("[^a-zA-Z. ]", "", content_normalized)  # remove non-alphabetic parts of question
    terms_all = word_tokenize(content_normalized)             # tokenize the content

    # remove the stop words
    terms = [w for w in terms_all if w not in stop_words]
    #terms = terms_all
    return q_cat, terms

## Data Preprocess
In this step, all the question are read, tokenized and stored in list of tuples: *(category, terms)*

In [None]:
def load_data(file):
  """
  read the lines from the given file and prepare the list of tuples of questions.
  :param file: given file
  :return: list of tuples(category, terms)
  """
  questions = []
  lines = file.readlines()
  for line in lines:
    cat, terms = tokenize(line)
    questions.append((cat, terms))
  return questions

In [None]:
# test load_data
temp_train_qs = load_data(train_file)
print(f'number of questions for training = {len(temp_train_qs)}')
print(f'as an example = {temp_train_qs[0]}')

## Vocabulary and Targets
We need the vocabulary and targets to train. In this step, we will make them ready. In the future, we need to convert the content(question) to tensor which is list of numbers. So, we need to keep unique id for each term. We have covered this in the vocab dictionary.

In [None]:
def build_vocabulary(questions):
  """
  Generate the vocabulary from the questions. 
  :param questions: given list of tuples(cat, terms)
  :return: list of unique categories, vocabulary(dictionary of term:Id)
  """
  cats = [cat for (cat, _) in questions]
  vocab = {'__PAD__': 0, '__UNK__': 1}
  for (_, terms) in questions:
    for term in terms:
       if term not in vocab:
         vocab[term] = len(vocab) 
  return list(set(cats)), vocab

In [None]:
# test the vocabulary builder
temp_cats, temp_vocabs = build_vocabulary(temp_train_qs)
print(f'Categories = {temp_cats}')
print(f'number of terms in vocab = {len(temp_vocabs)}')

## Build Tensor
Oen of the first steps in training any neural network is to convert any input to tensor.

In [None]:
def question_to_tensor(question, vocab, unk_token="__UNK__"):
  """
  convert the given question into tensor
  :param question: list of terms of question, [t1, t2, ...]
  :param vocab: dictionary of vocabulary
  :param unk_token: token to be used for the terms that are not in the vocabs.
  :return: tensor = [1, 4, 2, ...]
  """
  tensor = []
  for term in question:
    # get the id for the term
    word_id = vocab[term] if term in vocab else vocab[unk_token]
    tensor.append(word_id)

  return tensor

In [None]:
# test question_to_tensor builder
temp_tensor = question_to_tensor(temp_train_qs[1][1],temp_vocabs)
print(f'question_terms = {temp_train_qs[1][1]}, tensro = {temp_tensor}, shape = {len(temp_tensor)}')

## Data Loader
It is required to load *train*, *validation*, and *test* data and collect the vocabulary and question categories. 

In [None]:
# load training questions
training_data = load_data(train_file)

# load validation questions
validation_data = load_data(val_file)

# load test questions
test_data = load_data(test_file)

# build vocabulary
cats, vocab = build_vocabulary(training_data)

## Data Generator(Batch)
In most Deep NN models, the inputs are given in batches. A batch generator is implemented to generate batches of data samples for *train*, *validation*, and *test*. 

In [None]:
def data_generator(data, vocab, cats, batch_size, loop=False, shuffle=False):
  '''
  Generate a batch of samples from the given data.
  :param data: list of tuples of questions:(cat, terms).
  :param vocab: vocabulary dictionary {term:id, ...}
  :param cats: list of categories of questions.
  :param batch_size: size of batch.
  :param loop: True/False to loop back at the end of data.
  :param shuffle: True/False, shuffle the data or not.
  :Yield: inputs: subset of data samples, target: corresponing targets of selected inputs.
  '''

  # build a list of indexes for data samples
  data_l = len(data)
  data_indexes = list(range(data_l))

  # get the max length of questions for padding. 
  max_l = 0
  for (_,q) in data:
    max_l = max(max_l, len(q))

  # shuffle the indexes if it is True
  if shuffle:
    rnd.shuffle(data_indexes)

  stop = False
  index = 0

  while not stop:
    batch = []
    targets = []
    
    for i in range(batch_size):

        # if at the end of data.
        if index >= len(data_indexes):
          if not loop:
            stop = True
            break
          
          # start index from 0
          index = 0
          
          # shuffle the data indexes if required
          if shuffle:
            rnd.shuffle(data_indexes)
          
        # get the question, convert to tensor, and append the data and target
        q = data[data_indexes[index]]
        q_tensor = question_to_tensor(q[1], vocab)
        
        # pad the batched tensors to the longest question in the data.
        q_tensor_pad = q_tensor + [vocab["__PAD__"]]*(max_l - len(q_tensor)) 
        
        batch.append(q_tensor_pad)
        targets.append(cats.index(q[0]))

        # increase index
        index += 1

    # yield the batch and targets
    yield np.array(batch), np.array(targets)

    # if stop
    if stop:
      break

In [None]:
# test the data_generator
temp_data_generator = data_generator(temp_train_qs, temp_vocabs, temp_cats, batch_size=8,loop=False)
temp_next = next(temp_data_generator)
print(f'batch shape = {temp_next[0]}, targets = {temp_next[1]}')

We need to build data generators for training, validation, and testing processes. 

In [None]:
def train_generator(batch_size, shuffle=False):
  return data_generator(train_questions, vocab, cats, batch_size, True,shuffle=False)

def eval_generator(batch_size, shuffle=False):
  return data_generator(eval_questions, vocab, cats, batch_size, True,shuffle=False)

def test_generator(batch_size, shuffle=False):
  return data_generator(test_questions, vocab, cats, batch_size, True,shuffle=False)

# Training
Training consists of following steps:
1.   Define the NN model
2.   Define the training model
3.   Instantiate the model and training loop



## NN Model
Classifier is a function that design the neural network.

In [None]:
def classifier(vocab_size=len(Vocab), embedding_dim=256, output_dim=2, mode='train'):
    # create embedding layer
    embed_layer = tl.Embedding(
        vocab_size = vocab_size,    # Size of the vocabulary
        d_feature = embedding_dim)  # Embedding dimension
    
    # Create a mean layer, to create an "average" word embedding
    mean_layer = tl.Mean(axis=1)
    
    # Create a dense layer, one unit for each output
    dense_output_layer = tl.Dense(n_units = output_dim)
    
    # Create the log softmax layer.
    log_softmax_layer = tl.LogSoftmax()
    
    # Create a model using tl.Serial to combine all layers
    model = tl.Serial(
      embed_layer,        # embedding layer
      mean_layer,         # mean layer
      dense_output_layer, # dense output layer 
      log_softmax_layer   # log softmax layer
    )

    # return the model of type
    return model

## Training Model
Training a model includes various steps:
1.  In order to train the model, we need to define the *train_task* and *eval_tax*.  
2.  Define the training loop using above tasks.



In [None]:
def train_model(classifier, batch_size, n_steps, output_dir):
    '''
    Input: 
        classifier: the model you are building
        batch_size: the batch size for data-generation
        n_steps: the evaluation steps
        output_dir: folder to save your files
    Output:
        trainer: trax trainer
    '''
    # 1- Define the train_task
    train_task = training.TrainTask(
        labeled_data=train_generator(batch_size=batch_size, shuffle=True),
        loss_layer=tl.CrossEntropyLoss(),
        optimizer=trax.optimizers.Adam(0.01),
        n_steps_per_checkpoint=10,
    )

    # 2- Define the eva_task
    eval_task = training.EvalTask(
        labeled_data=val_generator(batch_size=batch_size, shuffle=True),
        metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    )

    # 3- training loop
    training_loop = training.Loop(
                                classifier, # The learning model
                                train_task, # The training task
                                eval_task = eval_task, # The evaluation task
                                output_dir = output_dir) # The output directory

    # 4- run the training loop
    training_loop.run(n_steps = n_steps)

    # 5- Return the training_loop, since it has the model.
    return training_loop

## Model and Training Instantiate
Training Initialization process includes following step:
1.   Initialization *Batch size* and *Random seed*

In [None]:
# initialize batch_size, random seed
batch_size = 16
rnd.seed(271)
n_steps = 100
output_dir="./output_dir/"

# define the model(classifier) with len(cats) output.
model = classifier(output_dim=len(cats))

# instantiate training 
training_loop = train_model(model, batch_size=batch_size, n_steps=n_steps, output_dir=output_dir)

# Evaluation
This section describes and implements required steps to evaluate a trained model.

 ## Prediction
This section shows the prediction for a given input consists of following steps:
1.   Get an input sample using data generator.
2.   Pass the input to the prediction and get the output.

In [None]:
# Step 1: get an input sample and check the shapes

# Create a generator object
tmp_train_generator = train_generator(16)

# get one batch
tmp_batch = next(tmp_train_generator)

# 0: inputs, 1: targets (the actual labels)
tmp_inputs, tmp_targets = tmp_batch

# print out the shape of inputs
print(f'The shape of input = {temp_inputs.shape} & the shape of targes = {temp_targets.shape}')

Following code is for the second step of prediction.

In [None]:
# Step 2: feed the question tensors into the model to get a prediction
tmp_pred = training_loop.eval_model(tmp_inputs)

# print out the shape of prediction.
print(f"The prediction shape is {tmp_pred.shape}, number of question tensors as rows")

## Predicted Class
Each column represent a category of question. Therefore, it shows the probability that the question belongs to that category. The column with the highest probability is selected as the predicted ouput.

In [None]:
# get the max between the columns 

# get the index and name of question category


## Evaluation of a batch
Following method will evaluate the performance of a model on a branch. It will return the *accuracy*, *# of corrects* and *total number*. 
This method will be used later to evaluate the model on the complete *test data*.


In [None]:
# Evaluate the performance of model on a given branch.

This section uses the *test* data to evaluate the performance of trained model. Therefore, following steps are required:

 Define data-generators

*   Define test data generator

In [None]:
# define data-generators
test_data_gen = test_generator(batch_size=batch_size, shuffle=False)
