<a href="https://colab.research.google.com/github/mkaramib/trax/blob/main/QuestionClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Question Classification
In this notebook, I will implement a question classifier using Trax deep learning framework. 

In [None]:
import numpy as np_base  # regular ol' numpy
import os
import random as rnd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from unicodedata import normalize
import re


# Initialize
Some of the libraries need to be downlowed or initialized such as NLTK tokenizer and stop-words. Following lines will do these steps. 

In [None]:
# initialize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Trax:
In this section, we need to install [trax](https://github.com/google/trax) if it is not installed. 

In [None]:
!pip install -q -U trax
import trax
from trax import layers as tl  # core building block
from trax import shapes  # data signatures: dimensionality and type
from trax import fastmath  # uses jax, offers numpy on steroids

Check which version of Trax has been installed.



In [None]:
!pip list | grep trax

# Trax Numpy
The key mathematical benefit of Trax is using JAX to implement its numpy version. So, following line will import Trax' numpy.

In [None]:
# import trax.fastmath.numpy
import trax.fastmath.numpy as np

## Data 
In this section, all the training and testing questions are read.
*   Train Data: contains 1000, 2000, 3000, 4000, or 5500 questions in each file.
*   Test Data: contains close to 500 questions to evaluate the trained model.

In each file(train and test), each line contains a question which has the following format:
*   QuestionCategory: Question content.

In [None]:
train_f = open("./questions/train_1000.label", mode='r',encoding="ISO-8859-1")
test_f = open("./questions/TREC_10.label", mode='r', encoding="ISO-8859-1")

# Tokenization
One of the key steps in the preprocess is to tokenize the questions. In this experiment, we use NLTK tokenization. 


In [None]:
def tokenize(question):
    """
    separate the question type as well as question tokens
    :param question: given question
    :return: question_category, question_terms
    """
    colon = question.find(':')            # index of first colon to separate the question category
    q_cat = question[0:colon]             # get question type
    content_normalized = normalize('NFKC', question[colon:])  # normalize the content
    content_normalized = re.sub("[^a-zA-Z. ]", "", content_normalized)  # remove non-alphabetic parts of question
    terms_all = word_tokenize(content_normalized)             # tokenize the content

    # remove the stop words
    terms = [w for w in terms_all if not w in stop_words]
    #terms = terms_all
    return q_cat, terms

# Data Preparation
In this step, all the question are read, tokenized and stored in list of tuples: *(category, terms)*

In [None]:
def prepare_data(file):
  """
  read the lines from the given file and prepare the list of tuples of questions.
  :param file: given file
  :return: list of tuples(category, terms)
  """
  questions = []
  lines = file.readlines()
  for line in lines:
    cat, terms = tokenize(line)
    question.append((cat, terms))
  return questions

# Vocabulary and Targets
We need the vocabulary and targets to train. In this step, we will make them ready. In the future, we need to convert the content(question) to tensor which is list of numbers. So, we need to keep unique id for each term. We have covered this in the vocab dictionary.

In [None]:
def build_vocabulary(questions):
  """
  Generate the vocabulary from the questions. 
  :param questions: given list of tuples(cat, terms)
  :return: list of unique categories, vocabulary(dictionary of term:Id)
  """
  cats = [cat for (cat, _) in questions]
  vocab = {'__PAD__': 0, '__UNK__': 1}
  for (_, terms) in questions:
    for term in terms:
       if term not in vocab:
         vocab[term] = len(vocab) 
  return list(set(cats)), vocab

# Build Tensor
Oen of the first steps in training any neural network is to convert any input to tensor.

In [None]:
def question_to_tensor(question, vocab, unk_token="__UNK__"):
  """
  convert the given question into tensor
  :param question: list of terms of question, [t1, t2, ...]
  :param vocab: dictionary of vocabulary
  :param unk_token: token to be used for the terms that are not in the vocabs.
  :return: tensor = [1, 4, 2, ...]
  """
  tensor = []
  for term in question:
    # get the id for the term
    word_id = vocab[term] if term in vocab else vocab[unk_token]
    tensor.append(word_id)

  return tensor

# Batch Data Generator
In most Deep NN models, the inputs are given in batches. A batch generator is implemented to generate batches of data samples for *train*, *validation*, and *test*. 

In [None]:
def data_generator(data, vocab, cats, batch_size, loop,shuffle=False):
  '''
  Generate a batch of samples from the given data.
  :param data: list of tuples of questions:(cat, terms).
  :param vocab: vocabulary dictionary {term:id, ...}
  :param cats: list of categories of questions.
  :param batch_size: size of batch.
  :param loop: True/False to loop back at the end of data.
  :param shuffle: True/False, shuffle the data or not.
  :Yield: inputs: subset of data samples, target: corresponing targets of selected inputs.
  '''

  # build a list of indexes for data samples
  data_l = len(data)
  data_indexes = list(range(data_l))

  # get the max length of questions for padding. 
  max_l = 0
  max_l = max(max_l, len(q)) for (_,q) in data

  # shuffle the indexes if it is True
  if shuffle:
    rnd.shuffle(data_indexes)

  stop = False
  index = 0

  while not stop:
    batch = []
    targets = []
    
    for i in range(batch_size):

        # if at the end of data.
        if index >= len(data_indexes):
          if not loop:
            stop = True
            break
          
          # start index from 0
          index = 0
          
          # shuffle the data indexes if required
          if shuffle:
            rnd.shuffle(data_indexes)
          
        # get the question, convert to tensor, and append the data and target
        q = data[data_indexes[index]]
        q_tensor = question_to_tensor(q[1], vocab)
        
        # pad the batched tensors to the longest question in the data.
        q_tensor_pad = q_tensor + [vocab["__PAD__"]]*(max_l - len(q_tensor)) 
        
        batch.append(q_tensor)
        targets.append(cats.index(q[0]))

        # increase index
        index += 1

    # if stop
    if stop:
      break

    # yield the batch and targets
    yield np.array(batch), np.array(targets)

We need to build data generators for training, validation, and testing processes. 

In [None]:
def train_generator(train_f, )