#TinyBertInfernce API
Idrak is using Tiny Bert Model for the classification of texts. TinyBert out perform the classical ML *(Random Forest, SGD, LightGBM, XGBoost, Linear Regression, and SVM)* as well as other Bert Varients *(MobileBert, Distilt Bert and Bert-Base-Uncased)* . Also, Tiny Bert is occupying 50 MB space on disk while other Bert Models are taking more than 400 MB space. 
Moreover, We have initialy 4 datasets and 5 classifier for each dataset. 
The detail of datasets are following
1.   Actual Text (Human Written Data)
2.   Data from Current Transcriptor
3.   Data from Transcriptor with Decoder (Beam)
4.   Data from Transcriptor without Decoder (Greedy)


And, the detail of classifier are following

1.  Hello, (Answering Machine and DNC ) `NUM_CLASSES=2`
2.  Intro, (Answering Machine, Busy, DNC, Greetings, sorry greetings, Greet back, Spanish, Other) `NUM_CLASSES=8`
3.  Pitch, (Busy, DNC, Spanish, Other, Not interested, Positive, Negative, BOT) `NUM_CLASSES=8`
4.  Yes No without Age Sheet (Positive, Negative, DNC, Other, Not interested) `NUM_CLASSES=5`
5.  Yes No with Age Sheet (Positive, Negative, DNC, Other, Not interested) `NUM_CLASSES=5`
This API will take a text string and return a dictionary containing probabilities of different classes `prob` and class label `class`

Mounting the Google Drive . You can ommit this line if you are running envoirnment other than Google Drive.

**Dataset:** https://drive.google.com/drive/folders/1YDvc7E_QYwlhaxMGxk0E7YI1tNeixgiF?usp=share_link 

==> data/datasetX/classifierX_train.csv 

==> data/datasetX/classifierX_train_aug.csv 

==> data/datasetX/classifierX_test.csv

**Models:** https://drive.google.com/drive/folders/1RbchTgviRCxcQ1wjniX79mM9VtxWSJ_o?usp=share_link 

**Checkpoint Path:**

==> model/tinybert_report/best_datasetX_classifier_X.ckpt

**Model History**

==> model/tinybert_report/datasetX_classifier_X_history.csv

**Wrong Predictions Record**

==> model/tinybert_report/datasetX_classifier_X_invalid_predictions.csv

**Classification Report**

==> model/tinybert_report/datasetX_classifier_X_report.json


##Installation

Installing the required models

In [None]:
!pip install transformers
!pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.2 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 58.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 36.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch_lightning
  Downloading pytorch_lightning-1.8.1-py3-none-any.whl (798 kB)
[K     |██████████████████████████████

## Mounting Drive

In [None]:
from google.colab import drive
drive.mount('gdrive/')

Mounted at gdrive/


Importing the Modules

In [35]:
import logging
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer,BertModel
from transformers import AdamW
import pytorch_lightning as pl
from torchmetrics import F1Score
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping,ProgressBarBase
from pytorch_lightning.loggers import TensorBoardLogger
import warnings
from sklearn.utils.extmath import softmax
from torchmetrics import Accuracy
import re
%matplotlib inline
%config InlineBackend.figure_format='retina'


##Model's Classes Initializations

Class to disable classes progress bar. Because progress consume more execution time.

In [None]:
class LitProgressBar(ProgressBarBase):

    def __init__(self):
        super().__init__()  # don't forget this :)
        self.enable = False

    def disable(self):
        self.enable = False

    def on_train_batch_end(self, trainer, pl_module, outputs, batch_idx):
        super().on_train_batch_end(trainer, pl_module, outputs, batch_idx)  # don't forget this :)
        percent = (self.train_batch_idx / self.total_train_batches) * 100
        # sys.stdout.flush()
        # sys.stdout.write(f'{percent:.01f} percent complete \r')


In [None]:
# RANDOM_SEED = 321

# sns.set(style='whitegrid', palette='muted', font_scale=1.2)
# HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
# sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))
# rcParams['figure.figsize'] = 12, 8

# pl.seed_everything(RANDOM_SEED)

In [None]:
#Disable Warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# model_path='prajjwal1/bert-tiny'
# tokenizer = AutoTokenizer.from_pretrained(model_path)

Pytorch lighting Requires a Dataset consiting of dataframe for prediction
So creating the dataset and dataloader

In [None]:
#Pytorch lighting Requires a Dataset consiting of dataframe for prediction
#So creating the dataset and dataloader

In [None]:
class CallCenterDataset(Dataset):
  '''
  Dataset for Bert Processing
  '''
  def __init__(
    self, 
    data: pd.DataFrame, 
    tokenizer: AutoTokenizer, 
    max_token_len: int = 40
  ):
    self.tokenizer = tokenizer
    self.data = data
    self.max_token_len = max_token_len
    
  def __len__(self):
    return len(self.data)

  def __getitem__(self, index: int):
    data_row = self.data.iloc[index]
    comment_text = data_row.cleaned_text
    labels = data_row['class']
    encoding = self.tokenizer.encode_plus(
      comment_text,
      add_special_tokens=True,
      max_length=self.max_token_len,
      return_token_type_ids=False,
      padding="max_length",
      truncation=True,
      return_attention_mask=True,
      return_tensors='pt',
    )
    return dict(
      comment_text=comment_text,
      input_ids=encoding["input_ids"].flatten(),
      attention_mask=encoding["attention_mask"].flatten(),
      labels=labels
    )

In [None]:
import logging

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

**CallCenterTagger** is model class. That require the forward,prediction and other class function of PyTorch model. Also, it have loss and other requirements.

In [None]:
class CallCenterTagger(pl.LightningModule):
  '''
  This is PyTorch model class having forward and other method requird for trianing,validation, testing and prediciton.
  Arguments:
  n_classes(int): Number of classes to predict
  model_path(str): pretrained models path 
  '''
  def __init__(self, n_classes: int,model_path=None,n_training_steps=None, n_warmup_steps=None,learning_rate=0.02):
    super().__init__()

    # return_dict=True
    self.bert = BertModel.from_pretrained(model_path, return_dict=True)
    self.classifier = nn.Linear(self.bert.config.hidden_size, n_classes)
       
    self.softmax = nn.Softmax(dim=1)
    self.n_training_steps = n_training_steps
    self.n_warmup_steps = n_warmup_steps
     
    self.criterion = nn.CrossEntropyLoss()
    self.train_f1 = F1Score(num_classes=n_classes,average="micro")
    self.train_acc=Accuracy()
    self.val_f1=F1Score(num_classes=n_classes,average="micro")
    self.val_acc=Accuracy()
    self.learning_rate=learning_rate
  def forward(self, input_ids, attention_mask, labels=None):
    
    output = self.bert(input_ids, attention_mask=attention_mask)
    last_state_output=output.last_hidden_state[:,0,:]
   
    output = self.classifier(last_state_output)

    loss = 0
    if labels is not None:
        loss = self.criterion(output, labels)
    return loss, output

  def training_step(self, batch, batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    loss, outputs = self(input_ids, attention_mask, labels)
    y_pred=outputs
    acc = self.train_acc(y_pred, labels)
    f1 = self.train_f1(y_pred, labels)

    self.log("train_accuracy", acc)
    self.log("train_f1", f1)
    self.log("train_loss", loss, prog_bar=True, logger=True)
    return {"loss": loss, "predictions": outputs, "labels": labels}

  def validation_step(self, batch, batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    
    loss, outputs = self(input_ids, attention_mask, labels)
    y_pred = outputs
    acc=self.val_acc(y_pred,labels)
    f1=self.val_f1(y_pred, labels)

    self.log("valid_accuracy", acc)
    self.log("valid_f1", f1)
    self.log("val_loss", loss, prog_bar=True, logger=True)
    return loss

  def test_step(self, batch, batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    
    loss, outputs = self(input_ids, attention_mask, labels)
    y_pred = outputs
    acc=self.val_acc(y_pred,labels)
    f1=self.val_f1(y_pred, labels)

    self.log("test_accuracy", acc)
    self.log("test_f1", f1)
    self.log("test_loss", loss, prog_bar=True, logger=True)
    return loss
  def predict_step(self,batch,batch_idx):
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    
    loss, outputs = self(input_ids, attention_mask, labels)
    y_pred = outputs
    acc=self.val_acc(y_pred,labels)
    f1=self.val_f1(y_pred, labels)

    return y_pred

  def training_epoch_end(self, outputs):
    
    labels = []
    predictions = []
    for output in outputs:
      for out_labels in output["labels"].detach().cpu():
        labels.append(out_labels)
      for out_predictions in output["predictions"].detach().cpu():
        predictions.append(out_predictions)

    labels = torch.stack(labels).int()
    predictions = torch.stack(predictions)
    train_accuracy = self.train_acc.compute()
    train_f1 = self.train_f1.compute()
    print('Train Accuracy: ',train_accuracy)
    print('Train F1: ',train_f1)
    self.log("epoch_train_accuracy", train_accuracy)
    self.log("epoch_train_f1", train_f1)
    self.train_acc.reset()
    self.train_f1.reset()


  def validation_epoch_end(self, outputs):
    val_accuracy = self.val_acc.compute()
    val_f1 = self.val_f1.compute()
    print('Valid Accuracy: ',val_accuracy)
    print('Valid F1: ',val_f1)
    # log metrics
    self.log("epoch_val_accuracy", val_accuracy)
    self.log("epoch_val_f1", val_f1)
    self.val_acc.reset()
    self.val_f1.reset()
  
  def configure_optimizers(self):
    LEARNING_RATE=self.learning_rate
    param_optimizer = list(self.named_parameters())
    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.05},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.01}
        ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=LEARNING_RATE, correct_bias=False)

    # scheduler = get_linear_schedule_with_warmup(
    #   optimizer,
    #   num_warmup_steps=self.n_warmup_steps,
    #  num_training_steps = -1
    # )

    return dict(
      optimizer=optimizer,
    )


Class for loading the data


In [None]:

class InferenceDataModue(pl.LightningDataModule):
  '''
  Module for Data loading. This module bind the text data and tokenizer on them.
  test_df(dataframe): having column cleaned_text and class 
  batchsize(int): 1 for prediction
  max_token_len(int): Maximum length of token for tokenizer
  '''
  def __init__(self,test_df, tokenizer, batch_size=1, max_token_len=64):
    super().__init__()
    self.batch_size = batch_size
    self.test_df = test_df
    self.tokenizer = tokenizer
    self.max_token_len = max_token_len


    self.test_dataset = CallCenterDataset(
      self.test_df,
      self.tokenizer,
      self.max_token_len)
    self.predict_dataset = CallCenterDataset(
      self.test_df,
      self.tokenizer,
      self.max_token_len)
  def test_dataloader(self):
    return DataLoader(
      self.test_dataset,
      batch_size=self.batch_size,
      num_workers=0
    )
  def predict_dataloader(self):
    return DataLoader(
      self.test_dataset,
      batch_size=self.batch_size,
      num_workers=0
    )

**IdrakTinyBertInference** class is the main class, where we are loading the **trained-model from checkpoint** and predicting over **text**.



In [None]:
class IdrakTinyBertInference:
  def __init__(self,datapath='',model_path='prajjwal1/bert-tiny',batch_size=1,classifier=1,dataset=1,model_name='',drive_folder='/content/gdrive/MyDrive/idraak/model/tinybert_report',checkpoint_name='best_checkpoint'):
    '''
    model_path(str): the path of tiny bert
    batch_size(int): 1 for prediction 
    classifier(int): classifier label
    dataset(int): dataset label
    drive_folder(str): the path of pretrained checkpoint
    '''
    self.df_test=pd.DataFrame()
    self.text=''
    self.classifier=classifier
    self.dataset=dataset
    self.drive_folder=drive_folder
    #Class Maps for Class value to Labels
    self.classifier_1_labelmaps={0:'answering_machine',1:'dnc'}
    self.classifier_2_labelmaps={0:'answering_machine',1:'busy',2:'dnc',3:'greetback',4:'greeting',5:'other',6:'sorry_greeting',7:'spanish'}
    self.classifier_3_labelmaps={0:'bot',1:'busy',2:'dnc',3:'negative',4:'not_intrested',5:'other',6:'positive',7:'spanish'}
    self.classifier_4_labelmaps={0:'dnc',1:'negative',2:'not_intrested',3:'other',4:'positive'}
    self.classifier_5_labelmaps={0:'dnc',1:'negative',2:'not_intrested',3:'other',4:'positive'}
    self.classifiers_meta={1:{'labels':self.classifier_1_labelmaps,'NUM_CLASSES':2},
                  2:{'labels':self.classifier_2_labelmaps,'NUM_CLASSES':8},
                  3:{'labels':self.classifier_3_labelmaps,'NUM_CLASSES':8},
                  4:{'labels':self.classifier_4_labelmaps,'NUM_CLASSES':5},
                  5:{'labels':self.classifier_5_labelmaps,'NUM_CLASSES':5}
                  }
    #Generated Text Point Path from classifier and dataset value passed
    self.checkpoint_path='{}/best_dataset{}_classifier_{}.ckpt'.format(self.drive_folder,self.dataset,self.classifier)
    self.checkpoint_name='best_'+checkpoint_name
    self.LABEL_COLUMNS=self.classifiers_meta[classifier]['NUM_CLASSES']
    self.log_dir = "lightning_logs/IDRAK/version_0"
    self.drive_folder=drive_folder
    self.model_name=model_name
    self.model_path=model_path
    self.MAX_TOKEN_COUNT=71
    self.BATCH_SIZE=batch_size
    #Defining Bert Tokenizer
    self.tokenizer = AutoTokenizer.from_pretrained(model_path)
    self.model = CallCenterTagger(n_classes=self.LABEL_COLUMNS,model_path=self.model_path,n_warmup_steps=0,n_training_steps=-1,learning_rate=0)
    self.model=self.model.load_from_checkpoint(self.checkpoint_path,model_path=self.model_path,n_classes=self.LABEL_COLUMNS)
    self.save_model_name=''
    self.checkpoint_callback = ModelCheckpoint(dirpath="checkpoints",filename=self.checkpoint_name,save_top_k=1,verbose=True,monitor="val_loss",mode="min")
    self.logger = TensorBoardLogger("lightning_logs", name="IDRAK")
    self.bar = LitProgressBar()
    self.early_stopping_callback = EarlyStopping(monitor='val_loss', patience=1)
    self.trainer = pl.Trainer(logger=self.logger,callbacks=[self.early_stopping_callback,self.checkpoint_callback,self.bar],max_epochs=0)
    self.dm=InferenceDataModue(test_df=self.df_test,tokenizer=self.tokenizer,batch_size=1,max_token_len=self.MAX_TOKEN_COUNT)
    
  def predict(self,text):
    '''
    prediction function. This function get a text string from Infernce object and return the 
    predicted class, probibilities and class labels using trianed pytorch model
    '''
    self.text=text
    prediction=self.prep_data()
    prediction=prediction[0].cpu().detach().numpy()
    y_pred=prediction[0].argmax()
    prob=softmax(prediction)[0]
    class_label=self.classifiers_meta[self.classifier]['labels'][y_pred]
    result={'prob':prob,'class':y_pred,'class_label':class_label}
    return result
  def prep_data(self):
    #function to make dataframe
    self.text=self.cleanify()
    self.df_test['cleaned_text']=[self.text]
    self.df_test['class']=[1] #dummy label
    self.df_test['class_labels']=['xx']
    # print(self.df_test)
    self.dm= self.dm=InferenceDataModue(test_df=self.df_test,tokenizer=self.tokenizer,batch_size=1,max_token_len=self.MAX_TOKEN_COUNT)
    p=self.trainer.predict(self.model,datamodule=self.dm)
    return p
  def eval_model(self):
    pass
  def cleanify(self):
    #function to clean text
    '''
    This is inner function. It will first remove the unwantted symbols from text
    using regular expression. Then Keep the numbers, alphabets, and question mark 
    '''
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]') #compile regulare expression for removing symbols
    BAD_SYMBOLS_RE = re.compile('[^0-9a-z ?]') #compile regulare expression to keep wanted data
    text=str(self.text)
    text = text.lower() #making text to lower case
    text = REPLACE_BY_SPACE_RE.sub(' ', text)  #applying 1 and 2nd mentioned re
    text = BAD_SYMBOLS_RE.sub(' ', text)
    return text
  def __repr__(self):
    return 'IdrakTinyBertInference(num_class={},trained_model_path={})'.format(self.LABEL_COLUMNS, self.checkpoint_path)    
  def __str__(self):
    return 'IdrakTinyBertInference Trained over {}'.format(self.checkpoint_path)


In [None]:
classifier=3

In [None]:
dataset=3

In [None]:
model_dirpath='gdrive/MyDrive/idraak/model/tinybert_report'

The constructor of IdrakTinyBertInfrence Requires 3 parameters.

1: `classifier`  expected input any number between 1 to 5. (This para donated the classifier intro, hello etc

2: `dataset` expected input any number between1 to 4 
This para about the dataset on which model is trained. 

3: `drive_folder` path of drive where trained models are stored.

In [None]:
itbf=IdrakTinyBertInference(classifier=5,dataset=2,drive_folder=model_dirpath)

Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at prajjwal1/bert-tiny were not used when initializing BertModel: ['cls.predictions.

For prediction you need to pass a text string to IdrakTinyBertInference object `predict` function as
```
result=itbf.predict(text)
```

In [None]:
text='random Text'

In [None]:
result=itbf.predict(text)



I am called


**Expected Output:**

`prob` : probabilities of each class

`class`: class numaric label

`class_label`: class label in human readable format

In [None]:
result

{'prob': array([0.15899259, 0.05700221, 0.12287708, 0.63530654, 0.02582158],
       dtype=float32), 'class': 3, 'class_label': 'other'}

Thanks