<a href="https://colab.research.google.com/github/nyp-sit/it3103/blob/main/week15/intent_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intent Recognition

In this practical, we will learn how to apply the HuggingFace Transformers library to our own Intent Recognition task for our chatbot.

####**NOTE: Be sure to set your runtime to a GPU instance!**

## Install the Hugging Face Transformers Library

Run the following cell below to install the transformers library.

In [None]:
!pip install transformers

## Getting the data and prepare the data

In [1]:
import pandas as pd

data_url = 'https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/datasets/airchat_intents.csv'
df = pd.read_csv(data_url)

df.head()

Unnamed: 0,Label,Text
0,atis_abbreviation,what is fare code h
1,atis_abbreviation,what is booking class c
2,atis_abbreviation,what does fare code q mean
3,atis_abbreviation,what is fare code qw
4,atis_abbreviation,what does the fare code f mean


We noticed that there are two columns 'Label' and 'Text'. Let's just examine what are the different labels we have and how many samples we have for each labels.

In [2]:
df['Label'].value_counts()

atis_flight                                 3666
atis_airfare                                 423
atis_ground_service                          255
atis_airline                                 157
atis_abbreviation                            147
atis_yes                                      82
atis_aircraft                                 81
atis_no                                       67
atis_flight_time                              54
atis_greeting                                 53
atis_quantity                                 51
atis_flight#atis_airfare                      21
atis_distance                                 20
atis_airport                                  20
atis_city                                     19
atis_ground_fare                              18
atis_capacity                                 16
atis_flight_no                                12
atis_meal                                      6
atis_restriction                               6
atis_airline#atis_fl

We can see that some labels have very few sample such as 'atis_meal', 'atis_airline#atis_flight_no', 'atis_cheapest', and so on. With so few samples, our model will have difficulty in learning any meaningful pattern from it. We will group these labels (with few samples) into a new label called 'others'.  

---



### Re-define our Classification Labels

Here we define the labels we are interested in classifying based on the original labels, and also we added a new label called 'Others'.
 

In [3]:
# Create a list of unique labels that we will recognize.
#
sentence_labels = [
              "others",
              "atis_abbreviation",
              "atis_aircraft",
              "atis_airfare",
              "atis_airline",
              "atis_flight",
              "atis_flight_time",
              "atis_greeting",
              "atis_ground_service",
              "atis_quantity",
              "atis_yes",
              "atis_no"]

# This creates a reverse mapping dictionary of "label" -> index.
# 
sentence_labels_id_by_label = dict((t, i) for i, t in enumerate(sentence_labels))

Now we will map the previous labels to the few ones we specified in the cell above. We will also convert the text labels into numeric labels (e.g. others->0, atis_abbreviation->1, etc). We can use the `map()` function in dataframe to help us do that. We define a lambda function that do the mapping.

In [4]:
df['Label'] = df['Label'].map(lambda label: 
                              sentence_labels_id_by_label[label] 
                              if label in sentence_labels_id_by_label 
                              else 0)

In [5]:
# examine a few random samples 
df.sample(10)

Unnamed: 0,Label,Text
1808,5,what flights are available on wednesday from ...
2351,5,flights from pittsburgh to baltimore arriving...
1316,5,all flights from boston to philadelphia which...
3330,5,are there any flights from new york to los an...
157,2,what kind of airplane is flight ua 281 from b...
4059,5,what is the earliest flight from tampa to mil...
1887,5,i would like a flight between denver and san ...
3626,5,newark to cleveland
4510,5,please list all flights between boston and at...
4947,8,is there ground transportation in st. louis


### Split Our Data

We will now separate the texts and labels and call them all_texts and all_labels and we will split the dataset into training and validation set. We do a stratified split to ensure we have equal representation of different labels in both train and validation set.

In [6]:
all_texts = df['Text']
all_labels = df['Label']

In [7]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(all_texts, 
                                                                    all_labels, 
                                                                    test_size=0.2, 
                                                                    stratify=all_labels)

In [8]:
train_labels.value_counts()/len(train_labels)

5     0.707770
3     0.081564
8     0.049228
4     0.030405
1     0.028475
0     0.027751
2     0.015685
10    0.015685
11    0.013031
6     0.010376
7     0.010135
9     0.009894
Name: Label, dtype: float64

In [9]:
val_labels.value_counts()/len(val_labels)

5     0.707529
3     0.082046
8     0.049228
4     0.029923
1     0.027992
0     0.027992
10    0.016409
2     0.015444
11    0.012548
6     0.010618
7     0.010618
9     0.009653
Name: Label, dtype: float64

### Tokenize the text 

Before we can use the text for classification, we need to tokenize them. We will use Tokenizer of the pretrained model 'distilbert-base-uncased' as we will be fine-tunining on a pretrained model 'distilbert-base-uncased'. 


In [10]:
len(sentence_labels)

12

In [11]:
## before we can feed the texts to tokenizer, we need to convert our texts into list of text string instead of 
## panda Series. We can do this by using to_list(). 

train_texts = train_texts.to_list()
train_labels = train_labels.to_list()
val_texts = val_texts.to_list()
val_labels = val_labels.to_list()

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [13]:
train_encodings = tokenizer(train_texts, padding=True, truncation=True)
val_encodings = tokenizer(val_texts, padding=True, truncation=True)

Once we have the encodings, we will go ahead and create a tensorflow dataset, ready to be used to train our model. Since the HuggingFace pretrained model (the tensorflow version) is a Keras model, it can consume the tf.data dataset. 

In [14]:
import tensorflow as tf

batch_size = 16

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(batch_size)

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(batch_size)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2022-01-05 13:06:24.494115: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-01-05 13:06:24.494263: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (xps-markk): /proc/driver/nvidia/version does not exist
2022-01-05 13:06:24.497016: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Train Your Sentence Classification Model

Run the following cell to download the "distilbert-base-uncased" and perform fine-tuning training using the dataset that we have above.

In [15]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",num_labels=len(sentence_labels))

2022-01-05 13:06:32.058524: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_projector', 'vocab_transform', 'vocab_layer_norm']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint 

As in previous lab, we start with a smaller learning rate 5e-5 (0.00005) and slowly reduce the learning rate over the course of training.

In [16]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay

num_epochs = 2

# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Since our dataset is already batched, we can simply take the len.
num_train_steps = len(train_dataset) * num_epochs

lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

In [17]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

opt = Adam(learning_rate=lr_scheduler)

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

model.fit(train_dataset, validation_data=val_dataset, epochs=num_epochs)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f1a70404940>

### Evaluating the Model

Run the following code to evaluate our model with entire validation data set.

We also print out the classification report to see how the model performs for each label. Note that those with smaller number of samples typically have lower F1-score.


In [18]:
output = model.predict(val_dataset, batch_size=1)
pred_probs = tf.nn.softmax(output.logits, axis=-1)
preds = tf.argmax(pred_probs, axis=-1)

In [19]:
val_labels = []
for _, labels in val_dataset.as_numpy_iterator():
    val_labels.extend(labels)

In [20]:
from sklearn.metrics import classification_report

print(classification_report(val_labels, preds))

              precision    recall  f1-score   support

           0       0.88      0.79      0.84        29
           1       0.93      0.90      0.91        29
           2       0.89      1.00      0.94        16
           3       0.97      1.00      0.98        85
           4       0.91      0.97      0.94        31
           5       1.00      0.99      0.99       733
           6       0.91      0.91      0.91        11
           7       1.00      1.00      1.00        11
           8       0.96      0.98      0.97        51
           9       0.90      0.90      0.90        10
          10       0.89      1.00      0.94        17
          11       1.00      0.77      0.87        13

    accuracy                           0.98      1036
   macro avg       0.94      0.93      0.93      1036
weighted avg       0.98      0.98      0.98      1036



### Saving the Model

When you training has completed, run the following cell to save your model.

Remember to download the model from Google Colab if you want to use later.

In [21]:
# Save the model

model.save_pretrained("intent_model")

## Putting Our Model to the Test

Run the following cell to create the necessary classes and functions to load our model and perform inference.


In [22]:
# Import the necessary libraries
#
from transformers import (
    AutoTokenizer,
    TFAutoModelForSequenceClassification
)

# Create the DistilBERT tokenizer
#
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Define a function to perform inference on a single input text.
# 
def infer_intent(model, text):
    # Passes the text into the tokenizer
    #
    input = tokenizer(text, truncation=True, padding=True, return_tensors="tf")
    
    # Sends the result from the tokenizer into our classification model
    #
    output = model(input)

    # Extract the output logits and convert to softmax 
    # Find the classification index with the highest value.
    #  
    pred_label = tf.argmax(tf.nn.softmax(output.logits, axis=-1), axis=-1)

    return pred_label

# Create a list of unique labels that we will recognize.
# Obviously this has to match what we trained our model with
# earlier.
#
sentence_labels = [
              "others",
              "atis_abbreviation",
              "atis_aircraft",
              "atis_airfare",
              "atis_airline",
              "atis_flight",
              "atis_flight_time",
              "atis_greeting",
              "atis_ground_service",
              "atis_quantity",
              "atis_yes",
              "atis_no"]

# Load the saved model file
#
intent_model = TFAutoModelForSequenceClassification.from_pretrained("intent_model")



Some layers from the model checkpoint at intent_model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at intent_model and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
text = input()

print (sentence_labels[infer_intent(intent_model, text)[0]])

 hello this is me


atis_greeting


In [24]:
!zip -r intent_model.zip intent_model

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: intent_model/ (stored 0%)
  adding: intent_model/config.json (deflated 57%)
  adding: intent_model/tf_model.h5 (deflated 8%)
