### A Simplied Interface to Text Classification With Hugging Face Transformers in TensorFlow Using [ktrain](https://github.com/amaiya/ktrain)

*ktrain* requires TensorFlow 2.

In [1]:
!pip3 install -q tensorflow_gpu==2.1.0

[K     |████████████████████████████████| 421.8MB 19kB/s 
[K     |████████████████████████████████| 450kB 48.4MB/s 
[K     |████████████████████████████████| 3.9MB 51.3MB/s 
[K     |████████████████████████████████| 51kB 5.6MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow 2.3.0 has requirement gast==0.3.3, but you'll have gast 0.2.2 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorboard<3,>=2.3.0, but you'll have tensorboard 2.1.1 which is incompatible.[0m
[31mERROR: tensorflow 2.3.0 has requirement tensorflow-estimator<2.4.0,>=2.3.0, but you'll have tensorflow-estimator 2.1.0 which is incompatible.[0m
[31mERROR: tensorflow-probability 0.11.0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.[0m


In [2]:
import tensorflow as tf
print(tf.__version__)

2.1.0


We then need to install *ktrain* library using pip.

In [3]:
!pip3 install -q ktrain

[K     |████████████████████████████████| 25.3MB 129kB/s 
[K     |████████████████████████████████| 421.8MB 19kB/s 
[K     |████████████████████████████████| 983kB 47.9MB/s 
[K     |████████████████████████████████| 245kB 54.1MB/s 
[K     |████████████████████████████████| 778kB 47.1MB/s 
[K     |████████████████████████████████| 471kB 49.1MB/s 
[K     |████████████████████████████████| 890kB 46.0MB/s 
[K     |████████████████████████████████| 1.1MB 43.0MB/s 
[K     |████████████████████████████████| 3.0MB 45.6MB/s 
[?25h  Building wheel for ktrain (setup.py) ... [?25l[?25hdone
  Building wheel for keras-bert (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Building wheel for syntok (setup.py) ... [?25l[?25hdone
  Building wheel for keras-transformer (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Building wheel fo

### Load a Dataset Into Arrays

In [4]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

#train_b = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
#test_b = fetch_20newsgroups(subset='test', shuffle=True, random_state=42)

print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))

x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


size of training set: 2257
size of validation set: 1502
classes: ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


## STEP 1:  Preprocess Data and Create a Transformer Model

We will use [DistilBERT](https://arxiv.org/abs/1910.01108).

In [5]:
import ktrain
from ktrain import text
MODEL_NAME = 'bert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, classes=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)



preprocessing train...
language: en
train sequence lengths:
	mean : 308
	95percentile : 837
	99percentile : 1938


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 343
	95percentile : 979
	99percentile : 2562


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




In [9]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 109,485,316
Non-trainable params: 0
_________________________________________________________________


## STEP 2:  Train the Model

In [None]:
learner.fit_onecycle(5e-5, 4)



begin training using onecycle policy with max lr of 5e-05...
Train for 377 steps, validate for 47 steps
Epoch 1/4
  5/377 [..............................] - ETA: 4:52 - loss: 1.5473 - accuracy: 0.1000

## STEP 3: Evaluate and Inspect the Model

In [None]:
learner.validate(class_names=t.get_classes())

Let's examine the validation example about which we were the most wrong.

In [None]:
learner.view_top_losses(n=1, preproc=t)

In [None]:
print(x_test[371])

This post talks more about computing than `alt.atheism` (the true category), so our model placed it into the only computing category available to it: `comp.graphics`

## STEP 4: Making Predictions on New Data in Deployment

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc=t)

In [None]:
predictor.predict('Jesus Christ is the central figure of Christianity.')

In [None]:
# predicted probability scores for each category
predictor.predict_proba('Jesus Christ is the central figure of Christianity.')

In [None]:
predictor.get_classes()

As expected, `soc.religion.christian` is assigned the highest probability.

Let's invoke the `explain` method to see which words contribute most to the classification.

We will need a forked version of the **eli5** library that supportes TensorFlow Keras, so let's install it first.

In [None]:
!pip3 install -q git+https://github.com/amaiya/eli5@tfkeras_0_10_1


In [None]:
predictor.explain('Jesus Christ is the central figure in Christianity.')

The words in the darkest shade of green contribute most to the classification and agree with what you would expect for this example.

We can save and reload our predictor for later deployment.

In [None]:
predictor.save('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor = ktrain.load_predictor('/tmp/my_distilbert_predictor')

In [None]:
reloaded_predictor.predict('My computer monitor is really blurry.')