## Classifying text using BERT

This week we learn how to classify text into a specific category using Google's Bidirectional Encoder Representations from Transformers (BERT) language model. Besides text classification, BERT is also used in Google's search engine to better understand searches in English. 

This week we also install a new package "ktrain", which uses keras as a baseline, but allows to create and train models with very few lines of code.

In [1]:
!pip install ktrain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting ktrain
  Downloading ktrain-0.31.2-py3-none-any.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 2.7 MB/s 
[?25hCollecting keras-bert>=0.86.0
  Downloading keras-bert-0.89.0.tar.gz (25 kB)
Collecting syntok==1.3.3
  Downloading syntok-1.3.3-py3-none-any.whl (22 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 47.7 MB/s 
[?25hCollecting transformers==4.10.3
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 44.9 MB/s 
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 61.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |███████████████████████████

As usual, we import the packages after training them.

In [2]:
import ktrain
from ktrain import text
import pandas as pd
from pandas import DataFrame
import numpy as np

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing and preparing the data

We upload the dataset for training and validation. This time, a csv file with texts and their labels.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [5]:
train_data = pd.read_csv('Master_data_3.csv', header=0)

#change names for columns of variables of interest, always keep text
train_data = train_data[["text", "diversity", "moral", "market", "innovation",
                         "ethnic_cultural", "gender", "sexual_ori",
                         "age", "ability", "religion"]]

#train_data=train_data.dropna().reset_index(drop=True)

In [None]:
labels = train_data.groupby('innovation').text.unique()
# Sort the over-represented class to the head.
labels = labels[labels.apply(len).sort_values(ascending=False).index]
excess = len(labels.iloc[0]) - len(labels.iloc[1])
remove = np.random.choice(labels.iloc[0], excess, replace=False)
train_data = train_data[~train_data.text.isin(remove)]

In [6]:
print(train_data)

                                                   text diversity moral  \
0     The growing maturity of #fintech has led to an...        No   NaN   
1     More investors are actively looking to invest ...       Yes    No   
2     Today 50+ global CEOs signed a pledge to #Embr...       Yes    No   
3     Steel is a truly diverse product, in both its ...        No   NaN   
4     Who is talking about #diversity at #WEF19? Fol...       Yes    No   
...                                                 ...       ...   ...   
1181  Walking into your place of work with your head...        No    No   
1182  We're proud to celebrate our 747 diverse new p...        No   Yes   
1183  Congratulations to UvA students Priscilla Mari...        No    No   
1184  We're very proud to announce our sponsorship &...        No   Yes   
1185  We at Tata Steel understand the importance of ...        No    No   

     market innovation ethnic_cultural gender sexual_ori  age ability religion  
0       NaN       

*ktrain* is handy because it does a lot of things with a few lines of code. This part reads the data, transforms it into numbers using the BERT embeddings and the maximim sequence lenght, and seperates the training and validation datasets.

In [7]:
#change label_columns for variables of interest

from ktrain.text.data import texts_from_csv
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(train_data,
                      label_columns = ["diversity"],
                      text_column = "text",
                      preprocess_mode='bert',
                      ngram_range=1,
                      val_pct=0.1,
                      maxlen=128)

['No', 'Yes']
       No  Yes
1080  1.0  0.0
737   1.0  0.0
677   1.0  0.0
816   0.0  1.0
1075  0.0  1.0
['No', 'Yes']
      No  Yes
72   1.0  0.0
408  1.0  0.0
198  0.0  1.0
499  1.0  0.0
74   1.0  0.0
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


## Defining and training the model

Here we create a BERT based text classifier. The main parameter to be changed here is the batch size (how many texts will the algorithm read at once).

In [8]:
learner = ktrain.get_learner(text.text_classifier('bert', (x_train, y_train), preproc=preproc, metrics = ['accuracy']),
                             train_data=(x_train, y_train),
                             val_data=(x_test, y_test),
                             batch_size=12)

Is Multi-Label? False
maxlen is 128
done.


This step trains the model. You already know what epochs mean, but what about the other parameters? This documentation may be handy: https://amaiya.github.io/ktrain/core.html#ktrain.core.Learner.autofit


In [9]:
learner.autofit(lr = 2e-5, early_stopping=10, reduce_on_plateau=5, epochs=40)



begin training using triangular learning rate policy with max lr of 2e-05...
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 00007: Reducing Max LR on Plateau: new max lr will be 1e-05 (if not early_stopping).
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 00012: Reducing Max LR on Plateau: new max lr will be 5e-06 (if not early_stopping).
Restoring model weights from the end of the best epoch: 2.
Epoch 12: early stopping
Weights from best epoch have been loaded into model.


<keras.callbacks.History at 0x7f5d671b3510>

This gives you some additional information on where the classifier performed well (or not).

In [11]:
learner.validate(val_data=(x_test, y_test))

              precision    recall  f1-score   support

           0       0.91      0.88      0.89        82
           1       0.75      0.81      0.78        37

    accuracy                           0.86       119
   macro avg       0.83      0.84      0.84       119
weighted avg       0.86      0.86      0.86       119



array([[72, 10],
       [ 7, 30]])

## Making predictions

The final blocks of code allow you to input text and see what the predicted label is.

In [12]:
learner.model.save('/content/drive/MyDrive/diversity_classifier/diversity_classifier')



In [13]:
learner.model.save_weights('/content/drive/MyDrive/diversity_classifier/diversity_classifier')

In [None]:
!zip -r /content/diversity_classifier.zip /content/diversity_classifier

Loading model:

In [None]:
learner.model.load('diversity_classifier')

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
db = pd.read.csv('insertname', header=0)

In [None]:
predictions = predictor.predict(db.text.tolist())

In [None]:
predictions_list = DataFrame(predictions, columns=['label'])

In [None]:
predictions_list.to_csv('insertnameforsaving') 

In [None]:
prediction = predictor.predict("P")
print(prediction) 