In [1]:
!pip install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/13/63/261305e5393181acc4ddbbed982f578d54d820c6b79d51a26eda2c0377c4/ktrain-0.25.1.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 1.4MB/s 
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 44.1MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted.org/packages/a0/e5/a0b9edd8664ea3b0d3270c451ebbf86655ed9fc4c3e4c45b9afae9c2e382/cchardet-2.1.7-cp36-cp36m-manylinux2010_x86_64.whl (263kB)
[K     |████████████████████████████████| 266kB 45.6MB/s 
[?25hCollecting syntok
  Downloading https://files.pythonhosted.org/packages/8c/76/a49e73a04b3e3a14ce232e8e28a1587f8108baa665644fe8c40e307e792e/syntok-1.3.1.tar.gz
Collecting seqeval==0.0.19
  Downloading https://files.pythonhosted.org/packages/93/e5/b7705156a

ktrain uses TensorFlow 2. To support sequence-tagging, ktrain also currently uses the CRF module from keras_contrib, which is not yet fully compatible with TensorFlow 2. To use the BiLSTM-CRF model (which currently requires keras_contrib) for sequence-tagging in ktrain, you must disable V2 behavior in TensorFlow 2 by adding the following line to the top of your notebook or script before importing ktrain:

In [2]:
import os
os.environ['DISABLE_V2_BEHAVIOR'] = '1'

In [3]:
import ktrain
from ktrain import text

Instructions for updating:
non-resource variables are not supported in the long term
Using DISABLE_V2_BEHAVIOR with TensorFlow


In [4]:
import pandas as pd

In [8]:
data = pd.read_csv("/content/ner_dataset.csv",encoding = "ISO-8859-1")

In [9]:
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Sequence tagging (or sequence labeling) involves classifying words or sequences of words as representing some category or concept of interest. One example of sequence tagging is Named Entity Recognition (NER), where we classify words or sequences of words that identify some entity such as a person, organization, or location. In this tutorial, we will show how to use ktrain to perform sequence tagging in three simple steps.

**We will be using the file ner_dataset.csv (which conforms to the format above) and will load and preprocess it using the entities_from_txt function.When loading the dataset above, we specify use_char=True to instruct ktrain to extract the character vocabulary to be used in a character embedding layer of a model.**

In [10]:
DATAFILE = '/content/ner_dataset.csv'
(trn, val, preproc) = text.entities_from_txt(DATAFILE,
                                             sentence_column='Sentence #',
                                             word_column='Word',
                                             tag_column='Tag', 
                                             data_format='gmb',
                                             use_char=True)

detected encoding: WINDOWS-1252 (if wrong, set manually)
Number of sentences:  47959
Number of words in the dataset:  35178
Tags: ['B-per', 'I-gpe', 'I-geo', 'I-per', 'I-art', 'B-geo', 'I-eve', 'I-org', 'B-tim', 'I-nat', 'B-art', 'B-org', 'O', 'B-gpe', 'I-tim', 'B-eve', 'B-nat']
Number of Labels:  17
Longest sentence: 104 words


In the cell below, notice that we suppied the wv_path_or_url argument. This directs ktrain to initialized word embeddings with one of the pretrained fasttext (word2vec) word vector sets from Facebook's fastttext site. When supplied with a valid URL to a .vec.gz, the word vectors will be automatically downloaded, extracted, and loaded in (download location is <home_directory>/ktrain_data).

In [11]:
WV_URL = 'https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz'
model = text.sequence_tagger('bilstm-crf', preproc, wv_path_or_url=WV_URL)

Embedding schemes employed (combined with concatenation):
	word embeddings initialized with fasttext word vectors (cc.en.300.vec.gz)
	character embeddings

pretrained word embeddings will be loaded from:
	https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
downloading pretrained word vectors to /root/ktrain_data ...
[██████████████████████████████████████████████████]
extracting pretrained word vectors...
done.

cleanup downloaded zip...
done.

loading pretrained word vectors...this may take a few moments...


In [12]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128)

In [13]:
learner.fit(1e-2, 1, cycle_len=1)

preparing training data ...done.
preparing validation data ...done.


<tensorflow.python.keras.callbacks.History at 0x7fab63367390>

In [14]:
learner.validate()

   F1:  84.11
              precision    recall  f1-score   support

         art       0.00      0.00      0.00        52
         eve       0.27      0.16      0.20        19
         geo       0.86      0.91      0.88      3837
         gpe       0.97      0.93      0.95      1509
         nat       0.29      0.07      0.11        30
         org       0.75      0.69      0.72      2019
         per       0.78      0.77      0.78      1714
         tim       0.90      0.86      0.88      1987

   micro avg       0.85      0.83      0.84     11167
   macro avg       0.60      0.55      0.56     11167
weighted avg       0.84      0.83      0.84     11167



0.8411350081330201

In [15]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [16]:
predictor.predict('As of 2019, Donald Trump is still the President of the United States.')

[('As', 'O'),
 ('of', 'O'),
 ('2019', 'B-tim'),
 (',', 'O'),
 ('Donald', 'B-per'),
 ('Trump', 'I-per'),
 ('is', 'O'),
 ('still', 'O'),
 ('the', 'O'),
 ('President', 'B-per'),
 ('of', 'O'),
 ('the', 'O'),
 ('United', 'B-geo'),
 ('States', 'I-geo'),
 ('.', 'O')]

In [17]:
predictor.save('/tmp/mypred')

In [18]:
reloaded_predictor = ktrain.load_predictor('/tmp/mypred')

In [20]:
reloaded_predictor.predict('Steve Carrel is my favorite American actor.')

[('Steve', 'B-per'),
 ('Carrel', 'I-per'),
 ('is', 'O'),
 ('my', 'O'),
 ('favorite', 'O'),
 ('American', 'B-gpe'),
 ('actor', 'O'),
 ('.', 'O')]