<a href="https://colab.research.google.com/github/jloutz/Resume_NER/blob/master/flair_nlp_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Resume NER Part 4: Working with Flair NLP

---

In this part we will use flair NLP to train a model on our data and evaluate the results. Please make sure you have set up your Google account and uploaded your files to Google drive. This Notebook should run on Google Colab.

Let's change the working directory to the Google drive where our training data is, and install flair nlp. 

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import os
os.chdir("/content/gdrive/My Drive/SAKI_2019/dataset") 

In [4]:
# download flair library #
! pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/4e/3a/2e777f65a71c1eaa259df44c44e39d7071ba8c7780a1564316a38bf86449/flair-0.4.2-py3-none-any.whl (136kB)
[K     |████████████████████████████████| 143kB 3.5MB/s 
Collecting sqlitedict>=1.6.0 (from flair)
  Downloading https://files.pythonhosted.org/packages/0f/1c/c757b93147a219cf1e25cef7e1ad9b595b7f802159493c45ce116521caff/sqlitedict-1.6.0.tar.gz
Collecting regex (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4e/1b178c38c9a1a184288f72065a65ca01f3154df43c6ad898624149b8b4e0/regex-2019.06.08.tar.gz (651kB)
[K     |████████████████████████████████| 655kB 43.5MB/s 
Collecting mpld3==0.3 (from flair)
[?25l  Downloading https://files.pythonhosted.org/packages/91/95/a52d3a83d0a29ba0d6898f6727e9858fe7a43f6c2ce81a5fe7e05f0f4912/mpld3-0.3.tar.gz (788kB)
[K     |████████████████████████████████| 798kB 49.0MB/s 
Collecting pytorch-pretrained-bert>=0.6.1 (from flair)
[?25l  Downloading https

In the next section, we will train a NER model with flair. This code is taken from the flair nlp tutorials section 7. "Training a model" 
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md



In [5]:
# imports 
from flair.datasets import Corpus
from flair.data_fetcher import NLPTaskDataFetcher

## make sure this describes your file structure
columns = {0: 'text', 2: "ner"}

# folder where training and test data are
data_folder = '/content/gdrive/My Drive/SAKI_2019/dataset/flair'

# 1.0 is full data, try a much smaller number like 0.1 to test run the code
downsample = 1.0 

## your train file name
train_file = 'bilou_training.csv'

## your test file name
test_file = 'bilou_test.csv'
# 1. get the corpus
corpus: Corpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                             train_file=train_file,
                                                             test_file=test_file,
                                                           dev_file=None).downsample(downsample)
print(corpus)
print(len(corpus.train))

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type='ner')
print(tag_dictionary.idx2item)


2019-06-17 10:48:11,728 Reading data from /content/gdrive/My Drive/SAKI_2019/dataset/flair
2019-06-17 10:48:11,729 Train: /content/gdrive/My Drive/SAKI_2019/dataset/flair/bilou_training.csv
2019-06-17 10:48:11,734 Dev: None
2019-06-17 10:48:11,736 Test: /content/gdrive/My Drive/SAKI_2019/dataset/flair/bilou_test.csv


  train_file, column_format
  test_file, column_format


Corpus: 493 train + 55 dev + 137 test sentences
493
[b'<unk>', b'O', b'U-Location', b'', b'U-Skills', b'U-Degree', b'B-Skills', b'I-Skills', b'L-Skills', b'B-Degree', b'I-Degree', b'L-Degree', b'-', b'B-Location', b'L-Location', b'I-Location', b'<START>', b'<STOP>']


In [6]:

# 4. initialize embeddings. Experiment with different embedding types to see what gets the best results
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings,FlairEmbeddings
from typing import List

embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use flair embeddings (needs a LONG time to train :-)
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type='ner',
                                        use_crf=True)

2019-06-17 10:48:19,307 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim.vectors.npy not found in cache, downloading to /tmp/tmpywwxpos0


100%|██████████| 160000128/160000128 [00:08<00:00, 19665524.70B/s]

2019-06-17 10:48:28,014 copying /tmp/tmpywwxpos0 to cache at /root/.flair/embeddings/glove.gensim.vectors.npy





2019-06-17 10:48:28,284 removing temp file /tmp/tmpywwxpos0
2019-06-17 10:48:28,777 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/glove.gensim not found in cache, downloading to /tmp/tmpnhczj7kf


100%|██████████| 21494764/21494764 [00:01<00:00, 11945486.72B/s]

2019-06-17 10:48:31,110 copying /tmp/tmpnhczj7kf to cache at /root/.flair/embeddings/glove.gensim
2019-06-17 10:48:31,136 removing temp file /tmp/tmpnhczj7kf



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


2019-06-17 10:48:33,172 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpruv2pxu9


100%|██████████| 73034624/73034624 [00:04<00:00, 16110576.39B/s]

2019-06-17 10:48:38,246 copying /tmp/tmpruv2pxu9 to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2019-06-17 10:48:38,366 removing temp file /tmp/tmpruv2pxu9
2019-06-17 10:48:46,665 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmptc94jmoq


100%|██████████| 73034575/73034575 [00:04<00:00, 16635714.23B/s]

2019-06-17 10:48:51,644 copying /tmp/tmptc94jmoq to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2019-06-17 10:48:51,741 removing temp file /tmp/tmptc94jmoq


In [0]:
# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

## give your model a name and folder of your choice. Your model will be saved there for loading later 
## you can run this notebook many times with different embeddings/params and save the models with different names
model_name = 'resources/taggers/resume-ner-4'

# 7. start training - you can experiment with batch size if you get memory errors
# how many epochs does it take before the model stops showing improvement? Start with a big number like 150, and stop the code cell
# from running at any time - the framework will persist the best model even if you interrupt training. 
trainer.train(model_name,
              learning_rate=0.1,
              mini_batch_size=5,
              #anneal_with_restarts=True,
              max_epochs=150)




2019-06-17 10:48:53,618 ----------------------------------------------------------------------------------------------------
2019-06-17 10:48:53,623 Evaluation method: MICRO_F1_SCORE
2019-06-17 10:48:54,864 ----------------------------------------------------------------------------------------------------
2019-06-17 10:49:06,685 epoch 1 - iter 0/99 - loss 2706.55908203
2019-06-17 10:50:32,402 epoch 1 - iter 9/99 - loss 693.45689697
2019-06-17 10:51:40,253 epoch 1 - iter 18/99 - loss 488.24582190
2019-06-17 10:53:12,777 epoch 1 - iter 27/99 - loss 415.98351479
2019-06-17 10:54:19,619 epoch 1 - iter 36/99 - loss 376.00188962
2019-06-17 10:55:48,450 epoch 1 - iter 45/99 - loss 355.31663314
2019-06-17 10:57:24,910 epoch 1 - iter 54/99 - loss 332.68958893
2019-06-17 10:58:45,664 epoch 1 - iter 63/99 - loss 314.93679237
2019-06-17 11:00:08,052 epoch 1 - iter 72/99 - loss 298.24884660
2019-06-17 11:01:38,558 epoch 1 - iter 81/99 - loss 283.27220749
2019-06-17 11:03:04,002 epoch 1 - iter 90/9