# Neural NER with flair
In this hands-on, we use https://github.com/zalandoresearch/flair, a state-of-the-art framework for NLP sequence labeling tasks with excellent quality
Their site contains several tutorials that show how to train your own models with your own data.
However, this requires GPUs and several hours of training. It is not feasible on this machine.


## Using a standard flair models for English
You can play with a few sentences. The model is trained on CONLL 2003 data. For details, see [Documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_2_TAGGING.md#list-of-pre-trained-sequence-tagger-models)

In [1]:
from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('I love Utrecht.', use_tokenizer=True)

# load the NER tagger
tagger = SequenceTagger.load('ner')

2019-07-08 14:58:51,593 loading file /home/jovyan/.flair/models/en-ner-conll03-v0.4.pt


In [2]:
# run NER over sentence
tagger.predict(sentence)

# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

ORG-span [3]: "Utrecht"


Show detailed information

In [3]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'I love Utrecht.', 'labels': [], 'entities': [{'text': 'Utrecht', 'start_pos': 7, 'end_pos': 14, 'type': 'ORG', 'confidence': 0.9657689332962036}]}


### Contextualized embeddings at work...
Playing with ambiguous words

In [4]:
sentence = Sentence('Washington went to Washington .')
tagger.predict(sentence)
print(sentence.to_dict(tag_type='ner'))

{'text': 'Washington went to Washington .', 'labels': [], 'entities': [{'text': 'Washington', 'start_pos': 0, 'end_pos': 10, 'type': 'PER', 'confidence': 0.982752799987793}, {'text': 'Washington', 'start_pos': 19, 'end_pos': 29, 'type': 'LOC', 'confidence': 0.9996899366378784}]}


## Applying a purely character-based NER model trained on French QUAERO corpus
We trained a purely character-based NER model using the QUAERO corpus.
The underlying character language model was trained on Swiss newspaper texts from the 19th century and on French Wikipedia.
Download the model (250MB) and a few scripts for running und testing it.

In [5]:
! git clone https://gitlab.ifi.uzh.ch/siclemat/dh2019-ner-tutorial-flair-quaero-material.git ~/flair-quaero

fatal: destination path '/home/jovyan/flair-quaero' already exists and is not an empty directory.


In [6]:
%cd ~/flair-quaero

/home/jovyan/flair-quaero


Let's look at some real newspaper data from the 19th century

In [7]:
!head -n 20 data.d/test_short.txt

1	PARIS	PROPN	B-loc.adm.town
2	,	PUNCT	O
3	i8	ADJ	B-time.date.abs
4	DÉCEMBRE	NOUN	I-time.date.abs
5	BULLETIN	VERB	O
6	DU	ADP	O
7	JOUR	NOUN	O

1	Il	PRON	O
2	n'	ADV	O
3	était	AUX	O
4	pas	ADV	O
5	Piémontais	PROPN	O
6	,	PUNCT	O
7	mais	CCONJ	O
8	bon	ADJ	O
9	Méridional	NOUN	O
10	,	PUNCT	O
11	déCosènzà	ADP	B-loc.adm.town
12	enCalabre	NOUN	B-loc.adm.reg


Tagging a verticalized file with reference annotations for evaluation. 
Output format:
 1. Token
 2. Gold NER IOB tag
 3. Computed NER IOB tag
 4. Probability/confidence of computed IOB tag

In [8]:
! python lib/flair_ner_tagger.py \
  --model resources.d/taggers/ner/pressfr-wikifr/raw-stringemb-crf/best-model.pt \
  data.d/test_short.txt

PARIS B-loc.adm.town B-loc.adm.town 0.9988251328468323
, O O 0.969717264175415
i8 B-time.date.abs B-time.date.abs 0.531688928604126
DÉCEMBRE I-time.date.abs I-time.date.abs 0.9934332370758057
BULLETIN O O 0.9681801199913025
DU O O 0.9992507100105286
JOUR O O 0.99275803565979

Il O O 0.9986006617546082
n' O O 0.9999463558197021
était O O 0.999908447265625
pas O O 0.9942612648010254
Piémontais O B-pers.ind 0.5966131091117859
, O O 0.9904968738555908
mais O O 0.9991064667701721
bon O O 0.9567022323608398
Méridional O B-pers.ind 0.494706928730011
, O O 0.9902147054672241
déCosènzà B-loc.adm.town O 0.6310709118843079
enCalabre B-loc.adm.reg O 0.24175947904586792
. O O 0.999897837638855

Temps B-prod.media B-prod.media 0.9141724109649658
Berlin B-loc.adm.town B-loc.adm.town 0.9977395534515381
, O O 0.9906547665596008
25 B-time.date.abs B-time.date.abs 0.8832522630691528
décembre I-time.date.abs I-time.date.abs 0.998478353023529
, O O 0.9989118576049805
8 B-time.hour.abs B-time.hour.abs 0.453

Save the relevant columns for evaluation in a file.

In [9]:
! python lib/flair_ner_tagger.py \
  --model resources.d/taggers/ner/pressfr-wikifr/raw-stringemb-crf/best-model.pt \
  data.d/test_short.txt |cut -d " " -f 1,2,3 > test_short_ner_tagged.txt

In [10]:
! head -n 100 test_short_ner_tagged.txt

PARIS B-loc.adm.town B-loc.adm.town
, O O
i8 B-time.date.abs B-time.date.abs
DÉCEMBRE I-time.date.abs I-time.date.abs
BULLETIN O O
DU O O
JOUR O O

Il O O
n' O O
était O O
pas O O
Piémontais O B-pers.ind
, O O
mais O O
bon O O
Méridional O B-pers.ind
, O O
déCosènzà B-loc.adm.town O
enCalabre B-loc.adm.reg O
. O O

Temps B-prod.media B-prod.media
Berlin B-loc.adm.town B-loc.adm.town
, O O
25 B-time.date.abs B-time.date.abs
décembre I-time.date.abs I-time.date.abs
, O O
8 B-time.hour.abs B-time.hour.abs
hwires I-time.hour.abs I-time.hour.abs
. O O

L' O O
un O O
, O O
dit O O
-il O O
d' O O
une O O
voix O O
navrée O O
, O O
c' O O
est O O
notre O O
capitaine B-func.ind B-func.ind
, O O
notre O O
frère O O
, O O
o°an B-pers.ind B-pers.ind
Chêne I-pers.ind I-pers.ind
. O O

Lo O O
Volksblalt B-prod.media B-prod.media
publie O O
un O O
appel O O
aux O O
socialistes B-pers.coll B-pers.coll
allemands I-pers.coll I-pers.coll
, O O

In [11]:
! perl lib/conlleval.pl < test_short_ner_tagged.txt

processed 169 tokens with 22 phrases; found: 23 phrases; correct: 16.
accuracy:  87.57%; precision:  69.57%; recall:  72.73%; FB1:  71.11
        func.coll: precision:   0.00%; recall:   0.00%; FB1:   0.00  1
         func.ind: precision:  50.00%; recall: 100.00%; FB1:  66.67  2
      loc.adm.reg: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
     loc.adm.town: precision:  60.00%; recall:  60.00%; FB1:  60.00  5
          org.adm: precision: 100.00%; recall: 100.00%; FB1: 100.00  2
        pers.coll: precision: 100.00%; recall: 100.00%; FB1: 100.00  1
         pers.ind: precision:  25.00%; recall:  33.33%; FB1:  28.57  4
       prod.media: precision: 100.00%; recall: 100.00%; FB1: 100.00  4
    time.date.abs: precision: 100.00%; recall: 100.00%; FB1: 100.00  3
    time.hour.abs: precision: 100.00%; recall: 100.00%; FB1: 100.00  1


### Possible hands-on
Modify the input file `data.d/test_short.txt` via Jupyter text editor (e.g. add or remove OCR noise) and look at the effect


## Testing the French Quaero model on historical Swiss newspaper texts
Let's test the model trained on French newspapers on some historical Swiss newspapers

In [12]:
%cd ~/flair-quaero
french_tagger = SequenceTagger.load('resources.d/taggers/ner/pressfr-wikifr/raw-stringemb-crf/best-model.pt')

/home/jovyan/flair-quaero
2019-07-08 15:00:16,484 loading file resources.d/taggers/ner/pressfr-wikifr/raw-stringemb-crf/best-model.pt


In [13]:
! cat ~/datasets/impresso/raw/GDL-1848-07-11-a-i0001.txt

CONFÉDÉRATION SUISSE. DIÈTE FÉDÉRALE .. Séance du 6 juillet. Kt, (i, t de situation du personnel et du matériel de l'armée Le conseil fédéral de la guerre fait connaître, dans un rapport en date du 10 mai, ce qui manque soit 'dans le personnel, soit dans le matériel de l'année fédérale. Chaque Etat donne des explications sur les objets qui lui manquent. — Genève se trouve embarrassé de sa compagnie de cavalerie, car on n'aime pas monter à cheval dans le canton de Genève. Le député voudrait inviter l'assemblée à échanger celte compagnie de cavalerie contre une compagnie de carabiniers et deux pièces du contingent. Il est décidé que les Etats en relard seront invités à se procurer, dans le plus bref délai possible, les objets qui leur manquent pour compléter leurs conlingens respectifs. Ensuite de l'arrêté de la Diète, le conseil fédéral de la guerre a ordonné une nouvelle inspection du contingent d'' Appenzell-inlérieur, mission qui a été confiée à M. le colonel fédéral Egloff. Quant au

In [14]:
sentence = Sentence('CONFÉDÉRATION SUISSE. DIÈTE FÉDÉRALE .. Séance du 6 juillet.', use_tokenizer=True)

In [15]:
french_tagger.predict(sentence)

[Sentence: "CONFÉDÉRATION SUISSE . DIÈTE FÉDÉRALE . . Séance du 6 juillet ." - 12 Tokens]

In [16]:
print(sentence.to_dict(tag_type='ner'))

{'text': 'CONFÉDÉRATION SUISSE. DIÈTE FÉDÉRALE .. Séance du 6 juillet.', 'labels': [], 'entities': [{'text': '6 juillet', 'start_pos': 50, 'end_pos': 59, 'type': 'time.date.abs', 'confidence': 0.7775298357009888}]}


## Applying the French Quaero model on impresso data

In [24]:
import os
import json
import codecs
import spacy
from tqdm import tqdm

In [18]:
nlp_fr = spacy.load("fr_core_news_sm")

In [19]:
base_dir = os.path.expanduser('~/datasets/impresso/raw/')
flair_output_dir = os.path.expanduser('~/datasets/impresso/flair_output/')

In [21]:
flair_entities = []

# for now let's process only French documents
# from the impresso corpus
french_files = [
    file
    for file in os.listdir(base_dir)
    if "NZZ" not in file
]

for file in french_files[:1]:
    
    path = os.path.join(base_dir, file)
    print(path)
    
    with codecs.open(path, 'r', 'utf-8') as infile:
        text = infile.read()
    
    # we let spacy do some pre-processing 
    # like splitting the text into sentence
    spacy_doc = nlp_fr(text)
    stop_at = 10
    
    # iterate all sentences in the document
    sentences = list(spacy_doc.sents)
    for spacy_sentence in tqdm(sentences[:stop_at]):
        
        # if the sentence is very short (< 5 tokens)
        # we skip it as it may be a wrongly split one
        if len(spacy_sentence.text.split()) < 5:
                continue

        sentence = Sentence(spacy_sentence.text, use_tokenizer=True)
        french_tagger.predict(sentence)
        tagged_sentence = sentence.to_dict(tag_type='ner')
        
        # if the results do contain some entity mentions
        # we stored them in a list
        if tagged_sentence['entities']:
            flair_entities.append(sentence.to_dict(tag_type='ner'))
            
    # let's write the output to 
    out_file_path = os.path.join(
        flair_output_dir, file.replace('.txt', '.json')
    )

    with codecs.open(out_file_path, 'w', 'utf-8') as outfile:
        json.dump(flair_entities, outfile)
        print(f'Flair output written to {outfile}')

/home/jovyan/datasets/impresso/raw/GDL-1898-08-02-a-i0008.txt


 10%|▉         | 13/132 [00:45<07:01,  3.54s/it]

## Next steps
Work through more of the tutorial at https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_1_BASICS.md