In [1]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# define columns
columns = {0: 'text', 2: 'pos', 3: 'ner'}

# this is the folder in which train, test and dev files reside
data_folder = 'ftb/'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='ftb6_train.conll',
                              test_file='ftb6_test.conll',
                              dev_file='ftb6_dev.conll')

2021-12-01 21:47:36,653 Reading data from ftb
2021-12-01 21:47:36,654 Train: ftb/ftb6_train.conll
2021-12-01 21:47:36,654 Dev: ftb/ftb6_dev.conll
2021-12-01 21:47:36,654 Test: ftb/ftb6_test.conll


In [2]:
len(corpus.train)

9881

In [3]:
print(corpus.train[0].to_tagged_string('ner'))

Certes , rien ne dit qu' une seconde motion de censure sur son projet de loi , reprenant l' accord du 10 avril , n' aurait pas été la bonne mais cette probabilité , reconnaissent les socialistes , n' était pas la plus plausible .


In [4]:
from flair.embeddings import TransformerWordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

# 2. what label do we want to predict?
label_type = 'ner'

# 3. make the label dictionary from the corpus
label_dict = corpus.make_label_dictionary(label_type=label_type)
print(label_dict)

# 4. initialize fine-tuneable transformer embeddings WITH document context
embeddings = TransformerWordEmbeddings(model='camembert-base',
                                       layers="-1",
                                       subtoken_pooling="mean",
                                       fine_tune=True,
                                       use_context=True,
                                       local_files_only=True,
                                       )

# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=label_dict,
                        tag_type='ner',
                        use_crf=False,
                        use_rnn=False,
                        reproject_embeddings=False,
                        )

# 6. initialize trainer
trainer = ModelTrainer(tagger, corpus)

# 7. run fine-tuning
trainer.fine_tune('resources/taggers/ner-transformer-ftb',
                  learning_rate=5.0e-6,
                  mini_batch_size=4,
                  mini_batch_chunk_size=1,  # remove this parameter to speed up computation if you have a big GPU
                  )

2021-12-01 21:47:49,432 Computing label dictionary. Progress:


100%|██████████| 9881/9881 [00:00<00:00, 19757.20it/s]

2021-12-01 21:47:50,168 Corpus contains the labels: pos (#278083), ner (#278083)
2021-12-01 21:47:50,168 Created (for label 'ner') Dictionary with 16 tags: <unk>, O, B-Organization, I-Organization, B-Person, I-Person, B-Location, B-Company, I-Company, B-FictionCharacter, B-Product, I-Location, B-POI, I-POI, I-Product, I-FictionCharacter





Dictionary with 16 tags: <unk>, O, B-Organization, I-Organization, B-Person, I-Person, B-Location, B-Company, I-Company, B-FictionCharacter, B-Product, I-Location, B-POI, I-POI, I-Product, I-FictionCharacter


HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /camembert-base/resolve/main/config.json (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1531fae521f0>: Failed to establish a new connection: [Errno 101] Network is unreachable')))
HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /camembert-base/resolve/main/pytorch_model.bin (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x1531fab4fe50>: Failed to establish a new connection: [Errno 101] Network is unreachable')))


OSError: Can't load weights for 'camembert-base'. Make sure that:

- 'camembert-base' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'camembert-base' is not a path to a local directory with something else, in that case)

- or 'camembert-base' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt

