##  GGTWEAK - German Gene Tagging with Weak Supervision

Obtaining a sufficient amount of high-quality training data is one of the crucial points and most formidable challenges in deep learning-based natural language processing. In this work, we present GGTWEAK (German Gene Tagging with Weak Supervision). In conventional settings all data must be labelled manually while complexity often does not allow for the involvement of non-experts for this laborious and thus costly task. This is especially true for molecular data, which is hard to discriminate from common abbreviations syntactically. Therefore, GGTWEAK provides a baseline bridging the gap between English gene taggers and models usable for German. GGTWEAK saves human resources compared to its English counterparts and potentially can be trained for free after development on available data. 
We design labelling functions based on the structure of gene naming conventions and databases from both the medical and general domain. Following that, we train a hidden Markov model for label aggregration. Based on our weakly labelled data, we finally train a German BERT model for named entity recognition. This weak supervision approach for gene labelling in the German language leverages the skweak framework achieving an entity-level F1 score of 60.4% on our test set, while dealing with a highly unbalanced data from the German Guideline Program in Oncology NLP Corpus. The NER model trained on the same development dataset with quantitatively less strong labels achieved 53.9%.

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords

import spacy
from spacy.tokens import Span, DocBin
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG

from skweak import heuristics, gazetteers, generative, utils, base
from skweak.base import SpanAnnotator
from skweak.heuristics import SpanEditorAnnotator, VicinityAnnotator, SpanConstraintAnnotator
from skweak.analysis import LFAnalysis

import sklearn.metrics

from evaluation import evaluate, get_results, compute_raw_numbers, _get_probs, show_errors


  from .autonotebook import tqdm as notebook_tqdm
2022-12-12 11:54:33.131727: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-12-12 11:54:33.131781: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [2]:
stops = set(stopwords.words('german'))
random_seed = 42

# Load Data


Here we customize the spaCy tokenizer without statistical model and loading the standard German spaCy model.

In [3]:
nlp = spacy.load('de_core_news_md')
infixes = nlp.Defaults.infixes + [r'([-])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7f1b80642640>

In the following, we construct a dataframe from all sentences in GGPONC and concatenate them. 

In [4]:
GGPONC_PATH = 'data/ggponc/plain_text/sentences/all_files_sentences/'

In [5]:
sentences = list(Path(GGPONC_PATH).glob('*.txt'))
len(sentences)

10193

In [6]:
def get_df(file):
    df = pd.read_csv(file, delimiter='\t', names =['text'] )
    df['file'] = file.stem
    df['sentence_id'] = df.index
    return df

In [7]:
dfs = [get_df(file) for file in sentences]

In [8]:
sentence_df = pd.concat(dfs)
len(sentence_df)

85996

# Labeling Functions

The Clinical Interpretation of Variants in Cancer (CIViC) database is an opensource, open-access knowledgebase curated by experts on therapeutic, prognostic, diagnostic and predisposing relevance of inherited and somatic variants of every type. Both, genes and variants, are used in this project.

In [9]:
df = pd.read_csv('data/molecular/nightly-GeneSummaries.tsv', sep='\t')
CIVIC_genes = df['name'].tolist()
CIVIC_genes_lower = [c.lower() for c in CIVIC_genes]

In [10]:
df = pd.read_csv('data/molecular/nightly-VariantSummaries.tsv', sep='\t', error_bad_lines=False )
CIVIC_variants = df['variant'].tolist()
CIVIC_variants_lower = [c.lower() for c in CIVIC_variants]

b'Skipping line 13: expected 29 fields, saw 33\nSkipping line 17: expected 29 fields, saw 30\nSkipping line 31: expected 29 fields, saw 30\nSkipping line 441: expected 29 fields, saw 30\nSkipping line 502: expected 29 fields, saw 30\nSkipping line 553: expected 29 fields, saw 31\n'


"cue_civic" is based on the CIViC database. If a token contains a gene which is listed in the database, this and the next token are labelled a gene. This way, we do not restrict the function to a 100 percent match but leave it some leeway.

In [11]:
def civic(doc):
    for tok in doc:
        for cue in CIVIC_genes:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
cue_civic = heuristics.FunctionAnnotator("cue_civic", civic)

The Online Mendelian Inheritance in Man (OMIM) database is the encyclopedic collection of the human medical branch of genetics.

In [12]:
omim_list = pd.read_csv("data/molecular/mim2gene.csv")
omim_list = omim_list['name'].tolist()
omim_list_lower = [o.lower() for o in omim_list]
short_genes = []
for u in omim_list:
    if len(u)<3:
        short_genes.append(u)
less_short_genes = []
for u in omim_list:
    if len(u)<5 and len(u)>2:
        less_short_genes.append(u)
print(len(omim_list))

16767


"omim" is based on the OMIM database and checks whether tokens are present in its list of 16,767 approved gene symbols in lowercase as the diversity of genes often shows in volatile capitalization. To increase precision, genes with a length shorter than three characters are matched only correctly cased.

In [13]:
def omim(doc):
    for tok in doc:
        if tok.text.lower() in omim_list_lower and tok.text.lower() not in stops and len(tok.text.lower())>=3:
            yield tok.i, tok.i+1, "Gene or Protein"
omim = heuristics.FunctionAnnotator("omim", omim) 

The Catalogue of Somatic Mutations in Cancer (COSMIC) database harbors somatic cell mutations and additional information associated with cancer in humans.

In [14]:
cosmic_census = pd.read_csv("data/molecular/cancer_gene_census.csv")
cosmic_census = cosmic_census['Gene Symbol'].tolist()
cosmic_census_lower = [c.lower() for c in cosmic_census]

"cue_cosmic_census" is based on the COSMIC database. If a token contains a gene symbol which is listed here, this token and its successor are annotated as a gene.

In [15]:
def cosmic(doc):
    for tok in doc:
        for cue in cosmic_census:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
cue_cosmic_census = heuristics.FunctionAnnotator("cue_cosmic_census", cosmic)  

"construct" is based on the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) naming conventions for genes and leverages regular expressions to let the annotator abide by them. Those expressions comprise various combinations of letters and numbers and certain fixed terms for shorter terms to avoid underfitting. In addition, the CIViC database for variants has also been included for a better recall.

In [16]:
def structure(doc):
    for tok in doc:
        if re.search(r"[a-zA-Z]{4}\d{2}", tok.text) or re.search(r"[a-zA-Z]{5}\d{1}", tok.text)\
        or re.search(r"[a-zA-Z]{4}\d{1}", tok.text) or re.search(r"[A-Z]{5}\d{1}", tok.text)\
        or re.search(r"[A-Z]{5}\d{2}", tok.text) or re.search(r"[A-Z]{3}\d{2}", tok.text)\
        or re.search(r"[a-zA-Z]{2}\d{3}[a-zA-Z]{2}", tok.text) or re.search(r"[a-zA-Z]{1}\d{3}[a-zA-Z]{1}", tok.text)\
        or re.search(r"[A-Z]{3}\d{2}", tok.text) or re.search(r"[A-Z]{6}\d{1}", tok.text)\
        or re.search(r"[A-Z]{3}\d{3}", tok.text) or re.search(r"[p]\d{2}", tok.text)\
        or re.search(r"CYP[a-zA-Z0-9]{3}", tok.text) or re.search(r"CYP[a-zA-Z0-9]{2}", tok.text)\
        or re.search(r"[A-Z]{3}\d{1}", tok.text) or re.search(r"[A-Z]{2}\d{2}", tok.text)\
        or re.search(r"^CK.", tok.text) or re.search(r"^PD-..", tok.text) or re.search(r"^PS[MA|A]", tok.text) or tok.text.lower in CIVIC_variants_lower:
            yield tok.i, tok.i+1, "Gene or Protein"
construct = heuristics.FunctionAnnotator("construct", structure)

In [17]:
# Calculate data stats (number of genes, sentences, etc...) -> Table for Materials
# Intersections of the gazetteers are calculated below.
intersection_db = list(set(cosmic_census_lower) & set(CIVIC_genes_lower) & set(omim_list_lower))
union_db = set(set(cosmic_census_lower).union(set(CIVIC_genes_lower)).union(set(omim_list_lower)))
print(len(intersection_db))
print(len(union_db))

249
16784


We define a list of labelling functions which we want to deploy onto our textual input. Every row in our dataframe is subject to examination. Those listed here was the best performing allocation from ca. 30 labelling functions.

In [32]:
lfs = [construct, cue_civic, omim, cue_cosmic_census]

#For Quick Run with Random Sentences!
#random_files = files_df.sample(n = 10000)
all_docs = []

for sentence_idx, doc in zip(tqdm(list(sentence_df.reset_index().iterrows())), nlp.pipe(sentence_df.text, batch_size=32, disable=["ner"])):
    i, row = sentence_idx
    for lf in lfs:
        doc = lf(doc)
    all_docs.append(doc)

100%|██████████| 85996/85996 [17:29<00:00, 81.92it/s] 


Remove files which have been manually annotated from the training dataset

## Training Set Evaluation

In [33]:
from skweak.analysis import LFAnalysis

lfa = LFAnalysis(all_docs, ['Gene or Protein'])
cov = lfa.lf_coverages().rename(index={'Gene or Protein' : 'Coverage'})
overlap = lfa.lf_overlaps().rename(index={'Gene or Protein' : 'Overlaps'})
pd.concat([cov, overlap])

Unnamed: 0,construct,cue_civic,omim,cue_cosmic_census
Coverage,0.492558,0.336109,0.501949,0.345145
Overlaps,0.274101,0.957828,0.534769,0.941478


In [43]:
# TODO: Turn into new LF
for t in nlp("Wir behandeln die Mutation des mit-Gens mit Chemotherapie."):
    print(t, t.pos_)

Wir PRON
behandeln VERB
die DET
Mutation NOUN
des DET
mit ADP
- PUNCT
Gens NOUN
mit ADP
Chemotherapie NOUN
. PUNCT


In [44]:
with open('data/molecular/annotated_sentences.txt') as fh:
    annotated_sentences = [l.strip() for l in fh.readlines()]
len(annotated_sentences)

343

In [45]:
docs, s_index = zip(*[(d, si) for d, si in zip(all_docs, sentence_df.reset_index().index) if not d.text in annotated_sentences])
docs = list(docs)
filtered_sentence_df = sentence_df.reset_index().loc[list(s_index)]
len(docs), len(filtered_sentence_df)

(85996, 85996)

# Label Aggregation

In [46]:
# Training of HMM and Majority Voter
#voter = skweak.voting.SequentialMajorityVoter("maj_voter", labels=["Gen"])
#voter.fit(docs)
hmm = generative.HMM("hmm", ["Gene or Protein"])
hmm.fit(docs)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4316 documents
Starting iteration 2


         1      -27084.5979             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4316 documents
Starting iteration 3


         2      -26489.4559        +595.1420


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4316 documents
Starting iteration 4


         3      -26481.5972          +7.8588


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4316 documents


         4      -26480.7070          +0.8902


In [47]:
for d in docs:
    d = hmm(d)
    d.ents = d.spans["hmm"]

Consider subset of files where at least on LF has matched

In [48]:
lf_match = []
for d in docs:
    lf_match.append(any([len(v) > 0 for s, v in d.spans.items() if s != 'hmm']))

filtered_sentence_df['lf_match'] = lf_match
by_file = filtered_sentence_df.groupby('file')['lf_match'].max()
match_files = by_file[by_file.values].index

filtered_docs = [d for d, match in zip(docs, filtered_sentence_df.file.isin(match_files)) if match]
gene_docs = [d for d, match in zip(docs, lf_match) if match]

In [49]:
len(docs), len(filtered_docs), len(gene_docs)

(85996, 33733, 4316)

In [50]:
utils.docbin_writer(docs, f"output/weak_training_lg.spacy")
utils.docbin_writer(filtered_docs, f"output/weak_training_md.spacy")

Write to ../weaksupervision/data/weak_training_lg.spacy...done
Write to ../weaksupervision/data/weak_training_md.spacy...done


# Labeling Function Analysis

In [51]:
gold_docs_dev = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

Our labeling functions must also be deployed onto the gold standard data to evaluate strong supervision against weak supervision.

In [53]:
def apply_hmm(gold_docs):
    for g in tqdm(gold_docs):
        if 'Gene or Protein' in g.spans:
            del g.spans['Gene or Protein']
        for lf in lfs:
            g = lf(g)
        g = hmm(g)

In [54]:
apply_hmm(gold_docs_dev)

100%|██████████| 1000/1000 [00:13<00:00, 76.25it/s]


In [56]:
# HMM / LFs vs. Gold-Standard
#evaluate(gold_docs, ['Gen'], ['lf15', 'hmm'])
evaluate(gold_docs_dev, ['Gene or Protein'], [l.name for l in lfs] + ['hmm'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Gene or Protein,100.0 %,construct,0.879,0.244,0.382,,,,0.833,0.305,0.446
Gene or Protein,100.0 %,cue_civic,0.975,0.372,0.538,,,,0.941,0.474,0.63
Gene or Protein,100.0 %,cue_cosmic_census,0.96,0.342,0.504,,,,0.928,0.436,0.594
Gene or Protein,100.0 %,hmm,0.919,0.487,0.636,,,,0.873,0.611,0.718
Gene or Protein,100.0 %,omim,0.955,0.411,0.574,,,,0.926,0.524,0.67
macro,,construct,0.879,0.244,0.382,,,,0.833,0.305,0.446
macro,,cue_civic,0.975,0.372,0.538,,,,0.941,0.474,0.63
macro,,cue_cosmic_census,0.96,0.342,0.504,,,,0.928,0.436,0.594
macro,,hmm,0.919,0.487,0.636,,,,0.873,0.611,0.718
macro,,omim,0.955,0.411,0.574,,,,0.926,0.524,0.67


# Training of Transformer-based NER Models

Train test splits for strong supervision

In [57]:
from sklearn.model_selection import train_test_split
gold_docs_strong = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

docs_strong_train, docs_strong_dev = train_test_split(gold_docs_strong, test_size=0.2, random_state=random_seed)
utils.docbin_writer(docs_strong_train, 'output/strong_train.spacy')
utils.docbin_writer(docs_strong_dev, 'output/strong_dev.spacy')

Write to ../weaksupervision/data/strong_train.spacy...done
Write to ../weaksupervision/data/strong_dev.spacy...done


In [None]:
# Train NER model on weak labels with spaCy
!spacy train config.cfg --paths.train output/weak_training_lg.spacy  --paths.dev data/molecular/gold_dev.spacy --output output/weak_ner_lg --gpu-id 0 --code training.py

2022-12-11 11:14:11.356343: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-12-11 11:14:11.356389: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[38;5;4mℹ Saving to output directory: ../weaksupervision/data/weak_ner_lg[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-12-11 11:14:16,695] [INFO] Set up nlp object from config
[2022-12-11 11:14:16,703] [INFO] Pipeline: ['transformer', 'ner']
[2022-12-11 11:14:16,706] [INFO] Created vocabulary
[2022-12-11 11:14:16,707] [INFO] Finished initializing nlp object
Some weights of the model

In [34]:
# Train NER model on smaller set of weak labels with spaCy
#!spacy train config.cfg --paths.train output/weak_training_md.spacy  --paths.dev data/molecular/gold_dev.spacy --output output/weak_ner_md --gpu-id 0 --code training.py

In [35]:
# Baseline: train NER model on strong labels with spaCy
!spacy train config.cfg --paths.train output/strong_train.spacy --paths.dev output/strong_dev.spacy --output output/strong_ner --gpu-id 0 --code training.py


2022-12-11 00:06:47.034362: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2022-12-11 00:06:47.034410: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[38;5;4mℹ Saving to output directory: ../weaksupervision/data/strong_ner[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-12-11 00:06:52,974] [INFO] Set up nlp object from config
[2022-12-11 00:06:52,983] [INFO] Pipeline: ['transformer', 'ner']
[2022-12-11 00:06:52,986] [INFO] Created vocabulary
[2022-12-11 00:06:52,987] [INFO] Finished initializing nlp object
Some weights of the model 

# Evaluation

In [58]:
ner_model_weak = spacy.load('output/weak_ner_lg/model-best/')
ner_model_strong = spacy.load('output/strong_ner/model-best/')

In [59]:
from IPython.display import display, Markdown

def print_metrics(gold_docs, is_test : bool):
    display(Markdown('__Labeling Function / HMM performance__'))
    gold_ents = [d.ents for d in gold_docs]
    
    apply_hmm(gold_docs)
    
    display(evaluate(gold_docs, ['Gene or Protein'], [l.name for l in lfs] + ['hmm']).loc['Gene or Protein'])
        
    display(Markdown('__Weak NER Performance__'))
    for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
        d.set_ents([])
        d = ner_model_weak(d)
        d.spans['ner_model'] = d.ents
        d.set_ents(gold_ent)
    display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])
    
    if is_test:
        display(Markdown('__Strong NER Performance__'))
        for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
            d.set_ents([])
            d = ner_model_strong(d)
            d.spans['ner_model'] = d.ents
            d.set_ents(gold_ent)
        display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])

## Dev Set Evaluation

In [60]:
gold_docs_dev_eval = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_dev_eval, is_test=False)

__Labeling Function / HMM performance__

100%|██████████| 1000/1000 [00:12<00:00, 77.05it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,construct,0.879,0.244,0.382,,,,0.833,0.305,0.446
100.0 %,cue_civic,0.975,0.372,0.538,,,,0.941,0.474,0.63
100.0 %,cue_cosmic_census,0.96,0.342,0.504,,,,0.928,0.436,0.594
100.0 %,hmm,0.919,0.487,0.636,,,,0.873,0.611,0.718
100.0 %,omim,0.955,0.411,0.574,,,,0.926,0.524,0.67


__Weak NER Performance__

100%|█████████▉| 999/1000 [00:51<00:00, 19.33it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.921,0.5,0.648,,,,0.874,0.625,0.728


## Test Set Evaluation

In [61]:
gold_docs_test_eval = list(DocBin().from_disk('data/molecular/gold_test.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_test_eval, is_test=True)

__Labeling Function / HMM performance__

100%|██████████| 1000/1000 [00:11<00:00, 83.62it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,construct,0.853,0.19,0.31,,,,0.836,0.28,0.42
100.0 %,cue_civic,0.934,0.354,0.514,,,,0.843,0.478,0.61
100.0 %,cue_cosmic_census,0.927,0.342,0.5,,,,0.854,0.473,0.608
100.0 %,hmm,0.856,0.4,0.546,,,,0.782,0.548,0.644
100.0 %,omim,0.904,0.363,0.518,,,,0.818,0.493,0.616


__Weak NER Performance__

100%|█████████▉| 999/1000 [00:48<00:00, 20.75it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.857,0.427,0.57,,,,0.768,0.573,0.656


__Strong NER Performance__

100%|█████████▉| 999/1000 [00:47<00:00, 20.99it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.537,0.66,0.592,,,,0.546,0.738,0.628


## Datasets Analysis

In [64]:
def get_genes(doc_list):
   return [e.text for d in doc_list for e in d.ents if e.label_ == 'Gene or Protein']

In [65]:
len(get_genes(gold_docs_dev))

475

In [66]:
len(get_genes(gold_docs_test_eval))

347

In [67]:
len(get_genes(gold_docs_dev_eval))

475

In [68]:
len(get_genes(docs))

5305