# Development of Labelling Functions and Model Training

In this notebook, we present the workflow of training a weakly supervised gene tagger for German medical text using the skweak framework. In particular, we:

1. implement different labelling functions for detecting gene names
2. aggregate their predictions using an Hidden Markov Model (Label Model)
3. train a spaCy NER model on this data
4. evaluate the results on a small set of gold-standard labels

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords

import spacy
from spacy.tokens import Span, DocBin
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG

from skweak import heuristics, gazetteers, generative, utils, base
from skweak.base import SpanAnnotator
from skweak.heuristics import SpanEditorAnnotator, VicinityAnnotator, SpanConstraintAnnotator
from skweak.analysis import LFAnalysis

import sklearn.metrics

from evaluation import evaluate, get_results, compute_raw_numbers, _get_probs, show_errors


In [2]:
stops = set(stopwords.words('german'))
random_seed = 42

# Load Data


Here we customize the spaCy tokenizer without statistical model and loading the standard German spaCy model.

In [3]:
nlp = spacy.load('de_core_news_md')
infixes = nlp.Defaults.infixes + [r'([-])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7f8f3693c900>

In the following, we construct a dataframe from all sentences in GGPONC and concatenate them. 

In [4]:
GGPONC_PATH = 'data/ggponc/plain_text/sentences/all_files_sentences/'

In [5]:
sentences = list(Path(GGPONC_PATH).glob('*.txt'))
len(sentences)

10193

In [6]:
def get_df(file):
    df = pd.read_csv(file, delimiter='\t', names =['text'] )
    df['file'] = file.stem
    df['sentence_id'] = df.index
    return df

In [7]:
dfs = [get_df(file) for file in sentences]

In [8]:
sentence_df = pd.concat(dfs)
len(sentence_df)

85996

# Labeling Functions

## Gazetteer-based LFs

The Clinical Interpretation of Variants in Cancer (CIViC) database is an opensource, open-access knowledgebase curated by experts on therapeutic, prognostic, diagnostic and predisposing relevance of inherited and somatic variants of every type. Both, genes and variants, are used in this project.

In [9]:
civic_genes_df = pd.read_csv('data/molecular/nightly-GeneSummaries.tsv', sep='\t')
CIVIC_genes = civic_genes_df['name'].tolist()
CIVIC_genes_lower = [c.lower() for c in CIVIC_genes]

In [10]:
civic_variant_df = pd.read_csv('data/molecular/nightly-VariantSummaries.tsv', sep='\t', error_bad_lines=False )
CIVIC_variants = civic_variant_df['variant'].tolist()
CIVIC_variants_lower = [c.lower() for c in CIVIC_variants]

Skipping line 13: expected 29 fields, saw 33
Skipping line 17: expected 29 fields, saw 30
Skipping line 31: expected 29 fields, saw 30
Skipping line 441: expected 29 fields, saw 30
Skipping line 502: expected 29 fields, saw 30
Skipping line 553: expected 29 fields, saw 31



"cue_civic" is based on the CIViC database. If a token contains a gene which is listed in the database, this and the next token are labelled a gene. This way, we do not restrict the function to a 100 percent match but leave it some leeway.

In [11]:
def civic_fn(doc):
    for tok in doc:
        for cue in CIVIC_genes:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
lf_civic = heuristics.FunctionAnnotator("CIViC", civic_fn)

Get all synonyms in Entrez for CIViC genes, remove short ones and German stopwords

In [12]:
entrez_df = pd.read_csv('data/Homo_sapiens.gene_info', sep='\t')

In [13]:
symbols = set()
for _, r in entrez_df.set_index('GeneID').loc[civic_variant_df.entrez_id].iterrows():
    symbols.add(r.Symbol)
    for s in r.Synonyms.split('|'):
        if not s in ['R1', 'R2', 'eN', 'HNPCC'] and len(s) > 1 and not s.lower() in stops:
            symbols.add(s.lower())

In [14]:
from spacy.matcher import Matcher

entrez_matcher = Matcher(nlp.vocab)
pattern = []
for s in nlp.pipe(tqdm(symbols), disable=["ner", "tok2vec"]):
    for pos in ['NOUN', 'PROPN', 'X']: # Consider only if first POS is one of these
        p = [{'LOWER' : spl.text.lower() } for spl in s]
        p[0]['POS'] = pos
        pattern.append(p)
        p2 = p + [{'LOWER' : '-'}, {'LOWER' : 'gen'}] #also consider if followed by -Gen
        pattern.append(p2)
entrez_matcher.add("entrez", pattern)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2028/2028 [00:00<00:00, 12862.40it/s]


In [15]:
def entrez_fn(doc):
    matches = entrez_matcher(doc)
    if matches:
        # Keep longest matches only
        spans = [doc[start:end] for _, start, end in matches]
        spans = spacy.util.filter_spans(spans)
        for s in spans:
            yield s.start, s.end, 'Gene or Protein'
lf_entrez = heuristics.FunctionAnnotator("Entrez", entrez_fn)  

The Online Mendelian Inheritance in Man (OMIM) database is the encyclopedic collection of the human medical branch of genetics. "omim" is based on the OMIM database and checks whether tokens are present in its list of 16,767 approved gene symbols in lowercase as the diversity of genes often shows in volatile capitalization. To increase precision, genes with a length shorter than three characters are matched only correctly cased.

In [16]:
omim_list = pd.read_csv("data/molecular/mim2gene.csv")
omim_list = omim_list['name'].tolist()
omim_list_lower = [o.lower() for o in omim_list]
short_genes = []
for u in omim_list:
    if len(u)<3:
        short_genes.append(u)
less_short_genes = []
for u in omim_list:
    if len(u)<5 and len(u)>2:
        less_short_genes.append(u)
print(len(omim_list))

16767


In [17]:
def omim_fn(doc):
    for tok in doc:
        if tok.text.lower() in omim_list_lower and tok.text.lower() not in stops and len(tok.text.lower())>=3:
            yield tok.i, tok.i+1, "Gene or Protein"
lf_omim = heuristics.FunctionAnnotator("OMIM", omim_fn) 

The Catalogue of Somatic Mutations in Cancer (COSMIC) database harbors somatic cell mutations and additional information associated with cancer in humans.

In [18]:
cosmic_census = pd.read_csv("data/molecular/cancer_gene_census.csv")
cosmic_census = cosmic_census['Gene Symbol'].tolist()
cosmic_census_lower = [c.lower() for c in cosmic_census]

In [19]:
def cosmic_fn(doc):
    for tok in doc:
        for cue in cosmic_census:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
lf_cosmic = heuristics.FunctionAnnotator("COSMIC", cosmic_fn)  

Gazetteer based on common Protein names, sourced from Wikipedia and refined using the training part of the dataset

In [20]:
from skweak.gazetteers import Trie, GazetteerAnnotator

terms = [t.strip() for t in open('proteins.txt', 'r').readlines()]

trie = Trie()
for term in terms:
    trie.add([t.text for t in nlp(term)])

lf_protein_gazetteer = GazetteerAnnotator('Proteins', tries = {'Gene or Protein' : trie })

doc = nlp("PD-L1")
lf_protein_gazetteer(doc)
doc.spans

{'Proteins': [PD-L1]}

## Rule-based LFs

"hgnc" is based on the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) naming conventions for genes and leverages regular expressions to let the annotator abide by them. Those expressions comprise various combinations of letters and numbers and certain fixed terms for shorter terms to avoid underfitting. In addition, the CIViC database for variants has also been included for a better recall.

In [21]:
def hgnc_fn(doc):
    for tok in doc:
        if re.search(r"[a-zA-Z]{4}\d{2}", tok.text) or re.search(r"[a-zA-Z]{5}\d{1}", tok.text)\
        or re.search(r"[a-zA-Z]{4}\d{1}", tok.text) or re.search(r"[A-Z]{5}\d{1}", tok.text)\
        or re.search(r"[A-Z]{5}\d{2}", tok.text) or re.search(r"[A-Z]{3}\d{2}", tok.text)\
        or re.search(r"[a-zA-Z]{2}\d{3}[a-zA-Z]{2}", tok.text) or re.search(r"[a-zA-Z]{1}\d{3}[a-zA-Z]{1}", tok.text)\
        or re.search(r"[A-Z]{3}\d{2}", tok.text) or re.search(r"[A-Z]{6}\d{1}", tok.text)\
        or re.search(r"[A-Z]{3}\d{3}", tok.text) or re.search(r"[p]\d{2}", tok.text)\
        or re.search(r"CYP[a-zA-Z0-9]{3}", tok.text) or re.search(r"CYP[a-zA-Z0-9]{2}", tok.text)\
        or re.search(r"[A-Z]{3}\d{1}", tok.text) or re.search(r"[A-Z]{2}\d{2}", tok.text)\
        or re.search(r"^CK.", tok.text) or re.search(r"^PD-..", tok.text) or re.search(r"^PS[MA|A]", tok.text) or tok.text.lower in CIVIC_variants_lower:
            yield tok.i, tok.i+1, "Gene or Protein"
lf_hgnc = heuristics.FunctionAnnotator("HGNC", hgnc_fn)

Rule-based matcher based on protein families

In [22]:
protein_matcher = Matcher(nlp.vocab)
patterns = []

for suffix in ['[A-Z]*[Kk]inase[n]?$', '[A-Z]+[rR]ezeptor(en|s)?$', '^(RAS|ras)$']:
    p = [{'TEXT' : { 'REGEX' : suffix}}]
    patterns.append(p)
    for _ in range(0, 3): # Consider also combinations like Rezepter-Tyrosinkinasen
        p = [{'IS_ALPHA' : True}, {'lower' : '-'}] + p
        patterns.append(p)
protein_matcher.add('protein', patterns[-1::-1])

def protein_families_fn(doc):
    matches = protein_matcher(doc)
    if matches:
        # Keep longest matches only
        spans = [doc[start:end] for _, start, end in matches]
        spans = spacy.util.filter_spans(spans)
        for s in spans:
            yield s.start, s.end, 'Gene or Protein'

lf_protein_families = heuristics.FunctionAnnotator("Protein Families", protein_families_fn) 

In [23]:
list(protein_families_fn(nlp("RAS k-RAS krass")))

[(0, 1, 'Gene or Protein'), (1, 4, 'Gene or Protein')]

In [24]:
# Calculate data stats (number of genes, sentences, etc...) -> Table for Materials
# Intersections of the gazetteers are calculated below.
intersection_db = list(set(cosmic_census_lower) & set(CIVIC_genes_lower) & set(omim_list_lower))
union_db = set(set(cosmic_census_lower).union(set(CIVIC_genes_lower)).union(set(omim_list_lower)))
print(len(intersection_db))
print(len(union_db))

218
16782


We define a list of labelling functions which we want to deploy onto our textual input. Every row in our dataframe is subject to examination. Those listed here was the best performing allocation from ca. 30 labelling functions.

In [25]:
lfs = [lf_civic, lf_entrez, lf_omim, lf_cosmic, lf_protein_gazetteer, lf_hgnc, lf_protein_families]

#For Quick Run with Random Sentences!
#random_files = files_df.sample(n = 10000)
all_docs = []

for sentence_idx, doc in zip(tqdm(list(sentence_df.reset_index().iterrows())), nlp.pipe(sentence_df.text, disable=["ner"])):
    i, row = sentence_idx
    for lf in lfs:
        doc = lf(doc)
    all_docs.append(doc)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 85996/85996 [20:11<00:00, 70.96it/s]


Remove files which have been manually annotated from the training dataset

In [26]:
with open('data/molecular/annotated_sentences.txt') as fh:
    annotated_sentences = [l.strip() for l in fh.readlines()]
len(annotated_sentences)

2000

In [27]:
docs, s_index = zip(*[(d, si) for d, si in zip(all_docs, sentence_df.reset_index().index) if not d.text in annotated_sentences])
docs = list(docs)
filtered_sentence_df = sentence_df.reset_index().loc[list(s_index)]
len(docs), len(filtered_sentence_df)

(83624, 83624)

## Training Set Evaluation

In [30]:
from skweak.analysis import LFAnalysis

lfa = LFAnalysis(docs, ['Gene or Protein'])
cov = lfa.lf_coverages().rename(index={'Gene or Protein' : 'Coverage'})
overlap = lfa.lf_overlaps().rename(index={'Gene or Protein' : 'Overlaps'})
pd.concat([cov, overlap])[[lf.name for lf in lfs]]

Unnamed: 0,CIViC,Entrez,OMIM,COSMIC,Proteins,HGNC,Protein Families
Coverage,0.209893,0.405695,0.344048,0.223103,0.136651,0.365478,0.017907
Overlaps,0.98042,0.685962,0.499573,0.929605,0.699248,0.381526,0.368852


# Label Aggregation

In [31]:
# Training of HMM and Majority Voter
#voter = skweak.voting.SequentialMajorityVoter("maj_voter", labels=["Gen"])
#voter.fit(docs)
hmm = generative.HMM("hmm", ["Gene or Protein"])
hmm.fit(docs)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4624 documents
Starting iteration 2


         1      -33073.4892             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4624 documents
Starting iteration 3


         2      -32400.1132        +673.3760


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4624 documents
Starting iteration 4


         3      -32386.9774         +13.1358


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Number of processed documents: 4000
Finished E-step with 4624 documents


         4      -32383.9291          +3.0482


In [32]:
for d in docs:
    d = hmm(d)
    d.ents = d.spans["hmm"]

Consider subset of files where at least on LF has matched

In [33]:
lf_match = []
for d in docs:
    lf_match.append(any([len(v) > 0 for s, v in d.spans.items() if s != 'hmm']))

filtered_sentence_df['lf_match'] = lf_match
by_file = filtered_sentence_df.groupby('file')['lf_match'].max()
match_files = by_file[by_file.values].index

filtered_docs = [d for d, match in zip(docs, filtered_sentence_df.file.isin(match_files)) if match]
gene_docs = [d for d, match in zip(docs, lf_match) if match]

In [34]:
len(docs), len(filtered_docs), len(gene_docs)

(83624, 35501, 4624)

# Labeling Function Analysis

In [35]:
gold_docs_dev = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

Our labeling functions must also be deployed onto the gold standard data to evaluate strong supervision against weak supervision.

In [36]:
def apply_hmm(gold_docs):
    for g in tqdm(gold_docs):
        if 'Gene or Protein' in g.spans:
            del g.spans['Gene or Protein']
        for lf in lfs:
            g = lf(g)
        g = hmm(g)

In [37]:
apply_hmm(gold_docs_dev)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:15<00:00, 64.38it/s]


In [41]:
# HMM / LFs vs. Gold-Standard
#evaluate(gold_docs, ['Gen'], ['lf15', 'hmm'])
evaluate(gold_docs_dev, ['Gene or Protein'], [l.name for l in lfs] + ['hmm'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Gene or Protein,100.0 %,CIViC,0.979,0.366,0.532,,,,0.944,0.465,0.624
Gene or Protein,100.0 %,COSMIC,0.96,0.342,0.504,,,,0.928,0.436,0.594
Gene or Protein,100.0 %,Entrez,0.937,0.45,0.608,,,,0.902,0.503,0.646
Gene or Protein,100.0 %,HGNC,0.879,0.244,0.382,,,,0.833,0.305,0.446
Gene or Protein,100.0 %,OMIM,0.955,0.411,0.574,,,,0.926,0.524,0.67
Gene or Protein,100.0 %,Protein Families,1.0,0.067,0.126,,,,1.0,0.076,0.142
Gene or Protein,100.0 %,Proteins,1.0,0.117,0.21,,,,0.934,0.12,0.212
Gene or Protein,100.0 %,hmm,0.899,0.596,0.716,,,,0.841,0.68,0.752
macro,,CIViC,0.979,0.366,0.532,,,,0.944,0.465,0.624
macro,,COSMIC,0.96,0.342,0.504,,,,0.928,0.436,0.594


# Training of Transformer-based NER Models

In [42]:
utils.docbin_writer(docs, f"output/weak_training_lg.spacy")

Write to output/weak_training_lg.spacy...done


Train test splits for strong supervision

In [43]:
from sklearn.model_selection import train_test_split
gold_docs_strong = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

docs_strong_train, docs_strong_dev = train_test_split(gold_docs_strong, test_size=0.2, random_state=random_seed)
utils.docbin_writer(docs_strong_train, 'output/strong_train.spacy')
utils.docbin_writer(docs_strong_dev, 'output/strong_dev.spacy')

Write to output/strong_train.spacy...done
Write to output/strong_dev.spacy...done


In [None]:
# Train NER model on weak labels with spaCy
!spacy train config.cfg --paths.train output/weak_training_lg.spacy  --paths.dev data/molecular/gold_dev.spacy --output output/weak_ner_lg --gpu-id 0 --code training.py

In [None]:
# Baseline: train NER model on strong labels with spaCy
!spacy train config.cfg --paths.train output/strong_train.spacy --paths.dev output/strong_dev.spacy --output output/strong_ner --gpu-id 0 --code training.py


# Evaluation

In [58]:
ner_model_weak = spacy.load('output/weak_ner_lg/model-best/')
ner_model_strong = spacy.load('output/strong_ner/model-best/')

In [59]:
from IPython.display import display, Markdown

def print_metrics(gold_docs, is_test : bool):
    display(Markdown('__Labeling Function / HMM performance__'))
    gold_ents = [d.ents for d in gold_docs]
    
    apply_hmm(gold_docs)
    
    display(evaluate(gold_docs, ['Gene or Protein'], [l.name for l in lfs] + ['hmm']).loc['Gene or Protein'])
        
    display(Markdown('__Weak NER Performance__'))
    for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
        d.set_ents([])
        d = ner_model_weak(d)
        d.spans['ner_model'] = d.ents
        d.set_ents(gold_ent)
    display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])
    
    if is_test:
        display(Markdown('__Strong NER Performance__'))
        for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
            d.set_ents([])
            d = ner_model_strong(d)
            d.spans['ner_model'] = d.ents
            d.set_ents(gold_ent)
        display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])

## Dev Set Evaluation

In [60]:
gold_docs_dev_eval = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_dev_eval, is_test=False)

__Labeling Function / HMM performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:15<00:00, 64.79it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,CIViC,0.979,0.366,0.532,,,,0.944,0.465,0.624
100.0 %,COSMIC,0.96,0.342,0.504,,,,0.928,0.436,0.594
100.0 %,Entrez,0.937,0.45,0.608,,,,0.902,0.503,0.646
100.0 %,HGNC,0.879,0.244,0.382,,,,0.833,0.305,0.446
100.0 %,OMIM,0.955,0.411,0.574,,,,0.926,0.524,0.67
100.0 %,Protein Families,1.0,0.067,0.126,,,,1.0,0.076,0.142
100.0 %,Proteins,1.0,0.117,0.21,,,,0.934,0.12,0.212
100.0 %,hmm,0.899,0.596,0.716,,,,0.841,0.68,0.752


__Weak NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [00:39<00:00, 24.99it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.902,0.617,0.732,,,,0.855,0.72,0.782


## Test Set Evaluation

In [61]:
gold_docs_test_eval = list(DocBin().from_disk('data/molecular/gold_test.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_test_eval, is_test=True)

__Labeling Function / HMM performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:13<00:00, 73.50it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,CIViC,0.933,0.35,0.51,,,,0.841,0.473,0.606
100.0 %,COSMIC,0.927,0.342,0.5,,,,0.854,0.473,0.608
100.0 %,Entrez,0.951,0.525,0.676,,,,0.89,0.608,0.722
100.0 %,HGNC,0.853,0.19,0.31,,,,0.836,0.28,0.42
100.0 %,OMIM,0.904,0.363,0.518,,,,0.818,0.493,0.616
100.0 %,Protein Families,0.538,0.027,0.052,,,,0.25,0.012,0.022
100.0 %,Proteins,1.0,0.131,0.232,,,,0.975,0.112,0.2
100.0 %,hmm,0.864,0.596,0.706,,,,0.789,0.689,0.736


__Weak NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [00:40<00:00, 24.53it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.901,0.613,0.73,,,,0.819,0.718,0.766


__Strong NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 999/1000 [00:38<00:00, 25.86it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.544,0.694,0.61,,,,0.558,0.758,0.642


## Descriptive Statistics

In [62]:
def get_genes(doc_list):
    return [e.text for d in doc_list for e in d.ents if e.label_ == 'Gene or Protein']

In [63]:
len(docs), len(get_genes(docs))

(83624, 5617)

In [64]:
len(gold_docs_dev_eval), len(get_genes(gold_docs_dev_eval))

(1000, 475)

In [65]:
len(gold_docs_test_eval), len(get_genes(gold_docs_test_eval))

(1000, 347)