##  GGTWEAK - German Gene Tagging with Weak Supervision

Obtaining a sufficient amount of high-quality training data is one of the crucial points and most formidable challenges in deep learning-based natural language processing. In this work, we present GGTWEAK (German Gene Tagging with Weak Supervision). In conventional settings all data must be labelled manually while complexity often does not allow for the involvement of non-experts for this laborious and thus costly task. This is especially true for molecular data, which is hard to discriminate from common abbreviations syntactically. Therefore, GGTWEAK provides a baseline bridging the gap between English gene taggers and models usable for German. GGTWEAK saves human resources compared to its English counterparts and potentially can be trained for free after development on available data. 
We design labelling functions based on the structure of gene naming conventions and databases from both the medical and general domain. Following that, we train a hidden Markov model for label aggregration. Based on our weakly labelled data, we finally train a German BERT model for named entity recognition. This weak supervision approach for gene labelling in the German language leverages the skweak framework achieving an entity-level F1 score of 60.4% on our test set, while dealing with a highly unbalanced data from the German Guideline Program in Oncology NLP Corpus. The NER model trained on the same development dataset with quantitatively less strong labels achieved 53.9%.

In [1]:
import re
import pandas as pd
import numpy as np
from pathlib import Path
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords

import spacy
from spacy.tokens import Span, DocBin
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG

from skweak import heuristics, gazetteers, generative, utils, base
from skweak.base import SpanAnnotator
from skweak.heuristics import SpanEditorAnnotator, VicinityAnnotator, SpanConstraintAnnotator
from skweak.analysis import LFAnalysis

import sklearn.metrics

from evaluation import evaluate, get_results, compute_raw_numbers, _get_probs, show_errors

In [2]:
stops = set(stopwords.words('german'))

# Load Data


Here we customize the spaCy tokenizer without statistical model and loading the standard German spaCy model.

In [3]:
nlp = spacy.load('de_core_news_md')
infixes = nlp.Defaults.infixes + [r'([-])']
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7fc51599f800>

In the following, we construct a dataframe from all sentences in GGPONC and concatenate them. 

In [23]:
GGPONC_PATH = 'data/ggponc/plain_text/sentences/all_files_sentences/'

In [5]:
sentences = list(Path(GGPONC_PATH).glob('*.txt'))
len(sentences)

10193

In [6]:
def get_df(file):
    df = pd.read_csv(file, delimiter='\t', names =['text'] )
    df['file'] = file.stem
    df['sentence_id'] = df.index
    return df

In [7]:
dfs = [get_df(file) for file in sentences]

In [8]:
sentence_df = pd.concat(dfs)
len(sentence_df)

85996

Remove files which have been manually annotated from the training dataset

In [9]:
with open('data/molecular/annotated_files.txt') as fh:
    annotated_files = [f.replace('.txt', '').strip() for f in fh.readlines()]
len(annotated_files)

138

In [10]:
sentence_df = sentence_df[~sentence_df.file.isin(annotated_files)]
len(sentence_df)

83290

# Labeling Functions

The Clinical Interpretation of Variants in Cancer (CIViC) database is an opensource, open-access knowledgebase curated by experts on therapeutic, prognostic, diagnostic and predisposing relevance of inherited and somatic variants of every type. Both, genes and variants, are used in this project.

In [12]:
df = pd.read_csv('data/molecular/nightly-GeneSummaries.tsv', sep='\t')
CIVIC_genes = df['name'].tolist()
CIVIC_genes_lower = [c.lower() for c in CIVIC_genes]

In [13]:
df = pd.read_csv('data/molecular/nightly-VariantSummaries.tsv', sep='\t', error_bad_lines=False )
CIVIC_variants = df['variant'].tolist()
CIVIC_variants_lower = [c.lower() for c in CIVIC_variants]

Skipping line 13: expected 29 fields, saw 33
Skipping line 17: expected 29 fields, saw 30
Skipping line 31: expected 29 fields, saw 30
Skipping line 441: expected 29 fields, saw 30
Skipping line 502: expected 29 fields, saw 30
Skipping line 553: expected 29 fields, saw 31



"cue_civic" is based on the CIViC database. If a token contains a gene which is listed in the database, this and the next token are labelled a gene. This way, we do not restrict the function to a 100 percent match but leave it some leeway.

In [14]:
def civic(doc):
    for tok in doc:
        for cue in CIVIC_genes:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
cue_civic = heuristics.FunctionAnnotator("cue_civic", civic)

The Online Mendelian Inheritance in Man (OMIM) database is the encyclopedic collection of the human medical branch of genetics.

In [15]:
omim_list = pd.read_csv("data/molecular/mim2gene.csv")
omim_list = omim_list['name'].tolist()
omim_list_lower = [o.lower() for o in omim_list]
short_genes = []
for u in omim_list:
    if len(u)<3:
        short_genes.append(u)
less_short_genes = []
for u in omim_list:
    if len(u)<5 and len(u)>2:
        less_short_genes.append(u)
print(len(omim_list))

16767


"omim" is based on the OMIM database and checks whether tokens are present in its list of 16,767 approved gene symbols in lowercase as the diversity of genes often shows in volatile capitalization. To increase precision, genes with a length shorter than three characters are matched only correctly cased.

In [16]:
def omim(doc):
    for tok in doc:
        if tok.text.lower() in omim_list_lower and tok.text.lower() not in stops and len(tok.text.lower())>=3:
            yield tok.i, tok.i+1, "Gene or Protein"
omim = heuristics.FunctionAnnotator("omim", omim) 

The Catalogue of Somatic Mutations in Cancer (COSMIC) database harbors somatic cell mutations and additional information associated with cancer in humans.

In [17]:
cosmic_census = pd.read_csv("data/molecular/cancer_gene_census.csv")
cosmic_census = cosmic_census['Gene Symbol'].tolist()
cosmic_census_lower = [c.lower() for c in cosmic_census]

"cue_cosmic_census" is based on the COSMIC database. If a token contains a gene symbol which is listed here, this token and its successor are annotated as a gene.

In [18]:
def cosmic(doc):
    for tok in doc:
        for cue in cosmic_census:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gene or Protein"
cue_cosmic_census = heuristics.FunctionAnnotator("cue_cosmic_census", cosmic)  

"construct" is based on the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) naming conventions for genes and leverages regular expressions to let the annotator abide by them. Those expressions comprise various combinations of letters and numbers and certain fixed terms for shorter terms to avoid underfitting. In addition, the CIViC database for variants has also been included for a better recall.

In [19]:
def structure(doc):
    for tok in doc:
        if bool(re.search(r"[a-zA-Z]{4}\d{2}", tok.text))==True or bool(re.search(r"[a-zA-Z]{5}\d{1}", tok.text))==True\
        or bool(re.search(r"[a-zA-Z]{4}\d{1}", tok.text))==True or bool(re.search(r"[A-Z]{5}\d{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{5}\d{2}", tok.text))==True or bool(re.search(r"[A-Z]{3}\d{2}", tok.text))==True\
        or bool(re.search(r"[a-zA-Z]{2}\d{3}[a-zA-Z]{2}", tok.text))==True or bool(re.search(r"[a-zA-Z]{1}\d{3}[a-zA-Z]{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{2}", tok.text))==True or bool(re.search(r"[A-Z]{6}\d{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{3}", tok.text))==True or bool(re.search(r"[p]\d{2}", tok.text))==True\
        or bool(re.search(r"CYP[a-zA-Z0-9]{3}", tok.text))==True or bool(re.search(r"CYP[a-zA-Z0-9]{2}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{1}", tok.text))==True or bool(re.search(r"[A-Z]{2}\d{2}", tok.text))==True\
        or bool(re.search(r"^CK.", tok.text))==True or bool(re.search(r"^PD-..", tok.text))==True or bool(re.search(r"^PS[MA|A]", tok.text))==True or tok.text.lower in CIVIC_variants_lower:
            yield tok.i, tok.i+1, "Gene or Protein"
construct = heuristics.FunctionAnnotator("construct", structure)

In [20]:
# Calculate data stats (number of genes, sentences, etc...) -> Table for Materials
# Intersections of the gazetteers are calculated below.
intersection_db = list(set(cosmic_census_lower) & set(CIVIC_genes_lower) & set(omim_list_lower))
union_db = set(set(cosmic_census_lower).union(set(CIVIC_genes_lower)).union(set(omim_list_lower)))
print(len(intersection_db))
print(len(union_db))

249
16784


We define a list of labelling functions which we want to deploy onto our textual input. Every row in our dataframe is subject to examination. Those listed here was the best performing allocation from ca. 30 labelling functions.

In [22]:
lfs = [construct, cue_civic, omim, cue_cosmic_census]

#For Quick Run with Random Sentences!
#random_files = files_df.sample(n = 10000)
docs = []

for sentence_idx, doc in zip(sentence_df.reset_index().iterrows(), nlp.pipe(sentence_df.text, batch_size=32,disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer",\
                                                                                             'morphologizer', 'ner'])):
    i, row = sentence_idx
    for lf in lfs:
        doc = lf(doc)
    docs.append(doc)
    if i % 1000 == 0:
        print(f'{i}/{len(sentence_df)}')

0/83290
1000/83290
2000/83290
3000/83290
4000/83290
5000/83290
6000/83290
7000/83290
8000/83290
9000/83290
10000/83290
11000/83290
12000/83290
13000/83290
14000/83290
15000/83290
16000/83290
17000/83290
18000/83290
19000/83290
20000/83290
21000/83290
22000/83290
23000/83290
24000/83290
25000/83290
26000/83290
27000/83290
28000/83290
29000/83290
30000/83290
31000/83290
32000/83290
33000/83290
34000/83290
35000/83290
36000/83290
37000/83290
38000/83290
39000/83290
40000/83290
41000/83290
42000/83290
43000/83290
44000/83290
45000/83290
46000/83290
47000/83290
48000/83290
49000/83290
50000/83290
51000/83290
52000/83290
53000/83290
54000/83290
55000/83290
56000/83290
57000/83290
58000/83290
59000/83290
60000/83290
61000/83290
62000/83290
63000/83290
64000/83290
65000/83290
66000/83290
67000/83290
68000/83290
69000/83290
70000/83290
71000/83290
72000/83290
73000/83290
74000/83290
75000/83290
76000/83290
77000/83290
78000/83290
79000/83290
80000/83290
81000/83290
82000/83290
83000/83290


# Label Aggregation

In [24]:
# Training of HMM and Majority Voter
#voter = skweak.voting.SequentialMajorityVoter("maj_voter", labels=["Gen"])
#voter.fit(docs)
hmm = generative.HMM("hmm", ["Gene or Protein"])
hmm.fit(docs)

Starting iteration 1
Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Finished E-step with 3861 documents
Starting iteration 2


         1      -23574.4098             +nan


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Finished E-step with 3861 documents
Starting iteration 3


         2      -23012.0902        +562.3196


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Finished E-step with 3861 documents
Starting iteration 4


         3      -23002.5389          +9.5513


Number of processed documents: 1000
Number of processed documents: 2000
Number of processed documents: 3000
Finished E-step with 3861 documents


         4      -23000.8700          +1.6689


# TODO: try to filter files with no LF annotations to make distribution closer to dev / test

In [25]:
for d in docs:
    d = hmm(d)
    d.ents = d.spans["hmm"]
    
utils.docbin_writer(docs, "output/weak_training.spacy")

Write to output/training.spacy...done


# Labeling Function Analysis

In [27]:
gold_docs_dev = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

Our labeling functions must also be deployed onto the gold standard data to evaluate strong supervision against weak supervision.

In [28]:
def apply_hmm(gold_docs):
    for g in tqdm(gold_docs):
        if 'Gene or Protein' in g.spans:
            del g.spans['Gene or Protein']
        for lf in lfs:
            g = lf(g)
        g = hmm(g)

In [29]:
apply_hmm(gold_docs_dev)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1095/1095 [00:11<00:00, 93.72it/s]


In [30]:
# HMM / LFs vs. Gold-Standard
#evaluate(gold_docs, ['Gen'], ['lf15', 'hmm'])
evaluate(gold_docs_dev, ['Gene or Protein'], [l.name for l in lfs] + ['hmm'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
label,proportion,model,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Gene or Protein,100.0 %,construct,0.887,0.303,0.452,,,,0.831,0.435,0.572
Gene or Protein,100.0 %,cue_civic,0.957,0.284,0.438,,,,0.892,0.405,0.558
Gene or Protein,100.0 %,cue_cosmic_census,0.971,0.27,0.422,,,,0.913,0.388,0.544
Gene or Protein,100.0 %,hmm,0.875,0.404,0.552,,,,0.823,0.582,0.682
Gene or Protein,100.0 %,omim,0.943,0.321,0.478,,,,0.887,0.462,0.608
macro,,construct,0.887,0.303,0.452,,,,0.831,0.435,0.572
macro,,cue_civic,0.957,0.284,0.438,,,,0.892,0.405,0.558
macro,,cue_cosmic_census,0.971,0.27,0.422,,,,0.913,0.388,0.544
macro,,hmm,0.875,0.404,0.552,,,,0.823,0.582,0.682
macro,,omim,0.943,0.321,0.478,,,,0.887,0.462,0.608


# Training of Transformer-based NER Models

In [None]:
# Train NER model on weak labels with spaCy
!spacy train config.cfg --paths.train output/weak_training.spacy  --paths.dev data/molecular/gold_dev.spacy --output output/weak_ner --gpu-id 0 --code training.py

In [31]:
# Baseline: train NER model on strong labels with spaCy
!spacy train config.cfg --paths.train data/molecular/gold_dev.spacy --paths.dev data/molecular/gold_test.spacy --output output/strong_ner --gpu-id 0 --code training.py

[38;5;2m✔ Created output directory: output/strong_ner[0m
[38;5;4mℹ Saving to output directory: output/strong_ner[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-11-28 17:43:06,711] [INFO] Set up nlp object from config
[2022-11-28 17:43:06,718] [INFO] Pipeline: ['transformer', 'ner']
[2022-11-28 17:43:06,721] [INFO] Created vocabulary
[2022-11-28 17:43:06,721] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at bert-base-german-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a Be

# Evaluation

In [33]:
ner_model_weak = spacy.load('output/weak_ner/model-best/')
ner_model_strong = spacy.load('output/strong_ner/model-best/')

In [34]:
from IPython.display import display, Markdown

def print_metrics(gold_docs, is_test : bool):
    display(Markdown('__Labeling Function / HMM performance__'))
    gold_ents = [d.ents for d in gold_docs]
    
    apply_hmm(gold_docs)
    
    display(evaluate(gold_docs, ['Gene or Protein'], [l.name for l in lfs] + ['hmm']).loc['Gene or Protein'])
        
    display(Markdown('__Weak NER Performance__'))
    for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
        d.set_ents([])
        d = ner_model_weak(d)
        d.spans['ner_model'] = d.ents
        d.set_ents(gold_ent)
    display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])
    
    if is_test:
        display(Markdown('__Strong NER Performance__'))
        for d, gold_ent in zip(gold_docs, tqdm(gold_ents)):
            d.set_ents([])
            d = ner_model_strong(d)
            d.spans['ner_model'] = d.ents
            d.set_ents(gold_ent)
        display(evaluate(gold_docs, ['Gene or Protein'], ['ner_model']).loc['Gene or Protein'])

## Dev Set Evaluation

In [35]:
gold_docs_dev_eval = list(DocBin().from_disk('data/molecular/gold_dev.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_dev_eval, is_test=False)

__Labeling Function / HMM performance__

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1095/1095 [00:11<00:00, 93.20it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,construct,0.887,0.303,0.452,,,,0.831,0.435,0.572
100.0 %,cue_civic,0.957,0.284,0.438,,,,0.892,0.405,0.558
100.0 %,cue_cosmic_census,0.971,0.27,0.422,,,,0.913,0.388,0.544
100.0 %,hmm,0.875,0.404,0.552,,,,0.823,0.582,0.682
100.0 %,omim,0.943,0.321,0.478,,,,0.887,0.462,0.608


__Weak NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████▉| 1094/1095 [00:41<00:00, 26.07it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.876,0.441,0.586,,,,0.809,0.624,0.704


## Test Set Evaluation

In [36]:
gold_docs_test_eval = list(DocBin().from_disk('data/molecular/gold_test.spacy').get_docs(nlp.vocab))

print_metrics(gold_docs_test_eval, is_test=True)

__Labeling Function / HMM performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1611/1611 [00:13<00:00, 117.69it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,construct,0.83,0.216,0.342,,,,0.812,0.286,0.424
100.0 %,cue_civic,0.928,0.384,0.544,,,,0.867,0.487,0.624
100.0 %,cue_cosmic_census,0.914,0.369,0.526,,,,0.863,0.472,0.61
100.0 %,hmm,0.865,0.465,0.604,,,,0.818,0.596,0.69
100.0 %,omim,0.919,0.394,0.552,,,,0.868,0.504,0.638


__Weak NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████▉| 1610/1611 [00:59<00:00, 27.22it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.87,0.476,0.616,,,,0.807,0.598,0.686


__Strong NER Performance__

100%|██████████████████████████████████████████████████████████████████████████████████████████████████▉| 1610/1611 [00:55<00:00, 29.07it/s]


Unnamed: 0_level_0,Unnamed: 1_level_0,tok_precision,tok_recall,tok_f1,tok_cee,tok_acc,coverage,ent_precision,ent_recall,ent_f1
proportion,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100.0 %,ner_model,0.868,0.78,0.822,,,,0.818,0.788,0.802
