Obtaining a sufficient amount of high-quality training data is one of the crucial points and most formidable challenges in deep learning-based natural language processing. In this work, we present GGTWEAK (German Gene Tagging with Weak Supervision). In conventional settings all data must be labelled manually while complexity often does not allow for the involvement of non-experts for this laborious and thus costly task. This is especially true for molecular data, which is hard to discriminate from common abbreviations syntactically. Therefore, GGTWEAK provides a baseline bridging the gap between English gene taggers and models usable for German. GGTWEAK saves human resources compared to its English counterparts and potentially can be trained for free after development on available data. 
We design labelling functions based on the structure of gene naming conventions and databases from both the medical and general domain. Following that, we train a hidden Markov model for label aggregration. Based on our weakly labelled data, we finally train a German BERT model for named entity recognition. This weak supervision approach for gene labelling in the German language leverages the skweak framework achieving an entity-level F1 score of 60.4% on our test set, while dealing with a highly unbalanced data from the German Guideline Program in Oncology NLP Corpus. The NER model trained on the same development dataset with quantitatively less strong labels achieved 53.9%.

In [1]:
import spacy, re
from spacy.tokens import Span, DocBin
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG
from skweak import heuristics, gazetteers, generative, utils, base
from skweak.base import SpanAnnotator
from skweak.heuristics import SpanEditorAnnotator, VicinityAnnotator, SpanConstraintAnnotator
from skweak.analysis import LFAnalysis
import pandas as pd
import numpy as np
import ipynb
import string
import sklearn.metrics


from ipynb.fs.full.evaluation import evaluate, get_results, compute_raw_numbers, _get_probs, show_errors

# Load Data


In [None]:
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
nlp.add_pipe('sentencizer')
nlp = spacy.load('de_core_news_md')
infixes = nlp.Defaults.infixes + [r'([-])']
doc_bin = DocBin()
spacy.require_gpu()

In [None]:
from pathlib import Path
import pandas as pd
sentences = list(Path('../ggponc_data/sentences/all_files_sentences').glob('*.txt'))

In [None]:
def get_df(file):
    df = pd.read_csv(file, delimiter='\t', names =['text'] )
    df['file'] = file.stem
    df['sentence_id'] = df.index
    return df

In [None]:
files = [get_df(file) for file in sentences]

In [None]:
files_df = pd.concat(files)

In [14]:
# Calculate data stats (number of genes, sentences, etc...) -> Table for Materials

intersection_db = list(set(cosmic_census_lower) & set(CIVIC_genes_lower) & set(omim_list_lower))
union_db = set(set(cosmic_census_lower).union(set(CIVIC_genes_lower)).union(set(omim_list_lower)))
print(len(intersection_db))
print(len(union_db))

# Labeling Functions

In [3]:
# Load gazetteers

df = pd.read_csv('nightly-GeneSummaries.tsv', sep='\t')
CIVIC_genes = df['name'].tolist()
CIVIC_genes_lower = [c.lower() for c in CIVIC_genes]

In [None]:
df = pd.read_csv('nightly-VariantSummaries.tsv', sep='\t', error_bad_lines=False )
CIVIC_variants = df['variant'].tolist()
CIVIC_variants_lower = [c.lower() for c in CIVIC_variants]

In [None]:
omim_list = pd.read_csv("mim2gene.csv")
omim_list = omim_list['name'].tolist()
omim_list_lower = [o.lower() for o in omim_list]
short_genes = []
for u in omim_list:
    if len(u)<3:
        short_genes.append(u)
less_short_genes = []
for u in omim_list:
    if len(u)<5 and len(u)>2:
        less_short_genes.append(u)
print(len(omim_list))

In [None]:
cosmic_census = pd.read_csv("cancer_gene_census.csv")
cosmic_census = cosmic_census['Gene Symbol'].tolist()
cosmic_census_lower = [c.lower() for c in cosmic_census]

"omim" is based on the Online Mendelian Inheritance in Man (OMIM) database and checks whether tokens are present in its list of 16,767 approved gene symbols in lowercase as the diversity of genes often shows in volatile capitalization. To increase precision, genes with a length shorter than three characters are matched only correctly cased.

In [4]:
def omim(doc):
    for tok in doc:
        if tok.text.lower() in omim_list_lower and tok.text.lower() not in stops and len(tok.text.lower())>=3:
            yield tok.i, tok.i+1, "Gen"
omim = heuristics.FunctionAnnotator("omim", omim) 

"cue_cosmic_census" is based on the Catalogue of Somatic Mutations in Cancer (COSMIC) database 
If a token contains a gene symbol which is listed here, this token and its successor are annotated as a gene.

In [None]:
def cosmic(doc):
    for tok in doc:
        for cue in cosmic_census:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gen"
cue_cosmic_census = heuristics.FunctionAnnotator("cue_cosmic_census", cosmic)  

"cue_civic" is based on the Clinical Interpretation of Variants in Cancer (CIViC) database. If a token contains a gene which is listed in the database, this and the next token are labelled a gene. This way, we do not restrict the function to a 100 percent match but leave it some leeway.

In [None]:
def civic(doc):
    for tok in doc:
        for cue in CIVIC_genes:
            if tok.text.find(cue) == -1:
                continue
            else:
                yield tok.i, tok.i+1, "Gen"
cue_civic = heuristics.FunctionAnnotator("cue_civic", civic)

"construct" is based on the Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) naming conventions for genes and leverages regular expressions to let the annotator abide by them. Those expressions comprise various combinations of letters and numbers and certain fixed terms for shorter terms to avoid underfitting. In addition, the CIViC database for variants has also been included for a better recall.

In [None]:
def structure(doc):
    for tok in doc:
        if bool(re.search(r"[a-zA-Z]{4}\d{2}", tok.text))==True or bool(re.search(r"[a-zA-Z]{5}\d{1}", tok.text))==True\
        or bool(re.search(r"[a-zA-Z]{4}\d{1}", tok.text))==True or bool(re.search(r"[A-Z]{5}\d{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{5}\d{2}", tok.text))==True or bool(re.search(r"[A-Z]{3}\d{2}", tok.text))==True\
        or bool(re.search(r"[a-zA-Z]{2}\d{3}[a-zA-Z]{2}", tok.text))==True or bool(re.search(r"[a-zA-Z]{1}\d{3}[a-zA-Z]{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{2}", tok.text))==True or bool(re.search(r"[A-Z]{6}\d{1}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{3}", tok.text))==True or bool(re.search(r"[p]\d{2}", tok.text))==True\
        or bool(re.search(r"CYP[a-zA-Z0-9]{3}", tok.text))==True or bool(re.search(r"CYP[a-zA-Z0-9]{2}", tok.text))==True\
        or bool(re.search(r"[A-Z]{3}\d{1}", tok.text))==True or bool(re.search(r"[A-Z]{2}\d{2}", tok.text))==True\
        or bool(re.search(r"^CK.", tok.text))==True or bool(re.search(r"^PD-..", tok.text))==True or bool(re.search(r"^PS[MA|A]", tok.text))==True or tok.text.lower in CIVIC_variants_lower:
            yield tok.i, tok.i+1, "Gen"
construct = heuristics.FunctionAnnotator("construct", structure)

In [5]:
# Apply Labeling Functions to data

lfs = [construct, cue_civic, omim, cue_cosmic_census]


#For Quick Run with Random Sentences!
#random_files = files_df.sample(n = 10000)
docs = []

for file_idx, doc in zip(files_df.reset_index().iterrows(), nlp.pipe(files_df.text, batch_size=32,disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer",\
                                                                                             'morphologizer', 'ner'])):
    i, row = file_idx
    for lf in lfs:
        doc = lf(doc)
    docs.append(doc)
    if i % 1000 == 0:
        print(f'{i}/{len(files_df)}')

# Labeling Function Analysis

In [7]:
# calculate agreement, overlap, create heatmap

def convert_gold_labels(file_name):
    db = DocBin().from_disk(file_name)
    gold_docs = list(db.get_docs(nlp.vocab))

    for g in gold_docs:
        spans = [Span(g, span.start, span.end, 'Gen') for span in g.spans['Gene or Protein']]
        ents = []
        for s in spans:
            overlap = False
            for i, e in enumerate(ents): # Check for overlap
                if s.start <= e.start and s.end >= e.end:
                    ents[i] = s # Replace with larger span
                    overlap = True
                    break
            if not overlap:
                ents.append(s)
        g.set_ents(ents)
        g.spans["Gene or Protein"] = []
    return gold_docs


In [None]:
gold_docs = convert_gold_labels('gold_labels.spacy')
gold_docs_test = convert_gold_labels('gold_test.spacy')


In [None]:
utils.docbin_writer(gold_docs, "aml4dh-skweak/goldig.spacy")
utils.docbin_writer(gold_docs_test, "aml4dh-skweak/goldig_test.spacy")

In [None]:
# Apply LFs to gold documents
for g in gold_docs:
    for lf in lfs:
        g = lf(g)
    # Why does this still have a "Variant" key?
    if 'Variant' in g.spans:
        del g.spans['Variant']
    g = hmm(g)

In [None]:
# HMM / LFs vs. Gold-Standard
#evaluate(gold_docs, ['Gen'], ['lf15', 'hmm'])
evaluate(gold_docs, ['Gen'], [l.name for l in lfs[0:14]] + ['hmm'])

# Label Aggregation

In [17]:
# Train HMM, Majority Voter, etc...

#voter = skweak.voting.SequentialMajorityVoter("maj_voter", labels=["Gen"])
#voter.fit(docs)
hmm = generative.HMM("hmm", ["Gen"])
hmm.fit(docs)

# Load Gold-Standard Training Data

In [12]:
# Load dev / test data
# ?


# Named Entity Recognition

In [11]:
# Train NER model with spaCy

!spacy train config.cfg --paths.train aml4dh-skweak/goldig.spacy  --paths.dev aml4dh-skweak/goldig_test.spacy --output aml4dh-skweak/goldig --gpu-id 0 --code training.py


#!spacy train config.cfg --paths.train aml4dh-skweak/training.spacy  --paths.dev aml4dh-skweak/goldig_test.spacy --output aml4dh-skweak/goldig --gpu-id 0 --code training.py

In [None]:
# Load best performing NER model
nlp = spacy.load("aml4dh-skweak/goldig/model-best")

# Evaluation

In [16]:
# Evaluate HMM, LFs, NER model on dev / test data -> final performance table

for d_gold, d_pred in zip(gold_docs, gold_docs_for_lfa):
    d_pred.set_ents([])
    d_pred = nlp_pred(d_pred)
    d_gold.spans['ner_model'] = d_pred.ents
    
evaluate(gold_docs, ['Gen'], ['ner_model'])