# BioBERT für Relationsextraktion

## Step 1: Korpusdaten verarbeiten
Mit `lxml.etree` die Korpus-XML-Dateien verarbeiten.

Siehe: [https://lxml.de/tutorial.html](https://lxml.de/tutorial.html)

Am Ende geht es um die Paare, die eindeutig durch eine ID identifiziert werden (document ID, sentence ID, pair ID).

* `id`
* `label` bzw. `interaction`
* `sentence`
* `e1_span`
* `e2_span`

Beispielauszug aus dem BioInfer Korpus:

```xml
<corpus id="BioInfer">
  ...
  <document id="BioInfer.d1" origId="8001585">
    ...
    <sentence id="BioInfer.d1.s1" origId="235" text="Birch profilin increased the critical concentration required for muscle and brain muscl polymerization in a concentration-dependent manner, supporting the notion of the formation of a heterologous complex between the plant protein and animal actin.">
      <entity charOffset="76-86" id="BioInfer.d1.s1.e0" origId="e.235.4" text="brain actin" type="Individual_protein" />
      <entity charOffset="6-13" id="BioInfer.d1.s1.e1" origId="e.235.5" text="profilin" type="Individual_protein" />
      <entity charOffset="65-70,82-86" id="BioInfer.d1.s1.e2" origId="e.235.6" text="muscle actin" type="Individual_protein" />
      <entity charOffset="242-246" id="BioInfer.d1.s1.e3" origId="e.235.7" text="actin" type="Individual_protein" />
      <pair e1="BioInfer.d1.s1.e0" e2="BioInfer.d1.s1.e1" id="BioInfer.d1.s1.p0" interaction="True" />
      <pair e1="BioInfer.d1.s1.e0" e2="BioInfer.d1.s1.e2" id="BioInfer.d1.s1.p1" interaction="False" />
      <pair e1="BioInfer.d1.s1.e0" e2="BioInfer.d1.s1.e3" id="BioInfer.d1.s1.p2" interaction="False" />
      <pair e1="BioInfer.d1.s1.e1" e2="BioInfer.d1.s1.e2" id="BioInfer.d1.s1.p3" interaction="True" />
      <pair e1="BioInfer.d1.s1.e1" e2="BioInfer.d1.s1.e3" id="BioInfer.d1.s1.p4" interaction="True" />
      <pair e1="BioInfer.d1.s1.e2" e2="BioInfer.d1.s1.e3" id="BioInfer.d1.s1.p5" interaction="False" />
    </sentence>
    ...
  </document>
  ...
</corpus>
```




Problematisch ist momentan wegen der überlappenden `charOffset`s:

```xml
<entity charOffset="76-86" id="BioInfer.d1.s1.e0" origId="e.235.4" text="brain actin" type="Individual_protein" />
<entity charOffset="65-70,82-86" id="BioInfer.d1.s1.e2" origId="e.235.6" text="muscle actin" type="Individual_protein" />
```

Folgende Lösungsmöglichkeiten:


*   Wir nehmen bei `muscle actin` nur den `muscle` Teil und ersetzen ihn durch `@PROTEIN$` oder markieren ihn mit `$` bzw. `#`. Das kriegen wir mit unserer ursprünglichen Implementation hin. Oder wir fügen einen Special Case zu unserer neuen Implementation hinzu.
*   Wir ignorieren diesen Special Case in der Hoffnung, dass das ein einmaliges Vorkommen war, und lassen einfach unsere jetzige Implementation darauf laufen.



Um mit unseren Dateien zu arbeiten, können wir entweder Colaboratory auf unser Google Drive zugreifen lassen oder wir laden die Korpusdateien selber hoch.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import os
import pandas as pd
from lxml import etree

# Paths to our corpus files
aimedPath = '/content/drive/My Drive/TransferLearning/Trainingsdaten/AIMed-train.xml'
#aimedPath = '/content/AIMed-train.xml'
bioinferPath = '/content/drive/My Drive/TransferLearning/Trainingsdaten/BioInfer-train.xml'  
#bioinferPath = '/content/BioInfer-train.xml'

In [0]:
def get_entity_charOffset(entities, id):
  for e in entities:
    if e.attrib['id'] == id:
      return e.attrib['charOffset']

def process_corpus(xmlFile):
  d_id = []   # document id
  s_id = []     # sentence id for grouping and train-dev-test split later on
  p_id = []     # pair id
  label = []    # interaction
  sentence = [] # sentence text
  e1_span = []  # e1 charOffset
  e2_span = []  # e2 charOffset

  for _, doc in etree.iterparse(xmlFile, events=("end",), tag=('document')):
    for sent in doc:
      entities = sent.findall('entity')
      for pair in sent.findall('pair'):
        d_id.append(doc.attrib['id'])
        s_id.append(sent.attrib['id'])
        attribs = pair.attrib
        p_id.append(attribs['id'])
        if attribs['interaction'] == 'True':
          label.append(1)
        else:
          label.append(0)
        sentence.append(sent.attrib['text'])
        e1_span.append(get_entity_charOffset(entities, attribs['e1']))
        e2_span.append(get_entity_charOffset(entities, attribs['e2']))  
      sent.clear()
    doc.clear()
  
  d = {'d_id': d_id, 's_id': s_id, 'p_id': p_id, 'sentence': sentence, 'label': label, 
      'e1_span': e1_span, 'e2_span': e2_span}
  df = pd.DataFrame(data=d)
  return df     

Damit können wir unsere Korpora zu Pandas DataFrames umwandeln:

In [0]:
aimed_corpus = process_corpus(aimedPath)

bioinfer_corpus = process_corpus(bioinferPath)

Ein kurzer Blick auf unsere Datenstruktur für den AIMed Korpus:

In [19]:
# Don't truncate text fields in the display
pd.set_option("display.max_colwidth", 0)

aimed_corpus.head()

Unnamed: 0,d_id,s_id,p_id,sentence,label,e1_span,e2_span
0,AIMed.d0,AIMed.d0.s5,AIMed.d0.s5.p0,"Cytokines measurements during IFN-alpha treatment showed a trend to decreasing levels of IL-4 at 4, 12, and 24 weeks.",0,30-38,89-92
1,AIMed.d0,AIMed.d0.s6,AIMed.d0.s6.p0,Levels of IFN-gamma were slightly increased following IFN-alpha treatment (P = 0.09).,0,10-18,54-62
2,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p0,"In patients with a complete response to IFN-alpha, the levels of IFN-gamma were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,65-73
3,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p1,"In patients with a complete response to IFN-alpha, the levels of IFN-gamma were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,109-117
4,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p2,"In patients with a complete response to IFN-alpha, the levels of IFN-gamma were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,186-189


## Step 2: Vorbereitung für BERT

1. Unser Hauptpaper: **"A BERT-based Universal Model for Both Within- and Cross-sentence Clinical Temporal Relation Extraction"** markiert die Positionen der Entitäten durch spezielle Non-XML Tags.
Wir könnten zum Beispiel `ps` (protein start) und `pe` (protein end) verwenden.
2. Alibaba: **"Enriching Pre-trained Language Model with Entity Information for Relation Classification"** markieren die Position der ersten Entität durch $-Zeichen und die Position der zweiten Entität durch #-Zeichen. 
3. BioBERT: **"BioBERT: a pre-trained biomedical language representation model for biomedical text mining"** anonymisiert die Entitäten. Bei uns würde man die Entitäten einfach durch `@PROTEIN$` ersetzen. Evtl. muss man die Entitäten z.B. durch eine Zahl am Ende unterscheiden?

Um diese speziellen Tokens - `ps` (protein start) und `pe` (protein end) -zu verwenden, müssen wir unser BioBERT Vokabular anpassen.

**HOW?**

Jedenfalls müssen wir unsere Pandas DataFrames den Ansätzen entsprechend anpassen. Dann müssen wir die jeweils in Train, Dev, Test aufteilen und am als .tsv-Dateien abspeichern, um später mit HuggingFace arbeiten zu können.

Hier für sollte reichen:

*   `id`
*   `sentence`
*   `label`


In [0]:
# Util functions
def get_span(entity_no, spans):
  span_tuples = []
  for span in spans.split(','):
    limits = span.split('-')
    start = int(limits[0])
    end = int(limits[1])+1
    span_tuples.append((entity_no, start, end))
  return span_tuples

def split_sentence(span_list, sentence, include_entities=False):
  '''
  Returns sentence blocks that are not to be replaced.
  '''
  sentence_array = []
  start_idx = 0
  for idx, triple in enumerate(span_list):
    sentence_array.append(sentence[start_idx:triple[1]])
    if include_entities:
      sentence_array.append(sentence[triple[1]:triple[2]])
    start_idx = triple[2]
  sentence_array.append(sentence[span_list[-1][2]:])
  return sentence_array

def export_tsv(df, out):
  '''
  Deletes span columns
  Then exports to out path
  '''
  print(df.head())
  data = df.copy()[['d_id', 's_id','p_id','sentence','label']]
  data.to_csv(out, sep='\t', index=False, header=False)

### 2.1 Ansätze 1 & 2: Markiere die Positionen der Entitäten

In [0]:
def new_markers(pair, e1_start, e1_end, e2_start, e2_end):
  entity_spans = get_span(1, pair['e1_span'])
  entity_spans.extend(get_span(2, pair['e2_span']))
  
  # Idea is to generate span triples and then replace them
  entity_spans.sort(key = lambda trip: trip[1])

  sentence_parts = split_sentence(entity_spans, pair['sentence'], 
                                  include_entities=True)

  idx = 1
  for triple in entity_spans:
    entity_no, _, _ = triple
    # TODO Special case in BioInfer with overlapping spans
    sentence_parts.insert(idx, e1_start if triple[0] == 1 else e2_start)
    idx += 2
    sentence_parts.insert(idx, e1_end if triple[0] == 1 else e2_end) 
    idx += 2  # increment for loop and for added elem
    # print(idx)
    # print(sentence_parts)

  pair['sentence'] = ''.join(sentence_parts)

  return pair

def add_markers(pair, e1_start, e1_end, e2_start, e2_end):
  '''
  pair: dataframe containing a pair
  e1_start, e1_end: special tags/symbols marking the position of entity e1
  e2_start, e2_end: special tags/symbols marking the position of entity e2
  '''
  start1, end1 = get_span(pair['e1_span'])
  start2, end2 = get_span(pair['e2_span'])
  # If necessary swap to assure right order
  if start1 > start2:
    tmp1, tmp2 = start1, end1
    start1, end1 = start2, end2
    start2, end2 = tmp1, tmp2

  mod_sent = pair['sentence'][:start1]
  if mod_sent != '' and pair['sentence'][start1-1] != ' ':
    mod_sent += ' '
  mod_sent += f'{e1_start} ' + pair['sentence'][start1:end1] + f' {e1_end}'  
  if pair['sentence'][end1] != ' ':
    mod_sent += ' '
  mod_sent += pair['sentence'][end1:start2]
  if pair['sentence'][start2-1] != ' ':
    mod_sent += ' ' 
  mod_sent += f'{e2_start} ' + pair['sentence'][start2:end2] + f' {e2_end}'
  if pair['sentence'][end2:] != '':
    mod_sent += ' ' + pair['sentence'][end2:]   

  pair['sentence'] = mod_sent
  return pair

def prepare_data(df, e1_start, e1_end, e2_start, e2_end):
  return df.apply(new_markers, args=(e1_start, e1_end, e2_start, e2_end), axis=1)

#### 2.1.1: Unser Hauptpaper: "_A BERT-based Universal Model for Both Within- and Cross-sentence Clinical Temporal Relation Extraction_" 
Hier markieren wir die Entitäten mit `ps` (protein start) und `pe` (protein end).
Wir müssten uns das Vokabular anschauen und evtl. etwas anpassen.

In [72]:
lin_aimed_corpus = prepare_data(aimed_corpus.copy(), 'ps ', ' pe', 'ps ', ' pe')
lin_bioinfer_corpus = prepare_data(bioinfer_corpus.copy(), 'ps ', ' pe', 'ps ', ' pe')
lin_bioinfer_corpus.head()

Unnamed: 0,d_id,s_id,p_id,sentence,label,e1_span,e2_span
0,BioInfer.d0,BioInfer.d0.s0,BioInfer.d0.s0.p0,ps alpha-catenin pe inhibits beta-catenin signaling by preventing formation of a beta-catenin*ps T-cell factor pe*DNA complex.,1,88-100,0-12
1,BioInfer.d0,BioInfer.d0.s0,BioInfer.d0.s0.p1,alpha-catenin inhibits ps beta-catenin pe signaling by preventing formation of a beta-catenin*ps T-cell factor pe*DNA complex.,1,88-100,23-34
2,BioInfer.d0,BioInfer.d0.s0,BioInfer.d0.s0.p2,alpha-catenin inhibits beta-catenin signaling by preventing formation of a ps beta-catenin pe*ps T-cell factor pe*DNA complex.,1,88-100,75-86
3,BioInfer.d0,BioInfer.d0.s0,BioInfer.d0.s0.p3,ps alpha-catenin pe inhibits ps beta-catenin pe signaling by preventing formation of a beta-catenin*T-cell factor*DNA complex.,1,0-12,23-34
4,BioInfer.d0,BioInfer.d0.s0,BioInfer.d0.s0.p4,ps alpha-catenin pe inhibits beta-catenin signaling by preventing formation of a ps beta-catenin pe*T-cell factor*DNA complex.,1,0-12,75-86


Exportieren:

In [74]:
export_tsv(lin_aimed_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/lin_aimed_train.tsv')
export_tsv(lin_bioinfer_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/lin_bioinfer_train.tsv')

       d_id         s_id            p_id  ... label  e1_span  e2_span
0  AIMed.d0  AIMed.d0.s5  AIMed.d0.s5.p0  ...  0     30-38    89-92  
1  AIMed.d0  AIMed.d0.s6  AIMed.d0.s6.p0  ...  0     10-18    54-62  
2  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p0  ...  0     40-48    65-73  
3  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p1  ...  0     40-48    109-117
4  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p2  ...  0     40-48    186-189

[5 rows x 7 columns]
          d_id            s_id               p_id  ... label  e1_span e2_span
0  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p0  ...  1     88-100   0-12  
1  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p1  ...  1     88-100   23-34 
2  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p2  ...  1     88-100   75-86 
3  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p3  ...  1     0-12     23-34 
4  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p4  ...  1     0-12     75-86 

[5 rows x 7 columns]


### 2.1.2: Alibaba: "_Enriching Pre-trained Language Model with Entity Information for Relation Classification_" 
Sie markieren die Position der ersten Entität durch `$`-Zeichen und die Position der zweiten Entität durch `#`-Zeichen.


In [77]:
ali_aimed_corpus = prepare_data(aimed_corpus.copy(), '$ ', ' $', '# ', ' #')
ali_bioinfer_corpus = prepare_data(bioinfer_corpus.copy(), '$ ', ' $', '# ', ' #')
ali_aimed_corpus.head()

Unnamed: 0,d_id,s_id,p_id,sentence,label,e1_span,e2_span
0,AIMed.d0,AIMed.d0.s5,AIMed.d0.s5.p0,"Cytokines measurements during $ IFN-alpha $ treatment showed a trend to decreasing levels of # IL-4 # at 4, 12, and 24 weeks.",0,30-38,89-92
1,AIMed.d0,AIMed.d0.s6,AIMed.d0.s6.p0,Levels of $ IFN-gamma $ were slightly increased following # IFN-alpha # treatment (P = 0.09).,0,10-18,54-62
2,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p0,"In patients with a complete response to $ IFN-alpha $, the levels of # IFN-gamma # were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,65-73
3,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p1,"In patients with a complete response to $ IFN-alpha $, the levels of IFN-gamma were higher at 24 weeks following # IFN-alpha # treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,109-117
4,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p2,"In patients with a complete response to $ IFN-alpha $, the levels of IFN-gamma were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of # IL-4 # decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,186-189


Exportieren:

In [79]:
export_tsv(ali_aimed_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/ali_aimed_train.tsv')
export_tsv(ali_bioinfer_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/ali_bioinfer_train.tsv')

       d_id         s_id            p_id  ... label  e1_span  e2_span
0  AIMed.d0  AIMed.d0.s5  AIMed.d0.s5.p0  ...  0     30-38    89-92  
1  AIMed.d0  AIMed.d0.s6  AIMed.d0.s6.p0  ...  0     10-18    54-62  
2  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p0  ...  0     40-48    65-73  
3  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p1  ...  0     40-48    109-117
4  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p2  ...  0     40-48    186-189

[5 rows x 7 columns]
          d_id            s_id               p_id  ... label  e1_span e2_span
0  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p0  ...  1     88-100   0-12  
1  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p1  ...  1     88-100   23-34 
2  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p2  ...  1     88-100   75-86 
3  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p3  ...  1     0-12     23-34 
4  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p4  ...  1     0-12     75-86 

[5 rows x 7 columns]


### 2.2: BioBERT: "_BioBERT: a pre-trained biomedical language representation model for biomedical text mining_" 
Sie anonymisieren die Entitäten. Bei uns würde man die Entitäten einfach durch `@PROTEIN$` ersetzen. Alternativ könnte man auch einfach die "blind"-Dateien von Mario verwenden. 

In [0]:
def new_anonymize(pair, anon1, anon2):
  '''
  pair: dataframe
  anon1: to anonymize entity 1
  anon2: to anonymize entity 2
  '''
  entity_spans = get_span(1, pair['e1_span'])
  entity_spans.extend(get_span(2, pair['e2_span']))
  
  # Idea is to generate span triples and then replace them
  entity_spans.sort(key = lambda trip: trip[1])

  sentence_parts = split_sentence(entity_spans, pair['sentence'])

  idx = 1
  for triple in entity_spans:
    entity_no, _, _ = triple
    # TODO Special case in BioInfer with overlapping spans
    sentence_parts.insert(idx, anon1 if triple[0] == 1 else anon2) 
    idx += 2  # increment for loop and for added elem

  pair['sentence'] = ''.join(sentence_parts)
  return pair

def anonymize_entities(pair, anon):
  '''
  pair: dataframe containing a pair
  anon: string that replaces the entity 
  '''
  start1, end1 = get_span(pair['e1_span'])
  start2, end2 = get_span(pair['e2_span'])
  # If necessary swap to assure right order
  if start1 > start2:
    tmp1, tmp2 = start1, end1
    start1, end1 = start2, end2
    start2, end2 = tmp1, tmp2

  mod_sent = pair['sentence'][:start1] + anon + pair['sentence'][end1:start2] + anon
  if pair['sentence'][end2:] != '':
    mod_sent += pair['sentence'][end2:]
  pair['sentence'] = mod_sent
  return pair

def anonymize_data(df, anon1, anon2):
  return df.apply(new_anonymize, anon1=anon1, anon2=anon2, axis=1)

In [82]:
lee_aimed_corpus = anonymize_data(aimed_corpus.copy(), '@PROTEIN1$', '@PROTEIN2$')
lee_bioinfer_corpus = anonymize_data(bioinfer_corpus.copy(), '@PROTEIN1$', '@PROTEIN2$')
lee_aimed_corpus.head()

Unnamed: 0,d_id,s_id,p_id,sentence,label,e1_span,e2_span
0,AIMed.d0,AIMed.d0.s5,AIMed.d0.s5.p0,"Cytokines measurements during @PROTEIN1$ treatment showed a trend to decreasing levels of @PROTEIN2$ at 4, 12, and 24 weeks.",0,30-38,89-92
1,AIMed.d0,AIMed.d0.s6,AIMed.d0.s6.p0,Levels of @PROTEIN1$ were slightly increased following @PROTEIN2$ treatment (P = 0.09).,0,10-18,54-62
2,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p0,"In patients with a complete response to @PROTEIN1$, the levels of @PROTEIN2$ were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,65-73
3,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p1,"In patients with a complete response to @PROTEIN1$, the levels of IFN-gamma were higher at 24 weeks following @PROTEIN2$ treatment than that of pre-treatment (P = 0.04), and the levels of IL-4 decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,109-117
4,AIMed.d0,AIMed.d0.s7,AIMed.d0.s7.p2,"In patients with a complete response to @PROTEIN1$, the levels of IFN-gamma were higher at 24 weeks following IFN-alpha treatment than that of pre-treatment (P = 0.04), and the levels of @PROTEIN2$ decreased markedly at 12 and 24 weeks (P = 0.02, 0.03, respectively). mRNA expression positively correlated with the level of Th1/Th2 type cytokines in the supernatant.",0,40-48,186-189


Exportieren:

In [83]:
export_tsv(lee_aimed_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/lee_aimed_train.tsv')
export_tsv(lee_bioinfer_corpus, '/content/drive/My Drive/TransferLearning/Trainingsdaten/lee_bioinfer_train.tsv')

       d_id         s_id            p_id  ... label  e1_span  e2_span
0  AIMed.d0  AIMed.d0.s5  AIMed.d0.s5.p0  ...  0     30-38    89-92  
1  AIMed.d0  AIMed.d0.s6  AIMed.d0.s6.p0  ...  0     10-18    54-62  
2  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p0  ...  0     40-48    65-73  
3  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p1  ...  0     40-48    109-117
4  AIMed.d0  AIMed.d0.s7  AIMed.d0.s7.p2  ...  0     40-48    186-189

[5 rows x 7 columns]
          d_id            s_id               p_id  ... label  e1_span e2_span
0  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p0  ...  1     88-100   0-12  
1  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p1  ...  1     88-100   23-34 
2  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p2  ...  1     88-100   75-86 
3  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p3  ...  1     0-12     23-34 
4  BioInfer.d0  BioInfer.d0.s0  BioInfer.d0.s0.p4  ...  1     0-12     75-86 

[5 rows x 7 columns]
