# Importing the data

## Raw data

In [1]:
path = 'data/aimed'

# There are ten provided train/test splits
train_split_file = 'train-203-1'
test_split_file = 'test-203-1'

In [2]:
def get_file_names(path, split_file):
    files = []
    with open('%s/%s' % (path + '/splits', split_file)) as f:
        for line in f:
            files.append(line.rstrip())
    return files

In [3]:
train_files = get_file_names(path, train_split_file)
test_files = get_file_names(path, test_split_file)

In [4]:
print("%d abstracts in training set and %s abstracts in test set." % (len(train_files), len(test_files)))

203 abstracts in training set and 22 abstracts in test set.


In [5]:
def get_texts(path, files):
    texts = []
    for file in files:
        with open('%s/%s' % (path, file)) as f:
            texts.append(f.read().replace('\n', ''))
    return texts

In [6]:
train_texts = get_texts(path, train_files)
test_texts = get_texts(path, test_files)

Let's have a look at one of the files.

In [19]:
train_texts[2]

"TI - Analysis of the oligomerization of <prot>  myogenin </prot>  and <prot>  E2A </prot>  products in vivo using a two - hybrid assay system .PG - 17498 - 501 AB - Members of the helix - loop - helix ( HLH ) family of proteins bind DNA and activate transcription as homo - and heterodimers . <p1  pair=1 >  <p1  pair=2 >  <prot> Myogenin </prot>  </p1>  </p1>  is a muscle - specific HLH protein that binds DNA in vitro as a heterodimer with several widely expressed HLH proteins , such as the <prot>  E2A </prot>  gene products <p2  pair=1 >  <prot>  E12 </prot>  </p2>  and <p2  pair=2 >  <prot>  E47 </prot>  </p2>  .We describe a method for detection of protein - protein interactions among HLH proteins in vivo in which dimerization through the HLH motif reconstructs a hybrid transcription factor containing the DNA - binding domain of yeast GAL4 linked to one HLH motif and the activation domain of <prot>  VP - 16 </prot>  linked to another .We have used this assay to investiagate whether 

The texts are annotated using XML, with proteins and genes inside the *protein* tag, and proteins belonging to relationships annotated with *p1*/*p2* and associated with a relationship id.

I would like to convert the *protein* tags to spacy entities, and extract the relationships into a spearate gold standard object.

## Extract Gold Standard Pairs
Proteins involved in a relationships are enclosed in `<p1  pair=x>` or `<p2  pair=x>` tags.
One protein can be involved in multiple relationships. Both as p1 and p2.
A tricky example is `<p1  pair=1 >  <p1  pair=2 >  <prot> Myogenin </prot>  </p1>  </p1>` which cannot be parsed with a regular expressions since it contains nested occurences of p1.
I will have to parse as xml instead.

In [9]:
import re

In [34]:
from  xml.etree import ElementTree as ET

Attribute values must be strings for xml.etree to accept them. Let's fix that.

In [69]:
def xmlify(text):
    return "<body>" + re.sub(pattern=r'<p([12])  pair=(\d+)', string=text, repl="<p\g<1>  pair='\g<2>'") + "</body>"

In [70]:
train_texts = [xmlify(text) for text in train_texts]
test_texts = [xmlify(text) for text in test_texts]

In [89]:
train_trees = []
test_trees = []

In [90]:
for text in train_texts:
    train_trees.append(ET.fromstring(text))

for text in test_texts:
    test_trees.append(ET.fromstring(text))

In [139]:
def extract_edges(tree):
    """
    Args:
        tree - Document tree from which to extract gold standard relationships
    Returns:
        List of tuple containing related proteins
    """
    
    pairs = []
    
    # Generate list of (p1/p2, pair_id, protein_text) tuples
    for child in tree.getchildren():
        if child.tag in ('p1', 'p2'):
            pairs +=(match_children(child, [(child.tag, child.attrib['pair'])]))
    
    # Generate tuples of protein texts
    pairs = sorted(pairs)
    return [(p1[2], p2[2]) for p1, p2 in zip(pairs[:len(pairs)//2], pairs[len(pairs)//2:])]

def match_children(pair, pairs):
    # A pair tag only has one child, which is either another pair tag or a protein
    # In some rare cases it has no children, just text. In this case cosider the text a protein.
    try:
        child = pair.getchildren()[0]
    except IndexError:
        return [(*p, pair.text.strip()) for p in pairs]
    
    if child.tag == 'prot':
        return [(*p, child.text.strip()) for p in pairs]
    else:
        return match_children(child, pairs + [(child.tag, child.attrib['pair'])])

In [140]:
protein_pairs = []

In [142]:
for tree in train_trees:
    protein_pairs += extract_edges(tree)

In [143]:
len(protein_pairs), len(set(protein_pairs))

(935, 671)

Cool! 935 mentioned pairs, and 671 unique pairs.

## Counting the proteins and relations

In [145]:
protein_pairs = set(protein_pairs)

In [144]:
proteins = set()

In [146]:
for p1, p2 in protein_pairs:
    proteins.add(p1)
    proteins.add(p2)

In [147]:
len(proteins)

536

536 unique proteins, and 671 relations between them. 
Most proteins are only part of a single relationship!