# Antibody Specification

### Create Training dataset

**Step 1**  
Get a list containing PMCID and PMID from ```pmcids-pmids.txt```

In [1]:
with open('resources/pmcids-pmids.txt', 'r') as file:
    lines = file.readlines()

Each lines seperate between PMID and PMCID

In [2]:
list_of_pmids_and_pmcids = []

In [3]:
for line in lines:
    sep_line = line.split('\t')
    pmid = sep_line[0]
    pmcid = sep_line[1].replace('\n', '')
    
    list_of_pmids_and_pmcids.append({ 'pmid': pmid, 'pmcid': pmcid })

**Step 2**  
find the snippets from nxml file

In [5]:
from xml.etree import ElementTree
from tqdm import trange
import pprint
import re
from nltk.tokenize import sent_tokenize
import nltk

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ploy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Find only ```<p>``` and then extract the sentences we want (find the regex pattern):
  
All others tags in ```<p>``` I convert them back to string and remove the xml tag out. 
- (S|s)pecific
- (B|b)ackground staining
- (C|c)ross( |-)reactiv

In [7]:
def remove_xml_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [8]:
def extract_snippets(text):
    """
    extract snippets from each paragraph
    """
    snippets = []
    define_words = ['(S|s)pecific', '((B|b)ackground staining)', '(C|c)ross( |-)reactiv']
    # split sentences from text
    split_texts = sent_tokenize(text)
    for word in define_words:
        snippet = []
        # find snippet which contains define_words
        for s_index in range(len(split_texts)):
            word_contain = re.findall(r"([^.]*?%s[^.]*\.)" % word, split_texts[s_index])
            if len(word_contain) != 0:
                snip = ''
                if s_index - 1 >= 0:
                    snip = snip + split_texts[s_index-1] + '\n'
                snip = snip + split_texts[s_index] + '\n'
                if s_index + 1 < len(split_texts):
                    snip = snip + split_texts[s_index+1] + '\n'
                    
                snippet.append(snip)
        if len(snippet) != 0:
            snippets.append(set(snippet))
    if len(snippets) != 0:
        return snippets
    return None

In [9]:
snippet_list = []

def find_paragraph(node):
    """
    find snippets in <p>
    """
    global snippet_list
    if node.tag == 'p':
        # convert all contents in <p> to string
        xml_str = ElementTree.tostring(node).decode('utf-8')
        text = remove_xml_tags(xml_str)

        if node.text is not None:
            snippets = extract_snippets(text)
            if snippets is not None:
                snippet_list.append(snippets)
    for child in node:
        find_paragraph(child)
    
    return snippet_list

In [10]:
def get_snippets(tree):
    """
    get snippets from each file
    """
    global snippet_list
    snippets = []
    node = tree.find('./body')

    for elem in node:
        snippet = find_paragraph(elem)
        snippets.extend(snippet)
        snippet_list = []
        
    if snippets is not None and len(snippet) != 0:
        return snippets
    return None

Resources Papers path

In [11]:
resources_path = 'resources/papers_4chunnan/'

In [12]:
def clean_snippet(snip):
    snip = snip.replace('\n', ' ')
    return snip[:-1]

```outputs``` will contains the dict of outputs that we will save in ```.tsv``` file later.

In [13]:
outputs = []

To parse the file, pass an open file handle to parse()  
It will read the data, parse the XML, and return an ElementTree object

In [15]:
for index in trange(len(list_of_pmids_and_pmcids), desc='reading and finding snippets in file'):
    with open(resources_path + list_of_pmids_and_pmcids[index]['pmcid'] + '.nxml', 'rt') as file:
        tree = ElementTree.parse(file)
        snippets = get_snippets(tree)
        if snippets is not None and len(snippets) != 0:
            for snips in snippets:
                for each_snip in snips:
                    for turple in each_snip:
                        outputs.append(
                            { 
                              'pmid': list_of_pmids_and_pmcids[index]['pmid'], 
                              'pmcid': list_of_pmids_and_pmcids[index]['pmcid'], 
                              'snippet': clean_snippet(turple)
                            }
                        )

reading and finding snippets in file: 100%|██████████| 2223/2223 [13:44<00:00,  2.70it/s]


In [17]:
len(outputs)

11342

**Step 3**  
Write outputs to file ```.tsv```  
The pattern is ```PMID\tPMCID\tSnippet\tAntibody related?\tSpecificity?\n```    
In which antibody related? and specificity? are empty.

In [18]:
file = open('train_ex_antibody.tsv', 'a')

In [19]:
for article_index in trange(len(outputs), desc='writing to file '):
    file.write('%s\t%s\t%s\t\t\t\n' % (outputs[article_index]['pmid'], 
                                       outputs[article_index]['pmcid'], 
                                       outputs[article_index]['snippet']))

writing to file : 100%|██████████| 11342/11342 [00:00<00:00, 327936.61it/s]


In [20]:
file.close()