# PubTator Central: Automated Concept Annotation for Biomedical Full Text Articles

## 1. About

- PubTator Central (PTC) is a Web-based system providing automatic annotations of **biomedical concepts** such as *genes* and *mutations* in **PubMed abstracts** and **PMC full-text articles**. 

- `website`: https://www.ncbi.nlm.nih.gov/research/pubtator/
- `api`: https://www.ncbi.nlm.nih.gov/research/pubtator/api.html
- `paper`: https://academic.oup.com/nar/article/47/W1/W587/5494727

- Displaying the abstract or full-text of a publication and related tools.

<img width="500" height="800" align="left" src="https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/nar/47/W1/10.1093_nar_gkz389/1/m_gkz389fig2.jpeg?Expires=1639156093&Signature=lJen0TlEgPqi-Fdy3o0y8dCi7Wdr5yWEgt8ubR71-Ua44hS08zyqfDGFyLdVga6kQemH7GuBNp7G9VbBLBZqsJ1wy9jKfUp6FJEpK3vR8Wtgz9waWG~l9MwnM5epsChmhBZmKxMo7j0JyJ-0T2fjVkoPPFDeIOXT7h09B1x9lecRLg9d0NQ0sVHOn-Uvgj-5X5UWWA3e~TDOwnG8fVlT24BKKLvkS2cOBpYCSSMHG-68TiR-g1Xd~1abcGQO8A58mV05YENXannF86NHOQAQrQCNrcYj5fkmfCSyyeJin~BRSA1Nr8UAdP9i30ujVEH0dQuvnXSl1l2tHoa~C-lkeQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA">

## 2. Usage

In [1]:
import json
import requests
import pandas as pd
from IPython.display import JSON
from nltk.tokenize.punkt import PunktSentenceTokenizer

In [2]:
# proxy
PROXIES = {'http': None, 'https': None}
PROXIES = {'http': 'http://165.225.96.34:10015', 'https': 'https://165.225.96.34:10015'}

# url
url = 'https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/'

# concepts
concepts = ['gene', 'disease', 'chemical', 'species', 'mutation', 'cellline']

# PMID or PMCID
json_PMID = {'concepts': concepts, 'pmids':[28483577,28483578]}
json_PMCID = {'concepts': concepts, 'pmcids':['PMC6207735']}

# format:pubtator, biocjson, biocxml

**PubMed abstracts**

In [3]:
res = requests.post(url = url + 'pubtator', json=json_PMID, proxies=PROXIES)
print(res.text)

28483577|t|Formoterol and fluticasone propionate combination improves histone deacetylation and anti-inflammatory activities in bronchial epithelial cells exposed to cigarette smoke.
28483577|a|BACKGROUND: The addition of long-acting beta2-agonists (LABAs) to corticosteroids improves asthma control. Cigarette smoke exposure, increasing oxidative stress, may negatively affect corticosteroid responses. The anti-inflammatory effects of formoterol (FO) and fluticasone propionate (FP) in human bronchial epithelial cells exposed to cigarette smoke extracts (CSE) are unknown. AIMS: This study explored whether FP, alone and in combination with FO, in human bronchial epithelial cellline (16-HBE) and primary bronchial epithelial cells (NHBE), counteracted some CSE-mediated effects and in particular some of the molecular mechanisms of corticosteroid resistance. METHODS: 16-HBE and NHBE were stimulated with CSE, FP and FO alone or combined. HDAC3 and HDAC2 activity, nuclear translocation of GR and

**PMC full-text**

In [4]:
# can't call PMC with pubtator, please use biocjson.
res = requests.post(url = url + 'biocjson', json=json_PMCID, proxies=PROXIES)
JSON(json.loads(res.text))

<IPython.core.display.JSON object>

**convert biocjson to list**

In [5]:
def biocjson2list(biocjson: dict) -> list:
    """解析biocjson格式的数据，获得其中的标注"""
    # 获得pmid, pmcid, passages
    pmid = biocjson.get('pmid')
    pmcid = biocjson.get('pmcid')
    passages = biocjson.get('passages')
    # 开始解析biocjson格式注释
    results = list()
    for passage in passages:
        # 获取当前被注释的文本段
        curr_text = passage.get('text').strip()
        # 将注释文本段分句，包括起始位置、终止位置
        sentences = [[start, end, curr_text[start:end]] for start, end in PunktSentenceTokenizer().span_tokenize(curr_text)]
        # 当前文本段在全文里的起始位置
        text_offset = passage.get('offset')
        # 当前文本段在全文里的类型
        section = passage.get('infons').get('section')
        section_type = passage.get('infons').get('section_type')
        # 当前文本段的注释信息
        annotations = passage.get('annotations')
        for annotation in annotations:
            ann_text = annotation['text'] # 注释字段
            ann_type = annotation['infons']['type'] # 注释字段的类型
            ann_iden = annotation['infons']['identifier'] #注释字段的标准化结果
            ann_length = len(annotation['text'].strip())  # 注释字段长度
            # ann_length = annotation['locations'][0]['length'] # 注释字段长度
            ann_full_text_start = annotation['locations'][0]['offset'] # 注释字段的全文起始位置
            ann_full_text_end = annotation['locations'][0]['offset'] + ann_length # 注释字段的全文终止位置
            ann_curr_text_start = ann_full_text_start - text_offset # 注释字段在当前文本段的起止位置
            ann_curr_text_end = ann_full_text_end - text_offset # 注释字段在当前文本段的终止位置
            # 如果注释字段的起止位置在当前文本段外，则跳过
            if ann_curr_text_start<0 or ann_curr_text_end>len(curr_text):
                continue
            # 注释字段的所属句子在当前文本段中起始位置、终止位置
            anchor_sentence_start = [start for start, end, sentence in sentences if start <= ann_curr_text_start][-1]
            anchor_sentence_end = [end for start, end, sentence in sentences if end >= ann_curr_text_end][0]
            # 所属句子
            anchor_sentence = curr_text[anchor_sentence_start:anchor_sentence_end]
            # 注释字段在所属句子中的起始位置
            anchor_ann_start = ann_curr_text_start - anchor_sentence_start
            # 注释字段在所属句子中的终止位置
            anchor_ann_end =  ann_curr_text_end - anchor_sentence_start
            anchor_ann_text = anchor_sentence[anchor_ann_start: anchor_ann_end]
            results.append([pmid, ann_full_text_start, ann_full_text_end, ann_text, ann_type, ann_iden, pmcid, section, section_type, anchor_ann_start, anchor_ann_end, anchor_ann_text, anchor_sentence])
    # 按照在全文的起始位置进行排序
    results = sorted(results, key=lambda x:x[2])
    return results 

In [6]:
res = requests.post(url = url + 'biocjson', json=json_PMCID, proxies=PROXIES)
bioclist = biocjson2list(json.loads(res.text))
col_names = ['pmid','fulltext_start','fulltext_end','text','type','identifier',
             'pmcid','section','section_type','sentence_start','sentence_end','sentence_text','sentence']
df = pd.DataFrame(bioclist, columns=col_names)
print(df.shape)
df.head()

(844, 13)


Unnamed: 0,pmid,fulltext_start,fulltext_end,text,type,identifier,pmcid,section,section_type,sentence_start,sentence_end,sentence_text,sentence
0,30375428,28,33,SIN3A,Gene,25942,PMC6207735,Title,TITLE,28,33,SIN3A,A novel somatic mutation of SIN3A detected in ...
1,30375428,46,59,breast cancer,Disease,MESH:D001943,PMC6207735,Title,TITLE,46,59,breast cancer,A novel somatic mutation of SIN3A detected in ...
2,30375428,122,129,ERalpha,Gene,2099,PMC6207735,Title,TITLE,122,129,ERalpha,A novel somatic mutation of SIN3A detected in ...
3,30375428,141,154,Breast cancer,Disease,MESH:D001943,PMC6207735,Abstract,ABSTRACT,0,13,Breast cancer,Breast cancer is the most frequent tumor in wo...
4,30375428,176,181,tumor,Disease,MESH:D009369,PMC6207735,Abstract,ABSTRACT,35,40,tumor,Breast cancer is the most frequent tumor in wo...
