# Extract data from PubMed

This notebook shows how to extract information from PubMed records. We will extract:

1. Publication title
2. Publication year
3. Full names of authors
4. Affiliation of each author

By applying minor code modifications, additional information can be extracted, such as keywords or even the full abstract.

The starting point is a small **example list** containing 3 pubmed ids.

The ending point will be a network where **research institutes** and **publications** are the nodes, while the edges connecting them hold the details about **authors** and their **affiliations**. 

-----------------------------------------------------------------------------------

### 1. Import packages, define functions and variables

In [None]:
import xml.etree.ElementTree as ET
from urllib.request import urlopen
import pandas as pd

counter = 0     # Used to compute the 'contribution score' when merging edges in code block [4]
edges = dict()

example_list = ['26046436', '32963239', '28793255']

In [None]:
# Get the normalized institute name from affiliation
def getInstitute(affiliation):
    if 'Burnham' in affiliation:
        return 'Sanford Burnham Prebys';
    elif 'Scripps' in affiliation:
        return 'The Scripps Research Institute';
    elif 'Salk' in affiliation:
        return 'Salk Institute'
    elif 'University of California' in affiliation:
        return 'UC San Diego'
    elif 'La Jolla Institute for Allergy and Immunology' in affiliation:
        return 'LJIAI'
    else:
        return 'Other'

### 2. Extract information from PubMed element tree

In the last part of this block of code, we merge all the edges in the **edges** dictionary that have the same source and target nodes and add the merge count to the dictionary. This will reduce the number of edges between a given instute and a given publication to 1.

In [None]:
edge_idx = {}

for item in example_list:
    efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % (item)
    handle = urlopen(efetch)
    data = handle.read()
    root = ET.fromstring(data)
    
    for article in root.findall("PubmedArticle"):
        pmid = article.find("MedlineCitation/PMID").text
        year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
        if year is None: year = 'NA'
        else: year = year.text
        aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
        title = article.find("MedlineCitation/Article/ArticleTitle")
        
        for author in aulist:
            if author.find('AffiliationInfo'):
                last_name = author[0].text
                fore_name = author[1].text 
                affiliation = author.find('AffiliationInfo')[0].text
                if "San Diego" in affiliation or 'La Jolla' in affiliation:
                        institute = getInstitute(affiliation)
                        
                        # Merge edges and compute the 'contribution score'
                        lookupKey = pmid + '_' + institute
                        if lookupKey in edge_idx:
                            oldRec = edges[edge_idx[lookupKey]]
                            newRec = (oldRec[0],oldRec[1],oldRec[2] + ', ' + fore_name[0] + ' ' + last_name,oldRec[3],oldRec[4],oldRec[5], oldRec[6]+1)
                            edges[edge_idx[lookupKey]] = newRec
                        else:
                            edges[counter] = (pmid, title.text, fore_name[0] + ' ' + last_name, affiliation, year, institute, 1)
                            edge_idx[lookupKey] = counter
                            counter += 1
                        

### 3. Create Pandas Dataframe

In [None]:
df = pd.DataFrame.from_dict(data = edges,
                            orient='index',
                            columns = ['pmid', 'title', 'author', 'affiliation',
                                       'year', 'institute', 'contribution score'])
df

### 4. Create NiceCX network from Pandas

Here we create a NiceCX network using the Pandas dataframe from the previous step.
When creating the network, we specify what Pandas columns to use as source and target nodes, source and target node attributes as well as edge attributes. We also define a default edge interaction.

The last 2 lines of code allow us to display the network in the notebook via the cyjupyter widget.


In [None]:
import ndex2
from ndex2.cx2 import PandasDataFrameToCX2NetworkFactory

pd_factory = PandasDataFrameToCX2NetworkFactory()
cx2_from_df = pd_factory.get_cx2network(df, source_field='institute', target_field='pmid', source_node_attr=[], target_node_attr=['title', 'year'], edge_attr=['author', 'affiliation', 'contribution score'], edge_interaction='contributed to')

print('network_attributes:', cx2_from_df.get_network_attributes())
print('nodes:', len(cx2_from_df.get_nodes()))
print('edges:', len(cx2_from_df.get_edges()))

In [None]:
# Set type attribute on "institute" nodes
institutes_names = ['UC San Diego', 'The Scripps Research Institute', 'Salk Institute', 'Sanford Burnham Prebys', 'LJIAI', 'Other']

for node_id, node in cx2_from_df.get_nodes().items():
    for name in institutes_names:
        if name in node.get('v')['name']:
            cx2_from_df.add_node_attribute(node_id, 'type', 'research institute')
        
print('network_attributes:', cx2_from_df.get_network_attributes())
print('nodes:', len(cx2_from_df.get_nodes()))
print('edges:', len(cx2_from_df.get_edges()))

In [None]:
# Set @context
cx2_from_df.add_network_attribute('context', {'pubmed': 'https://www.ncbi.nlm.nih.gov/pubmed/'})

In [None]:
# set prefixes on represents on publication nodes to work with @context
for node_id, node in cx2_from_df.get_nodes().items():
    if cx2_from_df.get_node(node_id)['v'].get('type', None) != 'research institute':
        cx2_from_df.add_node_attribute(node_id, 'reference', 'pubmed:'+ node['v']['name'])

print('network_attributes:', cx2_from_df.get_network_attributes())
print('nodes:', len(cx2_from_df.get_nodes()))
print('edges:', len(cx2_from_df.get_edges()))

### 5. Upload to NDEx

This last step loads the network to you NDEx account. You need to provide your NDEx account credentials (**user** and **password**) in order to upload the network.
The code will also generate a clickable URL that you can use to open a browser tab and view your network.

In [None]:
# Set credentials to access your NDEx account
import getpass
user = getpass.getpass()

In [None]:
password = getpass.getpass()

In [None]:
server = 'www.ndexbio.org'

# Upload the network
client = ndex2.client.Ndex2(host=server,username=user, password=password)
res = client.save_new_cx2_network(cx2_from_df.to_cx2(), visibility='PRIVATE')

# Generate a clickable link to view your network in the browser directly from the notebook.
# Please note that the browser might ask you to login in to your NDEx account in order to view the network.
base_url = 'https://www.ndexbio.org/viewer/networks/'
print (f"View your network: {base_url}{res.split('/')[-1]}")


 ### 6. Next steps
 
Your network is now saved in your NDEx account and its visibility set to PRIVATE, so you are the only one who can see it. You can perform additional operations on the network directly in NDEx; these include:
 
 - Adding/editing network attributes (title, description, version, etc)
 - Changing the network visibility
 - Importing it in Cytoscape for visual styling or further analysis
 - Requesting a DOI
 - Querying to extract sub-networks of interest

-----------------------------------------------------------------------------------

###### Questions/comments:   rpillich@ucsd.edu