# PyGraphistry Tutorial: Visualize Protein Interactions From BioGrid

That is over 600.000 interactions across 50'000 proteins!

##### Notes

This notebook automatically downloads about 200 MB of [BioGrid](http://thebiogrid.org) data. If you are going to run this notebook more than once, we recommend manually dowloading and saving the data to disk. To do so, unzip the two files and place their content in `pygraphistry/demos/data`.
- Protein Interactions: [BIOGRID-ALL-3.3.123.tab2.zip](http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.3.123/BIOGRID-ALL-3.3.123.tab2.zip)
- Protein Identifiers: [BIOGRID-IDENTIFIERS-3.3.123.tab.zip](http://thebiogrid.org/downloads/archives/Release%20Archive/BIOGRID-3.3.123/BIOGRID-IDENTIFIERS-3.3.123.tab.zip)



In [None]:
import pandas
import graphistry
graphistry.register(key='Email pygraphistry@graphistry.com to get your key!')

## Load Protein Interactions
Select columns of interest and drop empty rows.

In [None]:
url1 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-ALL-3.3.123.tab2.txt.gz'
rawdata = pandas.read_table(url1, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# pandas.read_table('./data/BIOGRID-ALL-3.3.123.tab2.txt', na_values=['-'], engine='c')

cols = ['BioGRID ID Interactor A', 'BioGRID ID Interactor B', 'Official Symbol Interactor A', 
        'Official Symbol Interactor B', 'Pubmed ID', 'Author', 'Throughput']
interactions = rawdata[cols].dropna()
interactions[:3]

### Let's have a quick peak at the data
Bind the columns storing the source/destination of each edge. This is the bare minimum to create a visualization. 

In [None]:
# This will upload ~8MB of data, be patient!
plotter = graphistry.bind(source="BioGRID ID Interactor A", destination="BioGRID ID Interactor B")
plotter.plot(interactions)

## A Fancier Visualization With Custom Labels and Colors
Let's lookup the name and organism of each protein in the BioGrid indentification DB.

In [None]:
# This downloads 170 MB, it might take some time.
url2 = 'https://s3-us-west-1.amazonaws.com/graphistry.demo.data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt.gz'
raw_proteins = pandas.read_table(url2, na_values=['-'], engine='c', compression='gzip')

# If using local data, comment the two lines above and uncomment the line below
# raw_proteins = pandas.read_table('./data/BIOGRID-IDENTIFIERS-3.3.123.tab.txt', na_values=['-'], engine='c')


protein_ids = raw_proteins[['BIOGRID_ID', 'ORGANISM_OFFICIAL_NAME']].drop_duplicates() \
                          .rename(columns={'ORGANISM_OFFICIAL_NAME': 'ORGANISM'})
protein_ids[:3]

We extract the proteins referenced as either sources or targets of interactions.

In [None]:
source_proteins = interactions[["BioGRID ID Interactor A", "Official Symbol Interactor A"]].copy() \
                              .rename(columns={'BioGRID ID Interactor A': 'BIOGRID_ID', 
                                               'Official Symbol Interactor A': 'SYMBOL'})

target_proteins = interactions[["BioGRID ID Interactor B", "Official Symbol Interactor B"]].copy() \
                              .rename(columns={'BioGRID ID Interactor B': 'BIOGRID_ID', 
                                               'Official Symbol Interactor B': 'SYMBOL'}) 

all_proteins = pandas.concat([source_proteins, target_proteins], ignore_index=True).drop_duplicates()
all_proteins[:3]

We join on the indentification DB to get the organism in which each protein belongs.

In [None]:
protein_labels = pandas.merge(all_proteins, protein_ids, how='left', left_on='BIOGRID_ID', right_on='BIOGRID_ID')
protein_labels[:3]

We assign colors to proteins based on their organism.

In [None]:
colors = protein_labels.ORGANISM.unique().tolist()
protein_labels['Color'] = protein_labels.ORGANISM.map(lambda x: colors.index(x))

For convenience, let's add links to PubMed and RCSB.

In [None]:
def makeRcsbLink(id):
    if isinstance(id, str):
        url = 'http://www.rcsb.org/pdb/gene/' + id.upper()
        return '<a target="_blank" href="%s">%s</a>' % (url, id.upper())
    else:
        return 'n/a'
    
protein_labels.SYMBOL = protein_labels.SYMBOL.map(makeRcsbLink)
protein_labels[:3]

In [None]:
def makePubmedLink(id):
    url = 'http://www.ncbi.nlm.nih.gov/pubmed/?term=%s' % id
    return '<a target="_blank" href="%s">%s</a>' % (url, id)

interactions['Pubmed ID'] = interactions['Pubmed ID'].map(makePubmedLink)
interactions[:3]

## Plotting
We bind columns to labels and colors and we are good to go. 

In [None]:
# This will upload ~10MB of data, be patient!
fancy_plotter = plotter.bind(node='BIOGRID_ID', edge_title='Author', point_title='SYMBOL', point_color='Color')
fancy_plotter.plot(interactions, protein_labels)