# Load an edge table of gene similarities into NDEx

This tutorial shows how to convert an edge table to an adjacency matrix and upload that matrix to NDEx.

The edge table is assumed to be in this format:

```Bash
SOURCE TARGET WEIGHT
GENE1  GENE2  0.123
GENE1  GENE3  0.144
.
.
```

**NOTE:** This notebook **assumes** the input edge table only contains the upper triangle of the matrix (for example there is an entry for GENE1 -> GENE2 -> WEIGHT, but **NO** entry for GENE2 -> GENE1 -> WEIGHT. To deal with this the code below adds the transpose of the matrix into the final result.

**WARNING:** Large tables take lots of ram. For example running this workflow with an edge table 19,000 genes (meaning 360 million element matrix) will consume **10-20** gigabytes of ram and will take 10 - 20 minutes to run

This tutorial requires the following packages and easiest way to setup is to first install Anaconda or Miniconda with Python 3 (https://docs.conda.io/en/latest/miniconda.html)

* python-igraph (Best to install via Conda ie: `conda install -c conda-forge python-igraph`)
* ddot (https://github.com/idekerlab/ddot)
* ndex2 (`pip install ndex2`)
* simplejson (`pip install simplejson`)
* pandas (`pip install pandas`)
* numpy (`pip install numpy`


# Import needed modules

In [None]:
import os
import sys
import getpass
import csv

import ddot
import ndex2
import pandas as pd
import numpy as np
from scipy.sparse import coo_matrix 


# Get list of genes from edge table

To minimize memory usage, the following code fragment will read the edge table and build a list of genes as well as a dictionary mapping gene names to integers. The dictionary will be used to relabel the genes to integers which is needed by the `coo_matrix` scipy function below.


Enter path to edge table file. It is assumed this file has the following header line: `SOURCE TARGET WEIGHT`

Example: `/tmp/foo.tsv`

In [None]:
# sys.version_info gets the version of python, needed to use correct call to get user input
#
# This fragment of code prompts you for a path to the edge file. Be sure
# to enter a file in the text field and hit enter to set the value
#
sys.stdout.write('Enter path to edge table file:\n')
if sys.version_info[0] >= 3:
   edgetable = os.path.abspath(input())
else:
   edgetable = os.path.abspath(raw_input())


In [None]:
#
# Print the value of edgetable to make sure it was set
#
sys.stdout.write('Edge table file set to: ' + edgetable + '\n')

In [None]:
# Reads edge table building a list of genes and a dictionary of gene names to integers (gene_map)
header_line = 0
gene_map = {}
counter = 0
with open(edgetable, 'r') as infile:
    reader = csv.reader(infile, delimiter='\t')
    for row in reader:
        if header_line is 0:
            header_line = 1
            continue
        if row[0] not in gene_map:
            gene_map[row[0]] = counter
            counter += 1
        if row[1] not in gene_map:
            gene_map[row[1]] = counter
            counter += 1


genes = sorted(list(gene_map.keys()))


# output number of genes found
len(genes)

# Generate and save new edge table with gene names replaced with integers

Using the `gene_map` dictionary generated earlier the code below re-reads the edge table and writes out a new edge table to current working directory named **output.tsv** with gene names replaced by integers (needed by `coo_matrix` function below)

**WARNING:** **output.tsv** will be written out to the current working directory and amount of disk space it consumes will be roughly the same as the input **Edge Table** (set in `edgetable` variable) file

In [None]:

counter = 0
with open('output.tsv', 'w') as outf:
    with open(edgetable, 'r') as infile:
        reader = csv.reader(infile, delimiter='\t')
        for row in reader:
            if counter is 0:
                outf.write(str(row[0]) + '\t' + str(row[1]) + '\t' + str(row[2]) + '\n')
                counter = 1
                continue
            outf.write(str(gene_map[row[0]])+'\t'+ str(gene_map[row[1]])+'\t'+ str(row[2]) + '\n')
    

# Load the edge table with gene names replaced by integers

**NOTE:** This method will use more ram. For a 180 million row edge table about 3-4 gigabyes of ram will be used.

In [None]:
df = pd.read_csv('output.tsv', sep='\t', dtype={'SOURCE': np.int32, 'TARGET': np.int32, 
                                             'WEIGHT': np.float64})

df.head()

# Create adjacency matrix 

Creates adjacency matrix using `coo_matrix` function. For 180 million row edge table this uses about 12 gigabytes of ram at peak and drops back down to about 8 gigabytes of ram
    
**NOTE:** This notebook **assumes** the input edge table only contains the upper triangle of the matrix 

(For example: There is an entry for GENE1 -> GENE2 -> WEIGHT, but **NO** entry for GENE2 -> GENE1 -> WEIGHT). 

To create the full matrix, the code below adds the transpose of the matrix into the final result via the `full_mat = sparse_mat + sparse_mat.T` command.


In [None]:
sparse_mat = coo_matrix((df.iloc[:,2], (df.iloc[:,0],df.iloc[:,1])), shape=(len(genes), len(genes)))

# delete the dataframe since we dont need it anymore
del df

# add the transpose of the matrix since it is assumed the input edge table only contains upper half of the triangle
# if this is NOT the case,  comment out this line
full_mat = sparse_mat + sparse_mat.T

full_mat

# Create NDEx NiceCXNetwork object

NDEx utilizes CX format for storage of data. The next command converts the matrix data into [NDEx CX format](http://www.home.ndexbio.org/data-model/)

The `create_edgeMatrix` stores the matrix data in three "opaque" aspects as described below:

* `matrix` - This aspect contains the matrix data as a serialized numpy array encoded in base64 and chunked into 100 megabyte blocks that are stored as elements in the json list
* `matrix_cols` - This aspect contains a list of the gene column names as strings
* `matrix_rows` - This aspect contains a list of the gene row names as strings
* `matrix_dtype` - This aspect contains the numpy data type for elements in matrix (ie numpy.float64)


In [None]:
network = ddot.utils.create_edgeMatrix(full_mat.todense(), genes, genes,verbose=True,ndex2=True)

# sets the name of the network
network.set_name('test similarity network')
network

# Get NDEx credentials

In [None]:
sys.stdout.write('Enter NDEx username:\n')
if sys.version_info[0] >= 3:
    user = input()
else:
    user = raw_input()

sys.stdout.write('Enter NDEx password:\n')
password = getpass.getpass()


# Upload to NDEx

In [None]:
# NDEx server to use for production use public.ndexbio.org
server_url = 'test.ndexbio.org'

res = network.upload_to(server_url, user, password)

sys.stdout.write('If successful the value below will be low level URL\n')
sys.stdout.write('The network will be private by default so\n')
sys.stdout.write('to see the network visit http://' + server_url + ' and login with user account entered earlier\n')

res

In [None]:
print('Tutorial complete. Have a nice day.')