# Download Donor Data with the ICGC API

This example shows how to use a PQL query to define the donors of interest and then download the data type of interest as tabular data (TSV). 

We start by importing the ICGC Python API (http://icgc-python.readthedocs.io/):

In [None]:
import icgc

Next, we define what donors we are interested in through a PQL query. In our case we are interested in donors where the primary cancer site is Brain.

In [None]:
pql = 'eq(donor.primarySite,"Brain")'

We want to see what data types are available for us to download as well as how large the downloads will be. We can see this with the following code.

In [None]:
from pprint import pprint

sizes = icgc.download_size(pql)
print("Sizes are:")
pprint(sizes)

We can see the various data types with their sizes reported in Bytes. In our example we want to download approximately 10MB worth of data for this demo. This will result in us downloading the donor, mirna_seq, stsm, and pexp data for the donors defined by our query. 

We will also download this data into the `mydata` directory.

In [None]:
import os
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

KB = 1024
MB = 1024 * KB
max_size = 10 * MB
current_size = 0

includes = []
for k in sizes:
    item_size = sizes[k]
    if current_size + item_size < max_size:
        includes.append(k)
        current_size += item_size

print("Including items {}".format(includes))
print("Approximate download size={:.2f} MB".format(current_size / MB))

# Change current directory
# os.chdir("mydata")

# Download the information, and save the results in the file "test.tar"
print("Starting Download...")
icgc.download(pql, includes, "mydata/test")
print("Finished Download!")

As this downloaded the data to a tar file. We will want to extract the data. This can either be done through the bash shell or as python code. This example shows how to do with with python. 

In [None]:
import tarfile
tar = tarfile.open("mydata/test.tar")
tar.extractall("mydata")
tar.close()

Now that the files have been extracted, let us take a look. We will first list the directory by running a shell command prefixed with `!`

In [None]:
!ls -l mydata

After seeing what files are available, let us load one into a data frame using a popular data analysis library called pandas. https://pandas.pydata.org/

For our example we will take a look at the protein expression data. 

In [None]:
import pandas
df = pandas.read_table("mydata/protein_expression.tsv.gz",compression='gzip',sep='\t')
df