# Data retrieval exercise 1

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

## Data set

The data set is from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

## Importing the modules

The first step is to install and import the required Python libraries. 

In [1]:
import sys
!{sys.executable} -m pip install GEOparse


Collecting GEOparse
[?25l  Downloading https://files.pythonhosted.org/packages/67/f6/9206e1acda1858fa9a117ae91d9541e011735e672d58be58a5ee0947ef13/GEOparse-1.1.0.tar.gz (189kB)
[K    100% |████████████████████████████████| 194kB 961kB/s ta 0:00:01
Collecting wgetter>=0.6 (from GEOparse)
  Downloading https://files.pythonhosted.org/packages/8e/ce/7f160ed9f0e16a5365bcbac1dbc6bad1631e9fc91610a444fbdebede3e8b/wgetter-0.7.tar.gz
Collecting biopython>=1.71 (from GEOparse)
[?25l  Downloading https://files.pythonhosted.org/packages/28/15/8ac646ff24cfa2588b4d5e5ea51e8d13f3d35806bd9498fbf40ef79026fd/biopython-1.73-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K    100% |████████████████████████████████| 2.2MB 256kB/s ta 0:00:01
Building wheels for collected packages: GEOparse, wgetter
  Running setup.py bdist_wheel for GEOparse ... [?25ldone
[?25h  Stored in directory: /home/nbuser/.cache/pip/wheels/f3/aa/77/45a2f1517e7545aaabce83d4ad371e4f58aa818e4ee38691cd
  Running setup.py bdist_wheel for w

In [2]:
import GEOparse
import pandas as pd

## Downloading the data

Downloading the example data set (https://geoparse.readthedocs.io/en/latest/usage.html#examples)

In [3]:
pwd

'/home/nbuser/library'

In [4]:
gse = GEOparse.get_GEO(geo="GSE1563", destdir="./")
# A GSM (or a Sample) contains information the conditions and preparation of a Sample

print()
print("GSM example:")
for gsm_name, gsm in gse.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break


18-Jan-2019 07:54:43 DEBUG utils - Directory ./ already exists. Skipping.
18-Jan-2019 07:54:44 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz to ./GSE1563_family.soft.gz
18-Jan-2019 07:54:44 INFO utils - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz to ./GSE1563_family.soft.gz


D: 100% -  9.6MiB  /  9.6MiB  eta 0:00:00


18-Jan-2019 07:55:58 INFO GEOparse - Parsing ./GSE1563_family.soft.gz: 
18-Jan-2019 07:55:58 DEBUG GEOparse - DATABASE: GeoMiame
18-Jan-2019 07:55:58 DEBUG GEOparse - SERIES: GSE1563
18-Jan-2019 07:55:58 DEBUG GEOparse - PLATFORM: GPL8300
18-Jan-2019 07:56:26 DEBUG GEOparse - SAMPLE: GSM26805
18-Jan-2019 07:56:27 DEBUG GEOparse - SAMPLE: GSM26806
18-Jan-2019 07:56:27 DEBUG GEOparse - SAMPLE: GSM26807
18-Jan-2019 07:56:28 DEBUG GEOparse - SAMPLE: GSM26808
18-Jan-2019 07:56:28 DEBUG GEOparse - SAMPLE: GSM26809
18-Jan-2019 07:56:29 DEBUG GEOparse - SAMPLE: GSM26810
18-Jan-2019 07:56:29 DEBUG GEOparse - SAMPLE: GSM26811
18-Jan-2019 07:56:30 DEBUG GEOparse - SAMPLE: GSM26812
18-Jan-2019 07:56:31 DEBUG GEOparse - SAMPLE: GSM26813
18-Jan-2019 07:56:31 DEBUG GEOparse - SAMPLE: GSM26814
18-Jan-2019 07:56:32 DEBUG GEOparse - SAMPLE: GSM26815
18-Jan-2019 07:56:32 DEBUG GEOparse - SAMPLE: GSM26816
18-Jan-2019 07:56:33 DEBUG GEOparse - SAMPLE: GSM26817
18-Jan-2019 07:56:33 DEBUG GEOparse - SAMPLE: 


GSM example:
Name:  GSM26805
Metadata:
 - title : C1PBL
 - geo_accession : GSM26805
 - status : Public on Jul 14 2004
 - submission_date : Jul 14 2004
 - last_update_date : Mar 16 2009
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : PBL
 - organism_ch1 : Homo sapiens
 - taxid_ch1 : 9606
 - molecule_ch1 : total RNA
 - description : Clinical status: control healthy blood donor, Age: unknown, Sex: unknown, Immunosupression: none, Histopathology: none, Donor type: NA, Scr (mg/dL): unknown, Days post transplant: NA, Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis, Keywords = DNA microarrays, gene expression, kidney, reje

In [5]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:")
for gpl_name, gpl in gse.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break


GPL example:
Name:  GPL8300
Metadata:
 - title : [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
 - geo_accession : GPL8300
 - status : Public on Mar 16 2009
 - submission_date : Mar 13 2009
 - last_update_date : Dec 13 2018
 - technology : in situ oligonucleotide
 - distribution : commercial
 - organism : Homo sapiens
 - taxid : 9606
 - manufacturer : Affymetrix
 - manufacture_protocol : see manufacturer's web site, , Based on this UniGene build and associated annotations, the HG-U95Av2 array represents approximately 10,000 full-length genes., 
 - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 08, 2012: annotation table updated with netaffx build 32, June 27, 2016: annotation table updated with netaffx build 35
 - web_link : http://www.affymetrix.com/support/technical/byproduct.affx?pro

In [6]:
# downloading the dataset from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650

gse = GEOparse.get_GEO(geo="GSE54650", destdir="./")

print()
print("GSM example:")
for gsm_name, gsm in gse.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break

18-Jan-2019 07:59:57 DEBUG utils - Directory ./ already exists. Skipping.
18-Jan-2019 07:59:58 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54650/soft/GSE54650_family.soft.gz to ./GSE54650_family.soft.gz
18-Jan-2019 07:59:58 INFO utils - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54650/soft/GSE54650_family.soft.gz to ./GSE54650_family.soft.gz


D: 100% - 65.9MiB  / 65.9MiB  eta 0:00:01


18-Jan-2019 08:08:20 INFO GEOparse - Parsing ./GSE54650_family.soft.gz: 
18-Jan-2019 08:08:20 DEBUG GEOparse - DATABASE: GeoMiame
18-Jan-2019 08:08:20 DEBUG GEOparse - SERIES: GSE54650
18-Jan-2019 08:08:20 DEBUG GEOparse - PLATFORM: GPL6246
18-Jan-2019 08:09:42 DEBUG GEOparse - SAMPLE: GSM1320990
18-Jan-2019 08:09:43 DEBUG GEOparse - SAMPLE: GSM1320991
18-Jan-2019 08:09:44 DEBUG GEOparse - SAMPLE: GSM1320992
18-Jan-2019 08:09:46 DEBUG GEOparse - SAMPLE: GSM1320993
18-Jan-2019 08:09:47 DEBUG GEOparse - SAMPLE: GSM1320994
18-Jan-2019 08:09:48 DEBUG GEOparse - SAMPLE: GSM1320995
18-Jan-2019 08:09:49 DEBUG GEOparse - SAMPLE: GSM1320996
18-Jan-2019 08:09:51 DEBUG GEOparse - SAMPLE: GSM1320997
18-Jan-2019 08:09:52 DEBUG GEOparse - SAMPLE: GSM1320998
18-Jan-2019 08:09:53 DEBUG GEOparse - SAMPLE: GSM1320999
18-Jan-2019 08:09:54 DEBUG GEOparse - SAMPLE: GSM1321000
18-Jan-2019 08:09:55 DEBUG GEOparse - SAMPLE: GSM1321001
18-Jan-2019 08:09:57 DEBUG GEOparse - SAMPLE: GSM1321002
18-Jan-2019 08:09:

18-Jan-2019 08:12:45 DEBUG GEOparse - SAMPLE: GSM1321130
18-Jan-2019 08:12:47 DEBUG GEOparse - SAMPLE: GSM1321131
18-Jan-2019 08:12:48 DEBUG GEOparse - SAMPLE: GSM1321132
18-Jan-2019 08:12:49 DEBUG GEOparse - SAMPLE: GSM1321133
18-Jan-2019 08:12:51 DEBUG GEOparse - SAMPLE: GSM1321134
18-Jan-2019 08:12:52 DEBUG GEOparse - SAMPLE: GSM1321135
18-Jan-2019 08:12:54 DEBUG GEOparse - SAMPLE: GSM1321136
18-Jan-2019 08:12:55 DEBUG GEOparse - SAMPLE: GSM1321137
18-Jan-2019 08:12:56 DEBUG GEOparse - SAMPLE: GSM1321138
18-Jan-2019 08:12:58 DEBUG GEOparse - SAMPLE: GSM1321139
18-Jan-2019 08:12:59 DEBUG GEOparse - SAMPLE: GSM1321140
18-Jan-2019 08:13:01 DEBUG GEOparse - SAMPLE: GSM1321141
18-Jan-2019 08:13:02 DEBUG GEOparse - SAMPLE: GSM1321142
18-Jan-2019 08:13:03 DEBUG GEOparse - SAMPLE: GSM1321143
18-Jan-2019 08:13:05 DEBUG GEOparse - SAMPLE: GSM1321144
18-Jan-2019 08:13:06 DEBUG GEOparse - SAMPLE: GSM1321145
18-Jan-2019 08:13:07 DEBUG GEOparse - SAMPLE: GSM1321146
18-Jan-2019 08:13:09 DEBUG GEOp

18-Jan-2019 08:16:06 DEBUG GEOparse - SAMPLE: GSM1321274
18-Jan-2019 08:16:07 DEBUG GEOparse - SAMPLE: GSM1321275
18-Jan-2019 08:16:09 DEBUG GEOparse - SAMPLE: GSM1321276
18-Jan-2019 08:16:10 DEBUG GEOparse - SAMPLE: GSM1321277



GSM example:
Name:  GSM1320990
Metadata:
 - title : Adr_CT18
 - geo_accession : GSM1320990
 - status : Public on Sep 30 2014
 - submission_date : Feb 04 2014
 - last_update_date : Sep 30 2014
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : adrenal gland
 - organism_ch1 : Mus musculus
 - taxid_ch1 : 10090
 - characteristics_ch1 : strain: C57/BL6, tissue: adrenal gland
 - growth_protocol_ch1 : C57/BL6 mice were entrained to a 12h light : 12h dark schedule for one week. Mice were then transferred to complete darkness prior to collection. Collection began at CT18. Mice were housed in light-tight boxes for the duration of the experiment.
 - molecule_ch1 : total RNA
 - extract_protocol_ch1 : Organ samples were homogenized in Trizol reagent (Invitrogen) using a Tissuelyser (Qiagen). RNA was extracted using RNeasy columns (Qiagen) as per manufacturer’s protocol, then pooled from three mice for each organ and time point.
 - label_ch1 : biotin
 - label_protocol_ch1 : not provided
 - hyb