# Data retrieval from GEO

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

GEOparse is a Python library to access Gene Expression Omnibus Database (GEO). GEOparse.get_GEO() will check the GEO database for a specified accession ID and download it to specified directory. The result will be loaded into GEOparse.GSE file. See the documentation in https://geoparse.readthedocs.io/en/latest/introduction.html#features.


## Installation of libraries

The first step is to install and import the required Python libraries. 


In [1]:
#pip is the package installer for Python, see https://pypi.org/project/pip/ for details

import sys
!{sys.executable} -m pip install GEOparse


Collecting GEOparse
[?25l  Downloading https://files.pythonhosted.org/packages/67/f6/9206e1acda1858fa9a117ae91d9541e011735e672d58be58a5ee0947ef13/GEOparse-1.1.0.tar.gz (189kB)
[K     |████████████████████████████████| 194kB 3.9MB/s eta 0:00:01
Collecting wgetter>=0.6 (from GEOparse)
  Downloading https://files.pythonhosted.org/packages/8e/ce/7f160ed9f0e16a5365bcbac1dbc6bad1631e9fc91610a444fbdebede3e8b/wgetter-0.7.tar.gz
Collecting biopython>=1.71 (from GEOparse)
[?25l  Downloading https://files.pythonhosted.org/packages/28/15/8ac646ff24cfa2588b4d5e5ea51e8d13f3d35806bd9498fbf40ef79026fd/biopython-1.73-cp36-cp36m-manylinux1_x86_64.whl (2.2MB)
[K     |████████████████████████████████| 2.2MB 20.5MB/s eta 0:00:01
Building wheels for collected packages: GEOparse, wgetter
  Building wheel for GEOparse (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/nbuser/.cache/pip/wheels/f3/aa/77/45a2f1517e7545aaabce83d4ad371e4f58aa818e4ee38691cd
  Building wheel for wgetter (setup.py) ... 

In [2]:
import GEOparse

# To read, write and process tabular data:

import pandas as pd

## Exercise 1

Let's download an example data set from the study "Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes" by Flechner et al, 2007 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2041877/).

In [3]:
# Check your current working folder if necessary:
import os
os.getcwd()

'/home/nbuser/library'

In [4]:
# download the data set using GEOparse(the data is available in GEO database with the accession ID GSE1563)

kidney_data = GEOparse.get_GEO(geo="GSE1563", destdir="./")

# A GSM (or a Sample) contains information the conditions and preparation of the sample

print()
print("GSM example:")
for gsm_name, gsm in kidney_data.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break


02-May-2019 11:39:26 DEBUG utils - Directory ./ already exists. Skipping.
02-May-2019 11:39:26 INFO GEOparse - File already exist: using local version.
02-May-2019 11:39:26 INFO GEOparse - Parsing ./GSE1563_family.soft.gz: 
02-May-2019 11:39:27 DEBUG GEOparse - DATABASE: GeoMiame
02-May-2019 11:39:27 DEBUG GEOparse - SERIES: GSE1563
02-May-2019 11:39:27 DEBUG GEOparse - PLATFORM: GPL8300
02-May-2019 11:39:34 DEBUG GEOparse - SAMPLE: GSM26805
02-May-2019 11:39:34 DEBUG GEOparse - SAMPLE: GSM26806
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26807
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26808
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26809
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26810
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26811
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26812
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26813
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26814
02-May-2019 11:39:35 DEBUG GEOparse - SAMPLE: GSM26815
02-M


GSM example:
Name:  GSM26805
Metadata:
 - title : C1PBL
 - geo_accession : GSM26805
 - status : Public on Jul 14 2004
 - submission_date : Jul 14 2004
 - last_update_date : Mar 16 2009
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : PBL
 - organism_ch1 : Homo sapiens
 - taxid_ch1 : 9606
 - molecule_ch1 : total RNA
 - description : Clinical status: control healthy blood donor, Age: unknown, Sex: unknown, Immunosupression: none, Histopathology: none, Donor type: NA, Scr (mg/dL): unknown, Days post transplant: NA, Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis, Keywords = DNA microarrays, gene expression, kidney, reje

In [6]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:")
for gpl_name, gpl in kidney_data.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break


GPL example:
Name:  GPL8300
Metadata:
 - title : [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
 - geo_accession : GPL8300
 - status : Public on Mar 16 2009
 - submission_date : Mar 13 2009
 - last_update_date : Dec 13 2018
 - technology : in situ oligonucleotide
 - distribution : commercial
 - organism : Homo sapiens
 - taxid : 9606
 - manufacturer : Affymetrix
 - manufacture_protocol : see manufacturer's web site, , Based on this UniGene build and associated annotations, the HG-U95Av2 array represents approximately 10,000 full-length genes., 
 - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 08, 2012: annotation table updated with netaffx build 32, June 27, 2016: annotation table updated with netaffx build 35
 - web_link : http://www.affymetrix.com/support/technical/byproduct.affx?pro

## Exercise 2

Now your task is to load the data set from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

In [3]:
# download the data set using GEOparse:

circadian_expression = GEOparse.get_GEO(geo="GSE54650", destdir="./")

#use the GSM example and GPL example codes above to print information of the data:

print()
print("GSM example:")
for gsm_name, gsm in circadian_expression.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break

07-May-2019 09:10:43 DEBUG utils - Directory ./ already exists. Skipping.
07-May-2019 09:10:43 INFO GEOparse - File already exist: using local version.
07-May-2019 09:10:43 INFO GEOparse - Parsing ./GSE54650_family.soft.gz: 
07-May-2019 09:10:43 DEBUG GEOparse - DATABASE: GeoMiame
07-May-2019 09:10:43 DEBUG GEOparse - SERIES: GSE54650
07-May-2019 09:10:43 DEBUG GEOparse - PLATFORM: GPL6246
07-May-2019 09:11:16 DEBUG GEOparse - SAMPLE: GSM1320990
07-May-2019 09:11:16 DEBUG GEOparse - SAMPLE: GSM1320991
07-May-2019 09:11:16 DEBUG GEOparse - SAMPLE: GSM1320992
07-May-2019 09:11:16 DEBUG GEOparse - SAMPLE: GSM1320993
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320994
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320995
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320996
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320997
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320998
07-May-2019 09:11:17 DEBUG GEOparse - SAMPLE: GSM1320999
07-May-2019 09:11:17 DEBUG GEOparse -

07-May-2019 09:11:33 DEBUG GEOparse - SAMPLE: GSM1321127
07-May-2019 09:11:33 DEBUG GEOparse - SAMPLE: GSM1321128
07-May-2019 09:11:33 DEBUG GEOparse - SAMPLE: GSM1321129
07-May-2019 09:11:33 DEBUG GEOparse - SAMPLE: GSM1321130
07-May-2019 09:11:33 DEBUG GEOparse - SAMPLE: GSM1321131
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321132
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321133
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321134
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321135
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321136
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321137
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321138
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321139
07-May-2019 09:11:34 DEBUG GEOparse - SAMPLE: GSM1321140
07-May-2019 09:11:35 DEBUG GEOparse - SAMPLE: GSM1321141
07-May-2019 09:11:35 DEBUG GEOparse - SAMPLE: GSM1321142
07-May-2019 09:11:35 DEBUG GEOparse - SAMPLE: GSM1321143
07-May-2019 09:11:35 DEBUG GEOp

07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321271
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321272
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321273
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321274
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321275
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321276
07-May-2019 09:11:51 DEBUG GEOparse - SAMPLE: GSM1321277



GSM example:
Name:  GSM1320990
Metadata:
 - title : Adr_CT18
 - geo_accession : GSM1320990
 - status : Public on Sep 30 2014
 - submission_date : Feb 04 2014
 - last_update_date : Sep 30 2014
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : adrenal gland
 - organism_ch1 : Mus musculus
 - taxid_ch1 : 10090
 - characteristics_ch1 : strain: C57/BL6, tissue: adrenal gland
 - growth_protocol_ch1 : C57/BL6 mice were entrained to a 12h light : 12h dark schedule for one week. Mice were then transferred to complete darkness prior to collection. Collection began at CT18. Mice were housed in light-tight boxes for the duration of the experiment.
 - molecule_ch1 : total RNA
 - extract_protocol_ch1 : Organ samples were homogenized in Trizol reagent (Invitrogen) using a Tissuelyser (Qiagen). RNA was extracted using RNeasy columns (Qiagen) as per manufacturer’s protocol, then pooled from three mice for each organ and time point.
 - label_ch1 : biotin
 - label_protocol_ch1 : not provided
 - hyb

In [8]:
# GLP example:
print()
print("GPL example:")
for gpl_name, gpl in circadian_expression.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break


GPL example:
Name:  GPL6246
Metadata:
 - title : [MoGene-1_0-st] Affymetrix Mouse Gene 1.0 ST Array [transcript (gene) version]
 - geo_accession : GPL6246
 - status : Public on Dec 05 2007
 - submission_date : Dec 05 2007
 - last_update_date : Dec 19 2018
 - technology : in situ oligonucleotide
 - distribution : commercial
 - organism : Mus musculus
 - taxid : 10090
 - manufacturer : Affymetrix
 - manufacture_protocol : See manufacturer's web site, 
 - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 07, 2012: annotation table updated with netaffx build 32, July 01, 2016: annotation table updated with netaffx build 35
 - web_link : http://www.affymetrix.com/support/technical/byproduct.affx?product=mogene-1_0-st-v1, http://www.affymetrix.com/support/technical/libraryfilesmain.affx
 - contact_name :