# Data retrieval from GEO

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

GEOparse is a Python library to access Gene Expression Omnibus Database (GEO). GEOparse.get_GEO() will check the GEO database for a specified accession ID and download it to specified directory. The result will be loaded into GEOparse.GSE file. See the documentation in https://geoparse.readthedocs.io/en/latest/introduction.html#features.


## Installation of libraries

The first step is to install and import the required Python libraries. 


In [None]:
#pip is the package installer for Python, see https://pypi.org/project/pip/ for details

import sys
!{sys.executable} -m pip install GEOparse


In [None]:
import GEOparse

# To read, write and process tabular data:

import pandas as pd

## Exercise 1

Let's download an example data set from the study "Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes" by Flechner et al, 2007 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2041877/).

In [None]:
# Check your current working folder if necessary:
#import os
#os.getcwd()

In [None]:
# download the data set using GEOparse(the data is available in GEO database with the accession ID GSE1563)

kidney_data = GEOparse.get_GEO(geo="GSE1563", destdir="./")

# A GSM (or a Sample) contains information the conditions and preparation of the sample

print()
print("GSM example:")
for gsm_name, gsm in kidney_data.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break


In [None]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:")
for gpl_name, gpl in kidney_data.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break

## Exercise 2

Now your task is to load the data set from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

In [None]:
# download the data set using GEOparse:



In [None]:
#use the GSM example code above to print information of the data:


In [None]:
# use the GLP example code above to print information of the data:
