# Data retrieval from GEO

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

GEOparse is a Python library to access Gene Expression Omnibus Database (GEO). GEOparse.get_GEO() will check the GEO database for a specified accession ID and download it to specified directory. The result will be loaded into GEOparse.GSE file. See the documentation in https://geoparse.readthedocs.io/en/latest/introduction.html#features.


We will get familiar with exploring unfamiliar data.

## Installation of libraries

The first step is to import the required Python libraries. 


In [3]:
#pip is the package installer for Python, see https://pypi.org/project/pip/ for details
#
#import sys
#!{sys.executable} -m pip install GEOparse

In [4]:
import GEOparse
# To read, write and process tabular data:
import pandas as pd

## Exercise 1

Let's download an example data set from the study "Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes" by Flechner et al, 2007 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2041877/).

In [5]:
# Check your current working folder if necessary:
import os
os.getcwd()

'C:\\Users\\Peder\\Documents\\repos\\CBM-101\\CBM101\\C_Data_resources'

In [6]:
# download the data set using GEOparse(the data is available in GEO database with the accession ID GSE1563)

kidney_data = GEOparse.get_GEO(geo="GSE1563", destdir="./")



28-Oct-2020 13:28:20 DEBUG utils - Directory ./ already exists. Skipping.
28-Oct-2020 13:28:20 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz to ./GSE1563_family.soft.gz
100%|█████████████████████████████████████████████████████████████████████████████| 9.62M/9.62M [00:05<00:00, 1.74MB/s]
28-Oct-2020 13:28:28 DEBUG downloader - Size validation passed
28-Oct-2020 13:28:28 DEBUG downloader - Moving C:\Users\Peder\AppData\Local\Temp\tmpb1p_l7ge to C:\Users\Peder\Documents\repos\CBM-101\CBM101\C_Data_resources\GSE1563_family.soft.gz
28-Oct-2020 13:28:28 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz
28-Oct-2020 13:28:28 INFO GEOparse - Parsing ./GSE1563_family.soft.gz: 
28-Oct-2020 13:28:28 DEBUG GEOparse - DATABASE: GeoMiame
28-Oct-2020 13:28:28 DEBUG GEOparse - SERIES: GSE1563
28-Oct-2020 13:28:28 DEBUG GEOparse - PLATFORM: GPL8300
28-Oct-2020 13:28

GSM example:
Name:  GSM26805
Metadata:
 - title : C1PBL
 - geo_accession : GSM26805
 - status : Public on Jul 14 2004
 - submission_date : Jul 14 2004
 - last_update_date : Mar 16 2009
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : PBL
 - organism_ch1 : Homo sapiens
 - taxid_ch1 : 9606
 - molecule_ch1 : total RNA
 - description : Clinical status: control healthy blood donor, Age: unknown, Sex: unknown, Immunosupression: none, Histopathology: none, Donor type: NA, Scr (mg/dL): unknown, Days post transplant: NA, Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis, Keywords = DNA microarrays, gene expression, kidney, rejec

#### Exercise 1. Inspect your downloaded data. a) what data type is it?

In [13]:
# %load solutions/ex1_1a.py
print(kidney_data)
type(kidney_data)

<SERIES: GSE1563 - 62 SAMPLES, 1 d(s)>


GEOparse.GEOTypes.GSE

#### b) what does it contain? Try to play around to access these different contents.
Hint: use `dir` or write `kidney_data.` and press Tab

In [27]:
# %load solutions/ex1_1b.py
dir(kidney_data)

# for example: 
print(kidney_data.name, '\n')
print(kidney_data.get_type(), '\n')
print(kidney_data.show_metadata())

GSE1563 

Expression profiling by array 

!Series_title = Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes
!Series_geo_accession = GSE1563
!Series_status = Public on Jul 14 2004
!Series_submission_date = Jul 14 2004
!Series_last_update_date = Dec 13 2018
!Series_pubmed_id = 15307835
!Series_summary = We used DNA microarrays (HG-U95Av2 GeneChips) to determine gene expression profiles for kidney biopsies and peripheral blood lymphocytes (PBLs) in transplant patients. Sample classes include kidney biopsies and PBLs from patients with 1) healthy normal donor kidneys, 2) well-functioning transplants with no clinical evidence of rejection, 3) kidneys undergoing acute rejection, and 4) transplants with renal dysfunction without rejection. Nomenclature for samples is as follows: 1) all sample names include either BX or PBL to indicate that they were derived from biopsies or PBLs respectively, 2) C indicates samples from healthy normal

#### c) look into the GSMs of `kidney_data`. 
Hint: you can also use the Tab trick multiple times to go deeper e.g. `kidney_data.gsms.` and press Tab

In [51]:
kidney_data.gsms

{'GSM26805': <SAMPLE: GSM26805>,
 'GSM26806': <SAMPLE: GSM26806>,
 'GSM26807': <SAMPLE: GSM26807>,
 'GSM26808': <SAMPLE: GSM26808>,
 'GSM26809': <SAMPLE: GSM26809>,
 'GSM26810': <SAMPLE: GSM26810>,
 'GSM26811': <SAMPLE: GSM26811>,
 'GSM26812': <SAMPLE: GSM26812>,
 'GSM26813': <SAMPLE: GSM26813>,
 'GSM26814': <SAMPLE: GSM26814>,
 'GSM26815': <SAMPLE: GSM26815>,
 'GSM26816': <SAMPLE: GSM26816>,
 'GSM26817': <SAMPLE: GSM26817>,
 'GSM26818': <SAMPLE: GSM26818>,
 'GSM26819': <SAMPLE: GSM26819>,
 'GSM26820': <SAMPLE: GSM26820>,
 'GSM26821': <SAMPLE: GSM26821>,
 'GSM26822': <SAMPLE: GSM26822>,
 'GSM26823': <SAMPLE: GSM26823>,
 'GSM26824': <SAMPLE: GSM26824>,
 'GSM26825': <SAMPLE: GSM26825>,
 'GSM26826': <SAMPLE: GSM26826>,
 'GSM26827': <SAMPLE: GSM26827>,
 'GSM26828': <SAMPLE: GSM26828>,
 'GSM26829': <SAMPLE: GSM26829>,
 'GSM26830': <SAMPLE: GSM26830>,
 'GSM26831': <SAMPLE: GSM26831>,
 'GSM26832': <SAMPLE: GSM26832>,
 'GSM26833': <SAMPLE: GSM26833>,
 'GSM26834': <SAMPLE: GSM26834>,
 'GSM26835

In [83]:
# %load solutions/ex1_1c.py

# by usingt he above trick we see that kidney_data.gsms is a dictionary (you can validate this)

type(kidney_data.gsms)

# a dict is a container of key,value pairs. The values in this case are of the class GEOparse.GEOTypes.GSM 
# e.g.
print(type(list(kidney_data.gsms.values())[0]))

## to get the available methods:
fst = list(kidney_data.gsms.values())[0]
dir(fst)

# for instance look at metadata:

fst.metadata

<class 'GEOparse.GEOTypes.GSM'>


{'title': ['C1PBL'],
 'geo_accession': ['GSM26805'],
 'status': ['Public on Jul 14 2004'],
 'submission_date': ['Jul 14 2004'],
 'last_update_date': ['Mar 16 2009'],
 'type': ['RNA'],
 'channel_count': ['1'],
 'source_name_ch1': ['PBL'],
 'organism_ch1': ['Homo sapiens'],
 'taxid_ch1': ['9606'],
 'molecule_ch1': ['total RNA'],
 'description': ['Clinical status: control healthy blood donor',
  'Age: unknown',
  'Sex: unknown',
  'Immunosupression: none',
  'Histopathology: none',
  'Donor type: NA',
  'Scr (mg/dL): unknown',
  'Days post transplant: NA',
  'Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis',
  'Keywords = DNA m

### Printing a summary
We could then do something like this:

In [84]:
# A GSM (or a Sample) contains information the conditions and preparation of the sample

print("GSM example:\n-------------")
for gsm_name, gsm in kidney_data.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print()
    print (gsm.table.head())
    break # so we stop after the first
    

GSM example:
-------------
Name:  GSM26805
Metadata:
 - title : C1PBL
 - geo_accession : GSM26805
 - status : Public on Jul 14 2004
 - submission_date : Jul 14 2004
 - last_update_date : Mar 16 2009
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : PBL
 - organism_ch1 : Homo sapiens
 - taxid_ch1 : 9606
 - molecule_ch1 : total RNA
 - description : Clinical status: control healthy blood donor, Age: unknown, Sex: unknown, Immunosupression: none, Histopathology: none, Donor type: NA, Scr (mg/dL): unknown, Days post transplant: NA, Abbreviations used in sample description: Abreviations used to describe patient samples include the following: BX - Biopsy; PBL- Peripheral Blood Lymphocytes; CsA -Cyclosporine; MMF - Mycophenolate Mofetil; P - Prednisone; FK - Tacrolimus;  SRL - Sirolimus; CAD -Cadaveric;  LD - Live Donor; Scr - Serum Creatinine; ATN - Acute Tubular Necrosis CNI - Calcineurin Inhibitor; FSGS - Focal Segmental Glomerulosclerosis, Keywords = DNA microarrays, gene expression,

or this:

In [85]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:\n-------------")
for gpl_name, gpl in kidney_data.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break


GPL example:
-------------
Name:  GPL8300
Metadata:
 - title : [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
 - geo_accession : GPL8300
 - status : Public on Mar 16 2009
 - submission_date : Mar 13 2009
 - last_update_date : Dec 13 2018
 - technology : in situ oligonucleotide
 - distribution : commercial
 - organism : Homo sapiens
 - taxid : 9606
 - manufacturer : Affymetrix
 - manufacture_protocol : see manufacturer's web site, , Based on this UniGene build and associated annotations, the HG-U95Av2 array represents approximately 10,000 full-length genes., 
 - description : Affymetrix submissions are typically submitted to GEO using the GEOarchive method described at http://www.ncbi.nlm.nih.gov/projects/geo/info/geo_affy.html, , June 03, 2009: annotation table updated with netaffx build 28, June 08, 2012: annotation table updated with netaffx build 32, June 27, 2016: annotation table updated with netaffx build 35
 - web_link : http://www.affymetrix.com/support/technical/bypr

#### Exercise 2.a.
Now your task is to load the data set from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

In [70]:
# %load solutions/ex1_2a.py
# download the data set using GEOparse:
circadian_expression = GEOparse.get_GEO(geo="GSE54650", destdir="./")


28-Oct-2020 14:05:26 DEBUG utils - Directory ./ already exists. Skipping.
28-Oct-2020 14:05:26 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54650/soft/GSE54650_family.soft.gz to ./GSE54650_family.soft.gz
100%|█████████████████████████████████████████████████████████████████████████████| 65.9M/65.9M [00:46<00:00, 1.50MB/s]
28-Oct-2020 14:06:14 DEBUG downloader - Size validation passed
28-Oct-2020 14:06:14 DEBUG downloader - Moving C:\Users\Peder\AppData\Local\Temp\tmp5yyeo5ak to C:\Users\Peder\Documents\repos\CBM-101\CBM101\C_Data_resources\GSE54650_family.soft.gz
28-Oct-2020 14:06:14 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54650/soft/GSE54650_family.soft.gz
28-Oct-2020 14:06:14 INFO GEOparse - Parsing ./GSE54650_family.soft.gz: 
28-Oct-2020 14:06:14 DEBUG GEOparse - DATABASE: GeoMiame
28-Oct-2020 14:06:14 DEBUG GEOparse - SERIES: GSE54650
28-Oct-2020 14:06:14 DEBUG GEOparse - PLATFORM: GPL6246
28-Oct-

28-Oct-2020 14:06:48 DEBUG GEOparse - SAMPLE: GSM1321117
28-Oct-2020 14:06:49 DEBUG GEOparse - SAMPLE: GSM1321118
28-Oct-2020 14:06:49 DEBUG GEOparse - SAMPLE: GSM1321119
28-Oct-2020 14:06:49 DEBUG GEOparse - SAMPLE: GSM1321120
28-Oct-2020 14:06:49 DEBUG GEOparse - SAMPLE: GSM1321121
28-Oct-2020 14:06:49 DEBUG GEOparse - SAMPLE: GSM1321122
28-Oct-2020 14:06:50 DEBUG GEOparse - SAMPLE: GSM1321123
28-Oct-2020 14:06:50 DEBUG GEOparse - SAMPLE: GSM1321124
28-Oct-2020 14:06:50 DEBUG GEOparse - SAMPLE: GSM1321125
28-Oct-2020 14:06:50 DEBUG GEOparse - SAMPLE: GSM1321126
28-Oct-2020 14:06:50 DEBUG GEOparse - SAMPLE: GSM1321127
28-Oct-2020 14:06:51 DEBUG GEOparse - SAMPLE: GSM1321128
28-Oct-2020 14:06:51 DEBUG GEOparse - SAMPLE: GSM1321129
28-Oct-2020 14:06:51 DEBUG GEOparse - SAMPLE: GSM1321130
28-Oct-2020 14:06:51 DEBUG GEOparse - SAMPLE: GSM1321131
28-Oct-2020 14:06:52 DEBUG GEOparse - SAMPLE: GSM1321132
28-Oct-2020 14:06:52 DEBUG GEOparse - SAMPLE: GSM1321133
28-Oct-2020 14:06:52 DEBUG GEOp

28-Oct-2020 14:07:20 DEBUG GEOparse - SAMPLE: GSM1321261
28-Oct-2020 14:07:20 DEBUG GEOparse - SAMPLE: GSM1321262
28-Oct-2020 14:07:20 DEBUG GEOparse - SAMPLE: GSM1321263
28-Oct-2020 14:07:20 DEBUG GEOparse - SAMPLE: GSM1321264
28-Oct-2020 14:07:20 DEBUG GEOparse - SAMPLE: GSM1321265
28-Oct-2020 14:07:21 DEBUG GEOparse - SAMPLE: GSM1321266
28-Oct-2020 14:07:21 DEBUG GEOparse - SAMPLE: GSM1321267
28-Oct-2020 14:07:21 DEBUG GEOparse - SAMPLE: GSM1321268
28-Oct-2020 14:07:21 DEBUG GEOparse - SAMPLE: GSM1321269
28-Oct-2020 14:07:22 DEBUG GEOparse - SAMPLE: GSM1321270
28-Oct-2020 14:07:22 DEBUG GEOparse - SAMPLE: GSM1321271
28-Oct-2020 14:07:22 DEBUG GEOparse - SAMPLE: GSM1321272
28-Oct-2020 14:07:22 DEBUG GEOparse - SAMPLE: GSM1321273
28-Oct-2020 14:07:22 DEBUG GEOparse - SAMPLE: GSM1321274
28-Oct-2020 14:07:23 DEBUG GEOparse - SAMPLE: GSM1321275
28-Oct-2020 14:07:23 DEBUG GEOparse - SAMPLE: GSM1321276
28-Oct-2020 14:07:23 DEBUG GEOparse - SAMPLE: GSM1321277


#### b) use the GSM example and GPL example codes above to print information of the data

In [76]:
# %load solutions/ex1_2b.py

# we only have to change the name of the data variable

print()
print("GSM example:\n-------------")
for gsm_name, gsm in circadian_expression.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break

    
print('\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@')
    
    
# GLP example:
print()
print("GPL example:\n-------------")
for gpl_name, gpl in circadian_expression.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break


GSM example:
-------------
Name:  GSM1320990
Metadata:
 - title : Adr_CT18
 - geo_accession : GSM1320990
 - status : Public on Sep 30 2014
 - submission_date : Feb 04 2014
 - last_update_date : Sep 30 2014
 - type : RNA
 - channel_count : 1
 - source_name_ch1 : adrenal gland
 - organism_ch1 : Mus musculus
 - taxid_ch1 : 10090
 - characteristics_ch1 : strain: C57/BL6, tissue: adrenal gland
 - growth_protocol_ch1 : C57/BL6 mice were entrained to a 12h light : 12h dark schedule for one week. Mice were then transferred to complete darkness prior to collection. Collection began at CT18. Mice were housed in light-tight boxes for the duration of the experiment.
 - molecule_ch1 : total RNA
 - extract_protocol_ch1 : Organ samples were homogenized in Trizol reagent (Invitrogen) using a Tissuelyser (Qiagen). RNA was extracted using RNeasy columns (Qiagen) as per manufacturerâ€™s protocol, then pooled from three mice for each organ and time point.
 - label_ch1 : biotin
 - label_protocol_ch1 : not