<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-retrieval-from-GEO" data-toc-modified-id="Data-retrieval-from-GEO-1">Data retrieval from GEO</a></span><ul class="toc-item"><li><span><a href="#Installation-of-libraries" data-toc-modified-id="Installation-of-libraries-1.1">Installation of libraries</a></span></li><li><span><a href="#Exercise-1" data-toc-modified-id="Exercise-1-1.2">Exercise 1</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Exercise-1.-Inspect-your-downloaded-data.-a)-what-data-type-is-it?" data-toc-modified-id="Exercise-1.-Inspect-your-downloaded-data.-a)-what-data-type-is-it?-1.2.0.1">Exercise 1. Inspect your downloaded data. a) what data type is it?</a></span></li><li><span><a href="#b)-what-does-it-contain?-Try-to-play-around-to-access-these-different-contents." data-toc-modified-id="b)-what-does-it-contain?-Try-to-play-around-to-access-these-different-contents.-1.2.0.2">b) what does it contain? Try to play around to access these different contents.</a></span></li><li><span><a href="#c)-look-into-the-GSMs-of-kidney_data." data-toc-modified-id="c)-look-into-the-GSMs-of-kidney_data.-1.2.0.3">c) look into the GSMs of <code>kidney_data</code>.</a></span></li></ul></li><li><span><a href="#Printing-a-summary" data-toc-modified-id="Printing-a-summary-1.2.1">Printing a summary</a></span><ul class="toc-item"><li><span><a href="#Exercise-2.a." data-toc-modified-id="Exercise-2.a.-1.2.1.1">Exercise 2.a.</a></span></li><li><span><a href="#b)-use-the-GSM-example-and-GPL-example-codes-above-to-print-information-of-the-data" data-toc-modified-id="b)-use-the-GSM-example-and-GPL-example-codes-above-to-print-information-of-the-data-1.2.1.2">b) use the GSM example and GPL example codes above to print information of the data</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Data retrieval from GEO

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

GEOparse is a Python library to access Gene Expression Omnibus Database (GEO). GEOparse.get_GEO() will check the GEO database for a specified accession ID and download it to specified directory. The result will be loaded into GEOparse.GSE file. See the documentation in https://geoparse.readthedocs.io/en/latest/introduction.html#features.


We will get familiar with exploring unfamiliar data.

## Installation of libraries

The first step is to import the required Python libraries. 


In [1]:
#pip is the package installer for Python, see https://pypi.org/project/pip/ for details
#
#import sys
#!{sys.executable} -m pip install GEOparse

In [2]:
import GEOparse
# To read, write and process tabular data:
import pandas as pd

## Exercise 1

Let's download an example data set from the study "Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes" by Flechner et al, 2007 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2041877/).

In [3]:
# Check your current working folder if necessary:
import os
os.getcwd()

'F:\\CBM101\\C_Data_resources'

In [5]:
# download the data set using GEOparse(the data is available in GEO database with the accession ID GSE1563)

kidney_data = GEOparse.get_GEO(geo="GSE1563", destdir="./")



16-Jun-2022 11:28:15 DEBUG utils - Directory ./ already exists. Skipping.
16-Jun-2022 11:28:15 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz to ./GSE1563_family.soft.gz
18.9MB [00:41, 478kB/s]                                                                                                


OSError: Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz'. ID could be incorrect or the data might not be public yet.

#### Exercise 1. Inspect your downloaded data. a) what data type is it?

In [None]:
# %load solutions/ex1_1a.py
print(kidney_data)
type(kidney_data)

#### b) what does it contain? Try to play around to access these different contents.
Hint: use `dir` or write `kidney_data.` and press Tab

In [None]:
# %load solutions/ex1_1b.py
dir(kidney_data)

# for example: 
print(kidney_data.name, '\n')
print(kidney_data.get_type(), '\n')
print(kidney_data.show_metadata())

#### c) look into the GSMs of `kidney_data`. 
Hint: you can also use the Tab trick multiple times to go deeper e.g. `kidney_data.gsms.` and press Tab

In [None]:
kidney_data.gsms

In [None]:
# %load solutions/ex1_1c.py

# by usingt he above trick we see that kidney_data.gsms is a dictionary (you can validate this)

type(kidney_data.gsms)

# a dict is a container of key,value pairs. The values in this case are of the class GEOparse.GEOTypes.GSM 
# e.g.
print(type(list(kidney_data.gsms.values())[0]))

## to get the available methods:
fst = list(kidney_data.gsms.values())[0]
dir(fst)

# for instance look at metadata:

fst.metadata

### Printing a summary
We could then do something like this:

In [None]:
# A GSM (or a Sample) contains information the conditions and preparation of the sample

print("GSM example:\n-------------")
for gsm_name, gsm in kidney_data.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print()
    print (gsm.table.head())
    break # so we stop after the first
    

or this:

In [None]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:\n-------------")
for gpl_name, gpl in kidney_data.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break

#### Exercise 2.a.
Now your task is to load the data set from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

In [None]:
# %load solutions/ex1_2a.py
# download the data set using GEOparse:
circadian_expression = GEOparse.get_GEO(geo="GSE54650", destdir="./")


#### b) use the GSM example and GPL example codes above to print information of the data

In [None]:
# %load solutions/ex1_2b.py

# we only have to change the name of the data variable

print()
print("GSM example:\n-------------")
for gsm_name, gsm in circadian_expression.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print (gsm.table.head())
    break

    
print('\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@')
    
    
# GLP example:
print()
print("GPL example:\n-------------")
for gpl_name, gpl in circadian_expression.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break