<!--associate:
# The comment must be at the very beginning of a cell, by itself, starting with 'associate:'. 
# Since it is not meant to appear in the output when run, the assumption is that it can 
# require a cell to itself. This refers to  directories, relative to the notebook.
TAI.png
-->

# Module 1: Is the hourglass model for gene expression really supported by the data?
### Paper to be examined: 
“A phylogenetically based transcriptome age index mirrors ontogenetic divergence patterns”, Nature 9;468(7325):815-8 (2010)[1]
### Key claim of the paper: 
"Gene expression follows the so-called hourglass pattern observed for morphological features of development, which are most similar to each other in the phylotypic stage in mid-development."

### Schedule:
* H1: General introduction to the paper/motivation
* H2-3: Write code to import the data and start computing transcriptome age index (TAI)
* H4-6: Aim to reproduce figure 1 of the paper – help/scripts will be given if needed.
* H7: Discussion: “Are you convinced of this result? What might have gone wrong?”
* H8: Redo analysis using log-transformed data
* H9: Summarize results (e.g. on this wiki)


### Key bioinformatics concept of this module: 
"Data normalization is important and can impact the results of subsequent analyses!"


# Installation and Setup

* Install the Anaconda distribution of Python 3.x.


# Libraries
Will be using [**GEOparse**](https://geoparse.readthedocs.io/en/latest/usage.html#working-with-geo-accession) for for fetching gene expression data and [**pandas**](https://pandas.pydata.org/pandas-docs/stable/10min.html) for data manipulation and preprocessing.

In [2]:
########### import necessary packages
import pandas as pd
import numpy as np
import GEOparse
import matplotlib.pyplot as plt

# Read gene expression data 


In [3]:
############# Download the data
file_name = 'GSE24616'
# write your code here 
gse = GEOparse.get_GEO(geo=file_name, destdir="./")

12-Nov-2018 15:10:53 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE24nnn/GSE24616/soft/GSE24616_family.soft.gz to ./GSE24616_family.soft.gz
12-Nov-2018 15:10:53 INFO utils - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE24nnn/GSE24616/soft/GSE24616_family.soft.gz to ./GSE24616_family.soft.gz



D: 100% - 50.6MiB  / 50.6MiB  eta 0:00:00


12-Nov-2018 15:11:30 INFO GEOparse - Parsing ./GSE24616_family.soft.gz: 
12-Nov-2018 15:11:30 DEBUG GEOparse - DATABASE: GeoMiame
12-Nov-2018 15:11:30 DEBUG GEOparse - SERIES: GSE24616
12-Nov-2018 15:11:30 DEBUG GEOparse - PLATFORM: GPL6457
  return DataFrame.from_csv(StringIO(data), index_col=None, sep="\t")
12-Nov-2018 15:11:31 DEBUG GEOparse - SAMPLE: GSM606866
12-Nov-2018 15:11:31 DEBUG GEOparse - SAMPLE: GSM606867
12-Nov-2018 15:11:32 DEBUG GEOparse - SAMPLE: GSM606868
12-Nov-2018 15:11:32 DEBUG GEOparse - SAMPLE: GSM606869
12-Nov-2018 15:11:32 DEBUG GEOparse - SAMPLE: GSM606870
12-Nov-2018 15:11:33 DEBUG GEOparse - SAMPLE: GSM606871
12-Nov-2018 15:11:33 DEBUG GEOparse - SAMPLE: GSM606872
12-Nov-2018 15:11:33 DEBUG GEOparse - SAMPLE: GSM606873
12-Nov-2018 15:11:33 DEBUG GEOparse - SAMPLE: GSM606874
12-Nov-2018 15:11:33 DEBUG GEOparse - SAMPLE: GSM606875
12-Nov-2018 15:11:34 DEBUG GEOparse - SAMPLE: GSM606876
12-Nov-2018 15:11:34 DEBUG GEOparse - SAMPLE: GSM606877
12-Nov-2018 15:11

12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607002
12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607003
12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607004
12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607005
12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607006
12-Nov-2018 15:11:52 DEBUG GEOparse - SAMPLE: GSM607007
12-Nov-2018 15:11:53 DEBUG GEOparse - SAMPLE: GSM607008
12-Nov-2018 15:11:53 DEBUG GEOparse - SAMPLE: GSM607009
12-Nov-2018 15:11:53 DEBUG GEOparse - SAMPLE: GSM607010
12-Nov-2018 15:11:53 DEBUG GEOparse - SAMPLE: GSM607011
12-Nov-2018 15:11:53 DEBUG GEOparse - SAMPLE: GSM607012


## GSE data structure:
Let's take a look at gene expression data stucture and accession.

**Data Sturcture:**
    - gse.gsms
        - gse.gsms.metadata
        - gse.gsms.name
        - gse.gsms.table
    - gse.gpl
        - gse.gpl.metadata
        - gse.gpl.name
        - gse.gpl.table
        
**GSE file name:** GSE24616

In [None]:
########### Explore an example of GSE content
print ("GSM example:")
for gsm_name, gsm in gse.gsms.items():
    print ("Name: ", gsm_name)
    print ('*'*100)
    print ("Metadata:"),
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print('*'*100)
    print ("Table data:"),
    print (gsm.table.head())
    break

In [None]:
print ("GPL example:")
for gpl_name, gpl in gse.gpls.items():
    print ("Name: ", gpl_name)
    print ('*'*100)
    print ("Metadata:"),
    for key, value in gpl.metadata.items():
        print (" - %s : %s" % (key, ", ".join(value)))
    print ('*'*100)
    print ("Table data:"),
    print (gpl.table.head())
    break

# Read age index data file

In [3]:
########### Read in age index data
# write your code here 
########### Set ProbeID as the index of dataframe
# write your code here 

# Pre processing gene expression data:
Gene expression data needs to be extracted from GSE data structure. Preprocessing steps are:
1. Extract the metadata
2. Extract the gene expression data
3. Add age index data to the gene expression data
4. Get average for the genes with multiple probesets
5. Select mixed and female samples 
6. Get the average gene expression for similar time points¶

## 1) Extract  metadata of samples in gene expression data
Complementary information about the samples is stores in  **gsm.metadata** including sex, developmental stage and the sample name. A sample metadata looks like:

"characteristics_ch1 : strain: wild type, developmental stage: adult, developmental timing: 1y9m, gender: mixed, number of individuals per sample: 2"

All these infomations are stored in an string and we need to extract them by some String Formatting Operations.

In [1]:
############### Extract GSE metadata
characteristics = {"stage":[],"time":[],"sex":[],"sample_name":[]}
# write your code here char_df.head()

## 2) Extract the gene expression data

In [None]:
############### Extract the gene expression data
# write your code here 
############# Add ProbeID as the index of gene expression dataframe
# write your code here 
############ Let's visualize at the gene expression data
# write your code here 

## 3) Add age index data to the gene expression data

In [2]:
# write your code here 

In [1]:
############ Sort by GeneID
# write your code here 


## 4) Get average for the genes with multiple probesets

In [4]:
########### Average out multiple transcripts
# write your code here 

## 5) Select mixed and female samples

In [74]:
# write your code here 

## 6) Get the average gene expression for similar time points

In [5]:
########## find samples of the same time points

# write your code here 
########### average the samples for similar time points
# write your code here 

# TAI Calculation

In [6]:
########### Calculating TAI
# write your code here 
#### define color map

# write your code here 

# Save the pre-processed expression data to a file

In [None]:
########### add age index to expression and write to file
# write your code here 

## Plot histogram of gene expression data vs log values

In [7]:
# write your code here 


## TAI calculation with log normalization of gene expression

In [9]:
########### Calculating TAI
# write your code here 
#### define color map

# write your code here 

## TAI calculation with absolute log normalization of gene expression (Optional analysis)

In [10]:
# write your code here 

#### define color map

# write your code here 