# answers 01: _the case of the dead sand mouse_

__Sean's version__

We don't need any imports. We're starting simple.

I assume that you have the data files [Moriarty_SuppTable1](http://mcb112.org/w01/Moriarty_SuppTable1) and [Adler_SuppTable2](http://mcb112.org/w01/Adler_SuppTable2) downloaded and in the current working directory where you're running Jupyter.

In [1]:
# Good practice: define params, data file names, etc at top of page or script.
#   then it's easy to make changes and re-run
moriartyfile = "Moriarty_SuppTable1"
adlerfile    = "Adler_SuppTable2"

## 1. check that the gene names match

> *Output the names that appear in Moriarty_SuppTable1 but not Adler_SuppTable2, if any*

One of many possible strategies: read each file line by line. Split each line into fields, on whitespace. Field 0 (the first one) is the name. Make a list of the names in each file. Convert the lists to sets, and Python allows you to subtract sets: 


In [2]:
with open(moriartyfile) as f1:
    names1 = []                          # initialize an empty list
    for line in f1:
        if line[0] == '#': continue      # skip comment lines
        line   = line.rstrip('\n')       # remove the trailing newline
        fields = line.split()            # split into fields on whitespace
        names1.append(fields[0])         # Add fields[0], the gene name, to the list. 'append' is a method in the list object.

with open(adlerfile) as f2:
    names2 = []
    for line in f2:
        if line[0] == '#': continue      
        line   = line.rstrip('\n')       
        fields = line.split()            
        names2.append(fields[0])   
    
# A fast pythonic way to get the different elements between two lists: set comparison.
for gene in set(names1) - set(names2):
    print(gene)
    
# Alternatively, the more obvious but slower way:
#  for gene in names1:
#     if gene not in names2: print(gene)

9-Sep
2-Sep
14-Sep
10-Mar
5-Mar
4-Sep
4-Mar
2-Mar
6-Sep
12-Sep
1-Dec
10-Sep
3-Sep
5-Sep
11-Sep
8-Mar
15-Sep
7-Sep
8-Sep
11-Mar
1-Sep
3-Mar
7-Mar
1-Mar
9-Mar
6-Mar


In [3]:
# May as well do it the other way to see what's going on...
# here's the names in Adler that aren't in Moriarty.
for gene in set(names2) - set(names1):
    print(gene)

SEPT11
SEPT8
SEPT9
MARCH10
MARCH5
MARCH6
DEC1
SEPT12
SEP15
SEPT2
MARCH9
MARCH1
MARC2
SEPT6
SEPT7
MARC1
SEPT3
SEPT14
SEPT4
MARCH3
MARCH7
SEPT1
MARCH2
SEPT10
MARCH4
SEPT5
MARCH8
MARCH11


> *if there's a difference - why?*

What's happened is that Moriarty et al. imported their data into Microsoft Excel, and by default -- unless you click options that tell it not to -- Excel import will 'helpfully' convert things that look like dates to a standard date format. So the gene "MARCH8" gets converted to "8-Mar". Excel also converts things that look like scientific-notation numbers to a standard number format, so a locus named "2310009E13" would get converted to "2.31E+13". The names in the Adler file are uncorrupted.

## 2. explore the data

> _output the five genes with the highest mRNA synthesis rate_

We're going to use the same trick in all three subanswers here. We'll read the rate/tpm numbers into a __dict__, indexed by gene name. We can't sort a dict directly (dicts are unordered). What we need is to sort a separate list of the dict's keys into the order we want, then access the data in the dict in that sorted order. We have a couple of options. 

One way would be to provide a custom sorting function to python's `sort` function that sorts by rate or tpm or whatever using the data in the dict. 

Here we'll use a pythonic shortcut version of that: the `sorted` function can take a dict (among other things) as an argument and return a sorted list of its keys. If you give `sorted` a function as an optional `key` argument, it'll sort on that value instead. So giving it `<dict>.get` -- the dict's own method for retrieving its value -- is sufficient here! A tricksy and useful idiom. `reverse=True` tells `sorted` to sort in descending, rather than the default ascending order.

The point of the exercise is to get used to lists (accessed by index, 0..n-1) and dicts (accessed by keyword), and how to sort them.

In [4]:
with open(adlerfile) as f2:        
    synth_rate  = {}     # initialize empty dicts
    halflife    = {}    
    for line in f2:
        if line[0] == '#': continue      
        line   = line.rstrip('\n')       
        fields = line.split()            

        synth_rate[fields[0]] = float(fields[1])  # fields[0] is the gene name; our key. fields[1] is synth rate as a STRING.
        halflife[fields[0]]   = float(fields[2])  #   ... we have to convert strings to numbers explicitly: hence float() 


sorted_byrate = sorted(synth_rate, key=synth_rate.get, reverse=True)    # Get a list of gene names (from synth_rate), sorting them on synth_rate's values 
for i in range(5):
    gene = sorted_byrate[i]
    print("{0:15s} {1}".format(gene, synth_rate[gene]))


SCT             174.5
TMEM2           87.5
CFAP100         83.1
RNASE7          78.3
MAPK10          62.8


> *output the five genes with the longest mRNA halflife*

Same thing, with the halflife data from Adler.

In [5]:
sorted_bylife = sorted(halflife, key=halflife.get, reverse=True)    
for i in range(5):
    gene = sorted_bylife[i]
    print("{0:15s} {1}".format(gene, halflife[gene]))


GIN1            66.3
SELL            62.5
ERF             59.2
HNF1A           56.8
ECE2            56.8


> *output the five genes with the highest ratio of expression at t=96 hours post-mortem vs. t=0*

Similar to the above, but now we pull in the Moriarty data, and make a dict of the t=96/t=0 ratio, indexed by gene name.

In [6]:
with open(moriartyfile) as f1:
    ratio       = {}     
    for line in f1:
        if line[0] == '#': continue      
        line   = line.rstrip('\n')       
        fields = line.split()            # field [1] = 0hr. [5] = 96hr.

        ratio[fields[0]] = float(fields[5]) / float(fields[1])

sorted_byratio = sorted(ratio, key=ratio.get, reverse=True)    
for i in range(5):
    gene = sorted_byratio[i]
    print("{0:15s} {1:.1f}".format(gene, ratio[gene]))

GIN1            26.4
SELL            25.5
EIF4A1          22.6
HNF1A           19.8
ERF             19.0


The suspicious clue: the most differentially upregulated genes at t=96 are the mRNAs with the longest half lives (the slowest mRNA decay rate).

### 3. Figure out what happened

> _write a script that merges the two data files, line by line, merging them on gene name_

We've already got `synth_rate` and `halflife` data from Adler, so let's just read through the Moriarty file again to merge its lines with the Adler numbers.

To write to a file, we open it for writing (`open(outfile, 'w')`), which gives us back a file handle. (Which we have to remember to close.) `print` takes a `file=<filehandle>` option so we can write to any open-for-writing filehandle.

Sometimes you want to print individual fields for a growing line, one at a time, so you don't want `print` to add the automatic newline; the `end=''` argument changes the default end-of-print character from `/n` to nothing (empty string).

Remember the gene name corruption issue. The question said we can ignore genes for which we don't have corresponding lines in both files.



In [7]:
outfile = 'foo.tbl' 
with open(outfile, 'w') as outf:
    # There's some name corruption that we're just going to avoid.  Create
    # a set of the gene names in the ratefile, that we can efficiently
    # check gene names in TPM file against.
    #
    good_genenames = set(synth_rate.keys())


    # Now read the TPM data file (again)
    # Convert its TPM fields to ratios relative to t=0, and append the
    # synthesis rate and halflife fields.
    #
    with open(moriartyfile) as f1: 
        for line in f1:
            if line[0] == '#': continue     
            line   = line.rstrip('\n')      
            fields = line.split()           
            if fields[0] not in good_genenames: continue   # Just skip genes with corrupted names,
                                                           # that we can't merge easily to rates.

            tpm   = [float(s) for s in fields[1:]]               # tpm[] is now an array of floats
            ratio = [tpm[i] / tpm[0] for i in range(len(tpm))]   # ratio[] are ratios rel to t=0
    
            print('{:15s} '.format(fields[0]), end='', file=outf)
            for x in ratio[1:]: print('{:6.1f} '.format(x), end='', file=outf)

            print('{:6.1f} '.format(synth_rate[fields[0]]), end='', file=outf) # append synth rate and halflife from Moriarty
            print('{:6.1f} '.format(halflife[fields[0]]),   end='', file=outf)
            print('', file=outf)

Now you have a new file `foo.tbl` in your local directory. The easiest way to explore it is not in python but simply at the command line. 

The key exploration step is to see how "upregulation" systematically correlates with mRNA halflife.

Turns out we can do shell commands in Jupyter Notebook, prefixing them with `!`. 

Do `man sort` and you'll see the manual page for the `sort` command. `-n` means sort numerically. `-r` means reverse, sort from big to small (default is ascending order). `-k5` means sort on the 5th field (whitespace-delimited) of each line, numbering from 1.

In [8]:
! sort -n -r -k5 foo.tbl | head -20

GIN1               1.8    3.0    7.1   26.4    8.4   66.3 
SELL               2.0    3.4    7.7   25.5    5.9   62.5 
EIF4A1             2.0    3.2    6.9   22.6    1.7   56.8 
HNF1A              1.9    2.8    6.5   19.8    0.4   56.8 
ERF                2.0    3.1    7.5   19.0    4.4   59.2 
DRAM1              1.8    3.0    7.0   18.7    0.4   51.3 
TUFM               1.9    3.2    6.2   18.5    2.2   52.5 
RTN4               1.9    3.1    6.3   18.5    0.9   51.2 
ECE2               1.7    2.9    6.3   18.5   32.5   56.8 
TCF7               1.8    3.1    5.9   18.4    1.1   51.7 
NUDT10             1.8    2.8    6.1   16.7    2.3   50.3 
NXT1               1.8    3.0    6.1   16.6    1.1   48.0 
NEUROD1            2.0    3.5    6.9   16.4   29.5   47.0 
GPR160             2.0    3.0    6.1   16.4    7.1   52.5 
PRM2               1.9    2.8    6.2   16.2    1.2   48.8 
rhubarb            1.8    3.3    6.5   16.1   12.7   49.9 
LCE1A              1.8    2.7    5.9   1

(Ignore the write error, if you get one: that's arising from some mishandled detail in Jupyter.)

That's a list of the top 20 most highly upregulated genes at t=96, and now we can see really clearly that there's a strong correlation between the halflife (last column) and the t=96 expression ratio.

### so what happened is...

when Moriarty killed the sand mouse, new mRNA synthesis stopped, and the existing mRNAs decay away at their rates. The least stable mRNAs go away fastest; the most stable mRNAs go away slowest. RNA-seq measures the _relative_ abundance of each transcript, not the _absolute_ abundance, so it looks like the "expression" of the more stable mRNAs is going up, just because they're going away slower.

Somewhat more subtle is the fact that the effect isn't necessarily monotonic. An mRNA with a moderate halflife will go up (as the least stable mRNAs disappear quickly from the population), then go down (as the moderate-halflife mRNA decays faster than more stable mRNAs).

If Moriarty et al. had reported their yields from their RNA preps, this could've been a big clue: they should've noticed that the RNA yield was going down relatively quickly with time, because the mRNA population was decaying, which is not really what you'd expect to see if your interpretation is that the sandmouse is running some sort of post-death gene expression program.