## Homework 01: The Case of the Dead Sand Mouse
###### By Kevin Liu

We first start by reading in the two supplemental tables from Moriarty et al. and Adler et al. line-by-line, strip them of the new line characters and whitespaces, and store each of the expression levels as type float in a list as the dictionary values, skipping any comment lines.

In [1]:
supptbl1 = {} # create empty dictionary to store data from Moriarty et al. 
# supplementary table 1.
for line in open('Moriarty_SuppTable1.txt'): # open file and read line-by-line 
    # in a for loop.
    
    if line[0] == '#': continue # if the line is a comment line or header that 
    # starts with the character '#', skip it.
    
    line = line.rstrip('\n') # strip the line of the new line character.
    
    fields = line.split() # split the line by whitespace, creating a list 
    # consisting of each of the values in this line (i.e., the fields).
    
    supptbl1[fields[0]] = [float(s) for s in fields[1:]] # set the first element
    # of the list fields (i.e., the gene names) as the dictionary key and set 
    # the remaining 5 elements as a list of dictionary values.

# read in data from Adler et al. supplementary table 2 in a similar way.
supptbl2 = {}

for line in open('Adler_SuppTable2.txt'):
    if line[0] == '#': continue
    line = line.rstrip('\n')
    fields = line.split()
    supptbl2[fields[0]] = [float(s) for s in fields[1:]]

### 1. Check that the gene names match

Since we know that Adler et al. determined the synthesis rates and half-lives of *all* genes in the prefrontal cortex of the sand mouse, we would expect that the genes found in the dataset by Moriarty et al. are also found in the dataset by Adler et al. Therefore, we check if the gene names match between the two supplementary tables.

In [2]:
notInAlder = [] # create an empty list to hold the gene names that are in 
# supptbl1 but not supptbl2.

# for each gene name (i.e., key of supptbl1) store it in the empty list if it is
# not found in supptbl2 keys.
for gene in supptbl1.keys():
    if gene not in supptbl2.keys():
        notInAlder.append(gene)

print('The gene names in Supplementary Table 1 by Moriarty et al. that are not \
found in Supplementary Table 2 by Adler et al. are: ' + ', '.join(notInAlder) + 
'.') # print out those gene names.

The gene names in Supplementary Table 1 by Moriarty et al. that are not found in Supplementary Table 2 by Adler et al. are: 15-Sep, 2-Mar, 1-Mar, 10-Sep, 7-Mar, 4-Mar, 2-Sep, 11-Sep, 6-Mar, 11-Mar, 3-Mar, 8-Sep, 7-Sep, 14-Sep, 6-Sep, 1-Dec, 8-Mar, 5-Mar, 9-Mar, 12-Sep, 1-Sep, 4-Sep, 10-Mar, 9-Sep, 5-Sep, 3-Sep.


In [3]:
# we can also examine the gene names found in supptbl2 but not present in 
# supptbl1.
notInMoriarty = [] # create an empty list to hold the gene names that are in 
# supptbl2 but not supptbl1.

# for each gene name (i.e., key of supptbl2), store it in the empty list if it 
# is not found in supptbl1 keys.
for gene in supptbl2.keys():
    if gene not in supptbl1.keys():
        notInMoriarty.append(gene)

print('The gene names in Supplementary Table 2 by Adler et al. that are not \
found in Supplementary Table 1 by Moriarty et al. are: ' + 
', '.join(notInMoriarty) + '.') # print out those gene names.

The gene names in Supplementary Table 2 by Adler et al. that are not found in Supplementary Table 1 by Moriarty et al. are: MARC1, SEPT11, MARCH8, SEPT3, MARCH6, SEPT2, SEPT5, MARCH7, MARC2, MARCH3, SEPT4, MARCH5, SEPT7, SEPT10, SEPT14, MARCH10, SEP15, MARCH1, SEPT12, SEPT8, MARCH9, SEPT6, SEPT9, DEC1, MARCH4, MARCH2, MARCH11, SEPT1.


By comparing the above two outputs, it is apparent that the difference is due to MS Excel erroneously formatting certain gene names that resemble dates as type date and coding them in the format 'D-MMM' in the dataset by Moriarty et al., which results in those converted names being exported as dates that are then falsely interpreted as gene names and caused inconsistency between the two datasets.

### 2. Explore the data

To examine the consistency of the data from the two supplementary tables, we identify the genes with the top 5 highest synthesis rates, top 5 longest half-lives, and those with the top 5 highest ratio of 96h/0h TPM expression levels. We would expect the genes with longer half-lives to decay slower, which should correspond to higher 96h/0h expression rates as those genes should have decayed less. Furthermore, if Moriarty's theory of post-mortem cortical gene expression is correct, we would also expect that the synthesis rates are high for genes that have high expression ratios at 96h/0h.

In [4]:
sortedSR = sorted(supptbl2.items(), key = lambda x: x[1][0], reverse = True) #
# sort the dictionary items by the synth_rate in descending order.
getMaxSR = map(lambda x: x[0], sortedSR[0:5]) # get the gene names (i.e., keys) 
# of the top 5 genes by synth_rate.
print('The genes with the top 5 highest synthesis rates are: ' + 
', '.join(getMaxSR) + '.')

The genes with the top 5 highest synthesis rates are: CCDC169-SOHLH2, DDX60L, LRRK1, SLC25A45, FARP1.


In [5]:
# do the same as above but using half-life.
sortedHL = sorted(supptbl2.items(), key = lambda x: x[1][1], reverse = True)
getMaxHL = map(lambda x: x[0], sortedHL[0:5])
print('The genes with the top 5 longest half-lives are: ' + ', '.join(getMaxHL) 
+ '.')

The genes with the top 5 longest half-lives are: TFRC, SPINK8, DIRC1, PLA1A, SAMSN1.


In [6]:
RE = {} # create an empty dictionary to store the ratio of 96h/0h TPM expression 
# levels.

# for each gene in supptbl1, store the gene name as the key in the empty 
# dictionary and store the ratio of 96h/0h TPM expression levels as the 
# corresponding value.
for gene in supptbl1.keys():
    RE[gene] = supptbl1[gene][4]/supptbl1[gene][0]
sortedRE = sorted(RE.items(), key = lambda x: x[1], reverse = True) # sort the 
# resulting dictionary by the values (i.e., expression ratios) in descending 
# order.
getMaxRE = map(lambda x: x[0], sortedRE[0:5]) # get the gene names (i.e., keys) 
# of the top 5 genes by expression ratio.
print('The genes with the top 5 highest ratio of 96h/0h TPM expression levels \
are: ' + ', '.join(getMaxRE) + '.')

The genes with the top 5 highest ratio of 96h/0h TPM expression levels are: TFRC, SPINK8, DIRC1, PLA1A, RSPRY1.


As expected, the genes with the top 5 longest half-lives identified by Adler et al. mostly correspond to the top 5 genes that have the highest ratio of 96/0h TPM expression identified by Moriarty et al. However, it is interesting to find that none of the top 5 genes with highest synthesis rates are those with longer half-lives and 96h/0h expression ratios.

### 3. Figure out what happened

In an attempt to determine if there are any artifacts or errors in the experiment and interpretation of the results, we calculate the expression ratios at all timepoints relative to t=0h for each gene that are found within both supplementary tables and merge them with the corresponding synthesis rates and half-lives.

In [7]:
merged = {} # create an empty dictionary to store the expression ratios relative
# to t=0h and the corresponding synthesis rates and half lives for that gene.

# iterate over each key:value pair in supptbl1, calculate each of the expression 
# ratios relative to t=0h and append to the empty dictionary one gene at a time 
# if the gene is present in both supptbl1 and supptbl2.
for key in supptbl1.keys():
    if key not in notInAlder:
        merged[key] = [supptbl1[key][1]/supptbl1[key][0]]
        merged[key].append(supptbl1[key][2]/supptbl1[key][0])
        merged[key].append(supptbl1[key][3]/supptbl1[key][0])
        merged[key].append(supptbl1[key][4]/supptbl1[key][0])
    
    # if the gene name is also found in supptbl2, append the synthesis rate and 
    # half-life for that gene in the same list of values as the expression 
    # ratios.
    if key in supptbl2.keys():
        merged[key].append(supptbl2[key][0])
        merged[key].append(supptbl2[key][1])

The merged data can then be written out as a text file for further inspection.

In [8]:
# convert dictionary into a list of lists for easy write out.
mergedList = [] # create an empty list to hold lists of data for each gene.

# for each key:value pair in merged, append them as a list within mergedList.
for key, value in merged.items():
    mergedList.append([key] + value)

# write out the merged data as a whitespace-delimited, column-justified text
# file named 'LiuKevin_01_3.txt' line-by-line.
with open('LiuKevin_01_3.txt', 'w') as f:
    headerList = ['# gene_name', 'r12h_0h', 'r24h_0h', 'r48h_0h', 'r96h_0h', 
                  'synth_rate', 'halflife'] # create a list of column names.
    headerFormat = '{0:20s} {1:10s} {2:10s} {3:10s} {4:10s} {5:10s} {6:10s}'
    f.write(headerFormat.format(*headerList) + '\n') # write out the headers 
    # as the first line of the output file, separated by whitespace.
    
    # for each list of data in mergedList, write them out into the file with 
    # each field padded by a number of whitespaces.
    for i in mergedList:
        dataFormat = '{0:20s} {1:<10.3f} {2:<10.3f} {3:<10.3f} {4:<10.3f} ' \
        '{5:<10.3f} {6:<10.3f}'
        f.write(dataFormat.format(*i) + '\n')

After close inspection of the results we've seen so far, we can identify two main concerns about Moriarty's dataset.

First, the automatic formatting of the cells must have erroneously recognized certain gene names as type date from the Supplementary Table 1 data by Moriarty et al. and converted them into 'D-MMM' format when the data was stored in MS Excel. The exported data from MS Excel file would then be written as it is shown in the incorrect 'D-MMM' format, resulting in the gene name inconsistency between the two papers' supplementary tables.

Second, the measured expression levels of genes with longer half-lives are most likely inflated relative to those with shorter half-lives, as we would expect genes with shorter half-lives to have been degraded by the time of measurement. This is evident from examining the top 5 genes with the longest half-lives and their corresponding expression ratios relative to t=0h, where genes with larger expression ratios for 96h/0h consistently have longer half-lives. 

Third, the synthesis rates for the genes with high expression ratios of 96h/0h should be high assuming that Moriarty's findings on post-mortem cortical gene expression is correct. However, we did not see any overlap in genes that have the top 5 highest synthesis rates and the top 5 genes with the highest expression ratios. This suggests that Moriarty's hypothesis is most likely biased due to time progression and half-life of decay in mRNA.