## Homework 02: The Adventure of the Ten Arcs
##### By Kevin Liu

Before we begin, we will first import the neccessary libraries and define the parameters for the Arc locus.

In [1]:
import numpy as np

experiment = 0 # if 0, run simulation using original parameters; if 1, run using experiment 1 parameters; if 2, run using experiment 2 parameters.

# Set up the Arc locus 
S         = 10           # Number of segments in the Arc locus (A..J)
T         = S            # Number of different transcripts (the same, one starting on each segment, 1..10)
N         = 100000       # total number of observed reads we generate
len_S     = 1000         # length of each segment (nucleotides)
len_Arc   = len_S * S    # total length of the Arc locus (nucleotides)
len_R     = 75           # read length

if experiment == 0:
    segment_map = {'Arc1': ['A', 'B', 'C', 'D'], 
               'Arc2': ['B', 'C'],
              'Arc3': ['C', 'D', 'E'],
              'Arc4': ['D', 'E', 'F', 'G'], 
              'Arc5': ['E', 'F', 'G', 'H'],
              'Arc6': ['F', 'G', 'H'],
              'Arc7': ['G', 'H'],
              'Arc8': ['H', 'I'],
              'Arc9': ['I', 'J', 'A'],
              'Arc10': ['J', 'A', 'B']}
    arc_nu = {'Arc1': 0.008, 
              'Arc2': 0.039,
              'Arc3': 0.291,
              'Arc4': 0.112, 
              'Arc5': 0.127,
              'Arc6': 0.008,
              'Arc7': 0.059,
              'Arc8': 0.060,
              'Arc9': 0.022,
              'Arc10': 0.273}
elif experiment == 1:
    segment_map = {'Arc1': ['A', 'B', 'C', 'D'], 
                   'Arc2': ['B', 'C'],
                   'Arc3': ['C', 'D', 'E'],
                   'Arc4': ['D', 'E', 'F', 'G'], 
                   'Arc5': ['E', 'F', 'G', 'H'],
                   'Arc6': ['F', 'G', 'H'],
                   'Arc7': ['G', 'H'],
                   'Arc8': ['H', 'I']}
    arc_nu = {}
    for i in range(len(segment_map.keys())):
        arc_nu['Arc'+str(i+1)] = np.random.dirichlet(np.ones(len(segment_map.keys())),size=1).tolist()[0][i]
else:
    segment_map = {'Arc1': ['A', 'B'], 
                   'Arc2': ['B', 'C'],
                   'Arc3': ['C', 'D'],
                   'Arc4': ['D', 'E'], 
                   'Arc5': ['E', 'F'],
                   'Arc6': ['F', 'G'],
                   'Arc7': ['G', 'H'],
                   'Arc8': ['H', 'I'], 
                   'Arc9': ['I', 'J'],
                   'Arc10': ['J', 'A']}
    arc_nu = {'Arc1': 0.008, 
              'Arc2': 0.039,
              'Arc3': 0.291,
              'Arc4': 0.112, 
              'Arc5': 0.127,
              'Arc6': 0.008,
              'Arc7': 0.059,
              'Arc8': 0.060,
              'Arc9': 0.022,
              'Arc10': 0.273}

#### 1. Use kallisto and reproduce Moriarty's result

To replicate the result produced by Moriarty using Kallisto, we will first build a Kallisto index file using our Arc locus transcripts fasta file and then quantify the read abundances using our reads fastq file and the produced index file.

In [2]:
! kallisto index -i arc.idx arc.fasta.gz # generate hash table.


[build] loading fasta file arc.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [3]:
! kallisto quant -i arc.idx -o arc_quant_out --single -l 150 -s 20 arc.fastq.gz # run quantification algorithm.


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,981 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 81 rounds



Next, we will examine the read abundance results produced by Kallisto.

In [4]:
# read in the abundance file generated by Kallisto.
arc_abund = {}
with open('arc_quant_out/abundance.tsv') as f:
    next(f)
    for line in f:
        line = line.rstrip('\n')
        fields = line.split('\t')
        arc_abund[fields[0]] = [float(s) for s in fields[1:]]

# print the names of the columns.
strFormat = '{:<10} {:<10} {:<10} {:<10} {:<10}'
print(strFormat.format('target_id', 'length', 'eff_length', 'est_counts', 'tpm'))

# print each field from Kallisto output.
for arc_n, prop in arc_abund.items():
    tempList = [arc_n]
    for i in range(len(prop)):
        tempList.append(prop[i])
    print(strFormat.format(*tempList))

target_id  length     eff_length est_counts tpm       
Arc1       4000.0     3851.0     2781.96    20318.9   
Arc2       2000.0     1851.0     3585.12    54477.8   
Arc3       3000.0     2851.0     28613.5    282290.0  
Arc4       4000.0     3851.0     10412.4    76050.3   
Arc5       4000.0     3851.0     13042.9    95263.0   
Arc6       3000.0     2851.0     1195.26    11792.0   
Arc7       2000.0     1851.0     5864.45    89113.5   
Arc8       2000.0     1851.0     5717.54    86881.2   
Arc9       3000.0     2851.0     2909.53    28704.4   
Arc10      3000.0     2851.0     25858.3    255109.0  


The TPM values produced by Kallisto closely matches that of Moriarty's attempt and we have successfully reproduced their results.

#### 2a. Simulate an Arc transcriptome and RNA-seq reads

Due to the discrepancies between the result produced by Kallisto and our own calculations, we will begin by generating simulated data according to the known Arc locus structure and its transcripts.

We will first generate the Arc locus DNA sequence. Given the known length of the Arc locus, the length of each segment, and the total number of segments, we can construct a simulated Arc locus DNA with uniform base composition using randomly selected nucleotides.

In [5]:
# generate Arc locus DNA sequence.
n_segments = [i.upper() for i in list(map(chr, range(97, 97+S)))] # Generate a list of the segments in Python with map and chr

# make dictionary with keys from A-Z depending on S and list values as random DNA sequence of length len_S.
arc_locus = {}
for i in range(S):
    arc_locus[n_segments[i]] = [np.random.choice(['A', 'T', 'G', 'C']) for i in range(len_S)]

# check if total length is = N.
# check_len = {}
# for i in range(S):
#     check_len[n_segments[i]] = ''.join(arc_locus[n_segments[i]])
# print(len(''.join(check_len.values())) == len_Arc)

arc_segment_list = []
for i in n_segments:
    arc_segment_list.append(''.join(''.join(arc_locus[i])))
arc_locus_flat = ''.join(arc_segment_list)
arc_locus_list = [*arc_locus_flat]

Since we know the segment composition for each of the Arc mRNA transcripts, we can then use the previously generated Arc locus DNA and the map of segments to each of the transcripts to construct each of the 10 Arc transcripts.

In [6]:
# generate Arc1...Arc10 mRNA transcripts.
transcripts = {}
for arc_n, segment_list in segment_map.items():
    tempList = []
    for segment in segment_list:
        tempList.append(''.join(arc_locus[segment]))
    transcripts[arc_n] = ''.join(tempList)

These produced transcripts can then be written out as a fasta file that is compatible with Kallisto.

In [7]:
# cut sequences into 60nt per line according to fasta format.
transcripts_n60 = {}
for arc_n, seq in transcripts.items():
    transcripts_n60[arc_n] = [transcripts[arc_n][i:i+60] for i in range(0, len(transcripts[arc_n]), 60)]

# write out transcripts as fasta file.
with open('LiuKevin_02_2a.fasta', 'w') as f:
    for arc_n, seq_lines in transcripts_n60.items():
        f.write('>' + arc_n + '\n')
        for line in seq_lines:
            f.write(line + '\n')

Lastly, we can generate simulated reads from each of the Arc mRNA transcripts with amounts that are determined by the abundances calculated in our own experiement. These generated reads can then be written out as a fastq file that is compatible with Kallisto.

In [8]:
# generate reads.
# make dictionary for each transcript of n out of N total number of reads each with length len_R.
n_reads = {}
for arc_n, prop in arc_nu.items():
    n_reads[arc_n] = int(N * prop)

reads = {}
for arc_n, n in n_reads.items():
    read_seq = []
    for i in range(n):
        start_pos = np.random.randint(0, len(transcripts[arc_n])-75)
        end_pos = start_pos + len_R
        read_seq.append(transcripts[arc_n][start_pos:end_pos])
    reads[arc_n] = read_seq

# write out reads as fastq file.
readsList = []
for arc_n, readList in reads.items():
    for n_read in readList:
        readsList.append(n_read)

with open('LiuKevin_02_2a.fastq', 'w') as f:
    for read_i in range(len(readsList)):
        headerLine = '@read' + str(read_i) + '\n'
        readLine = readsList[read_i] + '\n'
        plusLine = '+' + '\n'
        qLine = 'I' * len_R + '\n'
        f.write(headerLine + readLine + plusLine + qLine)

#### 3. Test kallisto

Since we have simulated our Arc locus transcripts and reads based on the abundances calculated by ourselves, lets run the simulated data through Kallisto to see if Kallisto's results would match our own calculations of TPM.

In [9]:
! kallisto index -i synth_arc.idx LiuKevin_02_2a.fasta


[build] loading fasta file LiuKevin_02_2a.fasta
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [10]:
! kallisto quant -i synth_arc.idx -o synth_arc_quant_out --single -l 75 -s 10 LiuKevin_02_2a.fastq


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: LiuKevin_02_2a.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 99,899 reads, 99,899 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 55 rounds



Given that the simulated reads were produced based on a list of known nucleotide abundances for each transcript, we can calculate the transcript abundances based on nucleotide abundances and the effective length reported in Kallisto's results and compare them to the transcript abundances Kallisto produced.

In [11]:
synth_arc_abund = {}
with open('synth_arc_quant_out/abundance.tsv') as f:
    next(f)
    for line in f:
        line = line.rstrip('\n')
        fields = line.split('\t')
        synth_arc_abund[fields[0]] = [float(s) for s in fields[1:]]

# calculate tpm using known nu_i
for arc_n, props in synth_arc_abund.items():
    props.append(arc_nu[arc_n])

for arc_n, prop in synth_arc_abund.items():
    prop.append(synth_arc_abund[arc_n][4]/synth_arc_abund[arc_n][1])

norm_vals = []
for arc_n, props in synth_arc_abund.items():
    norm_vals.append(props[5])

norm_constant = 1/sum(norm_vals)

for arc_n, prop in synth_arc_abund.items():
    prop[5] = prop[5]*norm_constant*(10**6)  
    prop.append(np.abs(prop[5]-prop[3])/((prop[5]+prop[3])/2))

# Print the names of the columns.
strFormat = '{:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10} {:<10}'
numFormat = '{:<10} {:<10} {:<10} {:<10} {:<10} {:<10.3f} {:<10.2f} {:<10.2f}'
print(strFormat.format('target_id', 'length', 'eff_length', 'est_counts', 'tpm', 'nu_i', 'calc_tpm', 'percent_diff'))

# print each data item.
for arc_n, prop in synth_arc_abund.items():
    tempList = [arc_n]
    for i in range(len(prop)):
        tempList.append(prop[i])
    print(numFormat.format(*tempList))

target_id  length     eff_length est_counts tpm        nu_i       calc_tpm   percent_diff
Arc1       4000.0     3926.0     3254.54    24096.5    0.008      5856.18    1.22      
Arc2       2000.0     1926.0     3179.01    47978.8    0.039      58194.65   0.19      
Arc3       3000.0     2926.0     28697.0    285086.0   0.291      285820.53  0.00      
Arc4       4000.0     3926.0     10122.5    74946.3    0.112      81986.53   0.09      
Arc5       4000.0     3926.0     12589.5    93212.2    0.127      92966.86   0.00      
Arc6       3000.0     2926.0     2219.1     22045.3    0.008      7857.61    0.95      
Arc7       2000.0     1926.0     5520.61    83319.2    0.059      88038.06   0.06      
Arc8       2000.0     1926.0     5506.19    83101.4    0.060      89530.23   0.07      
Arc9       3000.0     2926.0     2957.45    29380.3    0.022      21608.43   0.30      
Arc10      3000.0     2926.0     25853.1    256834.0   0.273      268140.91  0.04      


Given that the percent differences between the calculated transcript abundances and those produced by Kallisto for Arc1 and Arc6 are large, it is clear that the discrepancy between Kallisto results and our own results must be related to how Kallisto works on our specific data.

#### 4. "Debug" kallisto

##### Experiment 1

We first hypothesize that Kallisto only works on reads from a linear locus, whereas Arc has a circular locus. Therefore, it is expected that the resulting transcript abundances between our original approach and Moriarty's Kallisto approach do not correspond.

To test this first hypothesis, we can perform an experiment by assuming that the Arc locus has a linear structure as opposed to its circular structure. To do this, we will discard the Arc9 and Arc10 transcripts because they span the circular linking region of the locus as well as their associated reads from our simulated data. Furthermore, we will randomly generate nucleotide abundances for each Arc transcript and use them to calculate the transcript abundances for comparison with what Kallisto generates.

**Please set <experiment = 1> in the code chunk that defines all parameters for the Arc locus (top of the notebook) and re-run the whole notebook to produce results that test our hypothesis.**

After carrying out our first experiment using the newly simulated data, we find that the percent differences between the transcript abundances produced by Kallisto and our own calculations remain fairly large for Arc6. This does not match our expectations as it suggests that the circular locus structure of Arc can be resolved by Kallisto and that the discrepancy arises from some other source.

##### Experiment 2

Since the previous experiment has yielded unsatisfactory results that contradicts our first hypothesis, we will explore the nature of the Arc transcripts further. We notice that the sequence for transcript Arc2 is contained within the sequence for Arc1 and the sequences for Arc6 and Arc7 are contained within the sequence for Arc5. Therefore, we then hypothesize that Kallisto might have erroneously mapped the reads for those transcripts contained within other transcripts. We will then conduct a second experiment using non-overlapping transcript definitions to test our hypothesis.

**Please set <experiment = 2> in the code chunk that defines all parameters for the Arc locus (top of the notebook) and re-run the whole notebook to produce results that test our hypothesis.**

Following our second experiment, we notice that the calculated percent differences are much lower than the original experimental conditions and our first experiment. This suggests that our second experiment supports our hypothesis that the discrepancies between our original results and Moriarty's results are likely due to Kallisto's erroneous mapping of reads from overlapping transcripts.