## 1. Use kallisto and reproduce Moriarty's result

In [28]:
# Building a kallisto index of the Arc transcriptome, which we'll call transcripts.idx
!kallisto index -i transcripts.idx arc.fasta.gz


[build] loading fasta file arc.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [29]:
# Quantification of single end reads passing in the mean and standard deviation of fragment length
!kallisto quant -i transcripts.idx -o replicate_moriarty --single -l 150 -s 20 arc.fastq.gz  


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,981 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 81 rounds



#### Table of results from replication of Moriarty's results
| Transcript | Moriarty's  TPM | TPM from replicating Moriarty |
| --- | --- | --- |
| Arc 1 | 20000 |20318.9 |
| Arc 2 | 54000 |54477.8 |
| Arc 3 | 280000  | 282290 |
| Arc 4 | 76000   | 76050.3|
| Arc 5 | 95000   | 95263 |
| Arc 6 | 12000   | 11792 |
| Arc 7 | 89000   | 89113.5|
| Arc 8 | 88000   | 86881.2|
| Arc 9 | 29000   | 28704.4|
| Arc 10 | 260000   | 255109|

### From the above table showing Moriarty's TPM versus the results from my replication, we see that the TPMs are very similar. Thus, I successfully replicated Moriarty's results. To reconcile the first researcher's argument that Arc1 and Arc6 are the rarest transcripts and Moriarty's argument that the first researcher underestimated the abundance of Arc1 by about 3-fold and Arc6 by about 2-fold, I proceed to simulate an Arc transcriptome and RNA-seq reads.

## 2a. Simulate an Arc transcriptome and RNA-seq reads

In [16]:
import numpy as np
import gzip

def arc_transcriptome_read_simulator(    
    S = 10,
    N = 100000,
    len_S = 1000,
    len_R = 75,
    circular = True  
  
    ):

     
    T = S
    len_ARC = len_S * S

    """
        circular argument is passed into signify whether to generate the full circular transcriptome or 
        shortened linear transcriptome
    """
    np.random.seed(42)
    # To generate an Arc Locus DNA sequence:
    # Creating a 2D array where each nested array is the sequence of a segment (i.e. Segment A - Segment J)
    segments = np.reshape(np.random.choice(list('ACGT'), (len_ARC)),(-1,len_S))
    print('Shape of the genome 2D array is ' + str(np.shape(segments)))
    # To generate the Arc1..Arc10 mRNA transcripts:
    # Index the Arcs 1 - 10 according to their segment coordinates
    Arc1 = segments[0:4].flatten() # Covers ABCD
    Arc2 = segments[1:3].flatten() # Covers BC
    Arc3 = segments[2:5].flatten() # Covers CDE
    Arc4 = segments[3:7].flatten() # Covers DEFG
    Arc5 = segments[4:8].flatten() # Covers EFGH
    Arc6 = segments[5:8].flatten() # Covers FGH
    Arc7 = segments[6:8].flatten() # Covers GH
    Arc8 = segments[7:9].flatten() # Covers HI

    # if circular equals true, include arc9 and arc10
    if circular == True:
        # np.append() appends the part of the mRNA that is at end of the circle to the part of the mRNA that is at the start of the circle.
        Arc9 = np.append(segments[8:10].flatten(),segments[0].flatten(),axis = 0) # Covers IJA. 
        Arc10 = np.append(segments[9].flatten(),segments[0:2].flatten(),axis = 0) # Covers JAB
           # Create a list of the mRNA names 
        mRNA_names = ['Arc1','Arc2','Arc3','Arc4','Arc5','Arc6','Arc7','Arc8','Arc9','Arc10']
    # Put the actual sequence of Arc1 - Arc10 in a list
        mRNA_transcripts = [Arc1,Arc2,Arc3,Arc4,Arc5,Arc6,Arc7,Arc8,Arc9,Arc10]
    # if circular is false, then only inlcue arc1 - arc8
    else:
        mRNA_names = ['Arc1','Arc2','Arc3','Arc4','Arc5','Arc6','Arc7','Arc8']
        mRNA_transcripts = [Arc1,Arc2,Arc3,Arc4,Arc5,Arc6,Arc7,Arc8]


 


    output = gzip.open("simulated.fasta.gz", "wt")

    # Creating a loop that loops over the list of mRNA names and mRNA transcripts and writes them to a .fasta file
    for i in range(len(mRNA_names)):

        # Create a variable that contains a single mRNA transcript sequence as a single string
        single_transcript_as_list = list(mRNA_transcripts[i])
        # Write to the output file a line that has > plus the mRNA name
        output.write(">" + mRNA_names[i] + "\n")
        # Write to the output file the single mRNA transcript 70 characters per line
        one_list_per_line = [single_transcript_as_list[i:i+70] for i in range(0,len(single_transcript_as_list), 70)]
        for j in one_list_per_line:
            output.write("".join(j) + "\n")
    output.close()   

    # Calcultating the transcript counts based on assgined nucleotide abundances from figure 1 if circular is true
    if circular == True:
        C_Arc1 = 0.008 * N
        C_Arc2 = 0.039 * N
        C_Arc3 = 0.291 * N
        C_Arc4 = 0.112 * N
        C_Arc5 = 0.127 * N
        C_Arc6 = 0.008 * N
        C_Arc7 = 0.059 * N
        C_Arc8 = 0.060 * N
        C_Arc9 = 0.022 * N
        C_Arc10 = 0.273 * N
        # putting all transcript counts into a list
        all_arc_counts = [C_Arc1,C_Arc2,C_Arc3,C_Arc4,C_Arc5,C_Arc6,C_Arc7,C_Arc8,C_Arc9,C_Arc10]
    
    # Calculating the transcript counts based on self-assigned nucletide abundances for transcripts Arc1 - Arc8
    else: 
        
        C_Arc1 = 0.0449 * N
        C_Arc2 = 0.0759 * N
        C_Arc3 = 0.3279 * N
        C_Arc4 = 0.1489 * N
        C_Arc5 = 0.1639 * N
        C_Arc6 = 0.0457 * N
        C_Arc7 = 0.0959 * N
        C_Arc8 = 0.0969 * N
    
        all_arc_counts = [C_Arc1,C_Arc2,C_Arc3,C_Arc4,C_Arc5,C_Arc6,C_Arc7,C_Arc8]
    
    list_of_reads = []


    for k, count in enumerate(all_arc_counts):
        #all_reads_for_one_arc = []
        for j in range(int(count)):

            start_pos = np.random.randint(0, len(mRNA_transcripts[k]) - 75) 
            single_read = mRNA_transcripts[k][start_pos : start_pos + 75]
            assert len(single_read) == 75, 'Read length not 75'
 
            list_of_reads.append(single_read)
            


    print('Number of total reads generated is ' + str(len(list_of_reads)))


    

    quality_values = ''.join(np.random.choice(list('I'), 75))
    output_2 = gzip.open("simulated.fastq.gz", "wt")
    
    
    for i in range(int(sum(all_arc_counts)) - 1):

    # Create a variable that contains a single read sequence as a single string
        single_reads_string = "".join(list(list_of_reads[i]))
        # Write to the output file a line that has > plus the mRNA name
        output_2.write("@read" + str(i+1) + "\n")
        output_2.write(single_reads_string + "\n")
        output_2.write("+" + "\n")
        output_2.write(quality_values + "\n")


    output_2.close()  

    

        






    











In [30]:
arc_transcriptome_read_simulator(circular=True)

Shape of the genome 2D array is (10, 1000)
Number of total reads generated is 99899


### 3. Test Kallisto

In [31]:
# Building a kallisto index of the simulated Arc transcriptome, which we'll call transcripts_2.idx
!kallisto index -i transcripts_2.idx simulated.fasta.gz


[build] loading fasta file simulated.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 



In [32]:
# Quantification of simulated single ends reads passing in the mean and standard deviation of fragment length
!kallisto quant -i transcripts_2.idx -o simulated_arc --single -l 75 -s 10 simulated.fastq.gz  


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: simulated.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 99,899 reads, 99,899 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 69 rounds



#### Table of results from my simulation of Arc transcriptome and RNA-seq reads:
| Transcript | TPM from Figure 1 | TPM from simulated reads |segments_covered
| --- | --- | --- | --- |
| Arc 1 |6000   |17275 (Approx 3 x 6000) | ABCD|
| Arc 2 |58000   |47400| BC |
| Arc 3 |290000   |287757| CDE |
| Arc 4 |83000  |75902 | DEFG |
| Arc 5 |94000  |92989 | EFGH |
| Arc 6 |7800   |16819 (Approx 2 x 7800) | FGH |
| Arc 7 |87000   |87543| GH |
| Arc 8 |88000   |85776| HI |
| Arc 9 |22000   |24461 | IJA |
| Arc 10 |270000   |264077| JAB |


### From the above table, we see how Moriarty got his claim that the abundance of Arc1 was under estimated by about 3-fold and Arc6 by about 2-fold. I formed a hypothesis that Moriarty got errornous results becauses Kallisto does not quantificate circular transcriptome robustly. I then designed an experiment where I simulated a linear transcriptome by only keeping Arc1 through Arc8 and removing Arc9 and Arc10 because these two transcripts start at the tail of the circle and end at the head of the circle (if segment A is defined as the head and segment I as the tail). I did this by assigning Arc1 through Arc8 new nucleotide abundances that is similar to the original abundances and have a sum of 1, while keeping the original lengths. I then calculated TPM from the new nucleotide abundances. This caluclated TPM will serve as the baseline for the experiment.

## 4. "debug" kallisto

In [20]:
# Turning the ciruclar argument to False to start the aforementioned experiment
arc_transcriptome_read_simulator(circular=False)

Shape of the genome 2D array is (10, 1000)
Number of total reads generated is 99999


In [21]:
# Building a kallisto index of the simulated experiment, which we'll call transcripts_3.idx
!kallisto index -i transcripts_3.idx simulated.fasta.gz


[build] loading fasta file simulated.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 13 contigs and contains 8970 k-mers 



In [22]:
!kallisto quant -i transcripts_3.idx -o experiment --single -l 75 -s 10 simulated.fastq.gz  


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 8
[index] number of k-mers: 8,970
[index] number of equivalence classes: 19
[quant] running in single-end mode
[quant] will process file 1: simulated.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 99,999 reads, 99,999 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



In [23]:
# calculate nucleotide abundances following the formula: TPM_i = (V_i / L_i) * sum(V_i/L_j)^-1 * 1000000
new_experiment_base_tpm = ((np.array([0.0449,0.0759,0.3279,0.1489,0.1639,0.0457,0.0959,0.0969]) / np.array([4000,2000,3000,4000,4000,3000,2000,2000])) /
 np.sum(np.array([0.0449,0.0759,0.3279,0.1489,0.1639,0.0449,0.0959,0.0969]) / np.array([4000,2000,3000,4000,4000,3000,2000,2000]))) * 1000000
new_experiment_base_tpm = np.ndarray.round(new_experiment_base_tpm,decimals=2)
# Getting the TPM difference in percentages between the calculated baseline TPM and the output from Kalisto
TPM_differnece_in_percentage =  np.abs(np.array([32733.1,107026,313448,111978,103443,54679.9,135347,141346]) /new_experiment_base_tpm - 1) * 100

In [12]:
print(new_experiment_base_tpm)
print(TPM_differnece_in_percentage)

[ 32251.89 109038.67 314042.86 106955.58 117730.16  43768.71 137770.86
 139207.47]
[ 1.49203659  1.84583139  0.18942     4.69579988 12.13551396 24.92920171
  1.75934156  1.53621785]


#### Table of results from my experiment trying to 'correct' Kallisto:
| Transcript | Nucleotide Abundance | Length| Baseline TPM | Kallisto TPM Output | Segments Covered | Percent Difference (Baseline TPM vs Kallisto TPM output)
| --- | --- | --- | --- | --- | --- | --- | 
| Arc 1 |0.0449   |4000 | 32251.89| 32733.1   | ABCD | 1.49% |
| Arc 2 |0.0759   |2000 | 109038.67| 107026  |  BC   | 1.84% |
| Arc 3 |0.3279   |3000 | 314042.86| 313448  |  CDE  | 0.19% |
| Arc 4 |0.1489   |4000 | 106955.58|111978  |   DEFG | 4.70% |
| Arc 5 |0.1639   |4000 | 117730.16| 103443  |  EFGH | 12.14% |
| Arc 6 |0.0457   |3000 | 43768.71| 54679.7  |  FGH  | 24.93%|
| Arc 7 |0.0959   |2000 | 137770.86| 135347 |   GH   | 1.76%|
| Arc 8 |0.0969   |2000 | 139207.47| 141346 |   HI   | 1.54% |

### From the above table of the experiment I performed based on the linear transcriptome I generated, we can see that the baseline TPM is approximately the same as the TPM output from Kallisto. More sepcifically, the TPM similarity has improved for both Arc1 and Arc6. Thus, my hypothesis that Kallisto does not quantificate circular sequence well is generally supported by this result. The difference of 2x that was previously observed in Arc 6 is now reduced to approximately 1.25x after the experiment, but this is still a significant difference. I suspect that this difference is due to segment overlap: all segments (FGH) of Arc 6 is in Arc 5 also all part of Arc 7 (GH) is in Arc 6. However, this new hypothesis needs to be proved by further anaylsis.

### Overall, I conclude that the discrepancy we are seeing with Kallisto's quantification is partially due to the transcriptome being circular. In addition, further experiments should explore the impact of significant segment overlap on Kallisto's quantificatoin accuarcy.