# pset02: the adventure of the ten Arcs


## 2) reproduce Moriarty's result
I downloaded the files and will run a command line prompt. 

In [21]:
#import packages to use
import random
import gzip
import numpy as np #import numpy

In [2]:
#preview file
! gunzip -c arc.fasta.gz  | head -10

>Arc1
TAGCCTTCATCCTGTGTGGGTGTGGGCTCCCACTCGGTTCTAGGTCAGTACGAGCCTGCA
CCTTCCTGTGGAGCAAGTCCGTCTCCTTCCTGCGCTCATACCTAATGAGTGAGGCGCTAA
CTGCCCCTATGGGCGGATGGACCCAACTAGCCCATGAGTCGACCACCAGAGAACCTTGAT
CCGTCCTTGCCAGCATTAATGAGCATTCTCTTAGTTTTGACAGCGGGGCGATTCATGAGA
AACATATGCTTCCCCTTGTTCGAGCCGGATCACTTGAGTCGATACGTCTCCGGGGGTCTC
CGGGGAAGCCTCAGGGACCTAGTCCGATAACAGACACCTATATGCTAGTTGCTGGTGGAT
TGTGTTTCAATCTTCTTCCAAGAAGTGCACGTAAACATGGGGGTGTCGGTTATGGAAAGG
ATACCTATCTCCAGAATCAGTAACAAGTCAATGTAACGGGACGCACGGGACTCACCATCT
CTAGTATGCACTCTGCCGATGGGAACTTCGAATGCGCGATGCCTCTATTTCCAGTTGTAG


In [3]:
#run index to create transcriptome file
! kallisto index -i transcripts.idx arc.fasta.gz
#analyze reads and transcriptome file
! kallisto quant -i transcripts.idx -o abundances.tsv --single -l 150 -s 20 arc.fastq.gz


[build] loading fasta file arc.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 150, sd = 20
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: arc.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 99,981 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds



In [4]:
#print kallisto results and compare to Moriarty's
with open("abundances.tsv/abundance.tsv") as infile:  # Open file for reading 
            for line in infile:              # Iterate through each line in the file
                values = line.split()       # creates list of values
                print('%-17s %-17s %-17s %-17s %-17s'
                  % (values[0], values[1], values[2], values[3], values[4]))


target_id         length            eff_length        est_counts        tpm              
Arc1              4000              3851              3348.79           24514.5          
Arc2              2000              1851              3574.56           54440.7          
Arc3              3000              2851              28065.2           277510           
Arc4              4000              3851              10597.7           77579.1          
Arc5              4000              3851              12526.7           91700.1          
Arc6              3000              2851              1953.39           19315.2          
Arc7              2000              1851              5579.78           84980.4          
Arc8              2000              1851              5700.81           86823.7          
Arc9              3000              2851              3052.55           30183.8          
Arc10             3000              2851              25581.6           252953           


The tpm results in abundances.tsv were very similar to Moriarty's tpm results

## 3) simulate an Arc transcriptome and RNA-seq reads
python code simulates an Arc locus, transcriptome, and 100k reads from RNA seq experiment on the Arc transcriptome. Create many functions to do each step

In [5]:
# Set up the Arc locus (modified from pset)
num_segments= 10           # Number of segments in the Arc locus (A..J)
num_transcripts = num_segments  # Number of different transcripts (the same, one starting on each segment, 1..10)
reads_to_gen= 100000       # total number of observed reads we generate
len_seg     = 1000         # length of each segment (nucleotides)
len_Arc     = len_seg * num_segments    # total length of the Arc locus (nucleotides)
len_read    = 75           # read length

In [6]:
#creates random sequences for each segment
def create_segments(num, length):
    '''
    Given: number of segments and length of each segment in bp
    Return: list of segments
        '''
    segment_list = [] #emtpy list to store segments
    #Generate DNA sequences
    dna = ["A","G","C","T"]
    
    #choose a random nucleotide until have a full length segment
    for i in range(0, num):
        segment_random ='' #empty string
        for j in range(0,length):
            segment_random+=random.choice(dna)
        segment_list.append(segment_random)
                       
    return segment_list


In [7]:
#creates transcripts based on the segment sequences
def create_transcripts(seg_sequences, seg_for_transcript, file_name):
    '''
    Given: a list of segments and a list that details which segments are in each transcript
            for example, mRNA transcript 1 has segments ABCD, which correspond to 0,1,2,3
            given file name, write transcripts to a fasta file
    Return: list of transcripts, and list of transcripts lengths
    '''
    #mRNA transcripts
    transcripts = [] # list to hold all transcript sequences (Arc1 through Arc10)
    length = []
    
    #for each transcript concatenate the sequence denoted by the list of segments
    for segs in seg_for_transcript:
        tx = '' #empty string for transcript
        for x in segs:
            tx += seg_sequences[x]
        transcripts.append(tx)
        length.append(len(tx))
    
    with gzip.open(file_name, "wt") as outfile:
        #write for each transcript
        for i in range(0, len(transcripts)):
            outfile.write('>Arc%s\n' % str(i+1)) #write name of gene (Arc#)
            outfile.write('%s\n' % transcripts[i]) #transcript sequence

    return transcripts, length
 

In [8]:
#generates reads based on parameters
def generate_reads(v_i, reads_to_gen, len_read, transcripts, file_name):
    '''
    Given: list of nucleotide abundance, list of transcripts, number of reads to generate, and the length of each read
    simulates generating reads and saves them a a fastq file (with file_name as the name)
    Returns: normalized v_i list
    '''
    #normalize v_i  to sum to 1
    v_i = [float(i)/sum(v_i) for i in v_i]
    num = [*range(0,10)] #numbers 0 to 9, for convenience of naming later on

    with gzip.open(file_name, "wt") as outfile:      

        #generate reads and write to outfile
        for i in range(0,reads_to_gen):
            #sample transcript i based on vi
            t_num = np.random.choice(num, p = v_i)
            t = transcripts[t_num] #entire transcript

            #choose random start position in transcript between 0 and L-75
            start_pos = np.random.randint(0,len(t)-len_read)
            read = t[start_pos:start_pos+len_read] #take substring of length len_read

            #print first time (read number, source transcript, fragment start, fragment length)
            outfile.write('@%s:Arc%s:%s:%s \n' 
                          % (str(i+1), str(t_num+1), str(start_pos), str(len_read))) #write name of gene (Arc#)
            outfile.write('%s \n' % read)
            outfile.write('+ \n')
            outfile.write('%s \n' % ('I' * len_read))
            
    return v_i

## 4) test kallisto
To generate simulated data, I use the same transcript lengths and nucleotide abundances from the first table.

In [9]:
#initialize relevant values

#each transcript contains various segments (0 denotes segment A, 1 = B, etc)
seg_covered_original=   [[0,1,2,3],
                        [1,2],
                        [2,3,4],
                        [3,4,5,6],
                        [4,5,6,7],
                        [5,6,7],
                        [6,7],
                        [7,8],
                        [8,9,0],
                        [9,0,1]]

#original nucleotide abundance
v_i = [0.008,
       0.039,
       0.291,
       0.112,
       0.127,
       0.008,
       0.059,
       0.060,
       0.022,
       0.273]

In [10]:
#run kallisto

#generate sgments
segment_list = create_segments(num_segments, len_seg)
#generate transcriptome and write to fasta file
transcripts, transcript_length = create_transcripts(segment_list, seg_covered_original, "transcripts.fasta.gz")
#generate reads and write to fastq file
norm_vi = generate_reads(v_i, reads_to_gen, len_read, transcripts, "reads.fastq.gz")
#run kallisto on positive control
! kallisto index -i arc_transcripts.idx transcripts.fasta.gz
! kallisto quant -i arc_transcripts.idx -o abundances2.tsv --single -l 75 -s 10 reads.fastq.gz


[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 19 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 26
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 57 rounds



Now, I write a function to compare the output of Kallisto against the original data. In order to compare the nucleotide abundances returned by Kallisto (est_counts for each transcript), I need to normalize them.

In [11]:
def check_kallisto():
    '''
    Given: file path for the abundance.tsv
    Return: list of all kallisto results, a list of nucleotide abundances, and a list of tpm values
    Print out normalized nucleotide abundances and tpm values
    '''
    all_values = [] #stores all kallisto results in table form
    with open("abundances2.tsv/abundance.tsv") as infile:  # Open file for reading 
        for line in infile:              # Iterate through each line in the file
            values = line.split()       # creates list of values
            all_values.append(values)
    del all_values[0]
    
    #finds the sum of all the estimates counts 
    total_count = 0
    for x in all_values:
        total_count += float(x[3])
    
    #Print results from Kallisto
    print("Kallisto Results")
    print('%-17s %-17s %-17s' % ("Transcript", "Vi (norm)", "TPM values"))
    
    for val in all_values:
        norm_vi = float(val[3])/float(total_count) # normalize vi values to sum to 1
        val.append(norm_vi)
            
        print('%-17s %-17.6f %-17s' % (val[0], val[5], val[4]))
    print("\n")
    
    
    #create list of normalized vi values and tpm to return
    kallisto_norm_vi = []
    kallisto_tpm = []
    for x in all_values:
        kallisto_norm_vi.append(x[5])
        kallisto_tpm.append(x[4])
    
    return all_values, kallisto_norm_vi, kallisto_tpm


In [12]:
def print_original_info(vi_list, transcript_length):
    '''
    Given: list of vi and list of transcript lengths
    Print vi and tpm in column format with corresponding transcript
    Return: list of tpm for original data
    '''
    
    #Calculate TPM given lengths and v_i for each transcript
    tpm_actual = [] #empty list   
    
    #create arrays
    vi = np.array(vi_list)
    len_array = np.array(transcript_length)
    #normalize vi by length of each transcript
    ratio_v_l = np.divide(vi, len_array)
    total_ti = float(sum(ratio_v_l))
    
    for x in ratio_v_l:
        tpm_i = 1000000*float(x)/total_ti
        tpm_actual.append(tpm_i)
    print("Original Data")
    print('%-17s %-17s %-17s'
      % ('Transcript','Original Vi', "TPM values"))
    for i in range(0,len(vi)):
        print('%-17s %-17.6f %-17.1f'% ('Arc'+str(i+1), vi_list[i], tpm_actual[i]))
    print("\n")
    
    return tpm_actual

In [13]:
def percent_difference(results, reference):
    '''
    Given: 2 lists
    
    Prints percent difference of results list relative to reference list
    '''
    
    #calculate the percent difference is Kallisto's TPM relative to original data set 
    print('%-17s %-17s'
          % ('Transcript','% Dif (compared to original)'))
    for i in range(0, len(results)):
        x=float(results[i])
        y=float(reference[i])
        print('%-17s %-17.2f'
                % ('Arc'+str(i+1), 100*(x-y)/y))

    print('\n')

In [14]:
#print original nucleotide info 
og_tpm = print_original_info(norm_vi, transcript_length)
#print Kallisto Results
kallisto_results, kallisto_norm_vi, kallisto_tpm = check_kallisto()
#print difference in vi and tpm values

print("Differences in Vi")
percent_difference(kallisto_norm_vi, norm_vi)
print("Differences in TPM")
percent_difference(kallisto_tpm, og_tpm)



Original Data
Transcript        Original Vi       TPM values       
Arc1              0.008008          5904.1           
Arc2              0.039039          57564.6          
Arc3              0.291291          286346.9         
Arc4              0.112112          82656.8          
Arc5              0.127127          93726.9          
Arc6              0.008008          7872.1           
Arc7              0.059059          87084.9          
Arc8              0.060060          88560.9          
Arc9              0.022022          21648.2          
Arc10             0.273273          268634.7         


Kallisto Results
Transcript        Vi (norm)         TPM values       
Arc1              0.031395          23143.6          
Arc2              0.033709          50652.3          
Arc3              0.286175          283054           
Arc4              0.101033          74477.4          
Arc5              0.125146          92252.7          
Arc6              0.018243          18044.3      

Kallisto did not find the same tpm that the original data set had. Analysis of both nucleotide abundance and TPM (which are related) show that Kallisto overestimates Arc1 by about 3 fold and Arc 6 by about 2 fold (which is consistent with Moriarty's analysis using Kallisto). 

## 5) "debug" kallisto
Kallisto could be messing up Arc analysis because it contains a large amount of overlap between transcripts. For example, Arc2 (BC) is a subset of Arc1 (ABCD). The other noteable example is that Arc7 (GH) is a subset of Arc6 (FGH). 

I suspect that this overlap in segments can explain why the nucleotide abundances for Arc1 and Arc6 are overestimated. Let's start with Arc1. The sequence for Arc2 is contained inside Arc1 (indistinguishable), so when kallisto comes across a read from Arc2, it erroneously assigns it to Arc1 based on the weights of their nucleotide abundances. About $\frac{0.008}{(0.039+0.008)} = 0.17$ of Arc2's reads are counted as Arc1 reads, so the TPM of Arc1 is artifically inflated: $6000 + 0.17*58000 = 15860$. This TPM value for Arc1 is closer to Moriarty's results of 25000 and my simulation result of 23000 (might change slightly for each rerun). The TPM values are larger than 15860 because my crue calculation only took into account the overlap between Arc1 and Arc2, while in reality there are many segments of Arc1 overlap with other transcripts since A,B,C,D (the segments contained in Arc1's transcript) are repeated in other transcripts as well and in a similar order (CD and AB appear is Arc3 and Arc9)

A similar calculation can be done for Arc6 and Arc7. The fraction of Arc7 reads that kallisto will assign to Arc6 is: $\frac{0.008}{0.008+0.059}=0.119$, so a rough calculation of the TPM for Arc6 would be: $7800 + 0.119*87000 = 18153$, which is very close to the TPM of 19000 kallisto calculated in Moriarty's analysis and my simulation.

### Hypothesis
To test the hypothesis that a large amount of overlapping sequences between transcripts causes kallisto to overcount transcripts (especially those with low nucleotide abundance), I created a debug function to run kallisto with different segments in each transcript and nucleotide abundances.

In [15]:
#run kallisto with different segments within each transcript and also different nucleotide abundances
#most of this code is copied from previous parts
def debug_kallisto(segment_order_per_transcript, v_i):
    '''
    Given: a list detailing which segments are present in each transcript, and all nucleotide abundances
    Compare results of kallisto
    '''
    #generate transcriptome and write to fasta file
    transcripts, transcript_length = create_transcripts(segment_list, segment_order_per_transcript, "transcripts.fasta.gz")
    #generate reads and write to fastq file
    norm_vi = generate_reads(v_i, reads_to_gen, len_read, transcripts, "reads.fastq.gz")
    #run kallisto on positive control
    ! kallisto index -i arc_transcripts.idx transcripts.fasta.gz
    ! kallisto quant -i arc_transcripts.idx -o abundances2.tsv --single -l 75 -s 10 reads.fastq.gz

    #print original nucleotide info 

    #Calculate TPM given lengths and v_i for each transcript
    tpm_actual = [] #empty list   

    #create arrays
    vi = np.array(v_i)
    len_array = np.array(transcript_length)
    #normalize vi by length of each transcript
    ratio_v_l = np.divide(vi, len_array)
    total_ti = float(sum(ratio_v_l))

    for x in ratio_v_l:
        tpm_i = 1000000*float(x)/total_ti
        tpm_actual.append(tpm_i)
    print("Original Data")
    print('%-17s %-17s %-17s'
      % ('Transcript','Original Vi', "TPM values"))
    for i in range(0,len(vi)):
        print('%-17s %-17.6f %-17.1f'% ('Arc'+str(i+1), norm_vi[i], tpm_actual[i]))
    print("\n")

    #print Kallisto Results
    all_values = [] #stores all kallisto results in table form
    with open("abundances2.tsv/abundance.tsv") as infile:  # Open file for reading 
        for line in infile:              # Iterate through each line in the file
            values = line.split()       # creates list of values
            all_values.append(values)
    del all_values[0]

    #finds the sum of all the estimates counts
    total_count = 0
    for x in all_values:
        total_count += float(x[3])

    print("Kallisto Results")
    print('%-17s %-17s %-17s' % ("Transcript", "Vi (norm)", "TPM values"))

    for val in all_values:
        x = float(val[3])/float(total_count)
        val.append(x)

        print('%-17s %-17.6f  %-17s' % (val[0], val[5], val[4]))
    print("\n")


    #create list of normalized vi values and tpm
    kallisto_norm_vi = []
    kallisto_tpm = []
    for x in all_values:
        kallisto_norm_vi.append(x[5])
        kallisto_tpm.append(x[4])

    #print difference in vi and tpm values

    print("Differences in Vi")
    percent_difference(kallisto_norm_vi, norm_vi)
    print("Differences in TPM")
    percent_difference(kallisto_tpm, tpm_actual)

### DIfferent Scenarios
I created three alternate versions of how the transcripts were spliced. Again, the numbers correspond to which segment is to be included in the corresponding transcript (so 0 is segment A).

In [16]:
seg1 = [[0],
        [1],
        [2],
        [3],
        [4],
        [5],
        [6],
        [7],
        [8],
        [9]]

seg2 = [[0,1,2,3],
        [1,2,3,4],
        [2,3,4,5],
        [3,4,5,6],
        [4,5,6,7],
        [5,6,7,8],
        [6,7,8,9],
        [7,8,9,0],
        [8,9,0,1],
        [9,0,1,2]]

seg3 = [[0,1,2,3],
        [1,3,5],
        [2,4,6],
        [3,1],
        [4],
        [5,8,0],
        [6,7,8,9],
        [7,9,1],
        [8,3],
        [9]]
        
v_equal = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]    
#create equal nucleotide abundances


### Test 1: No Overlap
The first version (denoted by seg1) is when each transmcript is only one segment long (Arc1 has sgment A, Arc2 has segment B, etc). The results show that Kallisto is very accurate (almost 0% difference between calculated and actual nucleotide abundances and TPM).

In [17]:
print("Each Transcript contains only one segment (NO OVERLAP)")
debug_kallisto(seg1,v_i)

Each Transcript contains only one segment (NO OVERLAP)

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 10 contigs and contains 9700 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 9,700
[index] number of equivalence classes: 10
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

Original Data
Transcript        Original Vi       TPM values       
Arc1              0.008008          8008.0           
Arc2              0.03903

### Test 2a: Excess Overlap, original (uneven) nucleotide abundances
The second version (denoted by seg2) is when each transmcript overlaps with other ones (Arc1 has sgment A,B,C,D, Arc2 has segment B,C,D,E, etc). The results show that Kallisto is not accurate. The largest differences emerge with transcripts that had low nucleotide abundances. The reason for this is likely because Kallisto erroneously assigns an overrlapping transcripts' reads to Arc1 or Arc6. Because they have low vi to begin with, any additional reads artifically elevates their TPM and Vi drastically. This is what happened with Moriarty's analysis (and I provided a more in depth explanation and calculation in part 4).

Indeed, the results below show that Kallisto is inaccurate when large amounts of overlapping sequences are present (>100% difference)

In [18]:
print("Each Transcript contains overlap with multiple transcripts")
debug_kallisto(seg2, v_i)

Each Transcript contains overlap with multiple transcripts

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 21 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 30
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 83 rounds

Original Data
Transcript        Original Vi       TPM values       
Arc1              0.008008          8008.0           
Arc2              0

### Test 2b: Excess Overlap, equal nucleotide abundances
This variation on the second version of the transcriptome (denoted by seg2) still contains much overlap between transcripts, but I was interested to see if having transcripts with equal abundance would decrease Kallisto's errors. Interestingly, the results show that Kallisto is very close to the original data (although there are still up to 10% differences). As reads are evenly distributed across possible overlapping transcripts, each transcript ends up with roughly even reads despite the large amount of overlap.

In [19]:
print("Each Transcript contains overlap with multiple transcripts BUT have equal abundance")
debug_kallisto(seg2, v_equal)

Each Transcript contains overlap with multiple transcripts BUT have equal abundance

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 21 contigs and contains 10000 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,000
[index] number of equivalence classes: 30
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

Original Data
Transcript        Original Vi       TPM values       
Arc1              0.100000          100000.0    

### Test 3: No consecutive overlap between transcripts
The last version of the transcriptome I wanted to test is not possible in Arc and could explain why Kallisto does not work on Arc but works well on other transcriptomes. Arc is transcribed in a circle and contains consecutive segments of sequence, resulting in many transcripts being a subset of other transcripts. Kallisto works with the intersection of transcripts, and if a large majority of one transcript is a subset of another, this will lead to large errors in determining the vi and TPM of each transcript.

To simulate a more realistic splicing mechanism (one where segments that are not adjacent or consecutive can be spliced together), I created a list of segments (seg3) where no two transcripts have a more than one overlapping segment. For example, transcript1 (ABCD) does not have more than one overlap with Transcript2 (BDF) because while they share B, BC and BD are different sequences. This is different from Arc1 ABCD intersecting with Arc2 BC on the basis of 2 segments (BC).

The results show that Kallisto is more accurate in this test than compared to when all the sequences had multiple overlapping regions (test2). All perecent differences were less than 15%. 

In [20]:
print("Each Transcript contains overlapping segments, but the order of segments vary")
debug_kallisto(seg3, v_i)

Each Transcript contains overlapping segments, but the order of segments vary

[build] loading fasta file transcripts.fasta.gz
[build] k-mer length: 31
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 32 contigs and contains 10169 k-mers 


[quant] fragment length distribution is truncated gaussian with mean = 75, sd = 10
[index] k-mer length: 31
[index] number of targets: 10
[index] number of k-mers: 10,169
[index] number of equivalence classes: 25
[quant] running in single-end mode
[quant] will process file 1: reads.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 100,000 reads, 100,000 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds

Original Data
Transcript        Original Vi       TPM values       
Arc1              0.008008          3244.6           


## Summary 
In this pset, I simulated data to test how accurate Kallisto is and what assumptions it might be making when analyzing transcriptomes and reads. My analysis has shown that large amount of overlapping regions (espeically consecutive ones) are difficult for Kallisto to handle accurately, and errors in TPM and Vi calculations are exacerbated when transcripts have very unequal transcript abundances.
