# k-Mer Composition
**Given:** A DNA string *s* in FASTA format (having length at most 100 kbp).

**Return:** The 4-mer composition of *s*.


# Sample Dataset

In [1]:
%%file Sample_Dataset.txt
>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT
TTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCA
AATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCG
GGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGA
CTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG



Overwriting Sample_Dataset.txt


# Sample Output

In [2]:
%%file Sample_Output.txt
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1



Overwriting Sample_Output.txt


# Solution

In [3]:
import collections
import itertools

def parseFastaFile(fasta_file_path):
    fasta_file = open(fasta_file_path,'r')
    fasta_file_lines = fasta_file.readlines()
    
    fasta_records = collections.OrderedDict()
    
    for line in fasta_file_lines:
        if line[0] == ">":
            fasta_records[line[1:].rstrip()] = ""
        else:
            fasta_records[next(reversed(fasta_records))] += line.rstrip()
            
    fasta_file.close()
    
    return fasta_records


In [4]:
def getKmerComposition(sequence, n = 4, symbols = list("ACGT")):
    "Given a DNA string s in FASTA format (having length at most 100 kbp), return the 4-mer composition of s."
    
    kmers = ["".join(kmer) for kmer in list(itertools.product(symbols, repeat = n))]
    kmers.sort()
    
    kmer_lists = [["".join(kmer) for kmer in zip(*[iter(sequence[start:])]*n)] for start in range(0,n)]
    kmers_in_sequence = []
    for kmer_list in kmer_lists:
        kmers_in_sequence += kmer_list
        
    kmer_composition = [kmers_in_sequence.count(kmer) for kmer in kmers]
    
    return kmer_composition


In [5]:
def getKmerCompositionFromFileToFile(input_file_path, output_file_path, n = 4, symbols = list("ACGT")):
    "Wraps getKmerComposition to read from input_file_path and write to output_file_path"
    
    fasta_records = parseFastaFile(input_file_path)
    sequence = list(fasta_records.values())[0]
    kmer_composition = getKmerComposition(sequence, n, symbols)
    
    kmer_composition_as_string = " ".join(map(str,kmer_composition))
    
    output_file = open(output_file_path, 'w')
    output_file.write("%s\n" % kmer_composition_as_string)
    output_file.close()
    
    return


# Test Solution

In [6]:
getKmerCompositionFromFileToFile("Sample_Dataset.txt", "Test_Output.txt")

In [7]:
%%bash
echo Sample_Output.txt
md5sum Sample_Output.txt
cat Sample_Output.txt

Sample_Output.txt
0ff5cb601971a3442a3a93793bf5872b  Sample_Output.txt
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1


In [8]:
%%bash
echo Test_Output.txt
md5sum Test_Output.txt
cat Test_Output.txt

Test_Output.txt
0ff5cb601971a3442a3a93793bf5872b  Test_Output.txt
4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1


In [9]:
%%bash
if [ $(md5sum Sample_Output.txt|cut -f1 -d' ') == $(md5sum Test_Output.txt|cut -f1 -d' ') ]
then
    echo Sample output matches test output.
else
    echo Sample output does not Match test output.
fi

Sample output matches test output.


# Downloaded Dataset

In [10]:
%%bash
cp ~/Downloads/rosalind_kmer.txt ./
cat rosalind_kmer.txt

>Rosalind_3178
TCCCCTGTAACCAGCGACTTGGAACCGCAAAACGTTCGTTTTAATCAATTCTAACTCATG
CTAACGTACCGGCTCTTTCGTGTGCACGCACAAACAACTTTGAAGCAATGCCCCGTGTCG
GCAGTGCGCGTCGGATGTGGTTTGACTACCGGTCTAGGAGTACCTGATTTGTGCTGGGGG
TACCGAGATATTACTGGCGCCATATTAAGGCAGGAGCAGAGTTCTGGGTCTTGACAGCAT
CAACATTGCTTTAGTGTCGTCCTATAAAAATGGGAGAACTAACTCCGGCGCATTGTATGC
AGGGGGGTTTACAACAGTGATCTGTGCAATTATGCATGGAAGGTCAGTCCGCCTTTTGTC
CCATTACATTGATCTGGGAGCATGTCCCAGCGCAGGTGTGGTCATACACGGAGCCGTGCA
GATCTCTATGAGGGCGATTCTCGGCATAGTATGAGTGGCCAAGGGGCTGGAAGATAGGCT
CCGGATGAAAAGCGTCCGGTAGGTGGGGGGGAATCCCCAAATCGCTTAGCCACTGGACGC
TCTCGGGGTGTCGATGAGCCACTGCTGAAGTCCTAAGTTACACACTATCTCGGGGCAGTT
CCTACCGGCGCCCAAGTAGCCCCAATGGGTATTTTATCACTTATCACAAGCATAGTCCCA
TGGTAGGAGCAATGAAATAGTTGATCTCTGTCTTGCGGTCTACACACTAACAAATAGGGC
AGCCATAGTTTAATAAGACAGATATTTCATGTAAGAGAGCCACCCTATTGCCCTTCGGAG
TTCCTTCCACTTCATTTGCAGTACCTACGCGGCGACCACGGGACCTGGAGTGAGGCTGCG
CCACGTACTATTGGCGCTTTTCGGTCTGAGCGACAGTTAAACTAAAATCCGATAGCCGCC
CAGACCTTCACCACCAATGGCCGCTAACATACTCACCCACTCCGCCAGAGCATGGGATGT
TGCTCAAGT

# Solution to Downloaded Dataset

In [11]:
getKmerCompositionFromFileToFile("rosalind_kmer.txt", "Solution_Output.txt")

In [12]:
%%bash
cat Solution_Output.txt

369 380 350 360 376 341 368 347 357 352 353 340 344 370 342 365 363 370 346 358 363 362 340 358 337 365 378 373 365 350 369 354 342 368 353 347 324 368 383 342 337 335 366 344 353 321 344 353 402 349 344 317 339 379 354 343 325 312 328 380 342 395 353 352 376 337 371 366 342 343 349 391 361 342 331 338 384 352 335 353 393 343 343 366 351 320 337 395 383 317 351 356 355 343 376 383 346 373 342 362 330 369 348 325 364 381 371 330 356 404 363 391 370 351 318 373 371 344 373 357 356 364 359 353 361 356 352 352 331 360 335 311 364 372 367 376 350 343 353 348 344 355 333 376 326 354 336 342 361 362 362 358 349 341 343 376 337 367 336 324 334 373 336 350 344 376 369 336 351 370 336 375 343 344 323 365 325 337 380 351 354 343 385 370 356 342 344 332 381 389 356 351 383 355 346 384 355 367 369 324 342 380 345 345 340 338 335 348 368 358 347 358 370 359 368 346 354 349 374 410 355 385 351 360 315 365 363 349 360 330 309 361 341 339 359 326 341 383 344 368 371 378 370 320 367 376 375 381 355 342 