# DNA PROBLEM
The high-level solution is presented in thress simple steps:

1) Read FASTA input data and create an RDD[String], where each RDD element is a FASTA record (it can be either a comment line or an actual DNA sequence)

2) Define a mapper function: for every DNA letter in a FASTA record, emit a pair of (dna_letter, 1) where dna_letter is in {A, T, C, G} and 1 is a frequency (similar to a word count solution)

3) Sum up frequencies for all DNA letters (this is a reduction step). For each unique dna_letter, group and add all frequencies.

# Step 1: Create an RDD of String from Input

In [4]:
from pyspark.sql import SparkSession

spark= SparkSession.builder \
    .appName("dna-base-rdd")    \
    .master("local")    \
    .getOrCreate()


input_path = "C:/Users/malam/development/data/spark/sample.fasta"
records_rdd = spark.read \
    .text(input_path)   \
    .rdd.map(lambda r: r[0])

In [5]:
records_rdd.collect()

['>seq1',
 'cGTAaccaataaaaaaacaagcttaacctaattc',
 '>seq2',
 'agcttagTTTGGatctggccgggg',
 '>seq3',
 'gcggatttactcCCCCCAAAAANNaggggagagcccagataaatggagtctgtgcgtccaca',
 'gaattcgcacca',
 'AATAAAACCTCACCCAT',
 'agagcccagaatttactcCCC',
 '>seq4',
 'gcggatttactcaggggagagcccagGGataaatggagtctgtgcgtccaca',
 'gaattcgcacca']

# Step 2: Define a Mapper Function

In [6]:
def process_FASTA_record(fasta_record):
    key_value_list = [] 
    if (fasta_record.startswith(">")):
        # z counts the number of FASTA sequences
        key_value_list.append(("z", 1))
    else:
        chars = fasta_record.lower()
        for c in chars:
            key_value_list.append((c, 1))
    print(key_value_list) 
    return key_value_list

In [7]:
pairs_rdd = records_rdd.flatMap(lambda rec: process_FASTA_record(rec))

# Step 3: Find Frequencies of DNA Letters

In [8]:
frequencies_rdd = pairs_rdd.reduceByKey(lambda x, y: x+y)

In [9]:
frequencies_rdd.collect()

[('z', 4), ('c', 61), ('g', 53), ('t', 45), ('a', 73), ('n', 2)]

In [10]:
frequencies_rdd.collectAsMap()

{'z': 4, 'c': 61, 'g': 53, 't': 45, 'a': 73, 'n': 2}

In [11]:
grouped_rdd = pairs_rdd.groupByKey()
frequencies_rdd = grouped_rdd.mapValues(lambda values : sum(values))
frequencies_rdd.collect()

[('z', 4), ('c', 61), ('g', 53), ('t', 45), ('a', 73), ('n', 2)]

Pros:

The provided solution works and is simple: uses minimal amount of code to get the job done and uses the Spark’s simple map() and reduceByKey() transformations.

There is no scalability issue since we use reduceByKey() for reducing all (key, value) pairs, which will automatically perform the combine() optimization (local aggregation) on all worker nodes.

Cons:

This solution emits too many (key, value) pairs, where key is a DNA-letter and value is 1, as frequency. Sometimes, emitting too many (key, value) pairs might cause memory problems. If you get any error due to too many (key, value) pairs, then you might adjust the RDD’s StorageLevel. By default, Spark uses MEMORY, but you can set the StorageLevel to MEMORY and DISK combinations for that RDD.

Performance is not an optimal since emitting too many (key, value) pairs will take network time and prolong the shuffle time. As during the second processing step we defined, too many single frequency tuples are emitted, network time will prove a bottleneck when scaling this solution.