# DNA PROBLEM
The high-level solution is presented in thress simple steps:

1) Read FASTA input data and create an RDD[String], where each RDD element is a FASTA record. This step is the same as Version 1.

2)  Mapper step: 
    for every FASTA record, create a HashMap[Key, Value] (a dictionary or hash table) where Key is the DNA letter and Value is an aggregated frequency for that DNA letter. Then flatten (using Spark’s flatMap()) the hash map into a list of (Key, Value) pairs. This step is different from Version 1. Compared with Version 1, this step enable us to emit less (key, value) pairs

3)  Find frequencies for all DNA letters by aggregating frequencies of the same DNA letter (this is a reduction step). For each dna_letter, group and add all frequencies. This step is the same as Version 1.

# Step 1: Create an RDD of String from Input

In [5]:
from pyspark.sql import SparkSession

spark= SparkSession.builder \
    .appName("dna-base-rdd")    \
    .master("local")    \
    .getOrCreate()


input_path = "C:/Users/malam/development/data/spark/sample.fasta"
records_rdd =  spark.sparkContext.textFile(input_path)

In [6]:
records_rdd.collect()   

['>seq1',
 'cGTAaccaataaaaaaacaagcttaacctaattc',
 '>seq2',
 'agcttagTTTGGatctggccgggg',
 '>seq3',
 'gcggatttactcCCCCCAAAAANNaggggagagcccagataaatggagtctgtgcgtccaca',
 'gaattcgcacca',
 'AATAAAACCTCACCCAT',
 'agagcccagaatttactcCCC',
 '>seq4',
 'gcggatttactcaggggagagcccagGGataaatggagtctgtgcgtccaca',
 'gaattcgcacca']

# Step 2: Define a Mapper Function

In [11]:
from collections import defaultdict

def process_FASTA_as_hashmap(fasta_record):
    if (fasta_record.startswith(">")):
        return [("z", 1)]

    hashmap = defaultdict(int)
    chars = fasta_record.lower()
    for c in chars:
        hashmap[c] += 1
    #end-for
    print("hashmap=", hashmap)

    key_value_list = [(k, v) for k, v in hashmap.items()]
    print("key_value_list=", key_value_list)
    return key_value_list

In [12]:
pairs_rdd = records_rdd.flatMap(lambda rec: process_FASTA_as_hashmap(rec))

# Step 3: Find Frequencies of DNA Letters

In [13]:
frequencies_rdd = pairs_rdd.reduceByKey(lambda x, y: x+y)
frequencies_rdd.collect()

[('z', 4), ('c', 61), ('g', 53), ('t', 45), ('a', 73), ('n', 2)]

In [14]:
frequencies_rdd.collect()

[('z', 4), ('c', 61), ('g', 53), ('t', 45), ('a', 73), ('n', 2)]

In [15]:
frequencies_rdd.collectAsMap()

{'z': 4, 'c': 61, 'g': 53, 't': 45, 'a': 73, 'n': 2}

Pros:

The provided solution works, simple, and semi-efficient. This solution improves on Version 1, by emitting much less (key, value) pairs, since we create a dictionary per input record and then flatten it into a list of (key, value) pairs, where key is a DNA-letter and value is an associated aggregated frequency of the DNA-letter.

Network traffic is improved by emitting much fewer (key, value) pairs.

There is no scalability issue since we use reduceByKey() for reducing all (key, value) pairs

Cons:

For each DNA sequence, this solution emits up to 6 (key, value) pairs, where key is a DNA-letter and value is sum of associated frequencies. This is a much improvement over solution version 1

Performance is not an optimal since we are still emitting about 6 (key, value) pairs per DNA string

This solution might be using too much memory due to creation of a dictionary per DNA sequence