# map vs. flatMap

- map transformation applies a function to each row in a DataFrame/Dataset and returns the new transformed Dataset.
1 => 1
- flatMap transformation flattens the DataFrame/Dataset after applying the function on every element and returns a new transformed Dataset. 
The returned Dataset will return more rows than the current DataFrame. It is also referred to as a one-to-many transformation function
1 => Many
- One of the use cases of flatMap() is to flatten column which contains arrays, list, or any nested collection

In [8]:
import pyspark

sc = pyspark.SparkContext.getOrCreate()
rdd = sc.parallelize([("name", "joe,sarah,tom"), ("car", "hyundai"), ("animal", "cat,dog")])
result = rdd.map(lambda x: x[1].split(","))
print("result of map : ", result.collect())
# [['joe', 'sarah', 'tom'], ['hyundai']]

rdd = sc.parallelize([("name", "joe,sarah,tom"), ("car", "hyundai"), ("animal", "cat,dog")])
result = rdd.flatMap(lambda x: x[1].split(","))
print("result of flatMap : ", result.collect())
# ['joe', 'sarah', 'tom', 'hyundai']

result of map :  [['joe', 'sarah', 'tom'], ['hyundai'], ['cat', 'dog']]
result of flatMap :  ['joe', 'sarah', 'tom', 'hyundai', 'cat', 'dog']


In [14]:
test_file = "file:///home/jovyan/work/sample/lorem_ipsum.txt"
lines = sc.textFile(test_file)
words = lines.flatMap(lambda x: x.split())
word_count = words.countByValue()
#print(word_count)
#for word, count in word_count.items():
    #print(f"{word}: {count}")

In [15]:
# # How about sort by key?
word_count = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
sorted_word_count = word_count.map(lambda x: (x[1], x[0])).sortByKey()
for count, word in sorted_word_count.collect():
    print(f"{count}: {word}")

1: eiusmod
1: incididunt
1: dolore
1: aliqua.
1: Vehicula
1: nascetur.
1: montes.
1: rutrum.
1: Ridiculus
1: Sociis
1: nascetur
1: ridiculus.
1: Vivamus
1: Hac
1: amet,
1: elit,
1: do
1: labore
1: Ligula
1: Nascetur
1: ridiculus
1: Penatibus
1: cum.
1: curabitur.
1: Magnis
1: Rutrum
1: Eros
1: dapibus.
1: Praesent
2: ante.
2: vehicula.
2: potenti.
2: Phasellus
2: magnis.
2: praesent.
2: senectus.
2: suscipit.
2: montes
2: magna.
2: luctus.
2: Accumsan
2: Senectus
2: sociis.
2: penatibus.
2: Platea
2: mus
2: Lacinia
2: Mus
2: parturient.
2: Hendrerit
2: justo.
2: Ante
2: Eleifend
2: hac.
2: Natoque
2: Libero
3: Habitasse
3: lacinia.
3: interdum.
3: fermentum.
3: habitant.
3: Maecenas
3: Aenean
3: Suscipit
3: eros.
3: Dapibus
3: habitasse.
3: platea.
3: quisque.
3: dolor.
3: sollicitudin.
3: fames.
3: Dictumst
3: Mollis
3: Etiam
3: Potenti
3: porta.
3: parturient
3: phasellus.
3: Sodales
4: Metus
4: condimentum.
4: mollis.
4: Sapien
4: felis.
4: sapien.
4: netus.
4: Porta
4: tempus.
4: F