# WordMap Counting Problem All Together

We have done a lot to make our understanding better with MapReduce framework in PySpark.

Now, we are going to compose all of them together ..

In [1]:
import pyspark
from pyspark.sql import SparkSession

ss = SparkSession.builder.master("local[4]").appName("FlatMap-ReduceByKey").getOrCreate();
sc = ss.sparkContext

lines = [
    "word count from Wikipedia the free encyclopedia",
    "the word count is the number of words in a document or passage of text Word counting may be needed when a text",
    "is required to stay within certain numbers of words This may particularly be the case in academia legal",
    "proceedings journalism and advertising Word count is commonly used by translators to determine the price for"
]

# Create Rdd using parallelize
rdd = sc.parallelize(lines)
rddFlatMap = rdd.flatMap(lambda line:line.split(" "))
rddMap = rddFlatMap.map(lambda word: (word, 1))
rddReduced = rddMap.reduceByKey(lambda x, y : x + y)
rddReduced.collect()





[('word', 2),
 ('encyclopedia', 1),
 ('of', 3),
 ('words', 2),
 ('Word', 2),
 ('counting', 1),
 ('to', 2),
 ('stay', 1),
 ('certain', 1),
 ('particularly', 1),
 ('academia', 1),
 ('proceedings', 1),
 ('and', 1),
 ('used', 1),
 ('by', 1),
 ('Wikipedia', 1),
 ('the', 5),
 ('free', 1),
 ('is', 3),
 ('in', 2),
 ('document', 1),
 ('or', 1),
 ('This', 1),
 ('legal', 1),
 ('from', 1),
 ('number', 1),
 ('text', 2),
 ('needed', 1),
 ('numbers', 1),
 ('journalism', 1),
 ('for', 1),
 ('count', 3),
 ('a', 2),
 ('passage', 1),
 ('may', 2),
 ('be', 2),
 ('when', 1),
 ('required', 1),
 ('within', 1),
 ('case', 1),
 ('advertising', 1),
 ('commonly', 1),
 ('translators', 1),
 ('determine', 1),
 ('price', 1)]

## Exercise 1

Take the input from data.txt file and count the number of words.

In [13]:
import pyspark
from pyspark.sql import SparkSession

ses = SparkSession.builder.master("local[2]").appName("mapreduce").getOrCreate()

print("File content:")
rdd = ses.sparkContext.textFile("data.txt")
print(rdd.collect())

print("Applying flat map:")
flatrdd = rdd.flatMap(lambda x: x.split(" "))
print(flatrdd.collect())

wc = len(flatrdd.collect())
print("Word count: " + str(wc))


File content:
['PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the distributed cluster (multiple nodes). ', 'In other words, PySpark is a Python API for Apache Spark. Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.', 'Spark basically written in Scala and later on due to its industry adaptation it’s API PySpark released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.']
Applying flat map:
['PySpark', 'is', 'a', 'Spark', 'library', 'written', 'in', 'Python', 'to', 'run', 'Python', 'applications', 'using', 'Apache', 'Spark', 'capabilities,', 'using', 'PySpark', 'we', 'can', 'run', 'applications', 'parallelly', 'o

## Exercise 2

Load the testData.txt in the rdd and add all the values together.

In [16]:
rdd = ses.sparkContext.textFile("testData.txt")
print(rdd.collect())
ordd = rdd.flatMap(lambda x: x.split(" "))
print(ordd.collect())
total = ordd.reduce(lambda x, y: int(x) + int(y))

print(total)

['12 34 56 77 89 23 12 34 56 77 89 23 12 34 56 77 89 23 12 34 56 77 89 23 12 34 56 77 89 23 12 34 56 77 89 23']
['12', '34', '56', '77', '89', '23', '12', '34', '56', '77', '89', '23', '12', '34', '56', '77', '89', '23', '12', '34', '56', '77', '89', '23', '12', '34', '56', '77', '89', '23', '12', '34', '56', '77', '89', '23']
1746
