# Exemplo 06: Processamento Paralelo no Spark  
## Cálculo do valor de Pi Serial & Paralelo

Esse exemplo mostra o processamento paralelo no cluster LASID. Para isso, realizamos o cálculo do valor do número Pi usando método Monte Carlo (não é a melhor maneira de calcular, mas exige muito processamento...).

Esse programa define o tamanho do contador (*partition*) e cria vários processos com a função Map que retornam o valor que é totalizado com a função Reduce.

A primeira etapa realiza o cálculo em apenas um host, a segunda etapa realiza o cálculo paralelo com vários hosts usando virtualização da JVM e o terceiro realiza o cálculo paralelo com vários hosts usando virtualização Docker.

Esse exemplo avalia apenas o processamento (operação matemática) e não a leitura de arquivos. 

In [1]:
# Start Spark environment
import findspark
findspark.init()

In [2]:
# Load Python modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from random import random
from operator import add
import time

In [3]:
# Pi calculation using Monte Carlo Method
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

## Spark Pi Serial

In [4]:
# Starting Pi Serial (one processor)
start_time = time.time()

spark = SparkSession.builder\
         .master("local[*]")\
         .appName("JupyterPiSerial")\
         .getOrCreate()

In [5]:
partitions = 2000
n = 100000 * partitions

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("### Pi is roughly %f ###" % (4.0 * count / n))
print("--- Serial Execution time: %s seconds ---" % (time.time() - start_time))

### Pi is roughly 3.140332 ###
--- Serial Execution time: 27.552704572677612 seconds ---


In [6]:
# Stop Spark session
spark.stop()

## Spark Pi Parallel JVM

In [7]:
# Starting Pi Parallel (Spark Cluster with JVM)
start_time = time.time()

spark = SparkSession.builder\
        .appName("JupyterPiParallel_JVM")\
        .master("mesos://zk://10.129.64.20:2181,10.129.64.10:2181,10.129.64.30:2181/mesos") \
        .getOrCreate()

In [8]:
partitions = 2000
n = 100000 * partitions

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("### Pi is roughly %f ###" % (4.0 * count / n))
print("--- Parallel JVM Execution time: %s seconds ---" % (time.time() - start_time))

### Pi is roughly 3.142929 ###
--- Parallel JVM Execution time: 10.027198553085327 seconds ---


In [9]:
# Stop Spark session
spark.stop()

## Spark Pi Parallel Docker

In [10]:
# Starting Pi Parallel (Spark Cluster with Docker Image)
start_time = time.time()

spark = SparkSession.builder\
        .appName("JupyterPiParallel_Docker")\
        .master("mesos://zk://10.129.64.20:2181,10.129.64.10:2181,10.129.64.30:2181/mesos") \
        .config("spark.mesos.executor.docker.image","lasid/spark-worker:3.0.1_bionic") \
        .config("spark.mesos.containerizer","docker") \
        .getOrCreate()

In [11]:
partitions = 2000
n = 100000 * partitions

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("### Pi is roughly %f ###" % (4.0 * count / n))
print("--- Parallel Docker Execution time: %s seconds ---" % (time.time() - start_time))

### Pi is roughly 3.141541 ###
--- Parallel Docker Execution time: 16.91118288040161 seconds ---


In [12]:
# Stop Spark session
spark.stop()