# Exemplo 06: Cálculo do valor de Pi Serial & Paralelo

Esse exemplo mostra o processamento paralelo no cluster LASID. Para isso, realizamos o cálculo do valor do número Pi usando método Monte Carlo (não é a melho rmaneira de calcular, mas exige muito processamento).

Esse programa define o tamanho co contador (*partition*) e cria vários processos com processo Map que retornam o valor que é totalizado com a função Reduce.

A primeira etapa realiza o cálculo em apenas um host e a segunda etapa realiza o cálculo com vários hosts.

Esse exemplo avalia apenas o processamento (operação matemática) e não a leitura de arquivos. 

In [1]:
# Start Spark environment
import findspark
findspark.init()

In [2]:
# Load Python modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from random import random
from operator import add
import time

In [3]:
# Pi calculation using Monte Carlo Method
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

## Spark Pi Serial

In [4]:
# Starting Pi Serial (one processor)
start_time = time.time()

spark = SparkSession.builder\
         .master("local[1]")\
         .appName("JupyterPiSerial")\
         .getOrCreate()

In [5]:
partitions = 200
n = 100000 * partitions

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / n))
print("--- Serial Execution time: %s seconds ---" % (time.time() - start_time))

Pi is roughly 3.149600
--- Serial Execution time: 41.0227324962616 seconds ---


In [6]:
# Stop Spark session
spark.stop()

## Spark Pi Parallel

In [7]:
# Starting Pi Parallel (Spark Cluster with Docker Image)
start_time = time.time()

spark = SparkSession.builder\
        .appName("JupyterPiParallel")\
        .master("mesos://zk://10.129.64.20:2181,10.129.64.10:2181,10.129.64.30:2181/mesos") \
        .getOrCreate()

spark.conf.set("spark.submit.deployMode", "client")
spark.conf.set("spark.driver.supervise", "true")
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.driver.host","10.129.64.20")
spark.conf.set("spark.mesos.containerizer","docker")
spark.conf.set("spark.mesos.executor.docker.image","lasid/spark-worker:latest")
spark.conf.set("spark.mesos.executor.docker.forcePullImage","true")

In [8]:
partitions = 200
n = 100000 * partitions

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

print("Pi is roughly %f" % (4.0 * count / n))
print("--- Parallel Execution time: %s seconds ---" % (time.time() - start_time))

Pi is roughly 3.138272
--- Parallel Execution time: 5.433460235595703 seconds ---


In [9]:
# Stop Spark session
spark.stop()