## SparkContext Configuration

The cluster consists of __1 Mesos master node__ and __4 Mesos worker nodes__. The breakdown of the configuration show below:
- master node
    - 2 Cores
    - 4GB RAM
- worker node
    - 2 cores per node
    - 2.7GB RAM per node
    
There are interesting scenarios to study the performance of spark, which are:
- One executor per core:
    - instances = (5 nodes x 2 cores); cores = 1 core; memory = (2.9GB/5)
- One executor per node:
    - instances = 5 nodes; cores = 2 core; memory = 2.9GB/1
    
NOTE: The default cores (defaults at 4) for every spark application is bottlenecked by __SPARK_MASTER_OPTS__ predefined in spark-env.sh. In order to configure spark application beyond the defaults cores, set __spark.cores.max__ in SparkContext before setting:
- __spark.executor.instances__
- __spark.executor.cores__
- __spark.executor.memory__

another NOTE: default parallelism could be set by
- .config('spark.default.parallelism','6')

In [1]:
import pyspark.sql
session = pyspark.sql.SparkSession.builder \
    .master('mesos://10.64.22.90:5050') \
    .appName('Cache-test') \
    .config('spark.jars.packages','org.diana-hep:spark-root_2.11:0.1.16,org.diana-hep:histogrammar-sparksql_2.11:1.0.4') \
    .config('spark.driver.extraClassPath','/opt/hadoop/share/hadoop/common/lib/EOSfs.jar') \
    .config('spark.executor.extraClassPath','/opt/hadoop/share/hadoop/common/lib/EOSfs.jar') \
    .config("spark.cores.max", "8") \
    .config('spark.executor.instances','4') \
    .config('spark.executor.cores','2') \
    .config('spark.executor.memory','2g') \
    .config('spark.serializer','org.apache.spark.serializer.KryoSerializer') \
    .getOrCreate()

sqlContext = session
print sqlContext.version
print 'SparkSQL sesssion created'

2.3.0
SparkSQL sesssion created


## Data Ingestion

Nanoaod dataset serving from CERN EOS is ingested in spark via Xrootd Connector (__In the future will be done via grid certificate__)

In [2]:
from pyspark.sql.functions import lit
from samples import *

DFList = [] 

for s in samples:
    print 'Loading {0} sample from EOS file'.format(s) 
    dsPath = "root://eospublic.cern.ch//eos/opstest/cmspd-bigdata/"+samples[s]['filename']    
    tempDF = sqlContext.read \
                .format("org.dianahep.sparkroot") \
                .option("tree", "Events") \
                .load(dsPath)\
                .withColumn("sample", lit(s))
    DFList.append(tempDF)

Loading TT sample from EOS file
Loading WW sample from EOS file
Loading SingleMuon sample from EOS file


AnalysisException: u'Found duplicate column(s) in the data schema: `flag_csctighthalofilter`, `flag_trkpog_logerrortoomanyclusters`, `flag_hcallasereventfilter`, `flag_trkpogfilters`, `flag_eebadscfilter`, `flag_muonbadtrackfilter`, `flag_goodvertices`, `flag_ecaldeadcelltriggerprimitivefilter`, `flag_trkpog_manystripclus53x`, `flag_trkpog_toomanystripclus53x`, `flag_hcalstriphalofilter`, `flag_metfilters`, `flag_hbhenoiseisofilter`, `flag_globalsupertighthalo2016filter`, `flag_hbhenoisefilter`, `flag_csctighthalotrkmuunvetofilter`, `flag_csctighthalo2015filter`, `flag_ecallasercorrfilter`, `flag_globaltighthalo2016filter`, `flag_chargedhadrontrackresolutionfilter`, `flag_ecaldeadcellboundaryenergyfilter`;'

In [None]:
from pyspark.sql.types import *

field=([
    StructField('FIELDNAME_1',StringType(), True),
    StructField('FIELDNAME_2', StringType(), True),
    StructField('FIELDNAME_3', StringType(), True)
])

test=sqlContext.read \
                .format("org.dianahep.sparkroot") \
                .option("tree", "Events") \
                .load('root://eospublic.cern.ch//eos/opstest/cmspd-bigdata/SingleElectronRun2016C-03Feb2017-v1.root')\
                .withColumn("sample", lit(s))
test.printSchema()
schema = StructType(field)
sqlContext.sparkContext

df = sqlContext.createDataFrame(sqlContext.sparkContext.emptyRDD(), schema)
df.printSchema()

In [None]:
#DFList[0].printSchema()
print sqlContext._jsc.sc().getExecutorMemoryStatus().size()
print sqlContext._jsc.sc().getExecutorMemoryStatus()

## Transforming Dataframe

- By default, the __available partition (64MB)__ in the cluster (Level of Parallelism) is __equal to the number of cores in all executor__.

- By default, the parallelism/partitioning is done __one ingested dataset (Dataframe) per partition__.

No Spark Job created after the below cell, concluding that an action invoked on dataframe trigger Spark job creation.

In [None]:
columns = [
    ### MUON
    'Muon_pt',
    'Muon_eta',
    'Muon_phi',
    'Muon_mass',
    'Muon_charge',
    'Muon_mediumId',
    'Muon_softId',
    'Muon_tightId',
    'nMuon',
    ### SAMPLE
    'sample',
]

# Select columns from dataframe
DF = DFList[0].select(columns)
#DF.printSchema()

# Merge all dataset into a single dataframe
for df_ in DFList[1:]:
    DF = DF.union(df_.select(columns))
    
print 'Partition allocated for Dataframe:',DF.rdd.getNumPartitions(), 'partition'
print 'Partition allocated for Dataframe reported from executors (JVM):',sqlContext._jsc.sc().getExecutorMemoryStatus().size(), 'partition'
print 'Default number of partition (defaultParallelism) = ',sqlContext._jsc.sc().defaultParallelism()

## Action1: count

compute the number of entry of dataframe without caching.

In [None]:
import timeit
start_time = timeit.default_timer()
print 'total number of row in the DataFrame  = ', DF.count()
elapsed = timeelapsed = timeit.default_timer() - start_time
print 'time elapsed = ',elapsed,' s'

## Action2: Cache + count

To perform caching of dataframe, the caching will be done on first action.

In [None]:
start_time = timeit.default_timer()
##DF.cache()
DF.persist(pyspark.StorageLevel.MEMORY_ONLY)
print 'total number of events in the DataFrame  = ', DF.count()
elapsed = timeit.default_timer() - start_time
print 'time elapsed = ',elapsed,' s'

## Action3: Count with on Cached dataframe
FAST

In [None]:
start_time = timeit.default_timer()
print 'total number of row in the DataFrame  = ', DF.count()
elapsed = timeit.default_timer() - start_time
print 'Storage level = ',DF.rdd.getStorageLevel()
print 'time elapsed = ',elapsed,' s'

In [None]:
STOP
DF.unpersist()

In [None]:
start_time = timeit.default_timer()
print 'total number of row in the DataFrame  = ', DF.count()
elapsed = timeelapsed = timeit.default_timer() - start_time
print 'time elapsed = ',elapsed,' s'

## Action4: count + cache with serialization

In [None]:
start_time = timeit.default_timer()
DF.persist(pyspark.StorageLevel.MEMORY_ONLY_SER)
print 'total number of events in the DataFrame  = ', DF.count()
elapsed = timeit.default_timer() - start_time
print 'time elapsed = ',elapsed,' s'

## Action5: Count with on serially cached dataframe

In [None]:
start_time = timeit.default_timer()
print 'total number of row in the DataFrame  = ', DF.count()
elapsed = timeit.default_timer() - start_time
print 'Storage level = ',DF.rdd.getStorageLevel()
print 'time elapsed = ',elapsed,' s'