In [1]:
%%configure -f
{"name": "arik-manga-query", "executorMemory": "12G", "numExecutors": 16, "executorCores": 4,
 "conf": {"spark.yarn.appMasterEnv.PYSPARK_PYTHON":"python3"}}

In [2]:
import time
hdfs_dir = 'hdfs:///manga/arik-test/dr15/v2_4_3/logrss'

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
171,application_1580142637008_0177,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Table setup
In each of the queries below, we are reading the dataset from the parquet files, as the size is too large for our cluster. The OS will likely as much as possible of what we read anyhow, but will certainly be smaller than the largest queries and disk reads are guaranteed. We benefit from large blocks/files and locality in this case.

In [3]:
t = time.time()
manga = spark.read.parquet(hdfs_dir)# .cache(), data size is too large for cluster to cache
manga.createOrReplaceTempView('manga')
print(time.time()-t)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

5.636242151260376

### Simple count
This is the first query, so there will be some lazy allocations and OS level page caches will most likely be unrelated and populated here. After this, the same query should execute faster.

In [4]:
t = time.time()
spark.sql('SELECT count(*) FROM manga').show()
print(time.time()-t)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+
|count(1)|
+--------+
| 3757380|
+--------+

16.033992290496826

In [8]:
t = time.time()
spark.sql('SELECT count(*) FROM manga').show()
print(time.time()-t)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+
|count(1)|
+--------+
| 3757380|
+--------+

6.466812610626221

### counting fiber-exposure per ifu
The previous query could benefit a lot from parquet column group statistics (you can look at the spark task, it reads only a small portion of the entire dataset - indeed only ~0.2% of the dataset is read). The query still benefits strongly from parquet format, due to run-length-encoding (RLE), but there is a larger burden on computation due to grouping.

In [5]:
t = time.time()
spark.sql('SELECT PLATEID,IFUDESIGN,count(*) AS fexps FROM manga GROUP BY PLATEID,IFUDESIGN').show()
print(time.time()-t)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+---------+-----+
|PLATEID|IFUDESIGN|fexps|
+-------+---------+-----+
|   7964|    12705| 1905|
|   9507|     9101|  819|
|   7443|    12701| 1905|
|   8721|    12705| 1143|
|   8085|    12703| 1524|
|   8262|    12701| 1143|
|   9881|     3702|  333|
|   8084|    12704| 1905|
|   9195|     1902|  228|
|   8315|     3701|  333|
|   9507|    12703| 1143|
|   8261|     3701|  333|
|   8993|     3702|  333|
|   8323|     3702|  333|
|   8486|     3702|  333|
|   8440|     6104|  732|
|   9675|     6102|  366|
|   9675|     6103|  366|
|   8728|    12701| 1524|
|   9049|     3703|  444|
+-------+---------+-----+
only showing top 20 rows

14.67196774482727

### Band average flux
Here we force the system to scan columns that make little/no benefit from RLE, forcing a significant read. However, as we are only interacting with 5 columns, 3 of which are largely RLE'd away, we end up reading about 23% of the whole dataset, and utilize cpu for grouping/aggregation.

In [7]:
t = time.time()
spark.sql(
'''
SELECT PLATEID, IFUDESIGN, MANGAID, avg(SPEC.FLUX)
FROM (
    SELECT explode_outer(arrays_zip(WAVE, FLUX)) AS SPEC,
    PLATEID, IFUDESIGN, MANGAID
    FROM manga
)
WHERE SPEC.WAVE BETWEEN 3700 AND 4000
GROUP BY PLATEID, IFUDESIGN, MANGAID
'''
).show()
print(time.time()-t)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+---------+--------+------------------------+
|PLATEID|IFUDESIGN| MANGAID|avg(SPEC.FLUX AS `FLUX`)|
+-------+---------+--------+------------------------+
|   8320|     3701|1-519738|      2.7885750267017317|
|   9185|     6103|1-546965|     0.31238629822468966|
|   8933|     6103|1-456768|      1.2990703250866296|
|   8452|    12702|1-147685|      0.1602110807781861|
|   8552|     6102|1-321961|      0.4957944524832962|
|   9051|    12703|    50-4|    0.043823784664009144|
|   8450|     1902|1-491220|      1.6152788146299577|
|   8942|    12703|1-218519|        0.22637839739722|
|   8716|     3704|1-352081|     0.24095725655764183|
|   8945|    12702|1-615167|      1.3982744608175408|
|   8595|    12701|1-197805|      0.1740635495507865|
|   8548|     6102| 1-93236|      0.8367524575518326|
|   8333|     9101|1-265919|      0.5213299778290923|
|   8726|     9101|1-604826|      0.9012554880994621|
|   8458|    12704|1-166908|     0.36139197538068496|
|   8952|    12702|1-627000|

Cleanup after ourselves

In [13]:
%%cleanup -f