### Convert Apogee-Fire simulation data to parquet on HDFS

Since there is no native reader for hdf5 data in hadoop, there is no way at the moment to efficiently deal with large files (e.g. we cannot appropriately deal with chunks and distribute them on hdfs blocks). In the Apogee-Fire case, some of the files are close to 300G each, so doing a binaryFiles read is not feasible. 

The method we take here is to ensure the files are accessible on the local filesystem, where POSIX semantics are used and take advantage of the chunk-based reads built into the hdf5 library. To get the data visible from executors, we need to ensure the mounts are on each Spark node, which is done by adding the following mounts on the hadoop containers in the storage-0 statefulset:

```yaml
        - mountPath: /data/apogee-fire/m12f
          name: apogee-fire-1
        - mountPath: /data/apogee-fire/m12i
          name: apogee-fire-2
        - mountPath: /data/apogee-fire/m12m
          name: apogee-fire-3

```

And volume definitions:

```yaml
      - name: apogee-fire-1
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12f
          server: sciserver-fs1
      - name: apogee-fire-2
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12i
          server: sciserver-fs1
      - name: apogee-fire-3
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12m
          server: sciserver-fs1

```

Instead of forcing each chunk read to go to a single task, we distribute a large-but-not-overwhelming number of tasks which means multiple chunks will land on a single partition, and as such the memory requirements per task extend beyond chunk size (but also result in more reasonable size files). The executor memory and cores here is a tradeoff, sharing 8gb between 2 tasks did not prove workable

In [1]:
%%configure -f
{
    "numExecutors": 60, "executorCores": 1, "executorMemory": "8gb", "driverMemory": "2g", 
    "conf": { 
        "spark.pyspark.virtualenv.enabled": "true", 
        "spark.pyspark.virtualenv.python_version": "3.7",
        "spark.yarn.appMasterEnv.PIP_CACHE_DIR": "/tmp"
    }
}

In [2]:
sc.install_packages(['h5py', 'numpy', 'pandas'])

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
301,application_1635030912678_0029,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting h5py
  Using cached h5py-3.5.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (4.1 MB)
Collecting numpy
  Using cached numpy-1.21.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
Collecting pandas
  Using cached pandas-1.3.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
Collecting cached-property
  Using cached cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)
Collecting python-dateutil>=2.7.3
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2017.3
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: six, pytz, python-dateutil, numpy, cached-property, pandas, h5py
Successfully installed cached-property-1.5.2 h5py-3.5.0 numpy-1.21.3 pandas-1.3.4 python-dateutil-2.8.2 pytz-2021.3 six-1.16.0
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying iss

In [3]:
import h5py
from io import BytesIO
import numpy as np
import glob
from pyspark.sql import Row

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
def getFileListRdd(path):
    files = glob.glob(f'{path}/*.hdf5')
    return sc.parallelize(files, len(files))

def readH5(path):
    return h5py.File(path)

def getMaxChunkSize(h5):
    chunk_sizes = [h5[i].chunks[0] for i in h5.keys()]
    return max(chunk_sizes)

def getDataLength(h5):
    return h5['2MASS_magH'].shape[0]

def getChunkList(h5):
    l = getDataLength(h5)
    cs = getMaxChunkSize(h5)
    return zip(np.arange(0, l, cs), np.arange(0, l, cs) + cs)

def chunkToRow(h5, chunk):
    row_dict = {}
    chunk_slice = slice(chunk[0], chunk[1])
    for ds in h5:
        row_dict[ds] = np.array(h5[ds][chunk_slice]).tolist()
    return Row(**row_dict)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
a = getFileListRdd('/data/apogee-fire/m12f/lsr_0')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [6]:
ds = a.map(
    lambda x: (x, readH5(x))
).flatMapValues(
    getChunkList
).map(
    lambda x: (f'{x[0]}-{x[1][0]}', (x[0], x[1]))
).partitionBy(
    25000
).map(
    lambda x: chunkToRow(readH5(x[1][0]), x[1][1])
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
df = spark.createDataFrame(ds)
df.createOrReplaceTempView('arrRows')
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- 2MASS_magH: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magH_error: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magH_int: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magH_true: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magJ: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magJ_error: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magJ_int: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magJ_true: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magKs: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magKs_error: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- 2MASS_magKs_int: array (nullable = true)
 |    |-- element: do

In [8]:
out_df = spark.sql('''
SELECT a.* FROM (
  SELECT explode(arrays_zip(*)) a FROM  arrRows
)
''')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
out_df.select(['2MASS_magH', 'age', 'MgFe-APOGEE', 'random_index']).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+----------------+-------------------+------------+
|        2MASS_magH|             age|        MgFe-APOGEE|random_index|
+------------------+----------------+-------------------+------------+
| 17.17505645751953|8.89330005645752|                NaN|        5215|
|15.786520957946777|8.89330005645752|                NaN|       61669|
|17.847721099853516|8.89330005645752|                NaN|       71726|
|16.963459014892578|8.89330005645752|0.16909177601337433|      146484|
|14.978506088256836|8.89330005645752|                NaN|       35597|
| 16.20737648010254|8.89330005645752|                NaN|       60027|
|15.714632987976074|8.89330005645752|                NaN|       50464|
|17.672130584716797|8.89330005645752|0.18444456160068512|      149193|
|17.447877883911133|8.89330005645752|0.17968931794166565|      106078|
| 17.78213119506836|8.89330005645752|                NaN|       78737|
|16.371044158935547|8.89330005645752|                NaN|      148960|
|15.92

In [None]:
out_df.write.mode('overwrite').parquet('hdfs:///user/arik/apogee_fire_test/m12f/lsr_0/')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…