
### Convert Apogee-Fire simulation data to parquet on HDFS

Since there is no native reader for hdf5 data in hadoop, there is no way at the moment to efficiently deal with large files (e.g. we cannot appropriately deal with chunks and distribute them on hdfs blocks). In the Apogee-Fire case, some of the files are close to 300G each, so doing a binaryFiles read is not feasible. 

The method we take here is to ensure the files are accessible on the local filesystem, where POSIX semantics are used and take advantage of the chunk-based reads built into the hdf5 library. To get the data visible from executors, we need to ensure the mounts are on each Spark node, which is done by adding the following mounts on the hadoop containers in the storage-0 statefulset:

NOTE: changed the mountPath from /data to /data-external as to not conflict with the hdfs path

```yaml
        - mountPath: /data-external/apogee-fire/m12f
          name: apogee-fire-1
        - mountPath: /data-external/apogee-fire/m12i
          name: apogee-fire-2
        - mountPath: /data-external/apogee-fire/m12m
          name: apogee-fire-3

```

And volume definitions:

```yaml
      - name: apogee-fire-1
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12f
          server: sciserver-fs1
      - name: apogee-fire-2
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12i
          server: sciserver-fs1
      - name: apogee-fire-3
        nfs:
          path: /srv/zpool01/sdss_casload_backups/data-park/apogee-fire_sim/v1_0_1/m12m
          server: sciserver-fs1

```

Instead of forcing each chunk read to go to a single task, we distribute a large-but-not-overwhelming number of tasks which means multiple chunks will land on a single partition, and as such the memory requirements per task extend beyond chunk size (but also result in more reasonable size files). The executor memory and cores here is a tradeoff, sharing 8gb between 2 tasks did not prove workable

https://apps.sciserver.org/sparkgw/dracula/sparkhistory/history/application_1637717202829_0015/1/jobs/

the spark.kryoserializer.buffer.max.mb can totally be tweaked.  i just sort of guessed at the number.  

In [30]:
%%configure -f
{
    "numExecutors": 60, "executorCores": 1, "executorMemory": "8gb", "driverMemory": "2g", 
    
    "conf": { 
       
        
        "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
        "spark.kryoserializer.buffer.max.mb":"512"

        
        
    }
}

In [31]:
import h5py
from io import BytesIO
import numpy as np
import glob
from pyspark.sql import Row
import healpy
import HMpTy

import astropy.units as u
import astropy.coordinates as coord


Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
9,application_1648742864561_0010,pyspark,idle,Link,Link,sue100,✔


SparkSession available as 'spark'.


In [None]:
def getFileListRdd(path):
    files = glob.glob(f'{path}/*.hdf5')
    return sc.parallelize(files, len(files))

def getFileList(path):
    return glob.glob(f'{path}/*.hdf5')

def readH5(path):
    return h5py.File(path)

def getMaxChunkSize(h5):
    chunk_sizes = [h5[i].chunks[0] for i in h5.keys()]
    return max(chunk_sizes)

def getDataLength(h5):
    return h5['2MASS_magH'].shape[0]

def getChunkList(h5):
    l = getDataLength(h5)
    cs = getMaxChunkSize(h5)
    return zip(np.arange(0, l, cs), np.arange(0, l, cs) + cs)





###################################################
#  TODO: break the obs and true stuff out
#  into functions to remove redundant code
####################################################

def chunkToRow(h5, chunk, vctr_rot, gcen):
    row_dict = {}
    chunk_slice = slice(chunk[0], chunk[1])
    for ds in h5:
        row_dict[ds] = np.array(h5[ds][chunk_slice]).tolist()
    
 
    # add healpix    
    l = np.array(h5['l'][chunk_slice])
    b = np.array(h5['b'][chunk_slice])
    
    NSIDE=1048576 
    heal20id = healpy.ang2pix(NSIDE,  l, b, nest=True, lonlat=True)
    #theta = np.pi/2 - np.radians(dec)
    #phi = np.radians(ra)
    #hp8 = healpy.ang2pix(8, theta, phi, nest=True)
    row_dict['heal20id'] = heal20id.tolist()
    
    
    # add htmid 
    
    htm20 = HMpTy.HTM(depth=20)
    ra = np.array(h5['ra'][chunk_slice])
    dec = np.array(h5['dec'][chunk_slice])
    
    htm20id = htm20.lookup_id(ra,dec)
    row_dict['htm20id'] = htm20id.tolist()
    
    # add cx,cy,cz
    cx = np.cos(np.radians(ra)) * np.cos(np.radians(dec))
    cy = np.sin(np.radians(ra)) * np.cos(np.radians(dec))
    cz = np.sin(np.radians(dec))
    
    row_dict['cx'] = cx.tolist()
    row_dict['cy'] = cy.tolist()
    row_dict['cz'] = cz.tolist()
    
    #get coords
    #define the coordinates in ra/dec from the catalog
    
    parallax = np.array(h5['parallax'][chunk_slice])
    dist = 1./parallax
    
        
    parallax_true = np.array(h5['parallax_true'][chunk_slice])
    dist_true = 1./parallax_true
    
    pmra  = np.array(h5['pmra'][chunk_slice])      
    pmdec = np.array(h5['pmdec'][chunk_slice])
    vr    = np.array(h5['radial_velocity'][chunk_slice])  #this is where the nans live!
    
    
    ##########################################################
    #  no NaN's in the true values!
    
    pmra_true = np.array(h5['pmra_true'][chunk_slice])
    pmdec_true = np.array(h5['pmdec_true'][chunk_slice])
    vr_true = np.array(h5['radial_velocity_true'][chunk_slice])
    
    ra_true = np.array(h5['ra_true'][chunk_slice])
    dec_true = np.array(h5['dec_true'][chunk_slice])
    

    
    c = coord.ICRS( ra=ra*u.degree, 
                dec=dec*u.degree,
                distance=coord.Distance(dist*u.kpc, allow_negative=True),
               	pm_ra_cosdec=pmra*u.mas/u.yr,
                pm_dec=pmdec*u.mas/u.yr,
               	radial_velocity=vr*u.km/u.s)
    
    c_true = coord.ICRS( ra=ra_true*u.degree, 
                dec=dec_true*u.degree,
                distance=coord.Distance(dist_true*u.kpc, allow_negative=True),
               	pm_ra_cosdec=pmra_true*u.mas/u.yr,
                pm_dec=pmdec_true*u.mas/u.yr,
               	radial_velocity=vr_true*u.km/u.s)
    
    gc = c.transform_to(gcen)
    
    gc_true = c_true.transform_to(gcen)
    
    x, y, z = gc.cartesian.xyz.value
    x_true, y_true, z_true = gc_true.cartesian.xyz.value
    
    vx, vy, vz = gc.cartesian.differentials['s'].d_xyz.value
    vx_true, vy_true, vz_true = gc_true.cartesian.differentials['s'].d_xyz.value
    
    row_dict['px_gal_obs'] = x.tolist()
    row_dict['px_gal_true'] = x_true.tolist()
    
    row_dict['py_gal_obs'] = y.tolist()
    row_dict['py_gal_true'] = y_true.tolist()
    
    row_dict['pz_gal_obs'] = z.tolist()
    row_dict['pz_gal_true'] = z_true.tolist()
    
    row_dict['vx_gal_obs'] = vx.tolist()  
    row_dict['vx_gal_true'] = vx_true.tolist()
    
    row_dict['vy_gal_obs'] = vy.tolist() 
    row_dict['vy_gal_true'] = vy_true.tolist() 
    
    row_dict['vz_gal_obs'] = vz.tolist()  
    row_dict['vz_gal_true'] = vz_true.tolist()  
    
    # get dgal_true
    dhel_true = np.array(h5['dhel_true'][chunk_slice])
    l_true = np.array(h5['l_true'][chunk_slice])
    b_true = np.array(h5['b_true'][chunk_slice])
    
    dgal_true = np.sqrt(np.square(dhel_true*np.cos(np.radians(l_true))*np.cos(np.radians(b_true))-8.2)+np.square(dhel_true*np.sin(np.radians(l_true))*np.cos(np.radians(b_true)))+np.square(dhel_true*np.sin(np.radians(b_true))))
    
    
    
    row_dict['dgal_true'] = dgal_true.tolist()
    
    row_dict['dhel_obs'] = dist.tolist()
    
    # transform to cylindrical coords
    gc.representation_type = 'cylindrical'
    
    rho, phi, zcyl = gc.rho.to(u.kpc).value, gc.phi.degree, gc.z.to(u.kpc).value
    
    VRho = gc.d_rho.to(u.km/u.s).value
    VPhi = (gc.d_phi*gc.rho).to(u.km/u.s, equivalencies=u.dimensionless_angles()).value
    VZcyl = gc.d_z.to(u.km/u.s).value
   

    ##############################################################################
    # GET TRUE VERSIONS
    ##############################################################################

    gc_true.representation_type = 'cylindrical'
    
    rho_true, phi_true, zcyl_true = gc_true.rho.to(u.kpc).value, gc_true.phi.degree, gc_true.z.to(u.kpc).value
    
    VRho_true = gc_true.d_rho.to(u.km/u.s).value
    VPhi_true = (gc_true.d_phi*gc.rho).to(u.km/u.s, equivalencies=u.dimensionless_angles()).value
    VZcyl_true = gc_true.d_z.to(u.km/u.s).value
    
    
    row_dict['rho_cyl_obs'] = rho.tolist()
    row_dict['rho_cyl_true'] = rho_true.tolist()
    
    row_dict['phi_cyl_obs'] = phi.tolist()
    row_dict['phi_cyl_true'] = phi_true.tolist()
    
    row_dict['z_cyl_obs'] = zcyl.tolist()
    row_dict['z_cyl_true'] = zcyl_true.tolist()
    
    row_dict['vrho_cyl_obs'] = VRho.tolist()  
    row_dict['vrho_cyl_true'] = VRho_true.tolist() 
    
    row_dict['vphi_cyl_obs'] = VPhi.tolist()  
    row_dict['vphi_cyl_true'] = VPhi_true.tolist() 
   
    row_dict['vz_cyl_obs'] = VZcyl.tolist()  
    row_dict['vz_cyl_true'] = VZcyl_true.tolist() 
    
    

    
    return Row(**row_dict)

In [12]:
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, DoubleType, LongType

schema = StructType([ \
    StructField("2mass_magh",FloatType(), True), \
    StructField("2mass_magh_error",FloatType(), True), \
    StructField("2mass_magh_int",FloatType(), True), \
    StructField("2mass_magh_true",FloatType(), True), \
    StructField("2mass_magj",FloatType(), True), \
    StructField("2mass_magj_error",FloatType(), True), \
    StructField("2mass_magj_int",FloatType(), True), \
    StructField("2mass_magj_true",FloatType(), True), \
    StructField("2mass_magks",FloatType(), True), \
    StructField("2mass_magks_error",FloatType(), True), \
    StructField("2mass_magks_int",FloatType(), True), \
    StructField("2mass_magks_true",FloatType(), True), \
    StructField("a0",FloatType(), True), \
    StructField("cfe_apogee",FloatType(), True), \
    StructField("cafe_apogee",FloatType(), True), \
    StructField("feh_apogee",FloatType(), True), \
    StructField("mgfe_apogee",FloatType(), True), \
    StructField("nfe_apogee",FloatType(), True), \
    StructField("ofe_apogee",FloatType(), True), \
    StructField("sfe_apogee",FloatType(), True), \
    StructField("sife_apogee",FloatType(), True), \
    StructField("a_g_bp_val",FloatType(), True), \
    StructField("a_g_rp_val",FloatType(), True), \
    StructField("a_g_val",FloatType(), True), \
    StructField("age",FloatType(), True), \
    StructField("alpha",FloatType(), True), \
    StructField("b",DoubleType(), True), \
    StructField("b_true",DoubleType(), True), \
    StructField("bp_g",FloatType(), True), \
    StructField("bp_g_int",FloatType(), True), \
    StructField("bp_g_true",FloatType(), True), \
    StructField("bp_rp",FloatType(), True), \
    StructField("bp_rp_int",FloatType(), True), \
    StructField("bp_rp_true",FloatType(), True), \
    StructField("calcium",FloatType(), True), \
    StructField("carbon",FloatType(), True), \
    StructField("dec",DoubleType(), True), \
    StructField("dec_error",DoubleType(), True), \
    StructField("dec_true",DoubleType(), True), \
    StructField("dhel_true",DoubleType(), True), \
    StructField("dmod_true",DoubleType(), True), \
    StructField("e_bp_min_rp_val",FloatType(), True), \
    StructField("ebv",FloatType(), True), \
    StructField("feh",FloatType(), True), \
    StructField("g_rp",FloatType(), True), \
    StructField("g_rp_int",FloatType(), True), \
    StructField("g_rp_true",FloatType(), True), \
    StructField("helium",FloatType(), True), \
    StructField("l",FloatType(), True), \
    StructField("l_true",FloatType(), True), \
    StructField("logg",FloatType(), True), \
    StructField("lognh",FloatType(), True), \
    StructField("lum_val",FloatType(), True), \
    StructField("mact",FloatType(), True), \
    StructField("magnesium",FloatType(), True), \
    StructField("mini",FloatType(), True), \
    StructField("mtip",FloatType(), True), \
    StructField("neon",FloatType(), True), \
    StructField("nitrogen",FloatType(), True), \
    StructField("oxygen",FloatType(), True), \
    StructField("parallax",DoubleType(), True), \
    StructField("parallax_error",DoubleType(), True), \
    StructField("parallax_over_error",FloatType(), True), \
    StructField("parallax_true",DoubleType(), True), \
    StructField("parentid",LongType(), True), \
    StructField("partid",LongType(), True), \
    StructField("phot_bp_mean_mag",FloatType(), True), \
    StructField("phot_bp_mean_mag_error",FloatType(), True), \
    StructField("phot_bp_mean_mag_int",FloatType(), True), \
    StructField("phot_bp_mean_mag_true",FloatType(), True), \
    StructField("phot_g_mean_mag",FloatType(), True), \
    StructField("phot_g_mean_mag_error",FloatType(), True), \
    StructField("phot_g_mean_mag_int",FloatType(), True), \
    StructField("phot_g_mean_mag_true",FloatType(), True), \
    StructField("phot_rp_mean_mag",FloatType(), True), \
    StructField("phot_rp_mean_mag_error",FloatType(), True), \
    StructField("phot_rp_mean_mag_int",FloatType(), True), \
    StructField("phot_rp_mean_mag_true",FloatType(), True), \
    StructField("pmb_true",DoubleType(), True), \
    StructField("pmdec",DoubleType(), True), \
    StructField("pmdec_error",DoubleType(), True), \
    StructField("pmdec_true",DoubleType(), True), \
    StructField("pml_true",DoubleType(), True), \
    StructField("pmra",DoubleType(), True), \
    StructField("pmra_error",DoubleType(), True), \
    StructField("pmra_true",DoubleType(), True), \
    StructField("px_true",DoubleType(), True), \
    StructField("py_true",DoubleType(), True), \
    StructField("pz_true",DoubleType(), True), \
    StructField("ra",DoubleType(), True), \
    StructField("ra_error",DoubleType(), True), \
    StructField("ra_true",DoubleType(), True), \
    StructField("radial_velocity",DoubleType(), True), \
    StructField("radial_velocity_error",DoubleType(), True), \
    StructField("radial_velocity_true",DoubleType(), True), \
    StructField("random_index",LongType(), True), \
    StructField("silicon",FloatType(), True), \
    StructField("source_id",LongType(), True), \
    StructField("sulphur",FloatType(), True), \
    StructField("teff_val",FloatType(), True), \
    StructField("vx_true",FloatType(), True), \
    StructField("vy_true",FloatType(), True), \
    StructField("vz_true",FloatType(), True), \
    StructField("heal20id",LongType(), True), \
    StructField("htm20id",LongType(), True), \
    StructField("cx",DoubleType(), True), \
    StructField("cy",DoubleType(), True), \
    StructField("cz",DoubleType(), True), \
    #StructField("slice",IntegerType(), True), \
    StructField("px_gal_obs",DoubleType(), True), \
    StructField("px_gal_true",DoubleType(), True), \
    StructField("py_gal_obs",DoubleType(), True), \
    StructField("py_gal_true",DoubleType(), True), \
    StructField("pz_gal_obs",DoubleType(), True), \
    StructField("pz_gal_true",DoubleType(), True), \
    StructField("vx_gal_obs",DoubleType(), True), \
    StructField("vx_gal_true",DoubleType(), True), \
    StructField("vy_gal_obs",DoubleType(), True), \
    StructField("vy_gal_true",DoubleType(), True), \
    StructField("vz_gal_obs",DoubleType(), True), \
    StructField("vz_gal_true",DoubleType(), True), \
    StructField("dgal_true",DoubleType(), True), \
    StructField("dhel_obs",DoubleType(), True), \
    StructField("rho_cyl_obs",DoubleType(), True), \
    StructField("rho_cyl_true",DoubleType(), True), \
    StructField("phi_cyl_obs",DoubleType(), True), \
    StructField("phi_cyl_true",DoubleType(), True), \
    StructField("z_cyl_obs",DoubleType(), True), \
    StructField("z_cyl_true",DoubleType(), True), \
    StructField("vrho_cyl_obs",DoubleType(), True), \
    StructField("vrho_cyl_true",DoubleType(), True), \
    StructField("vphi_cyl_obs",DoubleType(), True), \
    StructField("vphi_cyl_true",DoubleType(), True), \
    StructField("vz_cyl_obs",DoubleType(), True), \
    StructField("vz_cyl_true",DoubleType(), True) 
])

In [26]:

# i know this sucks, leave me alone
sim = 'm12f'
lsr =  'lsr_0' #lsr_n or test, tests are based on lsr0 so...
nlsr = lsr
if (lsr == 'test'):
    nlsr = 'lsr_0'
label = f'{sim}-{nlsr}'.replace('_','')
print(label)


path = f'/data-external/apogee-fire/{sim}/{lsr}'
hdfs_path = f'hdfs:///data/apogee_fire/{sim}/{lsr}'



print(path)
print(hdfs_path)

csv_path = 'hdfs:///data/apogee_fire/sandersont4.csv'
csv_df = spark.read.option("header", True).option("inferSchema", True).csv(csv_path)
csv_df.printSchema()
csv_df.show()

#path = '/data-external/apogee-fire/m12f/lsr_0'
#hdfs_path = 'hdfs:///data/apogee_fire/m12f/lsr_0'

fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())

m12f-lsr0
/data-external/apogee-fire/m12f/lsr_0
hdfs:///data/apogee_fire/m12f/lsr_0
root
 |-- Label: string (nullable = true)
 |-- px: double (nullable = true)
 |-- py: double (nullable = true)
 |-- pz: integer (nullable = true)
 |-- vx: double (nullable = true)
 |-- vy: double (nullable = true)
 |-- vz: double (nullable = true)
 |-- v_R_LSR: double (nullable = true)
 |-- v_Z_LSR: double (nullable = true)
 |-- v_phi_LSR: double (nullable = true)

+---------+-------+----+---+---------+---------+-------+-------+-------+---------+
|    Label|     px|  py| pz|       vx|       vy|     vz|v_R_LSR|v_Z_LSR|v_phi_LSR|
+---------+-------+----+---+---------+---------+-------+-------+-------+---------+
|m12i-lsr0|    0.0| 8.2|  0| 224.7092| -20.3801| 3.8954|  -17.8|   -3.9|    224.4|
|m12i-lsr1|-7.1014|-4.1|  0| -80.4269|  191.724| 1.5039|  -24.4|   -1.5|    210.9|
|m12i-lsr2| 7.1014|-4.1|  0| -87.2735|-186.8567|-9.4608|   22.1|    9.5|    206.5|
|m12f-lsr0|    0.0| 8.2|  0| 226.1849|  14.3773|-4.

In [14]:
# JUST FOR THIS TEST
#hdfs_path = 'hdfs:///data/apogee_fire/m12f/test_true'

In [27]:

# this stuff doesn't need catalog data i can just do it once per survey here
# if i were smart i would serialize this to disk or something but i'm not smart so


xvsun = np.array(csv_df.select("px","py","pz","vx","vy","vz").filter(csv_df.Label == label).collect())
xvsun[0]  # i actually don't know why this is a 2d array but whatever who cares
          # oh maybe it's because of the "filter" thing, it doesn't know it will just get one row

# thought i could do it all at once, it didn't work
#xvsun = np.array(df.columns[4:]).collect()



phi = np.pi + np.arctan2(xvsun[0][1], xvsun[0][0])
phi

rot = np.array([
    [np.cos(phi), np.sin(phi), 0.0],
    [-np.sin(phi), np.cos(phi), 0.0],
    [0.0, 0.0, 1.0]
    ])

vctr_rot = np.dot(rot, xvsun[0][3:])
vctr_rot


gcen = coord.Galactocentric(galcen_distance=8.2*u.kpc, z_sun=0.0*u.kpc,galcen_v_sun=coord.CartesianDifferential(vctr_rot*u.km/u.s))
gcen




<Galactocentric Frame (galcen_coord=<ICRS Coordinate: (ra, dec) in deg
    (266.4051, -28.936175)>, galcen_distance=8.2 kpc, galcen_v_sun=(-14.3773, 226.1849, -4.8906) km / s, z_sun=0.0 pc, roll=0.0 deg)>

In [None]:

write = True

files = getFileList(path)
files.sort()
for index, file in enumerate(files, start=0):
    out_path = f'{hdfs_path}/slice={index}'
    print(out_path)
    #if (fs.exists(sc._jvm.org.apache.hadoop.fs.Path(f'{out_path}/_SUCCESS'))) == False:
    #if (index == 0):   
    if (True == True):
    
        ds = sc.parallelize([file]).map(
            lambda x: (x, readH5(x))
        ).flatMapValues(
            getChunkList
        ).map(
            lambda x: (f'{x[0]}-{x[1][0]}', (x[0], x[1]))
        ).partitionBy(
            2500
        ).map(
            lambda x: chunkToRow(readH5(x[1][0]), x[1][1], vctr_rot, gcen)
        )
    
        df = spark.createDataFrame(ds)
        df.createOrReplaceTempView('arrRows')
        #df.printSchema()
        
        out_df = spark.sql('''
            SELECT a.* FROM (
                SELECT explode(arrays_zip(*)) a FROM  arrRows
            )
            ''')
        
        #maybe this will work?
        #use schema defined above to separate floats / doubles
        out_df = spark.createDataFrame(out_df.rdd, schema)
        #out_df.printSchema()
        
        cols = [i.name for i in out_df.schema]
        cols_renamed = [i.lower().replace('-', '_') for i in cols]
        #cols_renamed
        
        for pair in zip(cols, cols_renamed):
            out_df = out_df.withColumnRenamed(*pair)
            
        if (write == True):
        
            sc.setJobDescription(f'write to parquet {out_path}')
        
            #out_df.show(10)

            out_df.write.mode('overwrite').parquet(out_path)
        else:
            out_df.show(10)
    else:
        print(f'{out_path} exists, skipping...')

        
 

In [None]:
out_df.printSchema()
out_df.createOrReplaceTempView('outdf')

In [22]:
mydf = spark.sql('''select px_gal_obs, py_gal_obs, pz_gal_obs, 
                 vx_gal_obs, vy_gal_obs, vz_gal_obs, 
                 dgal_true, dhel_true, 
                 rho_cyl_obs, phi_cyl_obs, z_cyl_obs, 
                 vrho_cyl_obs, vphi_cyl_obs, vz_cyl_obs \
                 from outdf''')

mydf.show(10)

+-------------------+-------------------+--------------------+----------+----------+----------+------------------+------------------+------------------+-------------------+--------------------+------------+------------+----------+
|         px_gal_obs|         py_gal_obs|          pz_gal_obs|vx_gal_obs|vy_gal_obs|vz_gal_obs|         dgal_true|         dhel_true|       rho_cyl_obs|        phi_cyl_obs|           z_cyl_obs|vrho_cyl_obs|vphi_cyl_obs|vz_cyl_obs|
+-------------------+-------------------+--------------------+----------+----------+----------+------------------+------------------+------------------+-------------------+--------------------+------------+------------+----------+
| -7.833710941746277| 0.5096609785983679|  0.1053370875582952|       NaN|       NaN|       NaN|10.381756353937098|12.645137661316038| 7.850272697934843|    176.27758461593|  0.1053370875582952|         NaN|         NaN|       NaN|
|  13.87824558613573| 32.416141173246785|   4.074440887625832|       NaN|   

ok now we have out_df which is the stuff ready to be written to parquet.  should we do the extra stuff before we write to parquet or after?
 - change column names
 - get rid of NaN's (can do that at ingest time)
 - add healpix (maybe we should do this as a separate step)
 

## ingest stuff down here

should move to different notebook



In [33]:
hdfs_path = 'hdfs:///data/apogee_fire/m12f/lsr_0/'
testDF = spark.read.parquet(hdfs_path)
testDF.createOrReplaceTempView('testDF')














In [36]:
len(testDF.columns)

135

In [23]:


mydf = spark.sql('''select px_gal_obs, py_gal_obs, pz_gal_obs, 
                 vx_gal_obs, vy_gal_obs, vz_gal_obs, 
                 dgal_true, dhel_true, 
                 rho_cyl_obs, phi_cyl_obs, z_cyl_obs, 
                 vrho_cyl_obs, vphi_cyl_obs, vz_cyl_obs \
                 from testDF''')



my_df = spark.sql('''select radial_velocity, radial_velocity_true from testDF''')
my_df.show(10)





+---------------+--------------------+
|radial_velocity|radial_velocity_true|
+---------------+--------------------+
|            NaN|   -128.853320708274|
|            NaN| -45.082068589629046|
|            NaN|  -42.79229241661227|
|            NaN|   4.565146946236344|
|            NaN| -4.1852082420782954|
|            NaN|  -6.307093176029351|
|            NaN|    1.78165005474733|
|            NaN| -0.8437668269285681|
|            NaN| -14.762100855508505|
|            NaN|   8.546889179502031|
+---------------+--------------------+
only showing top 10 rows

In [34]:
import pyspark.sql.functions as F
columns = testDF.columns
for column in columns:
    testDF = testDF.withColumn(column,F.when(F.isnan(F.col(column)),F.lit(-9999)).otherwise(F.col(column)))

In [37]:
server_name = "jdbc:sqlserver://sdss4c:1433"
#database_name= "fire_m12f_lsr2"
database_name = "apogee_fire_test"
url = server_name + ";" + "databaseName=" + database_name + ";"

table_name = "m12f_lsr0"
username = "connector_user"
password = "password123!#" # Please specify password here

try:
  testDF.write \
    .format("com.microsoft.sqlserver.jdbc.spark") \
    .mode("overwrite") \
    .option("url", url) \
    .option("dbtable", table_name) \
    .option("user", username) \
    .option("password", password) \
    .option("truncate","true") \
    .option("reliabilityLevel","BEST_EFFORT") \
    .option("tableLock","false") \
    .option("mssqlIsolationLevel", "READ_UNCOMMITTED") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .option("batchsize", "1048576") \
    .save()
except ValueError as error :
    print("Connector write failed", error)

In [None]:
%% cleanup -f