# Spark DataFrame -> Tensorflow Dataset

This notebook serves as a playground for testing `oarphpy.spark.spark_df_to_tf_dataset()`.  See also the unit tests for this utiltiy.

In [1]:
# Common imports and setup
from oarphpy.spark import NBSpark
from oarphpy.spark import spark_df_to_tf_dataset
from oarphpy import util

import os
import random
import sys

import numpy as np
import tensorflow as tf
from pyspark.sql import Row

spark = NBSpark.getOrCreate()

  __import__('pkg_resources').declare_namespace(__name__)
2019-12-27 21:01:03,560	oarph 336 : Trying to auto-resolve path to src root ...
2019-12-27 21:01:03,561	oarph 336 : Using source root /opt/oarphpy 
2019-12-27 21:01:03,589	oarph 336 : Generating egg to /tmp/op_spark_eggs_e2392756-5287-4e0e-bdb3-3bc52ee6cde4 ...
2019-12-27 21:01:03,641	oarph 336 : ... done.  Egg at /tmp/op_spark_eggs_e2392756-5287-4e0e-bdb3-3bc52ee6cde4/oarphpy-0.0.0-py3.6.egg


## Test on a "large" 2GB random dataset

Create the dataset

In [2]:
NUM_RECORDS = 1000

DATASET_PATH = '/tmp/spark_df_to_tf_dataset_test_large'
def gen_data(n):
  import numpy as np
  y = np.random.rand(2 ** 15).tolist()
  return Row(part=n % 100, id=str(n), x=1, y=y)
rdd = spark.sparkContext.parallelize(range(NUM_RECORDS))
df = spark.createDataFrame(rdd.map(gen_data))
if util.missing_or_empty(DATASET_PATH):
    df.write.parquet(DATASET_PATH, partitionBy=['part'], mode='overwrite')

In [3]:
%%bash -s "$DATASET_PATH"
du -sh $1

2.7M	/tmp/spark_df_to_tf_dataset_test_large


Test reading the dataset through Tensorflow

In [4]:
udf = spark.read.parquet(DATASET_PATH)
print("Have %s rows" % udf.count())
n_expect = udf.count()

ds = spark_df_to_tf_dataset(
        udf,
        'part',
        spark_row_to_tf_element=lambda r: (r.x, r.id, r.y),
        tf_element_types=(tf.int64, tf.string, tf.float64))

n = 0
t = util.ThruputObserver(name='test_spark_df_to_tf_dataset_large')
with util.tf_data_session(ds) as (sess, iter_dataset):
  t.start_block()
  for actual in iter_dataset():
    n += 1
    t.update_tallies(n=1)
    for i in range(len(actual)):
      t.update_tallies(num_bytes=sys.getsizeof(actual[i]))
    t.maybe_log_progress()
  t.stop_block()

print("Read %s records" % n)
assert n == n_expect

Have 10 rows
getting shards
10 [1, 6, 3, 5, 9, 4, 8, 7, 2, 0]


2019-12-27 21:02:21,279	oarph 336 : Reading partition 3 
2019-12-27 21:02:21,280	oarph 336 : Reading partition 0 
2019-12-27 21:02:21,281	oarph 336 : Reading partition 1 
2019-12-27 21:02:21,281	oarph 336 : Reading partition 6 
2019-12-27 21:02:21,283	oarph 336 : Reading partition 8 
2019-12-27 21:02:21,284	oarph 336 : Reading partition 4 
2019-12-27 21:02:21,287	oarph 336 : Reading partition 5 
2019-12-27 21:02:21,287	oarph 336 : Reading partition 2 
2019-12-27 21:02:21,288	oarph 336 : Reading partition 7 
2019-12-27 21:02:21,294	oarph 336 : Reading partition 9 
2019-12-27 21:02:30,044	oarph 336 : Done reading partition 5, stats:
 Partition 5 [Pid:336 Id:140641112145760]
----------  -------------------
Thruput
N thru      1
N chunks    1
Total time  8.73 seconds
Total thru  786.52 KB
Rate        90.11 KB / sec
Hz          0.11456897938921241
----------  -------------------
2019-12-27 21:02:30,047	oarph 336 : Progress for 
spark_tf_dataset [Pid:336 Id:140639257511920]
-----------------

2019-12-27 21:02:30,233	oarph 336 : 
Partition 6 [Pid:336 Id:140641112116360]
----------  ------------------
Thruput
N thru      1
N chunks    1
Total time  8.94 seconds
Total thru  786.52 KB
Rate        88.01 KB / sec
Hz          0.1118984490328969
----------  ------------------

2019-12-27 21:02:30,568	oarph 336 : Done reading partition 8, stats:
 Partition 8 [Pid:336 Id:140641112117200]
----------  -------------------
Thruput
N thru      1
N chunks    1
Total time  9.28 seconds
Total thru  786.52 KB
Rate        84.78 KB / sec
Hz          0.10779677057703793
----------  -------------------
2019-12-27 21:02:30,572	oarph 336 : Done reading partition 7, stats:
 Partition 7 [Pid:336 Id:140641112142008]
----------  -------------------
Thruput
N thru      1
N chunks    1
Total time  9.28 seconds
Total thru  786.52 KB
Rate        84.77 KB / sec
Hz          0.10777922258165155
----------  -------------------
2019-12-27 21:02:30,606	oarph 336 : Done reading partition 0, stats:
 Partition 0 [P

Read 10 records
