In [9]:
# run first. then have fun.
from pyspark.sql.functions import col, current_timestamp, to_date, datediff
# stats and agg functions
from pyspark.sql.functions import count, session_window, window, sum, min, max, percentile_approx

from delta.tables import DeltaTable

# keep the default compression codec as zstd
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

# common dirs, paths
dataset_dir = '/opt/spark/work-dir/hitchhikers_guide/datasets/ecomm_behavior_data'
delta_path = f"{dataset_dir}/delta"

# managed table information (from 100-streaming-first-steps)
dl_unmanaged_table = "ecomm"
dl_managed_table = "ecomm_by_day"

In [10]:
spark.conf.get("spark.sql.warehouse.dir")

'file:/opt/spark/work-dir/hitchhikers_guide/warehouse'

# Intro to Delta Lake Streaming
The following section will reuse the **Delta Lake** `default.ecomm_by_day` table created during [Streaming First Steps](./streaming-first-steps.ipynb).

> note: run the following cell to check if you have the local table. You should see `[Table(name='ecomm_by_day', database='default', description=None, tableType='MANAGED', isTemporary=False)]` somewhere in the list (if you have more than one from the work in the Guide)

In [11]:
# a few helpful methods for setting the local context
# which database like `use that_database` in SQL
# with unity catalog (spark.catalog.setCurrentCatalog....)
spark.catalog.setCurrentDatabase("default")
spark.catalog.listTables()
spark.catalog.tableExists(dl_unmanaged_table)
spark.table(dl_managed_table)


DataFrame[event_time: timestamp, event_type: string, product_id: int, category_id: bigint, category_code: string, brand: string, price: float, user_id: int, user_session: string, event_date: date]

> Note: If you see `java.sql.SQLException: Failed to start database 'metastore_db' with class loader jdk.internal.loader.ClassLoaders$AppClassLoader...` then you need to detach the `kernel` from the other notebook you have open. You can only have one notebook running with the local Metastore.

## Successful Streaming Begins with Metadata (lots and lots of metadata)
> In other words, if you don't understand how the table is laid out, what the structure of the table is (columns, types, is the table narrow or wide? do you know what any of the columns actually are?

Remember, when in lost or in doubt, always consult the data (metadata). To Peek at the Table Metadata with `detail()`
* - Use `DeltaTable.forName(spark, 'catalog.schema.table|schema.table|table').detail()` 
* - or `DeltaTable.forPath(spark, '/path/to/table/).detail()` for Unmanaged tables.

In [12]:
## Starting Small (Baby Steps)

dt_ecomm = DeltaTable.forName(spark, dl_managed_table)
table_details = dt_ecomm.detail()

# go on, take a peek (no one's looking)
table_details.printSchema()

root
 |-- format: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location: string (nullable = true)
 |-- createdAt: timestamp (nullable = true)
 |-- lastModified: timestamp (nullable = true)
 |-- partitionColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- numFiles: long (nullable = true)
 |-- sizeInBytes: long (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- minReaderVersion: integer (nullable = true)
 |-- minWriterVersion: integer (nullable = true)
 |-- tableFeatures: array (nullable = true)
 |    |-- element: string (containsNull = true)



### Table Details. Providing you with all the ... well details
Scanning the StructType of the `detail()` dataframe gives you a lot of data. The following use cases can be solved with the metadata:

```
root
 |-- format: string (nullable = true)
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location: string (nullable = true)
 |-- createdAt: timestamp (nullable = true)
 |-- lastModified: timestamp (nullable = true)
 |-- partitionColumns: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- numFiles: long (nullable = true)
 |-- sizeInBytes: long (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- minReaderVersion: integer (nullable = true)
 |-- minWriterVersion: integer (nullable = true)
 |-- tableFeatures: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

1. **Calculate Table Freshness**: `abs(current_time()-{table.lastModified})`: To answer the universal question of - "How Fresh Is It?".
2. **How Fast is the Table Growing?**: Size does matter. If we have two tables, tableA is 100gb and has `createdAt` of one year ago, and tableB is 100gb and was created yesterday, then we've got a scalability monster. Using the `freshness` technique, you can calculate the `days` a table has `existed`, and calculate the `avg` bytes per day using `sizeInBytes`.
3. **What is the Table Telling Us?**: Using the `properties` map, we can easily view ALL Table Properties, including those used to `automate` Delta Lake like `delta.logRetentionDuration` or those *we bring to the table* - pun truly intended. Like `catalog.team_name`

In [13]:
# Feel Free to Mess with the following cell to get used to the data available to you about the ecomm_by_day table.
from pyspark.sql.functions import col, current_timestamp, to_date, datediff
tbl_dets = (
    table_details
    .withColumn("now", current_timestamp())
    .withColumn("todaysDate", to_date(col("now")))
    .withColumn("ageInDays", datediff(col("todaysDate"),to_date("createdAt")))
    .withColumn("staleDays", datediff(col("todaysDate"),to_date("lastModified")))
)
# view all the time-based info on the table.
(tbl_dets
 .select(
     "now",
     "todaysDate",
     "createdAt",
     "lastModified",
     "ageInDays",
     "staleDays")
 .show(truncate=False)
)

# fetch the dataframe as a local Row
dets = tbl_dets.first()
# see it's a Row...<class 'pyspark.sql.types.Row'>
#print(type(dets))
team_name = dets['properties']['catalog.team_name']
team_slack = dets['properties']['catalog.engineering.comms.slack']
table_classifiction = dets['properties']['catalog.table.classification']

# stick to the details
print(f"""
Don't Panic!\n
The table '{dets.name}' has a known classification of '{table_classifiction}'.\n
The table is owned by the following team '{team_name}'.\n
If we need contact them via slack @ {team_slack}
""")

# or remember not to panic, everything is under control
#print(f"""
#I am no longer panicking.\n 
#Why you ask?\n
#I know that I can count on {team_name} to deliver gold data, otherwise...\n
#to slack ({team_slack}) we ride questions in hand about the TABLE {dets.name}.\n
#Which happened to be created on {dets.createdAt} and last updated at {dets.lastModified}...
#""")

+--------------------------+----------+-----------------------+-----------------------+---------+---------+
|now                       |todaysDate|createdAt              |lastModified           |ageInDays|staleDays|
+--------------------------+----------+-----------------------+-----------------------+---------+---------+
|2024-06-03 22:00:18.006635|2024-06-03|2024-06-03 21:51:13.192|2024-06-03 21:51:16.791|0        |0        |
+--------------------------+----------+-----------------------+-----------------------+---------+---------+


Don't Panic!

The table 'spark_catalog.default.ecomm_by_day' has a known classification of 'all-access'.

The table is owned by the following team 'dldg_authors'.

If we need contact them via slack @ https://delta-users.slack.com/archives/CG9LR6LN4



## Inspecting the Volume, Size, and Charateristics of a Delta Table
> there are many uses for math in a career as a data engineer. One of them is back of the envelope (or mostly right maths!)

In [14]:
bytesToMB = 1000000
(tbl_dets
 .select(
     col("numFiles"),
     (col("sizeInBytes")/bytesToMB)
       .alias("TableSizeInMegaBytes"),
     ((col("sizeInBytes")/bytesToMB)/col("numFiles"))
       .alias("avgMBPerFile")
 ).show()
)

+--------+--------------------+------------+
|numFiles|TableSizeInMegaBytes|avgMBPerFile|
+--------+--------------------+------------+
|       1|            0.005801|    0.005801|
+--------+--------------------+------------+



## What We've Learned about the Dataset
> Note: The following information is based on the 'complete' ecomm dataset. The full 15gb csv. 807mb is the size on disk after zstd compression and Delta encoding. 

1. The naive average megabytes per file is around `17mb`. If you run `ls -lh` across any given day, you'll see more of an odd split between say 3mb and 18mb due to non optimized, non-bin backed table data on disk.
    - (or the very *improbably* you may see exactly the value `0.0058685MB` if you are using the ***-sm*** dataset)
2. There are `142` files taking up a `~2.4gb` for the `entire` table.
    - (or even more improbably exactly `4` files taking up `0.023474MB` for the whole table)
3. There are probably many more `rows` of data in the table, so if we wanted to get a 'quick' count, then that would be a good idea too. That can give us more `approximate` math to work with (rows/day) - even if we are off - we are better informed with approximate math than wild guesses and hopes and dreams.
    - (unless we are looking for Magrathea or being attacked by Vorgons)


In [15]:
# convert the DeltaTable reference to a DataFrame
seconds_in_day = 86400
dt_as_df = dt_ecomm.toDF()
total_rows = dt_as_df.count()

# calculate the total number of partitions
# cheating since we are just taking the first of many (or fail if none - we know it is event_date, but still)
partitionCol = (dt_ecomm
                .detail()
                .first()["partitionColumns"][0])
total_partitions = (dt_as_df
                    .select(col(partitionCol))
                    .distinct()
                    .count())
avg_files_per_partition = dets['numFiles']/total_partitions if total_partitions != 0 else 0
rows_per_day = total_rows/dets['numFiles'] if dets['numFiles'] != 0 else 0
avg_row_size = dets['sizeInBytes']/total_rows if total_rows !=0 else 0

# the * denotes maybe accurate, maybe really off. 
# - This is mostly cheap approximations (and maybe accurate math)
print(f"""
The Table has {total_rows} rows.\n
*Daily Rows of {rows_per_day}\n
Total Partitions in Table {total_partitions}\n
*Avg Files per Partition {avg_files_per_partition}\n
*Avg Row Size {avg_row_size} in Bytes\n
*Avg Rows per Delta Lake File: {rows_per_day/avg_files_per_partition if avg_files_per_partition != 0 else 0}\n
Records per Second: {rows_per_day/seconds_in_day}\n
Records per Hour: {rows_per_day/24}
""")


The Table has 64 rows.

*Daily Rows of 64.0

Total Partitions in Table 1

*Avg Files per Partition 1.0

*Avg Row Size 90.640625 in Bytes

*Avg Rows per Delta Lake File: 64.0

Records per Second: 0.0007407407407407407

Records per Hour: 2.6666666666666665



# Our First Delta Lake Streaming Operation
> Clap your Hands! Or Celebrate However you want. It's time to be Streaming

Because we have potentially a gigantic amount of data - (Or depending on the adventure you chose a smaller set of 60, yes it should have been 42, but time...) - regardless, it is time to create our first streaming application.

## What We'll Need
1. A Place to Store our Application Metadata. Luckily we have our Local File Sytem, so we can just store the application data there for now. (See [common application directory](../../applications/README.md) to understand a little more.
2. A [Way of Restricting the Volume of Data We Read](https://docs.delta.io/latest/delta-streaming.html#limit-input-rate)
3. A [Means of Ignoring Things](https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes) we don't currently care about.
3. A Way of Limiting the Frequency in which our Application Runs (just like we want to limit the volume of data, when we start learning how to work with Streaming Data, it is better to slowly increase the rate which we will learn how to do.)

In [16]:
#spark.sql("drop table default.ecomm_aggs_table")

In [17]:
# read from the `default.ecomm_by_day` table, modify the read options to limit the maxFilesPerTrigger
# read up to 4 files, do a simple projection (select colA, colB)
# write out to a new Delta Lake table. 
# Checkpoint the progress so we can `pick up where we left off`

app_name = "dl_streaming_aggs"
app_version = "v0.0.1"
checkpoint_dir = "../../applications"
checkpoint_path = f"{checkpoint_dir}/{app_name}/{app_version}/_checkpoints"
print(f"checkpoint_path={checkpoint_path}")
ecomm_aggs_table = 'default.ecomm_aggs_table'

spark.conf.set("spark.sql.shuffle.partitions", "32")
# create the streaming Delta source object
ecomm_by_day_limited = (
    spark.readStream
    .format("delta")
    .option("maxFilesPerTrigger", 4)
    .option("ignoreChanges", True)
    .option("withEventTimeOrder", True)
    .table(dl_managed_table)
)

# view the schema for the table (since we know everything else about it now too)
ecomm_by_day_limited.printSchema()

# next select the columns we care about (feel free to switch things up here too)
ecomm_aggs = (
    ecomm_by_day_limited
    .withWatermark("event_time", '10 minutes')
    .select("event_time", "event_type", "product_id", "user_session", "user_id", "event_date")
    .groupBy(window("event_time", "30 minutes"), "user_id", "product_id", "event_date")
    .agg(count("event_type").alias('session_events'))
)

# next create the streaming sink

streamingQuery = (
    ecomm_aggs.writeStream
    .format("delta")
    .queryName("ecomm_aggs")
    .option("checkpointLocation", checkpoint_path)
    .outputMode("append")
    .partitionBy("event_date")
    .option("overwriteSchema", False)
    # triggers allow us to control the frequency in which a job will run. 
    # For the java nerds (me included) triggers run like scheduledThreadPools when using `processingTime` 
    # and once, will fire once and then the job will complete.
    #.trigger(processingTime='42 seconds')
    .trigger(availableNow=True)
    .toTable(ecomm_aggs_table)
)

checkpoint_path=../../applications/dl_streaming_aggs/v0.0.1/_checkpoints
root
 |-- event_time: timestamp (nullable = true)
 |-- event_type: string (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- category_id: long (nullable = true)
 |-- category_code: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- price: float (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- user_session: string (nullable = true)
 |-- event_date: date (nullable = true)



24/06/03 22:00:21 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `spark_catalog`.`default`.`ecomm_aggs_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
24/06/03 22:00:21 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


## Controlling the StreamingQuery
1. We returned a `streamingQuery` object when we executed the last cell before. The Streaming Query object provides you with a gateway into the realtime metrics and behavior of your Delta-Spark based application performance.

2. Given the application is `triggering` every `30s` that means twice a minute we'll have more data, as the job slowly chews through the 72 files of the data set, pulling in 600k files per tick.

Take a look at the metadata provided to you by the `streamingQuery`. Think about how impressive the numbers are.

In [28]:
streamingQuery.lastProgress

{'id': '3c9e757e-e2ee-4930-bcf2-82568d9c2084',
 'runId': '84eb0639-e11a-4214-b8b5-ae265245f82b',
 'name': 'ecomm_aggs',
 'timestamp': '2024-06-03T22:00:27.137Z',
 'batchId': 1,
 'numInputRows': 0,
 'inputRowsPerSecond': 0.0,
 'processedRowsPerSecond': 0.0,
 'durationMs': {'addBatch': 2067,
  'commitOffsets': 41,
  'getBatch': 0,
  'latestOffset': 35,
  'queryPlanning': 18,
  'triggerExecution': 2209,
  'walCommit': 43},
 'eventTime': {'watermark': '2019-09-30T23:50:50.000Z'},
 'stateOperators': [{'operatorName': 'stateStoreSave',
   'numRowsTotal': 58,
   'numRowsUpdated': 0,
   'allUpdatesTimeMs': 29,
   'numRowsRemoved': 0,
   'allRemovalsTimeMs': 47,
   'commitTimeMs': 2426,
   'memoryUsedBytes': 31480,
   'numRowsDroppedByWatermark': 0,
   'numShufflePartitions': 32,
   'numStateStoreInstances': 32,
   'customMetrics': {'loadedMapCacheHitCount': 64,
    'loadedMapCacheMissCount': 0,
    'stateOnCurrentVersionSizeBytes': 17496}}],
 'sources': [{'description': 'DeltaSource[file:/opt/

In [29]:
lprog = streamingQuery.lastProgress
input_rows_sec = lprog['inputRowsPerSecond']
processed_rows_sec = lprog['processedRowsPerSecond']

print(f"""
input_rows_a_second:{input_rows_sec}\n
processed_rows_a_second: {processed_rows_sec}\n
""")


input_rows_a_second:0.0

processed_rows_a_second: 0.0




^^ The prior output from the StreamingQueryListener is an aggregation of the collected runtime metadata, and statistical
behavior captured during the last microBatch. You'll notice that we started on index 16, and endingOffset was 17.

# Viewing the Delta Lake Information in the Streaming Query Stats
```
'startOffset': {
  'sourceVersion': 1,
  'reservoirId': '027b3701-5c07-46d4-9d96-e5539f81e8bf',
  'reservoirVersion': 33,
  'index': 16,
  'isStartingVersion': True},
'endOffset': {
  'sourceVersion': 1,
  'reservoirId': '027b3701-5c07-46d4-9d96-e5539f81e8bf',
  'reservoirVersion': 33,
  'index': 17,
  'isStartingVersion': True
}
```

This means we can take a look at the operations in the `/_checkpoints/offsets/17` directory. 

```
v1
{"batchWatermarkMs":1570578599000,"batchTimestampMs":1687853100013,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.stateStore.compression.codec":"lz4","spark.sql.streaming.stateStore.rocksdb.formatVersion":"5","spark.sql.streaming.statefulOperator.useStrictDistribution":"true","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"sourceVersion":1,"reservoirId":"027b3701-5c07-46d4-9d96-e5539f81e8bf","reservoirVersion":33,"index":17,"isStartingVersion":true}
```

In [30]:
streamingQuery.status

{'message': 'Stopped', 'isDataAvailable': False, 'isTriggerActive': False}

In [31]:
streamingQuery.stop()

## Applications have State in the form of Checkpoints. 
> Delta maintains its state in the terms of completed atomic transactions.

The application checkpoints track where the application has last successfully read from the Delta Lake table (source), and the application also keeps track of the delta version based on the resulting transformation and insert into the (sink). In our case we read from the `default.ecomm_by_day` and did some windowed aggregations for events per session, and then recorded the results in a new table named `default.ecomm_aggs_table`.

Let's peak at the checkpoint data. Open up `

## View the Checkpoint Data
> when you are managing a streaming application you will need to be familiar with both the Delta Log, as well as your application's own 'transaction' history which is stored in the 'checkpoints'

In [32]:
%%bash
ls -l ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/
# ls -l ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/commits/
# get the last modifed file in the dir (limit to 1) - this is the last commit version (microbatch number - for structured streaming)

# find the latest checkpoint in the checkpoints directory for the streaming application
unset -v last_checkpoint
for file in "../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/commits"/*; do
  [[ $file -nt $last_checkpoint ]] && last_checkpoint=$file
done

echo "${last_checkpoint}"
# view the commit info
cat $last_checkpoint
echo "---"
echo "TOTAL FILES IN CHECKPOINT DIRECTORY"
# view the total number of commits in the commits dir
ls -l ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/commits/ | wc -l

# cat ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/metadata 
# ls -l ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/offsets
#cat ../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/offsets/1

total 4
drwxr-xr-x 6 NBuser NBuser 192 Jun  3 22:00 commits
-rw-r--r-- 1 NBuser NBuser  45 Jun  3 22:00 metadata
drwxr-xr-x 6 NBuser NBuser 192 Jun  3 22:00 offsets
drwxr-xr-x 3 NBuser NBuser  96 Jun  3 22:00 state
../../applications/dl_streaming_aggs/v0.0.1/_checkpoints/commits/1
v1
{"nextBatchWatermarkMs":1569887450000}---
TOTAL FILES IN CHECKPOINT DIRECTORY
3


## The Fruits of our Quick Labor
The shopping aggregations is our own 'sessionization' based on things that would work for the hitchhikers guide to Delta Lake streaming. Have we learned a lot from the data? Maybe. Have we learned a lot more about how Delta Lake works? Surely.

In [33]:
(spark.read
 .table("default.ecomm_aggs_table")
 .where(col("event_date").isin("2019-10-01","2019-10-02"))
 .show(10, truncate=False))

+------+-------+----------+----------+--------------+
|window|user_id|product_id|event_date|session_events|
+------+-------+----------+----------+--------------+
+------+-------+----------+----------+--------------+



## Extra Homework: Finding Neat Patterns in the Data
> shopping is fun. We all do it, some of us even enjoy it. Regardless of your style, the one thing we have in common is that not one of us really shops the same. Investigate the 42 million shopping data points from this dataset to understand how people are shopping. 

In [34]:
(spark.read
 .table(dl_managed_table)
 .select("event_time", "event_type", "product_id", "user_session", "user_id")
 .show(50, truncate=False)
)

+-------------------+----------+----------+------------------------------------+---------+
|event_time         |event_type|product_id|user_session                        |user_id  |
+-------------------+----------+----------+------------------------------------+---------+
|2019-10-01 00:00:00|view      |44600062  |72d76fde-8bb3-4e00-8c23-a032dfed738c|541312140|
|2019-10-01 00:00:00|view      |3900821   |9333dfbd-b87a-4708-9857-6336556b0fcc|554748717|
|2019-10-01 00:00:01|view      |17200506  |566511c2-e2e3-422b-b695-cf8e6e792ca8|519107250|
|2019-10-01 00:00:01|view      |1307067   |7c90fc70-0e80-4590-96f3-13c02c18c713|550050854|
|2019-10-01 00:00:04|view      |1004237   |c6bd7419-2748-4c56-95b4-8cec9ff8b80d|535871217|
|2019-10-01 00:00:05|view      |1480613   |0d0d91c2-c9c2-4e81-90a5-86594dec0db9|512742880|
|2019-10-01 00:00:08|view      |17300353  |4fe811e9-91de-46da-90c3-bbd87ed3a65d|555447699|
|2019-10-01 00:00:08|view      |31500053  |6280d577-25c8-4147-99a7-abc6048498d6|550978835|

In [35]:
# find a user who has an interesting shopping pattern
# this user comes back frequently, views, comes back, and 10 days from the first
# view finally makes a purchase

(spark.read
 .table(dl_managed_table)
 .select("event_time", "event_type", "product_id", "user_id", "user_session")
 .where(col("user_id").eqNullSafe(516224384))
 .show(50, truncate=False)
)

+----------+----------+----------+-------+------------+
|event_time|event_type|product_id|user_id|user_session|
+----------+----------+----------+-------+------------+
+----------+----------+----------+-------+------------+



# Cleaning up with Vacuum.
We are done with the introduction to Streaming. The First steps covers creating tables, and modifying the table properties, as well as understanding a little more about the structure of a Delta Lake table. During normal processing, you most likely overwrote, or deleted some data, for each transaction that affects the data in a given Delta Lake table, there are some artifacts (call it orphaned data or files) that are no longer needed for the *CURRENT* version of the Delta Lake table. We will learn more about using `vacuum` while preserving enough history to `undo`, `rewind`, or `time-travel` to a particular point in Table Time under 

In [36]:
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","false")
DeltaTable.forName(spark, ecomm_aggs_table).vacuum(retentionHours=0)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled","true")

                                                                                

Deleted 0 files and directories in a total of 1 directories.


24/06/03 22:33:43 WARN JavaUtils: Attempt to delete using native Unix OS command failed for path = /tmp/blockmgr-c84de906-7921-4b7a-90a7-8b2c3a0f1cd0. Falling back to Java IO way
java.io.IOException: Failed to delete: /tmp/blockmgr-c84de906-7921-4b7a-90a7-8b2c3a0f1cd0
	at org.apache.spark.network.util.JavaUtils.deleteRecursivelyUsingUnixNative(JavaUtils.java:173)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:109)
	at org.apache.spark.network.util.JavaUtils.deleteRecursively(JavaUtils.java:90)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively(SparkFileUtils.scala:121)
	at org.apache.spark.util.SparkFileUtils.deleteRecursively$(SparkFileUtils.scala:120)
	at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1126)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1(DiskBlockManager.scala:368)
	at org.apache.spark.storage.DiskBlockManager.$anonfun$doStop$1$adapted(DiskBlockManager.scala:364)
	at scala.collection.IndexedSeqOptimize