## Hitchhiker's Guide to Hyperspace - An Indexing Subsystem for Apache Spark™
Hyperspace introduces the ability for Apache Spark™ users to create indexes on their datasets (e.g., CSV, JSON, Parquet etc.) and leverage them for potential query and workload acceleration.

In this notebook, we highlight the basics of Hyperspace, emphasizing on its simplicity and shows how it can be used by just anyone.

**Disclaimer**: Hyperspace helps accelerate your workloads/queries under two circumstances:

  1. Queries contain filters on predicates with high selectivity (e.g., you want to select 100 matching rows from a million candidate rows)
  2. Queries contain a join that requires heavy-shuffles (e.g., you want to join a 100 GB dataset with a 10 GB dataset)

You may want to carefully monitor your workloads and determine whether indexing is helping you on a case-by-case basis.

## Setup
To begin with, let's start a new Spark™ session. Since this notebook is a tutorial merely to illustrate what Hyperspace can offer, we will make a configuration change that allow us to highlight what Hyperspace is doing on small datasets. By default, Spark™ uses *broadcast join* to optimize join queries when the data size for one side of join is small (which is the case for the sample data we use in this tutorial). Therefore, we disable broadcast joins so that later when we run join queries, Spark™ uses *sort-merge* join. This is mainly to show how Hyperspace indexes would be used at scale for accelerating join queries.

The output of running the cell below shows a reference to the successfully created Spark™ session and prints out '-1' as the value for the modified join config which indicates that broadcast join is successfully disabled.

In [16]:
# Start your Spark session
spark

# Disable BroadcastHashJoin, so Spark will use standard SortMergeJoin. Currently Hyperspace indexes utilize SortMergeJoin to speed up query.
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Verify that BroadcastHashJoin is set correctly 
print(spark.conf.get("spark.sql.autoBroadcastJoinThreshold"))

-1

## Data Preparation

To prepare our environment, we will create sample data records and save them as parquet data files. While we use Parquet for illustration, you can use other formats such as CSV. In the subsequent cells, we will also demonstrate how you can create several Hyperspace indexes on this sample dataset and how one can make Spark™ use them when running queries. 

Our example records correspond to two datasets: *department* and *employee*. You should configure "empLocation" and "deptLocation" paths so that on the storage account they point to your desired location to save generated data files. 

The output of running below cell shows contents of our datasets as lists of triplets followed by references to dataFrames created to save the content of each dataset in our preferred location.

In [17]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

# Sample department records
departments = [(10, "Accounting", "New York"), (20, "Research", "Dallas"), (30, "Sales", "Chicago"), (40, "Operations", "Boston")]

# Sample employee records
employees = [(7369, "SMITH", 20), (7499, "ALLEN", 30), (7521, "WARD", 30), (7566, "JONES", 20), (7698, "BLAKE", 30)]

# Create a schema for the dataframe
dept_schema = StructType([StructField('deptId', IntegerType(), True), StructField('deptName', StringType(), True), StructField('location', StringType(), True)])
emp_schema = StructType([StructField('empId', IntegerType(), True), StructField('empName', StringType(), True), StructField('deptId', IntegerType(), True)])

departments_df = spark.createDataFrame(departments, dept_schema)
employees_df = spark.createDataFrame(employees, emp_schema)

#TODO ** customize this location path **
emp_Location = "/<yourpath>/employees.parquet"
dept_Location = "/<yourpath>/departments.parquet"

employees_df.write.mode("overwrite").parquet(emp_Location)
departments_df.write.mode("overwrite").parquet(dept_Location)

Let's verify the contents of parquet files we created above to make sure they contain expected records in correct format. We later use these data files to create Hyperspace indexes and run sample queries.

Running below cell, the output displays the rows in employee and department dataFrames in a tabular form. There should be 14 employees and 4 departments, each matching with one of triplets we created in the previous cell.

In [18]:
# emp_Location and dept_Location are the user defined locations above to save parquet files
emp_DF = spark.read.parquet(emp_Location)
dept_DF = spark.read.parquet(dept_Location)

# Verify the data is available and correct
emp_DF.show()
dept_DF.show()

+-----+-------+------+
|empId|empName|deptId|
+-----+-------+------+
| 7369|  SMITH|    20|
| 7499|  ALLEN|    30|
| 7566|  JONES|    20|
| 7698|  BLAKE|    30|
| 7521|   WARD|    30|
+-----+-------+------+

+------+----------+--------+
|deptId|  deptName|location|
+------+----------+--------+
|    10|Accounting|New York|
|    40|Operations|  Boston|
|    20|  Research|  Dallas|
|    30|     Sales| Chicago|
+------+----------+--------+

## Hello Hyperspace Index!
Hyperspace lets users create indexes on records scanned from persisted data files. Once successfully created, an entry corresponding to the index is added to the Hyperspace's metadata. This metadata is later used by Apache Spark™'s optimizer (with our extensions) during query processing to find and use proper indexes. 

Once indexes are created, users can perform several actions:
  - **Refresh** If the underlying data changes, users can refresh an existing index to capture that. 
  - **Delete** If the index is not needed, users can perform a soft-delete i.e., index is not physically deleted but is marked as 'deleted' so it is no longer used in your workloads.
  - **Vacuum** If an index is no longer required, users can vacuum it which forces a physical deletion of the index contents and associated metadata completely from Hyperspace's metadata.

Below sections show how such index management operations can be done in Hyperspace.

First, we need to import the required libraries and create an instance of Hyperspace. We later use this instance to invoke different Hyperspace APIs to create indexes on our sample data and modify those indexes.

Output of running below cell shows a reference to the created instance of Hyperspace.

In [20]:
from hyperspace import *

# Create an instance of Hyperspace
hyperspace = Hyperspace(spark)

### Create Indexes
To create a Hyperspace index, the user needs to provide 2 pieces of information:
* An Apache Spark™ DataFrame which references the data to be indexed.
* An index configuration object: IndexConfig, which specifies the *index name*, *indexed* and *included* columns of the index. 

We start by creating three Hyperspace indexes on our sample data: two indexes on the department dataset named "deptIndex1" and "deptIndex2", and one index on the employee dataset named 'empIndex'. 
For each index, we need a corresponding IndexConfig to capture the name along with columns lists for the indexed and included columns. Running below cell creates these indexConfigs and its output lists them.

**Note**: An *index column* is a column that appears in your filters or join conditions. An *included column* is a column that appears in your select/project.

For instance, in the following query:
```sql
SELECT X
FROM Table
WHERE Y = 2
```
X can be an *index column* and Y can be an *included column*.

In [21]:
# Create index configurations

emp_IndexConfig = IndexConfig("empIndex1", ["deptId"], ["empName"])
dept_IndexConfig1 = IndexConfig("deptIndex1", ["deptId"], ["deptName"])
dept_IndexConfig2 = IndexConfig("deptIndex2", ["location"], ["deptName"])

Now, we create three indexes using our index configurations. For this purpose, we invoke "createIndex" command on our Hyperspace instance. This command requires an index configuration and the dataFrame containing rows to be indexed.
Running below cell creates three indexes.


In [22]:
# Create indexes from configurations

hyperspace.createIndex(emp_DF, emp_IndexConfig)
hyperspace.createIndex(dept_DF, dept_IndexConfig1)
hyperspace.createIndex(dept_DF, dept_IndexConfig2)

### List Indexes

Below code shows how a user can list all available indexes in a Hyperspace instance. It uses "indexes" API which returns information about existing indexes as a Spark™'s DataFrame so you can perform additional operations. For instance, you can invoke valid operations on this DataFrame for checking its content or analyzing it further (for example filtering specific indexes or grouping them according to some desired property). 

Below cell uses DataFrame's 'show' action to fully print the rows and show details of our indexes in a tabular form. For each index, we can see all information Hyperspace has stored about it in the metadata. You will immediately notice the following:
  - "config.indexName", "config.indexedColumns", "config.includedColumns" and "status.status" are the fields that a user normally refers to. 
  - "dfSignature" is automatically generated by Hyperspace and is unique for each index. Hyperspace uses this signature internally to maintain the index and exploit it at query time. 
  
In the output below, all three indexes should have "ACTIVE" as status and their name, indexed columns, and included columns should match with what we defined in index configurations above.


In [23]:
hyperspace.indexes().show()

+----------+--------------+---------------+----------+--------------------+--------------------+------+
|      name|indexedColumns|includedColumns|numBuckets|              schema|       indexLocation| state|
+----------+--------------+---------------+----------+--------------------+--------------------+------+
|deptIndex1|      [deptId]|     [deptName]|       200|{"type":"struct",...|/location/...|ACTIVE|
|deptIndex2|    [location]|     [deptName]|       200|{"type":"struct",...|/location/...|ACTIVE|
| empIndex1|      [deptId]|      [empName]|       200|{"type":"struct",...|/location/...|ACTIVE|
+----------+--------------+---------------+----------+--------------------+--------------------+------+

### Delete Indexes
A user can drop an existing index by using the "deleteIndex" API and providing the index name. Index deletion does a soft delete: It mainly updates index's status in the Hyperspace metadata from "ACTIVE" to "DELETED". This will exclude the dropped index from any future query optimization and Hyperspace no longer picks that index for any query. However, index files for a deleted index still remain available (since it is a soft-delete), so that the index could be restored if user asks for.

Below cell deletes index with name "deptIndex2" and lists Hyperspace metadata after that. The output should be similar to above cell for "List Indexes" except for "deptIndex2" which now should have its status changed into "DELETED".

In [24]:
hyperspace.deleteIndex("deptIndex2")

hyperspace.indexes().show()

+----------+--------------+---------------+----------+--------------------+--------------------+-------+
|      name|indexedColumns|includedColumns|numBuckets|              schema|       indexLocation|  state|
+----------+--------------+---------------+----------+--------------------+--------------------+-------+
|deptIndex1|      [deptId]|     [deptName]|       200|{"type":"struct",...|/location/...| ACTIVE|
|deptIndex2|    [location]|     [deptName]|       200|{"type":"struct",...|/location/...|DELETED|
| empIndex1|      [deptId]|      [empName]|       200|{"type":"struct",...|/location/...| ACTIVE|
+----------+--------------+---------------+----------+--------------------+--------------------+-------+

### Restore Indexes
A user can use the "restoreIndex" API to restore a deleted index. This will bring back the latest version of index into ACTIVE status and makes it usable again for queries. Below cell shows an example of "restoreIndex" usage. We delete "deptIndex1" and restore it. The output shows "deptIndex1" first went into the "DELETED" status after invoking "deleteIndex" command and came back to the "ACTIVE" status after calling "restoreIndex".


In [25]:
hyperspace.deleteIndex("deptIndex1")

hyperspace.indexes().show()

hyperspace.restoreIndex("deptIndex1")

hyperspace.indexes().show()

+----------+--------------+---------------+----------+--------------------+--------------------+-------+
|      name|indexedColumns|includedColumns|numBuckets|              schema|       indexLocation|  state|
+----------+--------------+---------------+----------+--------------------+--------------------+-------+
|deptIndex1|      [deptId]|     [deptName]|       200|{"type":"struct",...|/location/...|DELETED|
|deptIndex2|    [location]|     [deptName]|       200|{"type":"struct",...|/location/...|DELETED|
| empIndex1|      [deptId]|      [empName]|       200|{"type":"struct",...|/location/...| ACTIVE|
+----------+--------------+---------------+----------+--------------------+--------------------+-------+

+----------+--------------+---------------+----------+--------------------+--------------------+-------+
|      name|indexedColumns|includedColumns|numBuckets|              schema|       indexLocation|  state|
+----------+--------------+---------------+----------+--------------------+

### Vacuum Indexes
The user can perform a hard-delete i.e., fully remove files and the metadata entry for a deleted index using "vacuumIndex" command. Once done, this action is irreversible as it physically deletes all the index files (which is why it is a hard-delete).
 
The cell below vacuums the "deptIndex2" index and shows Hyperspace metadata after vaccuming. You should see metadata entries for two indexes "deptIndex1" and "empIndex" both with "ACTIVE" status and no entry for "deptIndex2".

In [72]:
hyperspace.vacuumIndex("deptIndex2")
hyperspace.indexes().show()

+----------+--------------+---------------+----------+--------------------+--------------------+--------------------+------+
|      name|indexedColumns|includedColumns|numBuckets|              schema|       indexLocation|           queryPlan| state|
+----------+--------------+---------------+----------+--------------------+--------------------+--------------------+------+
|deptIndex1|      [deptId]|     [deptName]|       200|{"type":"struct",...|/datasets/idx...|Relation[deptId#1...|ACTIVE|
| empIndex1|      [deptId]|      [empName]|       200|{"type":"struct",...|/datasets/idx...|Relation[empId#19...|ACTIVE|
+----------+--------------+---------------+----------+--------------------+--------------------+--------------------+------+

## Enable/Disable Hyperspace

Hyperspace provides APIs to enable or disable index usage with Spark™.

  - By using "enable" command, Hyperspace optimization rules become visible to the Apache Spark™ optimizer and they will exploit existing Hyperspace indexes to optimize user queries.
  - By using "disable' command, Hyperspace rules no longer apply during query optimization. You should note that disabling Hyperspace has no impact on created indexes as they remain intact.

Below cell shows how you can use these commands to enable or disable Hyperspace. The output simply shows a reference to the existing Spark™ session whose configuration is updated.

In [26]:
# Enable Hyperspace
Hyperspace.enable(spark)

# Disable Hyperspace
Hyperspace.disable(spark)

<pyspark.sql.session.SparkSession object at 0x7f30d4c90dd8>

## Index Usage
In order to make Spark use Hyperspace indexes during query processing, the user needs to make sure that Hyperspace is enabled. 

The cell below enables Hyperspace and creates two DataFrames containing our sample data records which we use for running example queries. For each DataFrame, a few sample rows are printed.

In [27]:
# Enable Hyperspace
Hyperspace.enable(spark)

emp_DF = spark.read.parquet(emp_Location)
dept_DF = spark.read.parquet(dept_Location)

emp_DF.show(5)
dept_DF.show(5)

+-----+-------+------+
|empId|empName|deptId|
+-----+-------+------+
| 7369|  SMITH|    20|
| 7499|  ALLEN|    30|
| 7566|  JONES|    20|
| 7698|  BLAKE|    30|
| 7521|   WARD|    30|
+-----+-------+------+

+------+----------+--------+
|deptId|  deptName|location|
+------+----------+--------+
|    10|Accounting|New York|
|    40|Operations|  Boston|
|    20|  Research|  Dallas|
|    30|     Sales| Chicago|
+------+----------+--------+

# Hyperspace's Index Types

Currently, Hyperspace has rules to exploit indexes for two groups of queries: 
* Selection queries with lookup or range selection filtering predicates.
* Join queries with an equality join predicate (i.e. Equi-joins).


## Indexes for Accelerating Filters

Our first example query does a lookup on department records (see below cell). In SQL, this query looks as follows:

```sql
SELECT deptName 
FROM departments
WHERE deptId = 20
```

The output of running the cell below shows: 
- query result, which is a single department name.
- query plan that Spark™ used to run the query. 

In the query plan, the "FileScan" operator at the bottom of the plan shows the datasource where the records were read from. The location of this file indicates the path to the latest version of the "deptIndex1" index. This shows  that according to the query and using Hyperspace optimization rules, Spark™ decided to exploit the proper index at runtime.


In [28]:
# Filter with equality predicate

eqFilter = dept_DF.filter("""deptId = 20""").select("""deptName""")
eqFilter.show()

eqFilter.explain(True)

+--------+
|deptName|
+--------+
|Research|
+--------+

== Parsed Logical Plan ==
'Project [unresolvedalias('deptName, None)]
+- Filter (deptId#492 = 20)
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Analyzed Logical Plan ==
deptName: string
Project [deptName#493]
+- Filter (deptId#492 = 20)
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Optimized Logical Plan ==
Project [deptName#493]
+- Filter (isnotnull(deptId#492) && (deptId#492 = 20))
   +- Relation[deptId#492,deptName#493] parquet

== Physical Plan ==
*(1) Project [deptName#493]
+- *(1) Filter (isnotnull(deptId#492) && (deptId#492 = 20))
   +- *(1) FileScan parquet [deptId#492,deptName#493] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/location/deptIndex1/v__=0], PartitionFilters: [], PushedFilters: [IsNotNull(deptId), EqualTo(deptId,20)], ReadSchema: struct<deptId:int,deptName:string>

Our second example is a range selection query on department records. In SQL, this query looks as follows:

```sql
SELECT deptName 
FROM departments
WHERE deptId > 20"
```
Similar to our first example, the output of the cell below shows the query results (names of two departments) and the query plan. The location of data file in the FileScan operator shows that 'deptIndex1" was used to run the query.   


In [29]:
# Filter with range selection predicate

rangeFilter = dept_DF.filter("""deptId > 20""").select("deptName")
rangeFilter.show()

rangeFilter.explain(True)

+----------+
|  deptName|
+----------+
|Operations|
|     Sales|
+----------+

== Parsed Logical Plan ==
'Project [unresolvedalias('deptName, None)]
+- Filter (deptId#492 > 20)
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Analyzed Logical Plan ==
deptName: string
Project [deptName#493]
+- Filter (deptId#492 > 20)
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Optimized Logical Plan ==
Project [deptName#493]
+- Filter (isnotnull(deptId#492) && (deptId#492 > 20))
   +- Relation[deptId#492,deptName#493] parquet

== Physical Plan ==
*(1) Project [deptName#493]
+- *(1) Filter (isnotnull(deptId#492) && (deptId#492 > 20))
   +- *(1) FileScan parquet [deptId#492,deptName#493] Batched: true, Format: Parquet, Location: InMemoryFileIndex[/location/deptIndex1/v__=0], PartitionFilters: [], PushedFilters: [IsNotNull(deptId), GreaterThan(deptId,20)], ReadSchema: struct<deptId:int,deptName:string>

Our third example is a query joining department and employee records on the department id. The equivalent SQL statement is shown below:

```sql
SELECT employees.deptId, empName, departments.deptId, deptName
FROM   employees, departments 
WHERE  employees.deptId = departments.deptId"
```

The output of running the cell below shows the query results which are the names of 14 employees and the name of department each employee works in. The query plan is also included in the output. Notice how the file locations for two FileScan operators shows that Spark used "empIndex" and "deptIndex1" indexes to run the query.   


In [30]:
# Join

eqJoin = emp_DF.join(dept_DF, emp_DF.deptId == dept_DF.deptId).select(emp_DF.empName, dept_DF.deptName)

eqJoin.show()

eqJoin.explain(True)

+-------+--------+
|empName|deptName|
+-------+--------+
|  SMITH|Research|
|  JONES|Research|
|  ALLEN|   Sales|
|  BLAKE|   Sales|
|   WARD|   Sales|
+-------+--------+

== Parsed Logical Plan ==
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Relation[empId#486,empName#487,deptId#488] parquet
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Analyzed Logical Plan ==
empName: string, deptName: string
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Relation[empId#486,empName#487,deptId#488] parquet
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Optimized Logical Plan ==
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Project [empName#487, deptId#488]
   :  +- Filter isnotnull(deptId#488)
   :     +- Relation[empName#487,deptId#488] parquet
   +- Project [deptId#492, deptName#493]
      +- Filter isnotnull(deptId#492)
         +- Relation[deptId#492,

In [31]:
# Join

eqJoin = emp_DF.join(dept_DF, emp_DF.deptId == dept_DF.deptId).select(emp_DF.empName, dept_DF.deptName)

eqJoin.show()

eqJoin.explain(True)

+-------+--------+
|empName|deptName|
+-------+--------+
|  SMITH|Research|
|  JONES|Research|
|  ALLEN|   Sales|
|  BLAKE|   Sales|
|   WARD|   Sales|
+-------+--------+

== Parsed Logical Plan ==
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Relation[empId#486,empName#487,deptId#488] parquet
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Analyzed Logical Plan ==
empName: string, deptName: string
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Relation[empId#486,empName#487,deptId#488] parquet
   +- Relation[deptId#492,deptName#493,location#494] parquet

== Optimized Logical Plan ==
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Project [empName#487, deptId#488]
   :  +- Filter isnotnull(deptId#488)
   :     +- Relation[empName#487,deptId#488] parquet
   +- Project [deptId#492, deptName#493]
      +- Filter isnotnull(deptId#492)
         +- Relation[deptId#492,

## Support for SQL Semantics

The index usage is transparent to whether the user uses DataFrame API or Spark™ SQL. The following example shows the same join example as before, in sql form, showing the use of indexes if applicable.

In [32]:
from pyspark.sql import SparkSession

emp_DF.createOrReplaceTempView("EMP")
dept_DF.createOrReplaceTempView("DEPT")

joinQuery = spark.sql("SELECT EMP.empName, DEPT.deptName FROM EMP, DEPT WHERE EMP.deptId = DEPT.deptId")

joinQuery.show()
joinQuery.explain(True)

+-------+--------+
|empName|deptName|
+-------+--------+
|  SMITH|Research|
|  JONES|Research|
|  ALLEN|   Sales|
|  BLAKE|   Sales|
|   WARD|   Sales|
+-------+--------+

== Parsed Logical Plan ==
'Project ['EMP.empName, 'DEPT.deptName]
+- 'Filter ('EMP.deptId = 'DEPT.deptId)
   +- 'Join Inner
      :- 'UnresolvedRelation `EMP`
      +- 'UnresolvedRelation `DEPT`

== Analyzed Logical Plan ==
empName: string, deptName: string
Project [empName#487, deptName#493]
+- Filter (deptId#488 = deptId#492)
   +- Join Inner
      :- SubqueryAlias `emp`
      :  +- Relation[empId#486,empName#487,deptId#488] parquet
      +- SubqueryAlias `dept`
         +- Relation[deptId#492,deptName#493,location#494] parquet

== Optimized Logical Plan ==
Project [empName#487, deptName#493]
+- Join Inner, (deptId#488 = deptId#492)
   :- Project [empName#487, deptId#488]
   :  +- Filter isnotnull(deptId#488)
   :     +- Relation[empName#487,deptId#488] parquet
   +- Project [deptId#492, deptName#493]
      +- Filt

## Explain API
Indexes are great but how do you know if they are being used? Hyperspace allows users to compare their original plan vs the updated index-dependent plan before running their query. You have an option to choose from html/plaintext/console mode to display the command output. 

The following cell shows an example with HTML. The highlighted section represents the difference between original and updated plans along with the indexes being used.

In [42]:
eqJoin = emp_DF.join(dept_DF, emp_DF.deptId == dept_DF.deptId).select(emp_DF.empName, dept_DF.deptName)

spark.conf.set("spark.hyperspace.explain.displayMode", "html")
hyperspace.explain(eqJoin, True, displayHTML)

## Refresh Indexes
If the original data on which an index was created changes, then the index will no longer capture the latest state of data. The user can refresh such a stale index using "refreshIndex" command. This causes the index to be fully rebuilt and updates it accroding to the latest data records (don't worry, we will show you how to *incrementally refresh* your index in other notebooks).

The two cells below show an example for this scenario:
- First cell adds two more departments to the original departments data. It reads and prints list of departments to verify new departments are added correctly. The output shows 6 departments in total: four old ones and two new. Invoking "refreshIndex" updates "deptIndex1" so index captures new departments.
- Second cell runs our range selection query example. The results should now contain four departments: two are the ones, seen before when we ran the query above, and two are the new departments we just added.

In [43]:
extra_Departments = [(50, "Inovation", "Seattle"), (60, "Human Resources", "San Francisco")]

extra_departments_df = spark.createDataFrame(extra_Departments, dept_schema)
extra_departments_df.write.mode("Append").parquet(dept_Location)


dept_DFrame_Updated = spark.read.parquet(dept_Location)

dept_DFrame_Updated.show(10)

+------+---------------+-------------+
|deptId|       deptName|     location|
+------+---------------+-------------+
|    60|Human Resources|San Francisco|
|    60|Human Resources|San Francisco|
|    10|     Accounting|     New York|
|    50|      Inovation|      Seattle|
|    50|      Inovation|      Seattle|
|    40|     Operations|       Boston|
|    20|       Research|       Dallas|
|    30|          Sales|      Chicago|
+------+---------------+-------------+

In [45]:
newRangeFilter = dept_DFrame_Updated.filter("deptId > 20").select("deptName")
newRangeFilter.show()

newRangeFilter.explain(True)

+---------------+
|       deptName|
+---------------+
|Human Resources|
|Human Resources|
|      Inovation|
|      Inovation|
|     Operations|
|          Sales|
+---------------+

== Parsed Logical Plan ==
'Project [unresolvedalias('deptName, None)]
+- Filter (deptId#905 > 20)
   +- Relation[deptId#905,deptName#906,location#907] parquet

== Analyzed Logical Plan ==
deptName: string
Project [deptName#906]
+- Filter (deptId#905 > 20)
   +- Relation[deptId#905,deptName#906,location#907] parquet

== Optimized Logical Plan ==
Project [deptName#906]
+- Filter (isnotnull(deptId#905) && (deptId#905 > 20))
   +- Relation[deptId#905,deptName#906,location#907] parquet

== Physical Plan ==
*(1) Project [deptName#906]
+- *(1) Filter (isnotnull(deptId#905) && (deptId#905 > 20))
   +- *(1) FileScan parquet [deptId#905,deptName#906] Batched: true, Format: Parquet, Location: InMemoryFileIndex[abfss://data@location../departments.parquet], PartitionFilters: [], PushedFilters: [IsNotNull(deptId), Greater