# Hyperspace ZOrderCoveringIndex

In [None]:
val sessionId = scala.util.Random.nextInt(1000000)
val dataPath = s"/hyperspacetest/data-$sessionId"
val indexPath = s"/hyperspacetest/index-$sessionId"
spark.conf.set("spark.hyperspace.system.path", indexPath)

val numFiles = 100

### Data preparation

In [None]:
spark.range(50000000).map { _ =>
    (scala.util.Random.nextInt(10000000).toLong, scala.util.Random.nextInt(1000000000), scala.util.Random.nextInt(10000000))
}.toDF("colA", "colB", "colC").repartition(numFiles).write.mode("overwrite").format("parquet").save(dataPath)

// 50M rows with random integers stored in numFiles parquet files

### Create index

In [None]:
import com.microsoft.hyperspace.index.zordercovering._
import com.microsoft.hyperspace._
import com.microsoft.hyperspace.util.FileUtils
import org.apache.hadoop.fs.Path

val totalSizeInBytes = FileUtils.getDirectorySize(new Path(dataPath))
val sizePerPartition = totalSizeInBytes / numFiles                      
spark.conf.set("spark.hyperspace.index.zorder.targetSourceBytesPerPartition", sizePerPartition) // Default: 1G
// Changed per file size for z-order index for demonstration

val df = spark.read.parquet(dataPath)
val hs = new Hyperspace(spark)
hs.createIndex(df, ZOrderCoveringIndexConfig("zorderTestIndex", Seq("colA", "colB"), Seq("colC")))

In [None]:
def measureDuration(f : => Unit) {
    val start = System.nanoTime
    f
    val durationInMS = (System.nanoTime - start) / 1000 / 1000
    println("duration(ms): " + durationInMS)
}

### Check performance with and without ZOrderCoveringIndex

NOTE: performance gain will be different depending on query type, data size and computing environment. 
As the test data is not huge, use small computing resource to see the improvement from Z-ordering.

In [None]:
spark.disableHyperspace
val filterQuery = df.filter("colA > 758647 AND colA < 779999 AND colB > 10537919 AND colB < 10599715")
println(filterQuery.queryExecution.sparkPlan)
measureDuration(filterQuery.count)
measureDuration(filterQuery.count)

In [None]:
spark.enableHyperspace
val filterQuery = df.filter("colA > 758647 AND colA < 779999 AND colB > 10537919 AND colB < 10599715")
println(filterQuery.queryExecution.sparkPlan)
measureDuration(filterQuery.count)
measureDuration(filterQuery.count)

### Utility function for min/max skipping analysis

We provide min/max based analysis utility function for any DataFrame.
The analysis function only works for numeric columns. 
It'll collect min/max for each data file and generate analysis result.


In [None]:
import com.microsoft.hyperspace.util.MinMaxAnalysisUtil
val df = spark.read.parquet(dataPath)  

// Since source data is randomly generated, we need to check all files to find a value.
displayHTML(MinMaxAnalysisUtil.analyze(df, Seq("colA", "colB"), format = "html")) // format "text" and "html" are available.


In [None]:
// As the index data is Z-ordered, we can skip reading unnecessary files based on min/max statistics.
displayHTML(MinMaxAnalysisUtil.analyzeIndex(spark, "zorderTestIndex", Seq("colA", "colB"), format = "html")) 