[PROPOSAL]: Gold Standard #334

apoorvedave1 · 2021-01-26T06:35:10Z

Problem Statement

#282 Gold Standard

Design for plan validation for TPCDS queries, with features for extensibility to more data sets and various tpcds configs

Proposed solution

Attaching class diagram and E2E workflow of the gold standard test setup for hyperspace. This plan should have a basic setup for tpcds query plan validation as well as extension points for more query/data/config combinations.

File Structure:

./src/test/resources/tpcds-spark-2.4/ >>> Without Hyperspace: This directory is fully used by TPCDSSparkSuite.
    indexconfigs/                       >>> Empty Index config. This is to store how default spark would perform
    flags/                              >>> specialized configs for without hyperspace
    approvedSimplifiedPlans/            >>> simplified plans generated and stored for validation once.
    queries/                            >>> query files, one for each query
./src/test/resources/tpcdsBasic/      >>> This directory is fully used by TPCDSBasicSuite. Similar directories for other setups
    indexconfigs/                       >>> index configs. could be a conf file with index defs
    flags/                              >>> specialized configs for every setup (e.g. with/out hybrid scan
    approvedSimplifiedPlans/            >>> simplified plans generated and stored for validation once.
    queries/                            >>> query files, one for each query
./src/test/resources/tpcdsOther/  
    ...                                 >>> similar setup as above

/ApprovedSimplifiedPlans/
The approvedSimplifiedPlans directory will contain two files for each tpcds query: explain.txt and simplified.txt.
explain.txt: this file contains df.explain() output of a query. This is only for display and comparison purposes for the user. This file is not used for comparison in the tests.
simplified.txt: this file is a simplified plan. it normalizes references and cleans up locations. This plan is used in comparison and fails tests if string matching fails.

Test Class Diagram and File Structure

Workflow of DataGenerator and IndexGenerator

End to End Test Workflow

Complete pdf with above diagrams in high def:

Class Diagrams and Test Workflow.pdf

Implementation

Who/When: @apoorvedave1 , 3 weeks from date of start (5-6 weeks for merge) not including interruptions.

PRs

Copy from spark codebase, tpcds v1.4 queries and approved plans [Gold Standard] Add resources files for spark queries from spark's plan stability suite #383
Initial code for spark only setup with a single query [Gold Standard]: Initial code for spark only setup with a single query #384
Updated plans for all tpcds queries with spark-only setup [Gold Standard] Updated plans for all tpcds queries with spark-only setup #377
Initial Code showing Hyperspace Indexes with a sample query [Gold Standard] Initial Code showing Hyperspace Indexes with a sample query #385
Add index definitions and approved plans for TPCDSHyperspace Plan Stability Suite [Gold Standard] Add index definitions and approved plans for TPCDSHyperspace Plan Stability Suite #386
Code improvements: use external conf files instead of hardcoding index configs and test configs

Tasks:

Class ConfigReader: Create config file for basic config setup. (Simple config. Parquet, static data, bucket config, index root location etc.)
Get tpcds query files
Create Index Config file. csv?Json? "TableName, IndexName, IndexCols, IncludedCols" for every row
Class/Trait PlanStabilitySuite
Trait DataGenerator (use pre-created data and configs to generate data in test required format)
Trait IndexGenerator (use index config, other configs and output of dataGenerator to create indexes)
Class MockTPCDSDataGeneratorImpl: creates mock data files (create empty tables with schema same as oss spark)
Class MockIndexGeneratorImpl: creates index metadata files only. No actual index files created
Class Comparator (compares normalized plan files)
Class TPCDSSimple extends PlanStabilitySuite

MockTPCDSDataGenerator Tasks:

def generateData(src/test/ root folder name, system config files, data desination folder name)
generateData impl: same as oss spark TPCDSBase suite. Just create tables on catalog. This would also create empty folders on spark-warehouse dir.

Index Generator Tasks:

def generateIndex(data destination folder name, system config files, index config files, index destination folder name)
generateIndex impl: Create index metadata files only. A sample metadata file looks like:
class MockSignatureProvider: this returns table name for the source data. We can use this table name as signature in the index.

Indexcreator.createIndex(sourceTable, indexConfig): Unit => creates <index_storage_location>/<index_name>/_hyperspace_log/0

<index_storage_location>/<index_name>/_hyperspace_log/0:
                {
name              "name" : "filterIndex",
                  "derivedDataset" : {
                    "properties" : {
                      "columns" : {
indexCols               "indexed" : [ "c3" ],
included cols           "included" : [ "c1" ]
                      },
schema                "schemaString" : "{\"type\":\"struct\",\"fields\":[{\"name\":\"c3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}",
                      "numBuckets" : 200,
                      "properties" : {
                        "hasParquetAsSourceFormat" : "true"
                      }
                    },
                    "kind" : "CoveringIndex"
                  },
                  "content" : {
index_storage       "root" : {
                      "name" : "file:/C:/",
                      "files" : [ ],
                      "subDirs" : [ {
                        "name" : "Users",
                        "files" : [ ],
                        "subDirs" : [ {
                          "name" : "apdave",
                          "files" : [ ],
                          "subDirs" : [ {
                            "name" : "github",
                            "files" : [ ],
                            "subDirs" : [ {
                              "name" : "hyperspace-1",
                              "files" : [ ],
                              "subDirs" : [ {
                                "name" : "src",
                                "files" : [ ],
                                "subDirs" : [ {
                                  "name" : "test",
                                  "files" : [ ],
                                  "subDirs" : [ {
                                    "name" : "resources",
                                    "files" : [ ],
                                    "subDirs" : [ {
                                      "name" : "indexLocation",
                                      "files" : [ ],
                                      "subDirs" : [ {
                                        "name" : "filterIndex",
                                        "files" : [ ],
                                        "subDirs" : [ {
                                          "name" : "v__=0",
                                          "files" : [ {
arbitrary file info                                  "name" : "somefile.parquet",
                                                      "size" : 10,
                                                      "modifiedTime" : 1612989388690,
                                                      "id" : 0
                                                    }],
                                          "subDirs" : [ ]
                                        } ]
                                      } ]
                                    } ]
                                  } ]
                                } ]
                              } ]
                            } ]
                          } ]
                        } ]
                      } ]
                    },
                    "fingerprint" : {
                      "kind" : "NoOp",
                      "properties" : { }
                    }
                  },
                  "source" : {
                    "plan" : {
                      "properties" : {
                        "relations" : [ {
                          "rootPaths" : [ "file:/C:/Users/apdave/github/hyperspace-1/src/test/resources/e2eTests/lineitem" ],
                          "data" : {
                            "properties" : {
                              "content" : {
                                "root" : {
                                  "name" : "file:/C:/",
                                  "files" : [ ],
                                  "subDirs" : [ {
                                    "name" : "Users",
                                    "files" : [ ],
                                    "subDirs" : [ {
                                      "name" : "apdave",
                                      "files" : [ ],
                                      "subDirs" : [ {
                                        "name" : "github",
                                        "files" : [ ],
                                        "subDirs" : [ {
                                          "name" : "hyperspace-1",
                                          "files" : [ ],
                                          "subDirs" : [ {
                                            "name" : "src",
                                            "files" : [ ],
                                            "subDirs" : [ {
                                              "name" : "test",
                                              "files" : [ ],
                                              "subDirs" : [ {
                                                "name" : "resources",
                                                "files" : [ ],
                                                "subDirs" : [ {
                                                  "name" : "e2eTests",
                                                  "files" : [ ],
                                                  "subDirs" : [ {
                                                    "name" : "lineitem",
empty Content object                                "files" : [ ],
                                                    "subDirs" : [ ]
                                                  } ]
                                                } ]
                                              } ]
                                            } ]
                                          } ]
                                        } ]
                                      } ]
                                    } ]
                                  } ]
                                },
                                "fingerprint" : {
                                  "kind" : "NoOp",
                                  "properties" : { }
                                }
                              },
                              "update" : null
                            },
                            "kind" : "HDFS"
                          },
schema from catalog       "dataSchemaJson" : "{\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c4\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c5\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}",
                          "fileFormat" : "parquet",
                          "options" : { }
                        } ],
                        "rawPlan" : null,
                        "sql" : null,
                        "fingerprint" : {
                          "properties" : {
                            "signatures" : [ {
fixed provider                "provider" : "com.microsoft.hyperspace.index.MockSignatureProvider",
returns source table name     "value" : "lineitem"
                            } ]
                          },
                          "kind" : "LogicalPlan"
                        }
                      },
                      "kind" : "Spark"
                    }
                  },
                  "properties" : { },
                  "version" : "0.1",
                  "id" : 1,
                  "state" : "ACTIVE",
                  "timestamp" : 1612998769321,
                  "enabled" : true
                }

Comparator Tasks:

def compare(queryId, approvedSimplifiedPlansLocation, testSimplifiedPlansLocation): Boolean

PlanStabilityStuite Tasks:

protected def testRootLocation().
Implemented by subclasses. e.g. for TPCDSBasicSuite (extends PlanStabilitySuite), this would be "src/test/resources/tpcdsbasic/"
def dataGenerator. Use configs and test location to create dataGenerator
def indexGenerator. Use configs and test location to create indexGenerator
protected def setupDataAndIndex
def normalizePlan(query plan): Returns simplified query plan
def generateSimplifiedPlan(queryId).
Use configs and query id to get spark query from query file at "src/test/resources/tpcdsbasic/queries/"
Run query.explain to generate simplified plan.
normalize and return
def createAndSaveQueryPlans
For all queries to test
generateSimplifiedPlans and save at test location
def comparePlan(plan1, plan2): Boolean
def testQuery(query, regenerateApprovedPlans = false)
create normalized plan for query
if (regenerateApprovedPlans) copy plan to approvedPlans location
else compare with approvedPlans location and return result

TpcdsBasicSuite Tasks:

def testRootLocation: "src/test/resources/tpcdsbasic/"
def setupDataAndIndex:
Use dataGenerator and IndexGenerator to create data and index. >> E.g. for refresh index/hybrid scan, use this differently.
queries.foreach(super.testQuery(q, testRootLocation))

Updating Approved Plans

It is possible that with addition of new rules or indexes, we expect updated query plans. This would lead to test failures if we fail to update the approvedSimplifiedPlan for those queries.

To re-generate golden files for entire suite, run:
{{{
  SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite"
}}}

To re-generate golden file for a single test, run:
{{{
  SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q49)"
}}}

Regression: Defining Regression and Test Failure

If a test starts failing, it means the expected plan is different from actual plan for a failed query. For now it's a manual step to resolve this issue.
We have two options at this point:

It's possible the new plan is an improvement. Regenerate the golden file for this plan and push it with your PR
If it's not expected, find the bug in your rule and fix it so that the test passes.

Adding New Test Suites

Based on the current design, it's pretty easy to add new suites to Gold Standard.

Add required resources to `src/test/resouces/<your_cool_new_suite>/ folder. make sure to add information on how to generate data and index, any required configs, approved plans etc.
Extend PlanStabilitySuite and implement functionalities to generate data and indexes
Run the suite

Performance Implications (if applicable)

None

Open issues (if applicable)

Additional context (if applicable)

Similar to Spark's Plan Stability Suite
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/PlanStabilitySuite.scala

The text was updated successfully, but these errors were encountered:

rapoth · 2021-01-27T01:19:49Z

Thank you! Can you follow the outline from here please? I think you have most of it already.

imback82 · 2021-01-27T05:12:49Z

Can you explain tpcdsBasic vs. tpcdsOther?

apoorvedave1 · 2021-01-27T17:55:50Z

Thanks @rapoth , yeah I will fix it before merging.

the general idea is we will have lot's of approved simplified plans for different configs. Just as an example, with and without hybrid scan enabled, we will have very different "simplifiedApprovedPlans". so I used tpcdsBasic meaning tpcds with config 1 and tpcdsOther meaning tpcds with config 2

e.g.

tpcdsBasic/
  approvedPlans/
    q1simplified.txt
tpcdsOther/
  approvedPlans/
    q1simplified.txt

tpcdsBasic/.../q1Simplified.txt:

Project..
  Filter..
    FileScan: <HyperspaceIndex>

tpcdsOther/.../q1Simplified.txt:

Project..
  BucketUnion
    Filter
      FileScan: <Unindexed files identified by hybrid scan>
    Filter..
      FileScan: <HyperspaceIndex>

apoorvedave1 · 2021-02-12T22:53:08Z

@imback82 @pirz @thugsatbay @sezruby please take a look at the latest changes for gold standard

apoorvedave1 added proposal This is the default tag for a newly created design proposal untriaged This is the default tag for a newly created issue labels Jan 26, 2021

This was referenced Jan 27, 2021

Gold Standard #282

Open

Gold Standard for query plan check #261

Closed

apoorvedave1 mentioned this issue Feb 19, 2021

Gold Standard: spark-only version for creating and comparing golden files #361

Open

apoorvedave1 mentioned this issue Apr 15, 2021

Remove linefeed check from gold standard plan comparison #419

Merged

apoorvedave1 mentioned this issue Apr 22, 2021

[Gold Standard] Enable stats with spark 2.4 with a sample query #429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL]: Gold Standard #334

[PROPOSAL]: Gold Standard #334

apoorvedave1 commented Jan 26, 2021 •

edited

Loading

rapoth commented Jan 27, 2021

imback82 commented Jan 27, 2021

apoorvedave1 commented Jan 27, 2021

apoorvedave1 commented Feb 12, 2021

[PROPOSAL]: Gold Standard #334

[PROPOSAL]: Gold Standard #334

Comments

apoorvedave1 commented Jan 26, 2021 • edited Loading

Problem Statement

Proposed solution

File Structure:

Test Class Diagram and File Structure

Workflow of DataGenerator and IndexGenerator

End to End Test Workflow

Complete pdf with above diagrams in high def:

Implementation

PRs

Tasks:

Updating Approved Plans

Regression: Defining Regression and Test Failure

Adding New Test Suites

Performance Implications (if applicable)

Open issues (if applicable)

Additional context (if applicable)

rapoth commented Jan 27, 2021

imback82 commented Jan 27, 2021

apoorvedave1 commented Jan 27, 2021

apoorvedave1 commented Feb 12, 2021

apoorvedave1 commented Jan 26, 2021 •

edited

Loading