[PPS] CreateDatum hierarchical input types #9928

smalyala · 2024-04-04T19:22:51Z

This PR adds support for Union, Cross, Join, and Group inputs to the CreateDatum RPC. It builds upon the changes made in #9712.

One can visualize the Input as a tree. Inputs deeper in the spec are represented by deeper levels in the tree. When more datums need to be created, a datum request is propagated down the tree level-by-level via channels until it reaches the PFS inputs (leaves). The PFS nodes then send shards up one level for processing. The processed result shards are further propagated upwards, until eventually, they reach the root. The processed results at the root level are sent to the iterator, which iterates over each shard (file set) and sends the datums back to the client.

Logic by input:

Union
- shards received from children inputs are simply propagated up
Cross, Join
- a shard associated with a child input is received
- a cartesian product occurs to create all permutations of shards with the shard just received along with ones from other children inputs
  - each permutation represents a Cross/Join task (one shard per child input)
  - the comments for shardPermute() contain an example
Group
- all key file sets are needed for the Group task so this is largely unchanged from it's ListDatum implementation

Jira: INT-1204

codecov · 2024-04-08T18:57:27Z

Codecov Report

Attention: Patch coverage is 74.19355% with 72 lines in your changes are missing coverage. Please review.

Project coverage is 58.09%. Comparing base (9697f86) to head (7964d98).
Report is 20 commits behind head on master.

Files	Patch %	Lines
src/server/worker/datum/create_stream.go	73.72%	72 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #9928      +/-   ##
==========================================
+ Coverage   58.00%   58.09%   +0.08%     
==========================================
  Files         608      608              
  Lines       73924    74189     +265     
==========================================
+ Hits        42883    43099     +216     
- Misses      30485    30539      +54     
+ Partials      556      551       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bbonenfant

Approving, but we should look into why these formatter changes are happening spuriously.

jrockway

This is looking good. I think you want to figure out how to test each of the stream producers outside of pachd and write some unit tests that cover the actual calculations; are the joins correct, are the crosses correct, etc. That is what is currently missing.

jrockway · 2024-04-10T17:25:08Z

src/server/worker/datum/create_stream.go

+// Consumes file set id shards from each child input as they arrive. Calls cb with the index of
+// the child input. Function returns when all children channels are closed or the context is done.
+func consumeChildrenFsidChans(ctx context.Context, childrenFsidChans []chan string, cb func(int, string) error) error {
+	cases := make([]reflect.SelectCase, len(childrenFsidChans)+1)


I don't think we should use reflection for this. Just use a single channel and if you need to retain the index information, send struct{ id int; fsid string } instead of just the fsid.

The reason I use one channel per child input is so the child input can easily indicate to the parent it's done sending all its fsids by closing its channel. With the single channel approach, we can mimic that behavior by returning an empty string for the fsid (for example) to indicate a child is done sending its fsids. Does that sound okay?

jrockway · 2024-04-10T17:42:46Z

src/internal/task/util.go

@@ -61,17 +61,17 @@ func DoOne(ctx context.Context, doer Doer, input *anypb.Any) (*anypb.Any, error)

 // DoBatch executes a batch of tasks.
 func DoBatch(ctx context.Context, doer Doer, inputs []*anypb.Any, cb CollectFunc) error {
-	var eg errgroup.Group
+	eg, egCtx := errgroup.WithContext(ctx)


I don't think it's a big problem to call this variable ctx. It prevents someone from accidentally using the parent context when they intend to use this one.

jrockway · 2024-04-10T17:45:11Z

src/server/pachyderm_test.go

-		}
-		listDatumTimeToFirstDatum = time.Since(start).Seconds()
-		return nil
+	testTimeToFirstDatum := func(input *pps.Input) {


Do these tests need to use Pachyderm running in k8s? RealEnv has been replaced with pachd.NewTestPachd and can probably run tasks.

jrockway · 2024-04-10T17:59:09Z

src/server/worker/datum/create_stream.go

+	if index == len(input) {
+		temp := make([]string, len(result))
+		copy(temp, result)
+		*output = append(*output, temp)


Why do we need a copy of result here?

We modify the underlying array in result to generate permutations. Once we create a permutation, a copy needs to be created, because in subsequent iterations to create other permutations, the underlying array for result is modified.

jrockway · 2024-04-10T18:03:29Z

src/server/worker/datum/create_stream.go

+			fsidChan := make(chan string)
+			go streamingCreate(egCtx, c, taskDoer, input, fsidChan, errChan, requestDatumsChan)
+			for fsid := range fsidChan {
+				if err := renewer.Add(ctx, fsid); err != nil {


I think this should use the error group context. If the error group is done but the parent isn't, then this goroutine will live too long.

jrockway · 2024-04-10T18:12:29Z

src/server/worker/datum/create_stream.go

+				return errors.Wrap(err, "consumeChildrenFsidChans")
+			}
+		} else {
+			finished[i] = true


I think you need to remove the i-th selector from cases; nothing guarantees not hitting this 'matched because the channel is closed' case on the next Select call. It could happen forever.

Is the concern that we'll never exit from the loop or that we'll unnecessarily do extra iterations by matching on the closed channel repeatedly?

jrockway · 2024-04-10T18:16:25Z

src/server/worker/datum/util_test.go

+		{"c", "2", "x"},
+		{"c", "2", "y"},
+	}
+	for _, e := range expected {


You should rework this test to use require.NoDiff. I think that slicesEqualUnordered and sliceExistsInSlices only exist to get around having to sort expected and expected[], you should just sort those (with cmpopt.SortSlices for []string and [][]string). The output will be much easier to read and you don't need to maintain these helpers.

I like that better. Will do.

jrockway · 2024-04-10T18:30:10Z

src/server/pachyderm_test.go

+			client.NewPFSInputOpts(repo, pfs.DefaultProjectName, repo, "master", "/file-?*(??)0", "$1", "", false, false, nil),
+			client.NewPFSInputOpts(repo, pfs.DefaultProjectName, repo, "master", "/file-?0(??)0", "$1", "", false, false, nil),
+		)
+		testTimeToFirstDatum(input)
 	})


This should be a table-driven test:

testData := []struct{name string; input *pfs.Input}{ {"PFSInput", client.NewPFSInput(...)}, {"UnionInput", client.NewUnionInput(...)}, ... } for _, test := range testData { t.Run(test.name, func(t *testing.T) { t.Parallel() var eg errgroup.Group var listDatumTimeToFirstDatum, createDatumTimeToFirstDatum float64 ... }) }

Also, I am not sure how reliable it is to collect one timing data point each time the test is run. Nothing guarantees that each goroutine is given the same scheduling latency or the same amount of time on the CPU, so even if CreateDatum is faster than ListDatum, you aren't guaranteed to measure that every time. It's just a recipe for a flaky test. This could be a benchmark (func BenchmarkCreateDatum(b *testing.B) if you want to make sure that you collect enough timing samples to be confident in the result.

I changed it to a benchmark to avoid flakiness. Running the benchmark a few times confirmed the improved latency in getting the first datum.

jrockway · 2024-04-10T18:31:52Z

src/server/pps/server/api_server_test.go

+		require.NoError(t, datumClient.Send(&pps.CreateDatumRequest{Body: &pps.CreateDatumRequest_Start{Start: &pps.StartCreateDatumRequest{Input: input}}}))
+		n, err := grpcutil.Read[*pps.DatumInfo](datumClient, make([]*pps.DatumInfo, 11))
+		require.True(t, stream.IsEOS(err))
+		require.Equal(t, 10, n)


Somewhere, there needs to be a test that actually checks that CreateDatum produces correct result. You run all the code in these tests, which is good, but you aren't checking that the results are correct. Anything that returns 10 copies of DatumInfo (perhaps nil 10 times) will pass the test, which is not good enough for someone to be able to work on this code confidently.

I tested correctness for the inputs manually. I'll add some of those to the test suite.

smalyala added 9 commits April 4, 2024 15:22

Update rpc comments

c4a5bbb

PR feedback

ed03f13

Consume children inputs results + Union input

3c88143

Cross input

f48bb06

Case for ctx.Done() to avoid leaky goroutines

17a6fb3

Minor refactoring

141557f

Join input

9334547

Group input

dabc8e9

Modify Group logic to use ListDatum's impl

d2f4b16

smalyala force-pushed the smalyala/createdatum-hierarchical branch 4 times, most recently from 5647ad3 to 21b54d1 Compare April 7, 2024 18:57

Add tests

cfcdf60

smalyala force-pushed the smalyala/createdatum-hierarchical branch 3 times, most recently from f800f2e to 968c0ff Compare April 8, 2024 18:38

smalyala force-pushed the smalyala/createdatum-hierarchical branch from 968c0ff to 8ce580d Compare April 8, 2024 19:05

smalyala mentioned this pull request Apr 10, 2024

[PPS] CreateDatum support for PFS inputs #9712

Merged

Tidy up

781a105

smalyala force-pushed the smalyala/createdatum-hierarchical branch from 8ce580d to 781a105 Compare April 10, 2024 16:21

smalyala requested review from jrockway, bbonenfant and brycemcanally April 10, 2024 16:22

smalyala marked this pull request as ready for review April 10, 2024 16:22

smalyala requested review from a team as code owners April 10, 2024 16:22

smalyala removed the request for review from brycemcanally April 10, 2024 16:26

bbonenfant approved these changes Apr 10, 2024

View reviewed changes

jrockway reviewed Apr 10, 2024

View reviewed changes

smalyala force-pushed the smalyala/createdatum-hierarchical branch from 680a12f to 5750f05 Compare April 16, 2024 23:11

PR feedback

7964d98

smalyala force-pushed the smalyala/createdatum-hierarchical branch from 5750f05 to 7964d98 Compare April 17, 2024 00:03

jrockway approved these changes Apr 19, 2024

View reviewed changes

smalyala merged commit 02063a2 into master Apr 22, 2024
21 checks passed

smalyala added a commit that referenced this pull request Apr 22, 2024

[PPS] CreateDatum hierarchical input types (#9928)

fdcbe00

smalyala mentioned this pull request Apr 22, 2024

[Backport][PPS] CreateDatum hierarchical input types #9967

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PPS] CreateDatum hierarchical input types #9928

[PPS] CreateDatum hierarchical input types #9928

smalyala commented Apr 4, 2024 •

edited

codecov bot commented Apr 8, 2024 •

edited

bbonenfant left a comment

jrockway left a comment

jrockway Apr 10, 2024

smalyala Apr 16, 2024

jrockway Apr 10, 2024

jrockway Apr 10, 2024

jrockway Apr 10, 2024

smalyala Apr 16, 2024

jrockway Apr 10, 2024

smalyala Apr 16, 2024

jrockway Apr 10, 2024

smalyala Apr 16, 2024

jrockway Apr 10, 2024

smalyala Apr 16, 2024

jrockway Apr 10, 2024

smalyala Apr 18, 2024

jrockway Apr 10, 2024

smalyala Apr 16, 2024 •

edited

[PPS] CreateDatum hierarchical input types #9928

[PPS] CreateDatum hierarchical input types #9928

Conversation

smalyala commented Apr 4, 2024 • edited

codecov bot commented Apr 8, 2024 • edited

Codecov Report

bbonenfant left a comment

Choose a reason for hiding this comment

jrockway left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smalyala Apr 16, 2024 • edited

Choose a reason for hiding this comment

smalyala commented Apr 4, 2024 •

edited

codecov bot commented Apr 8, 2024 •

edited

smalyala Apr 16, 2024 •

edited