Skip to content

Add bulk loader validator #3838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

animesh2049
Copy link
Contributor

@animesh2049 animesh2049 commented Aug 20, 2019

bulk validator reports all the parsing error in input files


This change is Reviewable

Updates #3984

bulk validator reports all the parsing error in input files
@animesh2049 animesh2049 requested a review from a team August 20, 2019 12:04
This tool can also be used to validate input files for live loader.
@animesh2049 animesh2049 changed the title [WIP] Add bulk loader validator Add bulk loader validator Aug 22, 2019
Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ A review job has been created and sent to the PullRequest network.


@animesh2049 you can click here to see the review status or cancel the code review job.

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some inline feedback regarding error handling improvements and consistency.


Reviewed with ❤️ by PullRequest

Copy link

@gitlw gitlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like all the logic contained in this PR is also available in the current bulk loader. Is there anything new that this validator provides?

Reviewable status: 0 of 4 files reviewed, 9 unresolved discussions (waiting on @animesh2049 and @manishrjain)

@animesh2049
Copy link
Contributor Author

It seems like all the logic contained in this PR is also available in the current bulk loader. Is there anything new that this validator provides?

Reviewable status: 0 of 4 files reviewed, 9 unresolved discussions (waiting on @animesh2049 and @manishrjain)

Yes all the logic is already there in bulk loader but since we want a different validator tool, we thought it would be good to do it this way. What are your thoughts on this @gitlw ?

Copy link

@gitlw gitlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought Manish wanted a more rigorous validation tool. So maybe check with him first. Thanks!

Reviewable status: 0 of 4 files reviewed, 9 unresolved discussions (waiting on @animesh2049 and @manishrjain)

Copy link

@pullrequest pullrequest bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to inactivity, PullRequest has cancelled this review job. You can reactivate the code review job from the PullRequest dashboard

Copy link
Member

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add test in systest

@@ -0,0 +1,122 @@
package bulkvalidator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just call if validator instead of bulk_validator. This could be use for live loader data as well.


"github.com/dgraph-io/badger/y"
"github.com/dgraph-io/dgraph/chunker"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove newline here

"github.com/dgraph-io/dgraph/chunker"
"github.com/prometheus/common/log"

"github.com/dgraph-io/dgraph/x"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this import together with other import.

x.Fatalf("No data files found in %s\n", ld.opt.DataFiles)
}

loadType := chunker.DataFormat(files[0], ld.opt.DataFormat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check here that all the files have correct type?

if err == io.EOF {
break
} else if err != nil {
x.Check(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it make sense to continue here instead?


for chunkBuf := range m.readerChunkCh {
if err := chunker.Parse(chunkBuf.chunk); err != nil {
m.foundError = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we syncronizing access to foundError?

Copy link
Contributor Author

@animesh2049 animesh2049 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 11 unresolved discussions (waiting on @golangcibot, @mangalaman93, @manishrjain, and @pullrequest[bot])


dgraph/cmd/validator/mapper.go, line 29 at r2 (raw file):

Previously, pullrequest[bot] wrote…

Similar to other comment regarding fmt vs log would potentially switch this to use log since its being used concurrently

Done.


dgraph/cmd/validator/mapper.go, line 28 at r3 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

Are we syncronizing access to foundError?

Done.


dgraph/cmd/bulk_validator/loader.go, line 56 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

printf: Println call has possible formatting directive %s (from govet)

Done.


dgraph/cmd/validator/loader.go, line 61 at r2 (raw file):

Previously, pullrequest[bot] wrote…

Since some of the errors are being handled with x.Check which appears to use log.Fatal, would it make sense to just use log.Fatalln instead of fmt.Println with an os.Exit just so that all of the error output is using log?

Done.


dgraph/cmd/validator/loader.go, line 68 at r2 (raw file):

Previously, pullrequest[bot] wrote…

Could possibly add a newline onto the end of this printf to match the others

Done.


dgraph/cmd/validator/loader.go, line 75 at r2 (raw file):

Previously, pullrequest[bot] wrote…

Might simplify this to do:

go m.run(loadType, &mapperWg)

Although it actually looks like run() does everything except initializing the chunker on a separate goroutine so it may not be necessary to do this on another goroutine at all?

Done.


dgraph/cmd/validator/loader.go, line 86 at r2 (raw file):

Previously, pullrequest[bot] wrote…

Using log here as well might be better since this is being done concurrently as the log package locks around writing to std output, where fmt doesn't.

Done.


dgraph/cmd/validator/loader.go, line 92 at r2 (raw file):

Previously, pullrequest[bot] wrote…

It looks like x.Check does a log.Fatal which I believe will actually exit without a panic so this will skip your deferred functions above. Just wanted to check if that is okay? There is another usage below as well.

Defer function cleanup() only closes the file. log.Fatal should kill the program and hence close the file also.


dgraph/cmd/validator/loader.go, line 12 at r3 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

Move this import together with other import.

Done.


dgraph/cmd/validator/loader.go, line 63 at r3 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

Should we check here that all the files have correct type?

It's just taking file type from the first input file. If all the files are not of the same type, chunker will give error while chunking.


dgraph/cmd/validator/loader.go, line 99 at r3 (raw file):

Previously, mangalaman93 (Aman Mangal) wrote…

Will it make sense to continue here instead?

I don't think so, because this error is likely to come in the case of file format mismatch i.e. chunker expects different format.

Copy link
Member

@mangalaman93 mangalaman93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r3.
Reviewable status: 1 of 4 files reviewed, 12 unresolved discussions (waiting on @animesh2049, @golangcibot, @manishrjain, and @pullrequest[bot])


dgraph/cmd/validator/mapper.go, line 29 at r4 (raw file):

		for chunkBuf := range m.readerChunkCh {
			if err := chunker.Parse(chunkBuf.chunk); err != nil {
				atomic.CompareAndSwapUint32(&m.foundError, 0, 1)

I think it'd be okay to not print anything in case there is no error.


dgraph/cmd/validator/mapper.go, line 30 at r4 (raw file):

			if err := chunker.Parse(chunkBuf.chunk); err != nil {
				atomic.CompareAndSwapUint32(&m.foundError, 0, 1)
				glog.Errorf("Error Found in file %s: %s\n", chunkBuf.filename, err)

found*


dgraph/cmd/validator/mapper.go, line 32 at r4 (raw file):

				glog.Errorf("Error Found in file %s: %s\n", chunkBuf.filename, err)
			}

newline


dgraph/cmd/validator/mapper.go, line 36 at r4 (raw file):

	}()

	go func() {

comment


dgraph/cmd/validator/loader.go, line 68 at r2 (raw file):

Previously, animesh2049 (Animesh Chandra Pathak) wrote…

Done.

same here


dgraph/cmd/validator/run.go, line 59 at r4 (raw file):

	}

	loader := newLoader(opt)

no need for assignment.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is unnecessarily complex. Instead of trying to copy the bulk loader structure, think about what is the minimum you need to make things work nicely. Needs refactoring.

Reviewable status: 1 of 4 files reviewed, 19 unresolved discussions (waiting on @animesh2049, @golangcibot, @manishrjain, and @pullrequest[bot])


dgraph/cmd/validator/mapper.go, line 1 at r5 (raw file):

package validator

License?


dgraph/cmd/validator/mapper.go, line 11 at r5 (raw file):

)

type mapper struct {

You don't need this object.


dgraph/cmd/validator/mapper.go, line 21 at r5 (raw file):

}

func (m *mapper) run(inputFormat chunker.InputFormat, wg *sync.WaitGroup) {

You only need this function.


dgraph/cmd/validator/loader.go, line 1 at r5 (raw file):

package validator

License?


dgraph/cmd/validator/loader.go, line 18 at r5 (raw file):

	TmpDir        string
	NumGoroutines int
	CleanupTmp    bool

Do you need these options?


dgraph/cmd/validator/loader.go, line 33 at r5 (raw file):

}

type loader struct {

Probably don't need the loader object either.


dgraph/cmd/validator/loader.go, line 35 at r5 (raw file):

type loader struct {
	*state
	mappers []*mapper

Don't need these mapper objects.


dgraph/cmd/validator/loader.go, line 88 at r5 (raw file):

			for {
				chunkBuf, err := chunker.Chunk(r)
				if chunkBuf != nil && chunkBuf.Len() > 0 {

This should happen after err has been handled, right?


dgraph/cmd/validator/run.go, line 1 at r5 (raw file):

package validator

License?


dgraph/cmd/validator/run.go, line 20 at r5 (raw file):

func init() {
	Validator.Cmd = &cobra.Command{
		Use:   "validator",

validate


dgraph/cmd/validator/run.go, line 21 at r5 (raw file):

	Validator.Cmd = &cobra.Command{
		Use:   "validator",
		Short: "Validate input file",

files


dgraph/cmd/validator/run.go, line 60 at r5 (raw file):

	loader := newLoader(opt)
	loader.mapStage()

Add a message at the end saying the validation is done or something.

@animesh2049 animesh2049 deleted the animesh2049/bulk_loader_file_validator branch January 14, 2020 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants