Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup batch mode query #186

Closed
andrewresearch opened this issue Aug 14, 2018 · 0 comments
Closed

Setup batch mode query #186

andrewresearch opened this issue Aug 14, 2018 · 0 comments

Comments

@andrewresearch
Copy link
Member

andrewresearch commented Aug 14, 2018

Create a query that has no submitted text, only parameters. The parameters should include:

  • a url to an input S3 bucket which TAP has read/write access, or read only if an optional output bucket is provided. Start with s3 only, but potentially allow for other URLs in future, including connections to databases or pub/sub systems
  • a sequence of pipelines with accompanying parameters e.g. [{ “pipeline”: “importAndClean”, “parameters”:{“cleanType”:”utf8”}},{“pipeline”:”moves”,”parameters”:{“grammar”:”reflective”}}]
  • an optional url to a separate output S3 bucket where TAP has write access. If this parameter does not exist, then the input bucket needs to be read/write. TAP will create a subdirectory __TAP_OUTPUT (if not exists) and write to it.
  • an optional input format - initially only support UTF8 TXT which is default
  • an optional output format - initially only support JSON which is default, but perhaps CSV, HTML, PDF, etc in future

NOTE: TAP should write over the top of existing files. It is the user's responsibility to ensure file integrity within the buckets (if TAP is called multiple times with the same URL)

A batch mode query should return a single UUID that is linked to the batch job. A subsequent query (with UUID) can be made to check on the progress of the batch.

The batch process should:

  1. Create the __TAP_OUTPUT directory in the appropriate bucket (verifying permissions in the process)
  2. Create a metadata file with the UUID of the batch job e.g. BATCH_xxxx-xxxx-xxx-xxxx.txt
  3. Check the list of pipelines for validity
  4. Record the start time in the metadata file and spawn the job to a new process
  5. Return either a UUID for the batch job, or an appropriate error message
  6. Write the output of the batch job to the __TAP_OUTPUT directory and update metadata periodically (e.g. average document size, average analysis time per document, etc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant