Setup batch mode query #186

andrewresearch · 2018-08-14T02:24:14Z

Create a query that has no submitted text, only parameters. The parameters should include:

a url to an input S3 bucket which TAP has read/write access, or read only if an optional output bucket is provided. Start with s3 only, but potentially allow for other URLs in future, including connections to databases or pub/sub systems
a sequence of pipelines with accompanying parameters e.g. [{ “pipeline”: “importAndClean”, “parameters”:{“cleanType”:”utf8”}},{“pipeline”:”moves”,”parameters”:{“grammar”:”reflective”}}]
an optional url to a separate output S3 bucket where TAP has write access. If this parameter does not exist, then the input bucket needs to be read/write. TAP will create a subdirectory __TAP_OUTPUT (if not exists) and write to it.
an optional input format - initially only support UTF8 TXT which is default
an optional output format - initially only support JSON which is default, but perhaps CSV, HTML, PDF, etc in future

NOTE: TAP should write over the top of existing files. It is the user's responsibility to ensure file integrity within the buckets (if TAP is called multiple times with the same URL)

A batch mode query should return a single UUID that is linked to the batch job. A subsequent query (with UUID) can be made to check on the progress of the batch.

The batch process should:

Create the __TAP_OUTPUT directory in the appropriate bucket (verifying permissions in the process)
Create a metadata file with the UUID of the batch job e.g. BATCH_xxxx-xxxx-xxx-xxxx.txt
Check the list of pipelines for validity
Record the start time in the metadata file and spawn the job to a new process
Return either a UUID for the batch job, or an appropriate error message
Write the output of the batch job to the __TAP_OUTPUT directory and update metadata periodically (e.g. average document size, average analysis time per document, etc)

andrewresearch added enhancement Integration & deployment labels Aug 14, 2018

andrewresearch added this to the 3.3-M0 milestone Aug 14, 2018

This was referenced Aug 14, 2018

Users need to be able to initiate analysis on selected S3 corpora #76

Closed

Need to be able to load custom models and lexicons from S3 #152

Open

andrewresearch added the priority label Aug 14, 2018

andrewresearch closed this as completed Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup batch mode query #186

Setup batch mode query #186

andrewresearch commented Aug 14, 2018 •

edited

Loading

Setup batch mode query #186

Setup batch mode query #186

Comments

andrewresearch commented Aug 14, 2018 • edited Loading

andrewresearch commented Aug 14, 2018 •

edited

Loading