Add an operator to check for _SUCCESS files #406

acmiyaguchi · 2019-01-08T23:16:44Z

This PR adds an operator version of a sensor written in #395. This works in the same way as the S3FSCheckSuccessSensor, except it is not a long lived job.

I noticed a few things when watching the sensor last night.

The sensor worked as expected during a catch-up backfill when other jobs were not being run
The sensor would fail when it ran into unexpected partitions (the __HIVE_DEFAULT_PARTITION__ of S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists #404)
The sensor timed out during the run for 2019-01-07. The logs show that the poke method was never called during the lifetime of the sensor.
- This might be caused by the low concurrency limit, where the pokes are treated as a low-priority and is never given a chance to run.

In light of this, I've disabled the dataset_alerts DAG. Instead of using a long-lived sensor, I think a better solution in our current Airflow configuration is to run an operator once (with a few retries), when the dataset should definitely be written to disk. This way, the scheduler should pick up the task to be run once another task has finished. This reuses most of the code from the last PR.

There are a couple of small things in this PR too:

Adding a sensor to the plugins imports snakebite, breaking python 3 support in pre-1.10.1. I've disabled py36 tests in tox. See https://issues.apache.org/jira/browse/AIRFLOW-2474.
I've fixed S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists #404 so the expected number of partitions is a lower bound.

acmiyaguchi · 2019-01-08T23:31:41Z

The third observation is only partially correct, there were a total of 3 tries that happened during the run on 2018-01-07. https://r3e51de5feeeb5619-tp.appspot.com/admin/airflow/log?task_id=check_main_summary&dag_id=dataset_alerts&execution_date=2019-01-07T01:00:00

This was the initial run which failed to start executing. This was cleared around 7:00 UTC.
This log shows the pokes correctly running at 30 minute intervals. However this didn't succeed because there were actually 101 partitions.

aws s3 ls s3://telemetry-parquet/main_summary/v4/submission_date_s3=20190107/ --recursive | grep _SUCCESS | wc -l

This log was created at 13:00 UTC overlapping with run 2. Not sure why this exists.

The concurrency limit may actually not be an issue here, but something strange did happen on the run last night.

The concurr

sunahsuh

lgtm! were you planning on putting the new operator into use anywhere?

acmiyaguchi · 2019-01-14T19:02:24Z

I'm going to enable the dataset_alerts dag again one this is merged in. If there are issues with the sensors for mysterious reasons, I'm going to swap them out with the operator instead since it's simpler.

acmiyaguchi added 3 commits January 8, 2019 14:03

Add a S3FSCheckSuccessOperator and fix imports

869cf13

Remove py36 from envlist due to snakebite import in sensors

8269d98

Fix mozilla#404 - Set the number of expected partitions as a lower bound

a8efb0b

acmiyaguchi requested a review from sunahsuh January 8, 2019 23:17

sunahsuh approved these changes Jan 14, 2019

View reviewed changes

Merge branch 'master' into check-status-operator

35cbd1d

acmiyaguchi merged commit 672983c into mozilla:master Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an operator to check for _SUCCESS files #406

Add an operator to check for _SUCCESS files #406

acmiyaguchi commented Jan 8, 2019

acmiyaguchi commented Jan 8, 2019

sunahsuh left a comment

acmiyaguchi commented Jan 14, 2019

Add an operator to check for _SUCCESS files #406

Add an operator to check for _SUCCESS files #406

Conversation

acmiyaguchi commented Jan 8, 2019

acmiyaguchi commented Jan 8, 2019

sunahsuh left a comment

Choose a reason for hiding this comment

acmiyaguchi commented Jan 14, 2019