Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an operator to check for _SUCCESS files #406

Merged
merged 4 commits into from
Jan 14, 2019

Conversation

acmiyaguchi
Copy link
Contributor

This PR adds an operator version of a sensor written in #395. This works in the same way as the S3FSCheckSuccessSensor, except it is not a long lived job.

I noticed a few things when watching the sensor last night.

  • The sensor worked as expected during a catch-up backfill when other jobs were not being run
  • The sensor would fail when it ran into unexpected partitions (the __HIVE_DEFAULT_PARTITION__ of S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists #404)
  • The sensor timed out during the run for 2019-01-07. The logs show that the poke method was never called during the lifetime of the sensor.
    • This might be caused by the low concurrency limit, where the pokes are treated as a low-priority and is never given a chance to run.

In light of this, I've disabled the dataset_alerts DAG. Instead of using a long-lived sensor, I think a better solution in our current Airflow configuration is to run an operator once (with a few retries), when the dataset should definitely be written to disk. This way, the scheduler should pick up the task to be run once another task has finished. This reuses most of the code from the last PR.

There are a couple of small things in this PR too:

@acmiyaguchi
Copy link
Contributor Author

The third observation is only partially correct, there were a total of 3 tries that happened during the run on 2018-01-07. https://r3e51de5feeeb5619-tp.appspot.com/admin/airflow/log?task_id=check_main_summary&dag_id=dataset_alerts&execution_date=2019-01-07T01:00:00

  1. This was the initial run which failed to start executing. This was cleared around 7:00 UTC.
  2. This log shows the pokes correctly running at 30 minute intervals. However this didn't succeed because there were actually 101 partitions.
  • aws s3 ls s3://telemetry-parquet/main_summary/v4/submission_date_s3=20190107/ --recursive | grep _SUCCESS | wc -l
  1. This log was created at 13:00 UTC overlapping with run 2. Not sure why this exists.

The concurrency limit may actually not be an issue here, but something strange did happen on the run last night.

The concurr

Copy link
Contributor

@sunahsuh sunahsuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! were you planning on putting the new operator into use anywhere?

@acmiyaguchi
Copy link
Contributor Author

I'm going to enable the dataset_alerts dag again one this is merged in. If there are issues with the sensors for mysterious reasons, I'm going to swap them out with the operator instead since it's simpler.

@acmiyaguchi acmiyaguchi merged commit 672983c into mozilla:master Jan 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists
2 participants