Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists #404

Closed
acmiyaguchi opened this issue Jan 7, 2019 · 0 comments · Fixed by #406
Closed

S3FSCheckSuccess fails when __HIVE_DEFAULT_PARTITION__ exists #404

acmiyaguchi opened this issue Jan 7, 2019 · 0 comments · Fixed by #406

Comments

@acmiyaguchi
Copy link
Contributor

def check_s3fs_success(bucket, prefix, num_partitions):
"""Check the s3 filesystem for the existence of `_SUCCESS` files in dataset partitions.
:prefix: Bucket prefix of the table
:num_partitions: Number of expected partitions
"""
s3 = boto3.resource("s3")
objects = s3.Bucket(bucket).objects.filter(Prefix=prefix)
success = set([obj.key for obj in objects if "_SUCCESS" in obj.key])
return len(success) == num_partitions

This fails when there is a __HIVE_DEFAULT_PARTITION__ in the dataset.

$aws s3 ls s3://telemetry-parquet/main_summary/v4/submission_date_s3=20190105/ --recursive | grep _SUCCESS | wc -l

101

This should be modified so the number of partitions is the minimum required _SUCCESS partitions to be written, or an explicit list of partitions.

acmiyaguchi added a commit to acmiyaguchi/telemetry-airflow that referenced this issue Jan 8, 2019
acmiyaguchi added a commit that referenced this issue Jan 14, 2019
* Add a S3FSCheckSuccessOperator and fix imports

* Remove py36 from envlist due to snakebite import in sensors

* Fix #404 - Set the number of expected partitions as a lower bound
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant