Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New manifest structure and added support for BQ manifest. #295

Merged
merged 27 commits into from Feb 22, 2023

Conversation

mahrsee1997
Copy link
Collaborator

Fixes #149 and #60.

weather_dl/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass for a review. I recommend slightly less deviation from the existing class structure.

So far, this looks great! I really like what I've seen in the demo DB :)

weather_dl/README.md Outdated Show resolved Hide resolved
weather_dl/download_pipeline/clients.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/clients.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/fetcher.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/clients.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved

def __init__(self, location: Location) -> None:
super().__init__(Location(location[5:]))
TABLE_SCHEMA = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredzyda I could use a second pair of eyes on the schema definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TylerTCF may have some opinions, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we land this change, I was thinking Tyler could build a prototype dashboard out of sample data you produce. That should be a good test of the schema.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check with Tyler one last time. Otherwise, I think we're good to go.

weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
Pass manifest instance across pipeline steps instead of dict.

Improve mock manifest client & make use of it in tests.

Remove None from Status Enum & make set its default value to None in DownloadStatus dataclass.
Instead of creating a separate transaction for each stage, now we will just set the stage using set_stage().
Copy link
Collaborator

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More feedback. Overall it's looking good!

weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/clients.py Show resolved Hide resolved
weather_dl/download_pipeline/manifest.py Show resolved Hide resolved
Comment on lines 286 to 289
if parsed_gcs_path.scheme != 'gs' or parsed_gcs_path.netloc == '':
new_status.size = LocalSystemFileSizeStrategy().get_file_size(path)
else:
new_status.size = GCSBlobSizeStrategy().get_file_size(parsed_gcs_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much cleaner. However, could we apply something like the strategy pattern on the manifest subclass itself? e.g. the GCS Manifest has a clac_file_size() method that gets overridden? I see no need to create a separate class hierarchy, this one should do fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, our _read and _update calls use the strategy pattern in my book.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation a bit w.r.t to getting file size. WRYT ?


def __init__(self, location: Location) -> None:
super().__init__(Location(location[5:]))
TABLE_SCHEMA = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we land this change, I was thinking Tyler could build a prototype dashboard out of sample data you produce. That should be a good test of the schema.

weather_dl/download_pipeline/manifest.py Outdated Show resolved Hide resolved
weather_dl/download_pipeline/parsers.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few critical points of feedback. Once these are fixed, I'm happy to land these changes. Great work, Rahul!

@mahrsee1997
Copy link
Collaborator Author

Thanks Alex for reviewing.

@mahrsee1997 mahrsee1997 merged commit 197d815 into main Feb 22, 2023
@mahrsee1997 mahrsee1997 deleted the dl-bq-manifest branch February 22, 2023 10:18
This was referenced Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update manifest to track pipeline stages.
2 participants