Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-928] Full Google Cloud Platform Support #11

Closed
8 tasks
ulli-snyman opened this issue Jun 7, 2019 · 17 comments
Closed
8 tasks

[KED-928] Full Google Cloud Platform Support #11

ulli-snyman opened this issue Jun 7, 2019 · 17 comments
Assignees

Comments

@ulli-snyman
Copy link

Description

In kedro Docs, GCP is mentioned but cannot find any references to data connectors.
Add data connectors for Google Cloud Storage and Google Big Query

Context

Allow the use of the Google Cloud Products for me and my team of data scientists. GCP is a large cloud provider and is a very popular IAAS that is used by many people. GCS is the Google equivalent to AWS S3 and Big Query is Googles hosted database system.

Possible Implementation

using the google python client library, create the following:
kedro.io.gcs_csv
kedro.io.gcs_parquet
kedro.io.gcs_hdfs
kedrio.io.gcs_pickle
kedrio.io.gcs_json
kedro.io.gbq [load df from GBQ]

Possible Alternatives

in the S3 implematations, replace s3fs with boto3 to allow for access to both S3 and GCS with the same code. See GCP simple migration method outlined here, however this does not allow for full access to the GCS product. ie service accounts etc. and could break some functions in the s3 implementations.

Todo

  • Set up authentication via service accounts and default credentials
  • Duplicate functionality in io.s3 connectors for GCS
  • Duplicate functionality in io.sql connector for GBQ
  • Test connectors against public datasets
  • Test connectors on private datasets

Checklist

  • [Component: I/O]
  • [Type: Enhancement]
  • [Priority: Medium]
@ulli-snyman ulli-snyman added the Issue: Feature Request New feature or improvement to existing feature label Jun 7, 2019
@idanov
Copy link
Member

idanov commented Jun 7, 2019

Hi @ulli-snyman, thank you for contributing to Kedro with adding a feature request and potentially adding support for new data sets. Adding support for GCS is something we would love to have as part of kedro.contrib.io and we will be more than happy to welcome contributions for the datasets. I will mark this as good first issue label, so anyone interested in doing can pick it up.

@ulli-snyman
Copy link
Author

Hey @idanov, Getting ready for my first PR towards this issue, covering the CSVDataSet method,
I am busy writing tests and have run into a bit of a roadblock, Currently there is no Functionality to mock a GCS bucket as in the moto package.

The common approach for testing with CGP services is to actually read/write to the service.
I'm scratching my head a bit here as I can run the tests with my credentials in a a testing project but that is specific to me and anyone else wanting to test this would need a GCP project to test this in.

Ive set the tests to take in GCP Configuration from ENV Vars, this is the best way I can see this working out... Would you be fine with this or have you got any other ideas as to how we could test this?

@nakhan98
Copy link
Contributor

Hi @ulli-snyman, thank you for your interest in contributing to Kedro! I'm the QA on the Kedro team. You should be able to mock out calls to the GCP library. An example is shown here. You can also see how the developers test their client code here. If you need any further assistance, please let us know.

@lorenabalan lorenabalan changed the title Add Support for GCP Services. i.e. Google Cloud Storage and Big Query [KED-928] Add Support for GCP Services. i.e. Google Cloud Storage and Big Query Aug 6, 2019
@lorenabalan
Copy link
Contributor

lorenabalan commented Aug 6, 2019

I've updated the title with our internal ticket number to keep track of this more easily. :)
@ulli-snyman how is this coming along? Do you need any help from our side?

@ulli-snyman
Copy link
Author

Hey @lorenabalan,
Things have been busy my side, will try wrap things up in the PR by the end of the month.

@lorenabalan
Copy link
Contributor

Things have been busy my side, will try wrap things up in the PR by the end of the month.

Totally fine, just wanted to check in and make sure that you're not stuck on something from our end. :)

@plauto
Copy link

plauto commented Aug 29, 2019

Hey there,
if that's OK, can I start working on it? I have some experience with GCP and I'd be happy to implemnt those features :)

@yetudada
Copy link
Contributor

Hi @plauto! We would love the help! But it might be a good idea to just sync with @ulli-snyman as he mentioned that he has started working on a PR. Let's give him until the end of the week to reply about how far he's gotten and whether or not he needs help. If there's no status update then it's all yours.

@plauto
Copy link

plauto commented Aug 29, 2019

Sounds good to me! Thanks @yetudada

@plauto
Copy link

plauto commented Sep 2, 2019

Hey! If that’s ok, can I start working on it this week?

@Flid
Copy link
Contributor

Flid commented Sep 2, 2019

Looks like there's still no reply, and sure, go for it! @plauto
Thank you in advance for the contribution.

@921kiyo
Copy link
Contributor

921kiyo commented Sep 12, 2019

@plauto How's the development coming along? If you would like our early feedback/comments, feel free to open a draft PR so we can see if you are on the right track :)

@plauto
Copy link

plauto commented Sep 12, 2019 via email

@plauto
Copy link

plauto commented Sep 23, 2019

@921kiyo I am going to push a draft PR. Sorry for being a bit late on this, but I could find some time to work on it end of last week. There are still a couple of things to finish (e.g. unit tests for Versioned Dataset which have a bit of complexity due to the way I have structured unit tests). I look forward to get a feedback from you, when you will have some time. After that it shouldn't take long to finish up the rest!

@ManjuladeviM
Copy link

This blog is the general information for the feature. You got good work for this blog. We have a developing our creative content of this mind. Thank you for this blog. This for very interesting and useful.
Best Google cloud Online Training

@yetudada yetudada changed the title [KED-928] Add Support for GCP Services. i.e. Google Cloud Storage and Big Query [KED-928] Full GCP Support Oct 29, 2019
@yetudada yetudada removed Issue: Feature Request New feature or improvement to existing feature good first issue labels Oct 29, 2019
@yetudada yetudada changed the title [KED-928] Full GCP Support [KED-928] Full Google Cloud Platform Support Oct 29, 2019
@yetudada
Copy link
Contributor

yetudada commented Dec 10, 2019

@ulli-snyman and everyone who has been watching this issue. We're excited to announce that kedro 0.15.5 will have CSVGCSDataSet, ParquetGCSDataSet and JSONGCSDataSet.

In a following release of Kedro, we will have:

  • Support for Google Big Query
  • And a new series of file-storage agnostic datasets for CSVDataSet, ParquetDataSet, JSONDataSet, ExcelDataSet, HDFDataSet and PickleDataSet made possible because we stumbled into fsspec while we were looking at Dask integration; these datasets will support GCS, S3, etc. and simplify our data catalog

I'll close this issue when we have finished full support of GCS.

@yetudada
Copy link
Contributor

yetudada commented Feb 5, 2020

@ulli-snyman This PR has been addressed and full Google Cloud Support will be available in the next release. The datasets are already available in the develop branch: https://github.com/quantumblacklabs/kedro/blob/develop/kedro/extras/datasets/

They all use fsspec to load filepath: and GCS is included in that series: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/registry.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants