Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote: implement Google Drive #2040

Open
wants to merge 2 commits into
base: master
from

Conversation

@ei-grad
Copy link
Collaborator

commented May 22, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


FIx #2018

@ei-grad ei-grad changed the title remote: implement Google Drive [WIP] remote: implement Google Drive May 22, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch 5 times, most recently from 4524712 to d249f69 May 23, 2019

@ei-grad

This comment has been minimized.

Copy link
Collaborator Author

commented May 24, 2019

Review can be started - dvc pull / dvc push is working, I'm proceeding with tests.

Show resolved Hide resolved dvc/config.py Outdated
@efiop
Copy link
Member

left a comment

@ei-grad Looks great! 🔥 Is it possible to do some func tests locally for gdrive? Or do we need a real drive account? Also, looks like upload/download could be refactored a bit so they are easier to read ;) I know that many other Remotes have a similar problem with upload/download methods, but just while we are at it, we could enhance this particular one a little bit by splitting it into separate sub-methods.

@ei-grad

This comment has been minimized.

Copy link
Collaborator Author

commented May 27, 2019

@efiop Thanks for the review! :)

Is it possible to do some func tests locally for gdrive? Or do we need a real drive account?

I think we need a real account. And IIRC I read somewhere in their API docs that it is not prohibited by google policy to create a test account for this purpose. But I think it is also good to have a full unittest coverage on this implementation.

upload/download could be refactored a bit so they are easier to read

Sure. Btw, it is untestable now also. In progress.

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from 646f928 to 679ad9a May 30, 2019

@ei-grad ei-grad changed the title [WIP] remote: implement Google Drive remote: implement Google Drive May 30, 2019

@efiop
Copy link
Member

left a comment

Looking good!

Show resolved Hide resolved tests/func/test_gdrive.py Outdated
Show resolved Hide resolved dvc/path/gdrive.py Outdated
Show resolved Hide resolved tests/func/test_gdrive.py Outdated
Show resolved Hide resolved tests/func/test_gdrive.py Outdated
Show resolved Hide resolved tests/unit/remote/test_gdrive.py Outdated
Show resolved Hide resolved tests/unit/remote/test_gdrive.py Outdated
Show resolved Hide resolved dvc/remote/gdrive/__init__.py Outdated
Show resolved Hide resolved dvc/remote/gdrive/__init__.py Outdated

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch 2 times, most recently from 87ad701 to 6d15884 Jun 1, 2019

@ei-grad ei-grad requested a review from efiop Jun 1, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch 5 times, most recently from cfea397 to 620bb4e Jun 1, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from 7dd7810 to bdf03f9 Jul 3, 2019

@efiop

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

Sure, @ei-grad ! Let's give PyDrive a try.

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from bdf03f9 to 3caafdb Jul 5, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch 4 times, most recently from ac19c8a to 6b758c6 Jul 5, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from 6b758c6 to fd1ab6e Jul 16, 2019

@ei-grad

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 16, 2019

Just a status update - the latest feedback points were addressed, only couple of questions remain. It was probably not the right thing for me to mark the review conversations as resolved by me, sorry. Anyway I'm in the process of PyDrive-related refactoring changing a notable portion of code, and
I'll probably will come up with a new code review request later this week.

@vmarkovtsev

This comment has been minimized.

Copy link

commented Jul 17, 2019

Hey @ei-grad thank you so much for working on this! If you would benefit from any help, e.g. writing tests or coding, please let me know.

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from 439495b to 9541bc1 Jul 26, 2019

@ei-grad ei-grad force-pushed the ei-grad:google-drive branch from 9541bc1 to 6afe310 Jul 29, 2019

@efiop

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

@ei-grad Please take a look at DeepSource complaints, I believe there are some valid ones.



@pytest.fixture()
def repo():

This comment has been minimized.

Copy link
@efiop

efiop Aug 2, 2019

Member

Do we need to use repo from the dvc code repo itself? Or can we use dvc_repo fixture, as we do everywhere else?

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 6, 2019

Author Collaborator

This unit tests doesn't need the real repo, so it is a bit excessive to setup/teardown the dvc_repo fixture for each test. But it may be a good idea to create one temporary just for them all. Would it be ok to make this fixture scope="module" and take/return dvc_repo?

This comment has been minimized.

Copy link
@efiop

efiop Aug 7, 2019

Member

Sounds good :)

"{} is not a folder".format("/".join(current_path))
)
parent = metadata["id"]
to_create = [part] + list(parts)

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

should it be sublist of parts here?

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 6, 2019

Author Collaborator

parts is a partially consumed iterator, but yeah, I'll rewrite it

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 6, 2019

Author Collaborator

The exception/break condition was also valid, but DeepSource suggests too that iterating over iterator with the break/else construct and using the loop variable is something not readable and bug-risky. :)

TIMEOUT = (5, 60)

def __init__(
self,

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

use kwargs instead of a long list here?

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 6, 2019

Author Collaborator

Hm. I'd rather pass the OAuth2 instance instead of its arguments. And this method also needs a docstring, probably.

Security notice:
It always adds the Authorization header to the requests, not paying
attention is request is for googleapis.com or not. It is just how

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

typo: is -> if

def session(self):
"""AuthorizedSession to communicate with https://googleapis.com
Security notice:

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

it's probably easy to add a test/assert to check that domain is intact

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 6, 2019

Author Collaborator

Great idea, thanks! What do you think if I'd just override the request() of AuthorizedSession?

@@ -0,0 +1,17 @@
from dvc.remote.gdrive.utils import response_error_message

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

we should include from __future__ import unicode_literals everywhere

creds_id = self._get_creds_id(info["installed"]["client_id"])
return os.path.join(creds_storage_dir, creds_id)

def _get_storage_lock(self):

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

could you clarify this a little bit? why do we need the lock, and how does it work.

self._thread_lock.acquire()
while time() - t0 < self.timeout:
try:
self._lock = zc.lockfile.LockFile(self.lock_file)

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

is it a regular lockfile or is there something specific? we are using lockfile already somewhere, a different implementation, so do we need zc here? should we specify the dependency explicitly then? Also, what happens if we break the execution in the middle, it will start raising an exception? should we at least explain how to recover from it.

This comment has been minimized.

Copy link
@efiop

efiop Aug 8, 2019

Member

@shcheklein We are using zc.lockfile in other places too :) Not sure about the purpose of this lock though, do we actually write to the file it is protecting anywhere @ei-grad ?

break
params["pageToken"] = data["nextPageToken"]

def get_metadata(self, path_info, fields=None):

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

could you please add comment what does it return? really hard to understand this. just trying to understand why logic is so complicated here

errors_count += 1
if errors_count >= 10:
raise
sleep(1.0)

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 6, 2019

Member

do we actually need this sleep here?

This comment has been minimized.

Copy link
@ei-grad

ei-grad Aug 12, 2019

Author Collaborator

It is a must to wait some time between consecutive resumable upload requests. If one request would fail due to a short network problem then it would end up with all retry attempts failed in a short time. Maybe the same exponential backoff policy should be used here, as it is for error handling in self.request. Though it is not so clear for me would it be the right solution for resumable upload or not. The hardcoded 1 second sleep with 10 retries looks better, imho, but it is also not the best behavior, definitely.

One possible solution could be to store the upload process in the DVC's state database to make it possible to resume uploads between the dvc runs, but this feels like an overkill for me. Other backends don't care about the connection interruptions / server errors during large files upload at all, if I'm not mistaken.

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 12, 2019

Member

Yep. DB is an overkill for sure. As for retries - is it possible to use some decorator out of many existing? In both cases.

@shcheklein
Copy link
Member

left a comment

a few questions to address

@efiop @ei-grad what else do we need to get it done, guys?

@ei-grad you wanted to try some library as far as I remember, what's your take on it?

@shcheklein

This comment has been minimized.

Copy link
Member

commented Aug 7, 2019

@ei-grad please check DeepSource stuff as well, let's fix it (except obvious false positives).

def exists(self, path_info):
return self.client.exists(path_info)

def batch_exists(self, path_infos, callback):

This comment has been minimized.

Copy link
@shcheklein

shcheklein Aug 8, 2019

Member

@efiop do we need to update anything to support threading here - I mean status, etc? you are changing something with @pared as far as I understand.

This comment has been minimized.

Copy link
@efiop

efiop Aug 8, 2019

Member

Yes, batch_exists will be no longer needed after #2375 . As to threads in general, as long as self.client is thread-safe(looks like it is, but maybe @ei-grad could confirm/deny that) we will be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.