Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for webhdfs file system #217

Closed

Conversation

Projects
None yet
4 participants
@harsham4026
Copy link
Contributor

commented Sep 27, 2018

No description provided.

Harsha Mandai Harsha Mandai
@harsham4026

This comment has been minimized.

Copy link
Contributor Author

commented Sep 27, 2018

@MSeal developed the basic things that we do with HDFS. Please have a look and let me know for the changes and addition of features :). Once it looks fine I'll start writing the test cases.

Harsha Mandai added some commits Sep 27, 2018

Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
@codecov

This comment has been minimized.

Copy link

commented Sep 27, 2018

Codecov Report

Merging #217 into master will decrease coverage by 3.46%.
The diff coverage is 48.73%.

@@            Coverage Diff             @@
##           master     #217      +/-   ##
==========================================
- Coverage    77.9%   74.43%   -3.47%     
==========================================
  Files          11       13       +2     
  Lines        1136     1244     +108     
==========================================
+ Hits          885      926      +41     
- Misses        251      318      +67

Harsha Mandai added some commits Sep 27, 2018

Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
@harsham4026

This comment has been minimized.

Copy link
Contributor Author

commented Sep 28, 2018

@MSeal did you get a chance to review this PR. Looking forward to listening to your suggestions or changes :)

@betatim

This comment has been minimized.

Copy link
Member

commented Sep 28, 2018

Could you please add some tests for the HDFS support similar to the ones in papermill/tests/test_s3.py to make sure things work as planned and won't get broken by accident in the future by other people changing things. That would be great.

Edit: just saw your comment about writing tests. So ignore my comment above. Sorry about that.

@harsham4026

This comment has been minimized.

Copy link
Contributor Author

commented Sep 28, 2018

@betatim sure. Preparing the test cases. Soon will commit them.

@MSeal
Copy link
Member

left a comment

Thanks for putting this together!

I think we should use pyarrow to avoid needing the webhdfs client installed on the hadoop cluster. See http://wesmckinney.com/blog/python-hdfs-interfaces/

The good note is I think a lot of the code here won't be needed with the pyarrow client.
with hdfs.open('/path/to/file') as f: and hdfs.mkdir does what we need already so we can move our logic to be in the iorw class without much overhead.

Also lots of tests will be needed for a merge here as it's a relatively complex interface. You'll want to mock the actual hdfs object and just prove that the inputs match documented inputs for each wrapper. Also we need to make sure papermill still operates on non-hdfs enabled clients. Travis should prove this for us.

@@ -0,0 +1,108 @@
# -*- coding: utf-8 -*-
"""Utilities for working with S3."""

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

with HDFS

from pywebhdfs.errors import FileNotFound
from pywebhdfs.webhdfs import PyWebHdfsClient

from .s3 import retry

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

Let's move retry to a utils.py so it can be shared by the two

FileNotFoundError = IOError


class HDFS(object):

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

We probably don't need to make all the HDFS operators prefixed with _. I'd say most of these are fine to expose without the private prefix.

This comment has been minimized.

Copy link
@harsham4026

harsham4026 Oct 1, 2018

Author Contributor

removed the private prefix.

return False
else:
return True
except Exception as e:

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

This is redundant, removing it has the same behavior

""":param path:str
:returns reads contents of the given file path and returns"""
if self._exists(path) and not self._is_file(path):
raise IOError("file {path} doesn't exist".format(path=path))

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

Let's raise FileNotFoundError for this

:returns reads contents of the given file path and returns"""
if self._exists(path) and not self._is_file(path):
raise IOError("file {path} doesn't exist".format(path=path))
read = self.__webhdfs.read_file(path, offset=offset, length='null' if size < 0 else size)

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

Note sure length='null' is right?

else:
raise FileNotFound("directory {directory} doesn't exist".format(directory=path))

def _read(self, path, size=-1, offset=0, binary=False):

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

Better to use size=None for python interfaces

read = read.decode('utf-8')
return read

def _write(self, path, s=""):

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

s doesn't need ="", also let's name this better as contents

else:
raise FileNotFound("parent directory {} doesn't exist".format(self._path_basename(path)))

def _mkdir(self, path, recursive, exist_ok=False):

This comment has been minimized.

Copy link
@MSeal

MSeal Sep 29, 2018

Member

You don't actually use recursive except to raise exceptions? I'd drop exist_ok and just assume it's True for execution purposes

Harsha Mandai added some commits Oct 1, 2018

Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
@todo

This comment has been minimized.

Copy link

commented Oct 1, 2018

verify what to raise

https://github.com/nteract/papermill/blob/dac7606771645dd3da03092207295fab923c8f9c/papermill/utils.py#L17-L22


This comment was generated by todo based on a TODO comment in dac7606 in #217. cc @harsham4026.

Harsha Mandai added some commits Oct 1, 2018

Harsha Mandai Harsha Mandai
Harsha Mandai Harsha Mandai
import logging

from pywebhdfs.errors import FileNotFound
from pywebhdfs.webhdfs import PyWebHdfsClient

This comment has been minimized.

Copy link
@MSeal

MSeal Oct 1, 2018

Member

Before you get too far fixing things, I do want to point out that we should probably use pyarrow for hdfs:// paths and pywebhdfs for webhdfs:// paths. Not all hadoop clusters will have webhdfs installed/enabled so we'll want the register to differentiate the two.

This comment has been minimized.

Copy link
@harsham4026

harsham4026 Oct 1, 2018

Author Contributor

okay. Will make changes for pyarrow :)

This comment has been minimized.

Copy link
@MSeal

MSeal Oct 1, 2018

Member

If you want to leave this branch for webhdfs that's fine too, just wanted to let you know there was a difference between the two.

This comment has been minimized.

Copy link
@harsham4026

harsham4026 Oct 1, 2018

Author Contributor

sure

This comment has been minimized.

Copy link
@harsham4026

harsham4026 Oct 1, 2018

Author Contributor

Added the code for supporting hdfs. PR #224 addresses this.

Harsha Mandai Harsha Mandai

@MSeal MSeal changed the title support for hdfs file system support for webhdfs file system Oct 2, 2018

@harsham4026

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2018

@MSeal are we planning to webhdfs support in near future? If we are planning to add I'll try to add the test cases for the code of pywebhdfs committed.


def __init__(self, host=None, port=None, user_name=None,
path_to_hosts=None, **kwargs):
self.__webhdfs = PyWebHdfsClient.__init__(self, host, port, user_name, **kwargs)

This comment has been minimized.

Copy link
@MSeal

MSeal Oct 6, 2018

Member

Use single _ for private variables, double __ is usually for hidden system attributes.

except Exception as e:
logger.debug('retrying: {}'.format(e))
else:
raise Exception # TODO verify what to raise

This comment has been minimized.

Copy link
@MSeal

MSeal Oct 6, 2018

Member

I know this was just copied, but let's improve this some. This should raise e and the line e = Exception("No attempts made") should be at the top of wrapper`.

@MSeal

This comment has been minimized.

Copy link
Member

commented Oct 6, 2018

@MSeal are we planning to webhdfs support in near future? If we are planning to add I'll try to add the test cases for the code of pywebhdfs committed.

@harsham4026 I see no problem with adding this. There are times you need to use this over native hdfs calls. It will require extensive testing though, so you'll need at least one test per function you added (more for the functions with more complex logic). That'll be a little bit of work to complete, but if you get stuck on any particular test / mock pattern I'll respond with how to achieve desired testing patterns.

@harsham4026

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

@MSeal I was exploring the libraries to mock the webhdfs and hdfs, can you please suggest me the library to mock hdfs for writing the test cases?

@MSeal

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

@harsham4026 To test these you'll probably need to mock PyWebHdfsClient or the __webhdfs object and prove it's receiving calls with the right arguments. You can see how we do the latter pattern in test_engines some. You likely won't be able to have the unittests prove data goes to/from HDFS directly, but I'd encourage manually testing that aspect at least once as well.

@rgbkrk

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

This PR has been open long enough that I'm going to recommend trying it again as a new PR, given how much has changed since November. Feel free to mention this PR in any new PRs related to this one.

Cheers!

@rgbkrk rgbkrk closed this Jun 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.