-
Notifications
You must be signed in to change notification settings - Fork 138
dffml: source: file: Add support for .zip file #38
dffml: source: file: Add support for .zip file #38
Conversation
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
==========================================
- Coverage 89.29% 88.95% -0.35%
==========================================
Files 46 46
Lines 3186 3213 +27
Branches 334 337 +3
==========================================
+ Hits 2845 2858 +13
- Misses 291 303 +12
- Partials 50 52 +2
Continue to review full report at Codecov.
|
I guess you won't need a text wrapper, ZipFile defaults to text type. I was actually working on it since morning and was going to open a PR. The issue I am facing is it isn't working with iris_training case, did you try the same? EDIT: My bad, got the text wrapper. |
Without text wrapper, it throws the following error:
This is why i used text wrapper |
@yashlamba To fix your error, make sure that you change the headers using |
Also, this works only for a zip file that contains a single file inside. Lines 55 to 59 in 32389f3
Since a zip folder can contain multiple files, when I try to load them all inside the for-loop, only the last file in the I think this happens because of: Lines 29 to 34 in 32389f3
TestTest cases havent been included yet. @pdxjohnny any alternative to using EDIT: After having a discussion with Yash, we have decided that I'll be continuing to add the test cases as well. |
@yashlamba were you able to resolve the error? |
Just wondering, why can't we use the regular file extension check? I tried and it seems to work fine. |
It was happening with my code actually, I didn't try yours then. It's working great with yours! |
@yashlamba just checking the file extension works fine in most cases but it is not the best possible way. Any file extension can be modified. So I felt a more strict way is using the available functions to check them. |
dffml/source/file.py
Outdated
elif zipfile.is_zipfile(self.filename): | ||
archive = zipfile.ZipFile(self.filename) | ||
for file in archive.infolist(): | ||
file = io.TextIOWrapper(archive.open(file, 'r')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TextIOWrapper inherits from https://docs.python.org/3/library/io.html#io.IOBase io.IOBase
so we can use the with
keyword here as well.
Hence we have the opportunity to keep the convention of using opener = something
.
However, since we'll use the with keyword just on opener on line 65, we need to define a little helper function so that both the zipfile and the textiowrapper are closed at the end of the with block.
from contextlib import contextmanager
...
elif zipfile.is_zipfile(self.filename):
@contextmanager
def opener_helper():
with zipfile.ZipFile(self.filename) as archive:
for file in archive.infolist():
with io.TextIOWrapper(archive.open(filename, 'r')) as fd:
yield fd
# Only care about the one file
break
opener = opener_helper()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
dffml/source/file.py
Outdated
@@ -66,10 +77,19 @@ def __repr__(self): | |||
elif self.filename[::-1].startswith(('.xz')[::-1]) or \ | |||
self.filename[::-1].startswith(('.lzma')[::-1]): | |||
close = lzma.open(self.filename, 'wt') | |||
elif zipfile.is_zipfile(self.filename): | |||
tmp = tempfile.NamedTemporaryFile(suffix='.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets avoid the use of a tempfile here. I think a helper method like with load might be useful, see https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.write or maybe https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.writestr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a helper method similar to the one in _open method.
@pdxjohnny Could you suggest how to write test cases for this feature? Right now, a |
I think you can mock the return_value of is_zipfile to return True, then check that it tries to open. I think you may need to mock return_value's for textiowrapper too, but not sure |
The usage of
Because of this, I have changed the condition to check only the extension similar to other file types. Also, I tried using FakeFileSource object but certain inbuilt functions in zipfile was different which required specific attributes. So, I created a MockZipFile class in test_file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Thanks for all your work on this, I think these are my final comments on this.
tests/source/test_file.py
Outdated
source.close = m_open | ||
with patch('os.path.exists', return_value=True), \ | ||
patch('zipfile.ZipFile',m_open): | ||
source.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The await
keyword needs to be used in front of any async
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis:
test_close_xz (tests.source.test_file.TestFileSource) ... ok
test_close_zip (tests.source.test_file.TestFileSource) ... /home/travis/build/intel/dffml/tests/source/test_file.py:153: RuntimeWarning: coroutine 'FileSource.close' was never awaited
source.close()
ok
test_filename (tests.source.test_file.TestFileSource) ... ok
test_filename_readonly (tests.source.test_file.TestFileSource) ... ok
...
ok
test_open_xz (tests.source.test_file.TestFileSource) ... ok
test_open_zip (tests.source.test_file.TestFileSource) ... /home/travis/build/intel/dffml/tests/source/test_file.py:104: RuntimeWarning: coroutine 'FileSource.open' was never awaited
source.open()
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using await throws the following error:
TypeError: object MagicMock can't be used in 'await' expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was using a MockZipFile
object. I changed the approach to using FakeFileSource
object similar to how it is done already. Even then the error persists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, okay I'll check it out
Signed-off-by: John Andersen <john.s.andersen@intel.com>
…o zipfile_feature
Thanks for your work on this @sudharsana-kjl! After trying my hand at this I see how confusing |
@pdxjohnny I was looking at the commit |
Damn! Good catch. Ideally we figure out how to test these lines: https://codecov.io/gh/intel/dffml/src/master/dffml/source/file.py#L83...94 and that probably still involves using the MockZipFile. If you want to fix my error there I'd appreciate it, it would also mean getting rid of yield42 (it think). |
The
zipfile
module has been used to support .zip files