Skip to content
This repository has been archived by the owner on Aug 25, 2024. It is now read-only.

dffml: source: file: Add support for .zip file #38

Merged
merged 13 commits into from
Apr 11, 2019

Conversation

sudharsana-kjl
Copy link
Contributor

The zipfile module has been used to support .zip files

@codecov-io
Copy link

codecov-io commented Mar 25, 2019

Codecov Report

Merging #38 into master will decrease coverage by 0.34%.
The diff coverage is 48.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   89.29%   88.95%   -0.35%     
==========================================
  Files          46       46              
  Lines        3186     3213      +27     
  Branches      334      337       +3     
==========================================
+ Hits         2845     2858      +13     
- Misses        291      303      +12     
- Partials       50       52       +2
Impacted Files Coverage Δ
tests/source/test_file.py 100% <100%> (ø) ⬆️
dffml/source/file.py 79.1% <17.64%> (-20.9%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf3493e...187ea5a. Read the comment docs.

@yashlamba
Copy link
Contributor

yashlamba commented Mar 25, 2019

I guess you won't need a text wrapper, ZipFile defaults to text type. I was actually working on it since morning and was going to open a PR. The issue I am facing is it isn't working with iris_training case, did you try the same?

EDIT: My bad, got the text wrapper.

@sudharsana-kjl
Copy link
Contributor Author

Without text wrapper, it throws the following error:

Traceback (most recent call last):
  File "/home/sudharsana/GSoC/dffml/env/bin/dffml", line 11, in <module>
    load_entry_point('dffml==0.1.1', 'console_scripts', 'dffml')()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 145, in main
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 127, in cli
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/cli.py", line 203, in run
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/asynchelper.py", line 20, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/source.py", line 67, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 39, in open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 60, in _open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/csvfile.py", line 35, in load_fd
  File "/usr/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

This is why i used text wrapper

@sudharsana-kjl
Copy link
Contributor Author

sudharsana-kjl commented Mar 25, 2019

@yashlamba To fix your error, make sure that you change the headers using sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' *.csv and then compress it. Add the text wrapper as well.

@sudharsana-kjl
Copy link
Contributor Author

sudharsana-kjl commented Mar 25, 2019

Also, this works only for a zip file that contains a single file inside.

elif zipfile.is_zipfile(self.filename):
archive = zipfile.ZipFile(self.filename)
for file in archive.infolist():
file = io.TextIOWrapper(archive.open(file, 'r'))
await self.load_fd(file)

Since a zip folder can contain multiple files, when I try to load them all inside the for-loop, only the last file in the archive.infolist() is getting added to the memory.

I think this happens because of:

async def load_fd(self, fd):
'''
Parses a CSV stream into Repo instances
'''
i = 0
self.mem = {}

self.mem={} in the beginning of every call to load_fd() and I'm trying to operate on a single source object. so its getting overwritten.

Test

Test cases havent been included yet. @pdxjohnny any alternative to using FakeFileSource in tests? Because I use the is_zipfile() to check whether its a zip file or not. So the usual file extension check cant be used here. Can we import zipfile package and create a new zipfile inside test?

EDIT: After having a discussion with Yash, we have decided that I'll be continuing to add the test cases as well.

@sudharsana-kjl
Copy link
Contributor Author

Without text wrapper, it throws the following error:

Traceback (most recent call last):
  File "/home/sudharsana/GSoC/dffml/env/bin/dffml", line 11, in <module>
    load_entry_point('dffml==0.1.1', 'console_scripts', 'dffml')()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 145, in main
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 127, in cli
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/cli.py", line 203, in run
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/asynchelper.py", line 20, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/source.py", line 67, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 39, in open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 60, in _open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/csvfile.py", line 35, in load_fd
  File "/usr/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

This is why i used text wrapper

@yashlamba were you able to resolve the error?

@yashlamba
Copy link
Contributor

Just wondering, why can't we use the regular file extension check? I tried and it seems to work fine.

@yashlamba
Copy link
Contributor

Without text wrapper, it throws the following error:

Traceback (most recent call last):
  File "/home/sudharsana/GSoC/dffml/env/bin/dffml", line 11, in <module>
    load_entry_point('dffml==0.1.1', 'console_scripts', 'dffml')()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 145, in main
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/cli.py", line 127, in cli
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/cli.py", line 203, in run
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/util/asynchelper.py", line 20, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/source.py", line 67, in __aenter__
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 39, in open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/file.py", line 60, in _open
  File "/home/sudharsana/GSoC/dffml/env/lib/python3.6/site-packages/dffml-0.1.1-py3.6.egg/dffml/source/csvfile.py", line 35, in load_fd
  File "/usr/lib/python3.6/csv.py", line 111, in __next__
    self.fieldnames
  File "/usr/lib/python3.6/csv.py", line 98, in fieldnames
    self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

This is why i used text wrapper

@yashlamba were you able to resolve the error

It was happening with my code actually, I didn't try yours then.

It's working great with yours!

@sudharsana-kjl
Copy link
Contributor Author

sudharsana-kjl commented Mar 25, 2019

@yashlamba just checking the file extension works fine in most cases but it is not the best possible way. Any file extension can be modified. So I felt a more strict way is using the available functions to check them.

elif zipfile.is_zipfile(self.filename):
archive = zipfile.ZipFile(self.filename)
for file in archive.infolist():
file = io.TextIOWrapper(archive.open(file, 'r'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextIOWrapper inherits from https://docs.python.org/3/library/io.html#io.IOBase io.IOBase so we can use the with keyword here as well.

Hence we have the opportunity to keep the convention of using opener = something.

However, since we'll use the with keyword just on opener on line 65, we need to define a little helper function so that both the zipfile and the textiowrapper are closed at the end of the with block.

from contextlib import contextmanager

...

elif zipfile.is_zipfile(self.filename):
    @contextmanager
    def opener_helper():
        with zipfile.ZipFile(self.filename) as archive:
            for file in archive.infolist():
                with io.TextIOWrapper(archive.open(filename, 'r')) as fd:
                    yield fd
            # Only care about the one file
            break
    opener = opener_helper()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

@@ -66,10 +77,19 @@ def __repr__(self):
elif self.filename[::-1].startswith(('.xz')[::-1]) or \
self.filename[::-1].startswith(('.lzma')[::-1]):
close = lzma.open(self.filename, 'wt')
elif zipfile.is_zipfile(self.filename):
tmp = tempfile.NamedTemporaryFile(suffix='.csv')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid the use of a tempfile here. I think a helper method like with load might be useful, see https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.write or maybe https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.writestr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a helper method similar to the one in _open method.

@sudharsana-kjl
Copy link
Contributor Author

@pdxjohnny Could you suggest how to write test cases for this feature?

Right now, a FakeFileSource object is created in the test. Since most of the other file type support only check the extension, this works fine. But here I've used an inbuilt function is_zipfile() to check. Because of this, I wont be able to use the existing FakeFileSource class.

@johnandersen777
Copy link

I think you can mock the return_value of is_zipfile to return True, then check that it tries to open. I think you may need to mock return_value's for textiowrapper too, but not sure

@sudharsana-kjl
Copy link
Contributor Author

The usage of zipfile.is_zipfile() throws the following error while running existing tests:

(env) sudharsana@sudharsana-HP-15-Notebook-PC:~/GSoC/dffml(zipfile_feature)$ python -m unittest discover
.E.......E......................
add 40 and 2 {
    "get_single":{
        "result":42
    }
}


multiply 42 and 10 {
    "get_single":{
        "result":420
    }
}

............................................................
..........................
======================================================================
ERROR: test_close (tests.source.test_file.TestFileSource)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sudharsana/GSoC/dffml/dffml/util/asynctestcase.py", line 39, in run_it
    result = self.loop.run_until_complete(coro(*args, **kwargs))
  File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/home/sudharsana/GSoC/dffml/tests/source/test_file.py", line 108, in test_close
    await source.close()
  File "/home/sudharsana/GSoC/dffml/dffml/source/file.py", line 71, in close
    await asyncio.shield(self._close())
  File "/home/sudharsana/GSoC/dffml/dffml/source/file.py", line 82, in _close
    elif zipfile.is_zipfile(self.filename):
  File "/usr/lib/python3.7/zipfile.py", line 204, in is_zipfile
    result = _check_zipfile(fp)
  File "/usr/lib/python3.7/zipfile.py", line 187, in _check_zipfile
    if _EndRecData(fp):
  File "/usr/lib/python3.7/zipfile.py", line 289, in _EndRecData
    maxCommentStart = max(filesize - (1 << 16) - sizeEndCentDir, 0)
TypeError: '>' not supported between instances of 'int' and 'MagicMock'

======================================================================
ERROR: test_open (tests.source.test_file.TestFileSource)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sudharsana/GSoC/dffml/dffml/util/asynctestcase.py", line 39, in run_it
    result = self.loop.run_until_complete(coro(*args, **kwargs))
  File "/usr/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/home/sudharsana/GSoC/dffml/tests/source/test_file.py", line 63, in test_open
    await source.open()
  File "/home/sudharsana/GSoC/dffml/dffml/source/file.py", line 39, in open
    await asyncio.shield(self._open())
  File "/home/sudharsana/GSoC/dffml/dffml/source/file.py", line 55, in _open
    elif zipfile.is_zipfile(self.filename):
  File "/usr/lib/python3.7/zipfile.py", line 204, in is_zipfile
    result = _check_zipfile(fp)
  File "/usr/lib/python3.7/zipfile.py", line 187, in _check_zipfile
    if _EndRecData(fp):
  File "/usr/lib/python3.7/zipfile.py", line 289, in _EndRecData
    maxCommentStart = max(filesize - (1 << 16) - sizeEndCentDir, 0)
TypeError: '>' not supported between instances of 'int' and 'MagicMock'

----------------------------------------------------------------------
Ran 118 tests in 0.685s

FAILED (errors=2)

Because of this, I have changed the condition to check only the extension similar to other file types.

Also, I tried using FakeFileSource object but certain inbuilt functions in zipfile was different which required specific attributes. So, I created a MockZipFile class in test_file.

Copy link

@johnandersen777 johnandersen777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! Thanks for all your work on this, I think these are my final comments on this.

source.close = m_open
with patch('os.path.exists', return_value=True), \
patch('zipfile.ZipFile',m_open):
source.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The await keyword needs to be used in front of any async functions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis:

test_close_xz (tests.source.test_file.TestFileSource) ... ok
test_close_zip (tests.source.test_file.TestFileSource) ... /home/travis/build/intel/dffml/tests/source/test_file.py:153: RuntimeWarning: coroutine 'FileSource.close' was never awaited
  source.close()
ok
test_filename (tests.source.test_file.TestFileSource) ... ok
test_filename_readonly (tests.source.test_file.TestFileSource) ... ok
...
ok
test_open_xz (tests.source.test_file.TestFileSource) ... ok
test_open_zip (tests.source.test_file.TestFileSource) ... /home/travis/build/intel/dffml/tests/source/test_file.py:104: RuntimeWarning: coroutine 'FileSource.open' was never awaited
  source.open()
ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using await throws the following error:
TypeError: object MagicMock can't be used in 'await' expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was using a MockZipFile object. I changed the approach to using FakeFileSource object similar to how it is done already. Even then the error persists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, okay I'll check it out

dffml/source/file.py Outdated Show resolved Hide resolved
dffml/source/file.py Outdated Show resolved Hide resolved
@johnandersen777 johnandersen777 merged commit 1a66085 into intel:master Apr 11, 2019
@johnandersen777
Copy link

Thanks for your work on this @sudharsana-kjl! After trying my hand at this I see how confusing zipfile is! I had to move things around a bit.

@sudharsana-kjl
Copy link
Contributor Author

@pdxjohnny I was looking at the commit a58acb6. This is great! dumping the data back as bzip makes it less complex. Also in tests, the use of yield_42 function is a great idea. Also, I have a doubt, since MockZipFile object is not used anymore, could you please remove it? or do you have any future plans of using it?

@johnandersen777
Copy link

Damn! Good catch. Ideally we figure out how to test these lines: https://codecov.io/gh/intel/dffml/src/master/dffml/source/file.py#L83...94

and that probably still involves using the MockZipFile. If you want to fix my error there I'd appreciate it, it would also mean getting rid of yield42 (it think).

@johnandersen777 johnandersen777 mentioned this pull request May 21, 2019
4 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants