-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Parquet, HDF5, and CSV interfaces #1194
Conversation
|
@wesm is it better map parquet logical types directly to ibis types, or convert to arrow types which then almost trivially map to ibis types? |
|
@jreback it's probably better to use the Arrow types, which will be more consistent with the underlying execution. How large do we anticipate each of the file implementations being? Any reason to have the module vs. |
I added an arrow converter (not used, but its in the tests) & a parquet one; certainly can change to just convert parquet types to arrow to ibis.
was organizing like the pandas backend. but yes we could combine all of these into a single file. though this is not user visible anyhow. |
|
ok revamped to use a single file per (csv, hdf5, parquet). |
62b8051
to
9a4080e
Compare
|
👍 |
ibis/file/parquet.py
Outdated
|
|
||
| def parquet_types_to_ibis_schema(schema): | ||
| pairs = [] | ||
| for cs in schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm is there (maybe internal) function to convert a ParquetColumnSchema to an Arrow schema?
would prefer to not have these parquet routines directly exposed here, rather have arrow types (which trivially convert to ibis types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code exists, but needs to be exposed: https://issues.apache.org/jira/browse/ARROW-1759
|
@wesm @cpcloud I am not sure how parquet represents in python2 but this seems odd using the example table from https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L72 I am not seeing a logical_type for strings in py2; in py3 these are UTF8 (both are BYTE_ARRAY as physical_type). note that the last field is passed in as bytes, so this looks as expected. PY2 PY3 |
|
Hm I would expect strings to be UTF8 there for both Pythons. That seems like a bug. |
|
revised, I guess I missed |
|
|
|
Yep, @xhochy when I said "strings" I meant "string types coming from parquet". All string types should be represented by unicode in python2 and str in python3. |
|
note that |
| 'float': float, | ||
| 'halffloat': float16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have two spellings of float16 or can we just leave it as float16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can take it out, but this is a common spelling, e.g. like float, double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's fine.
ibis/file/client.py
Outdated
| class FileClient(ibis.client.Client): | ||
|
|
||
| def __init__(self, root): | ||
| super(FileClient, self).__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this call here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/client.py
Outdated
| def __dir__(self): | ||
| dbs = self.list_databases(path=self.path) | ||
| tables = self.list_tables(path=self.path) | ||
| return sorted(list(set(dbs).union(set(tables)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just call sorted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/client.py
Outdated
|
|
||
| new_name = "{}.{}".format(name, self.extension) | ||
| if (self.root / name).is_dir(): | ||
| path = path / name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do augmented assignment here as: path /= name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/client.py
Outdated
|
|
||
| def __getattr__(self, name): | ||
| try: | ||
| return object.__getattribute__(self, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A call to __getattribute__ isn't necessary because you're only ever inside __getattr__ if a call to __getattribute__ has raised AttributeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/parquet.py
Outdated
|
|
||
|
|
||
| _ARROW_DTYPE_TO_IBIS_TYPE = { | ||
| 'int8': dt.int8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the arrow objects here rather than strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not implemented :>
In [5]: hash(pa.int8())
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-bca0e6e2f6af> in <module>()
----> 1 hash(pa.int8())
TypeError: unhashable type: 'pyarrow.lib.DataType'
| } | ||
|
|
||
|
|
||
| def arrow_types_to_ibis_schema(schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make a note somewhere that this doesn't handle complex types yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/parquet.py
Outdated
| return t | ||
|
|
||
| def list_tables(self, path=None): | ||
| # tables are files in a dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this looks pretty similar, I think there's some opportunity to factor this code out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
ibis/file/parquet.py
Outdated
|
|
||
|
|
||
| @pre_execute.register(ParquetTable, ParquetClient) | ||
| def parquet_data_preload_uri_client(op, client, scope=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See csv pre_execute comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment above
ibis/file/utils.py
Outdated
| @@ -0,0 +1,6 @@ | |||
| try: | |||
| import pathlib # noqa | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the specific flake8 error code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i remove this file, only used in 1 place
| @@ -415,6 +415,32 @@ def valid_literal(self, value): | |||
| return isinstance(value, six.string_types + (datetime.datetime,)) | |||
|
|
|||
|
|
|||
| class SignedInteger(Integer): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Int8, Int16, Int32, and Int64 classes should inherit from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@cpcloud looks good to go |
|
@cpcloud this looks ready to merge. |
|
ping |
ibis/expr/types.py
Outdated
| pass | ||
|
|
||
|
|
||
| class HalffloatScalar(ScalarExpr, FloatValue): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to subclass from halffloat value not float value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/expr/types.py
Outdated
| pass | ||
|
|
||
|
|
||
| class HalffloatColumn(NumericColumn, FloatValue): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ibis/file/csv.py
Outdated
| usecols = None | ||
|
|
||
| else: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove some of this extra whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if str(d).endswith(self.extension): | ||
| tables.append(d.stem) | ||
| elif path.is_file(): | ||
| # by definition we are at the db level at this point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for HDF5 a file is a database, as well as a dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the HDF client then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, if you look above you requested that the common code be moved here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I definitely remember that request, but this method is only used in one place and makes assumptions that are specific to HDF5 which is why I'm asking. This isn't a blocker though.
remove halffloatingvalue superclass
|
Merging! Thanks @jreback! |
closes #1175
closes #1165
on top of #1167