New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parquet, HDF5, and CSV interfaces #1194

Closed
wants to merge 16 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@jreback
Contributor

jreback commented Oct 27, 2017

closes #1175
closes #1165

on top of #1167

@jreback jreback added the enhancement label Oct 27, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 27, 2017

@wesm is it better map parquet logical types directly to ibis types, or convert to arrow types which then almost trivially map to ibis types?

(Pdb) p parquet_file.schema
<pyarrow._parquet.ParquetSchema object at 0x10bcf6c48>
ticker: BYTE_ARRAY UTF8
time: INT64 TIMESTAMP_MICROS
open: DOUBLE
__index_level_0__: INT64
 
(Pdb) p list(parquet_file.schema)[0].logical_type
'UTF8'
@wesm

This comment has been minimized.

Member

wesm commented Oct 27, 2017

@jreback it's probably better to use the Arrow types, which will be more consistent with the underlying execution.

How large do we anticipate each of the file implementations being? Any reason to have the module vs. ibis/file/parquet.py?

@jreback jreback force-pushed the jreback:parquet branch from 6be2ce2 to dc02d00 Oct 27, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 27, 2017

@jreback it's probably better to use the Arrow types, which will be more consistent with the underlying execution.

I added an arrow converter (not used, but its in the tests) & a parquet one; certainly can change to just convert parquet types to arrow to ibis.

How large do we anticipate each of the file implementations being? Any reason to have the module vs. ibis/file/parquet.py?

was organizing like the pandas backend. but yes we could combine all of these into a single file. though this is not user visible anyhow.

@jreback jreback force-pushed the jreback:parquet branch from dc02d00 to b786d88 Oct 27, 2017

@jreback jreback changed the title from WIP: Parquet file interface to ENH: Parquet file interface Oct 27, 2017

@jreback jreback force-pushed the jreback:parquet branch from b786d88 to 99e11c5 Oct 27, 2017

@jreback jreback referenced this pull request Oct 27, 2017

Open

ENH: file back end enhancements #1195

0 of 3 tasks complete

@jreback jreback self-assigned this Oct 27, 2017

@jreback jreback added this to the 0.12 milestone Oct 27, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 27, 2017

ok revamped to use a single file per (csv, hdf5, parquet).

@jreback jreback force-pushed the jreback:parquet branch 3 times, most recently from 62b8051 to 9a4080e Oct 27, 2017

@wesm

This comment has been minimized.

Member

wesm commented Oct 30, 2017

👍

@jreback jreback force-pushed the jreback:parquet branch from 9a4080e to 3e49b37 Oct 30, 2017

@jreback jreback modified the milestones: 0.12, 0.13 Oct 30, 2017

@jreback jreback force-pushed the jreback:parquet branch from 3e49b37 to 5812d8d Oct 31, 2017

def parquet_types_to_ibis_schema(schema):
pairs = []
for cs in schema:

This comment has been minimized.

@jreback

jreback Oct 31, 2017

Contributor

@wesm is there (maybe internal) function to convert a ParquetColumnSchema to an Arrow schema?
would prefer to not have these parquet routines directly exposed here, rather have arrow types (which trivially convert to ibis types)

This comment has been minimized.

@wesm

wesm Oct 31, 2017

Member

The code exists, but needs to be exposed: https://issues.apache.org/jira/browse/ARROW-1759

@cpcloud cpcloud changed the title from ENH: Parquet file interface to ENH: Parquet, HDF5, and CSV interfaces Oct 31, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 31, 2017

@wesm @cpcloud I am not sure how parquet represents in python2 but this seems odd

using the example table from https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L72

I am not seeing a logical_type for strings in py2; in py3 these are UTF8 (both are BYTE_ARRAY as physical_type). note that the last field is passed in as bytes, so this looks as expected.

PY2

<pyarrow._parquet.ParquetSchema object at 0x1047bf908>
uint8: INT32 UINT_8
uint16: INT32 UINT_16
uint32: INT64
uint64: INT64 UINT_64
int8: INT32 INT_16
int16: INT32 INT_16
int32: INT32
int64: INT64
float32: FLOAT
float64: DOUBLE
bool: BOOLEAN
datetime: INT64 TIMESTAMP_MICROS
str: BYTE_ARRAY
str_with_nulls: BYTE_ARRAY
empty_str: BYTE_ARRAY
bytes: BYTE_ARRAY
__index_level_0__: INT64

PY3

uint8: INT32 UINT_8
uint16: INT32 UINT_16
uint32: INT64
uint64: INT64 UINT_64
int8: INT32 INT_16
int16: INT32 INT_16
int32: INT32
int64: INT64
float32: FLOAT
float64: DOUBLE
bool: BOOLEAN
datetime: INT64 TIMESTAMP_MICROS
str: BYTE_ARRAY UTF8
str_with_nulls: BYTE_ARRAY UTF8
empty_str: BYTE_ARRAY UTF8
bytes: BYTE_ARRAY
__index_level_0__: INT64
@cpcloud

This comment has been minimized.

Member

cpcloud commented Oct 31, 2017

Hm I would expect strings to be UTF8 there for both Pythons. That seems like a bug.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 1, 2017

revised, I guess I missed parquet_file.schema.to_arrow_types()

@xhochy

This comment has been minimized.

xhochy commented Nov 1, 2017

unicode in Python 2 should be UTF8, str has no encoding and does is only regarded as a byte array.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Nov 1, 2017

Yep, @xhochy when I said "strings" I meant "string types coming from parquet". All string types should be represented by unicode in python2 and str in python3.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 1, 2017

note that .to_arrow_types() handles this the correct way. I am thinking that there might be not enough meta-data in parquet and arrow is doing some kind of work-around here.

'float': float,
'halffloat': float16,

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Do we need to have two spellings of float16 or can we just leave it as float16?

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

i can take it out, but this is a common spelling, e.g. like float, double.

This comment has been minimized.

@cpcloud

cpcloud Nov 2, 2017

Member

Ok, that's fine.

class FileClient(ibis.client.Client):
def __init__(self, root):
super(FileClient, self).__init__()

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Don't need this call here.

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

def __dir__(self):
dbs = self.list_databases(path=self.path)
tables = self.list_tables(path=self.path)
return sorted(list(set(dbs).union(set(tables))))

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Can just call sorted here.

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

new_name = "{}.{}".format(name, self.extension)
if (self.root / name).is_dir():
path = path / name

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

You can do augmented assignment here as: path /= name

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

def __getattr__(self, name):
try:
return object.__getattribute__(self, name)

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

A call to __getattribute__ isn't necessary because you're only ever inside __getattr__ if a call to __getattribute__ has raised AttributeError

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

_ARROW_DTYPE_TO_IBIS_TYPE = {
'int8': dt.int8,

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Can we use the arrow objects here rather than strings?

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

This seems not implemented :>

In [5]: hash(pa.int8())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bca0e6e2f6af> in <module>()
----> 1 hash(pa.int8())

TypeError: unhashable type: 'pyarrow.lib.DataType'
}
def arrow_types_to_ibis_schema(schema):

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Can you make a note somewhere that this doesn't handle complex types yet?

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

return t
def list_tables(self, path=None):
# tables are files in a dir

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Again this looks pretty similar, I think there's some opportunity to factor this code out.

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

fixed

@pre_execute.register(ParquetTable, ParquetClient)
def parquet_data_preload_uri_client(op, client, scope=None, **kwargs):

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

See csv pre_execute comment

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

see my comment above

@@ -0,0 +1,6 @@
try:
import pathlib # noqa

This comment has been minimized.

@cpcloud

cpcloud Nov 1, 2017

Member

Can you put the specific flake8 error code here?

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

i remove this file, only used in 1 place

@@ -415,6 +415,32 @@ def valid_literal(self, value):
return isinstance(value, six.string_types + (datetime.datetime,))
class SignedInteger(Integer):

This comment has been minimized.

@cpcloud

cpcloud Nov 2, 2017

Member

The Int8, Int16, Int32, and Int64 classes should inherit from this.

This comment has been minimized.

@jreback

jreback Nov 2, 2017

Contributor

done

@jreback jreback force-pushed the jreback:parquet branch from 42ef51d to 6b37516 Nov 3, 2017

@jreback jreback force-pushed the jreback:parquet branch from 6b37516 to 33cd1da Nov 7, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 7, 2017

@cpcloud looks good to go

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 14, 2017

@cpcloud this looks ready to merge.

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 19, 2017

ping

pass
class HalffloatScalar(ScalarExpr, FloatValue):

This comment has been minimized.

@cpcloud

cpcloud Nov 19, 2017

Member

This needs to subclass from halffloat value not float value

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

done

pass
class HalffloatColumn(NumericColumn, FloatValue):

This comment has been minimized.

@cpcloud

cpcloud Nov 19, 2017

Member

Same as previous comment

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

done

jreback added some commits Nov 19, 2017

usecols = None
else:

This comment has been minimized.

@cpcloud

cpcloud Nov 19, 2017

Member

Can you remove some of this extra whitespace?

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

fixed

if str(d).endswith(self.extension):
tables.append(d.stem)
elif path.is_file():
# by definition we are at the db level at this point

This comment has been minimized.

@cpcloud

cpcloud Nov 19, 2017

Member

Why is this true?

This comment has been minimized.

@jreback

jreback Nov 19, 2017

Contributor

for HDF5 a file is a database, as well as a dir.

This comment has been minimized.

@cpcloud

cpcloud Nov 20, 2017

Member

Should this be in the HDF client then?

This comment has been minimized.

@jreback

jreback Nov 20, 2017

Contributor

no, if you look above you requested that the common code be moved here.

This comment has been minimized.

@cpcloud

cpcloud Nov 20, 2017

Member

Right, I definitely remember that request, but this method is only used in one place and makes assumptions that are specific to HDF5 which is why I'm asking. This isn't a blocker though.

jreback added some commits Nov 19, 2017

remove implicit casting tests
remove halffloatingvalue superclass
@cpcloud

This comment has been minimized.

Member

cpcloud commented Nov 20, 2017

Merging! Thanks @jreback!

@cpcloud cpcloud closed this Nov 20, 2017

@cpcloud cpcloud reopened this Nov 20, 2017

@cpcloud cpcloud closed this in 7b777fd Nov 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment