Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parquet, HDF5, and CSV interfaces #1194

Closed
wants to merge 16 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 27, 2017

closes #1175
closes #1165

on top of #1167

@jreback jreback added the feature Features or general enhancements label Oct 27, 2017
@jreback
Copy link
Contributor Author

jreback commented Oct 27, 2017

@wesm is it better map parquet logical types directly to ibis types, or convert to arrow types which then almost trivially map to ibis types?

(Pdb) p parquet_file.schema
<pyarrow._parquet.ParquetSchema object at 0x10bcf6c48>
ticker: BYTE_ARRAY UTF8
time: INT64 TIMESTAMP_MICROS
open: DOUBLE
__index_level_0__: INT64
 
(Pdb) p list(parquet_file.schema)[0].logical_type
'UTF8'

@wesm
Copy link
Member

wesm commented Oct 27, 2017

@jreback it's probably better to use the Arrow types, which will be more consistent with the underlying execution.

How large do we anticipate each of the file implementations being? Any reason to have the module vs. ibis/file/parquet.py?

@jreback
Copy link
Contributor Author

jreback commented Oct 27, 2017

@jreback it's probably better to use the Arrow types, which will be more consistent with the underlying execution.

I added an arrow converter (not used, but its in the tests) & a parquet one; certainly can change to just convert parquet types to arrow to ibis.

How large do we anticipate each of the file implementations being? Any reason to have the module vs. ibis/file/parquet.py?

was organizing like the pandas backend. but yes we could combine all of these into a single file. though this is not user visible anyhow.

@jreback jreback changed the title WIP: Parquet file interface ENH: Parquet file interface Oct 27, 2017
@jreback jreback mentioned this pull request Oct 27, 2017
3 tasks
@jreback jreback self-assigned this Oct 27, 2017
@jreback jreback added this to the 0.12 milestone Oct 27, 2017
@jreback
Copy link
Contributor Author

jreback commented Oct 27, 2017

ok revamped to use a single file per (csv, hdf5, parquet).

@jreback jreback force-pushed the parquet branch 3 times, most recently from 62b8051 to 9a4080e Compare October 28, 2017 00:47
@wesm
Copy link
Member

wesm commented Oct 30, 2017

👍


def parquet_types_to_ibis_schema(schema):
pairs = []
for cs in schema:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm is there (maybe internal) function to convert a ParquetColumnSchema to an Arrow schema?
would prefer to not have these parquet routines directly exposed here, rather have arrow types (which trivially convert to ibis types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code exists, but needs to be exposed: https://issues.apache.org/jira/browse/ARROW-1759

@cpcloud cpcloud changed the title ENH: Parquet file interface ENH: Parquet, HDF5, and CSV interfaces Oct 31, 2017
@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2017

@wesm @cpcloud I am not sure how parquet represents in python2 but this seems odd

using the example table from https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_parquet.py#L72

I am not seeing a logical_type for strings in py2; in py3 these are UTF8 (both are BYTE_ARRAY as physical_type). note that the last field is passed in as bytes, so this looks as expected.

PY2

<pyarrow._parquet.ParquetSchema object at 0x1047bf908>
uint8: INT32 UINT_8
uint16: INT32 UINT_16
uint32: INT64
uint64: INT64 UINT_64
int8: INT32 INT_16
int16: INT32 INT_16
int32: INT32
int64: INT64
float32: FLOAT
float64: DOUBLE
bool: BOOLEAN
datetime: INT64 TIMESTAMP_MICROS
str: BYTE_ARRAY
str_with_nulls: BYTE_ARRAY
empty_str: BYTE_ARRAY
bytes: BYTE_ARRAY
__index_level_0__: INT64

PY3

uint8: INT32 UINT_8
uint16: INT32 UINT_16
uint32: INT64
uint64: INT64 UINT_64
int8: INT32 INT_16
int16: INT32 INT_16
int32: INT32
int64: INT64
float32: FLOAT
float64: DOUBLE
bool: BOOLEAN
datetime: INT64 TIMESTAMP_MICROS
str: BYTE_ARRAY UTF8
str_with_nulls: BYTE_ARRAY UTF8
empty_str: BYTE_ARRAY UTF8
bytes: BYTE_ARRAY
__index_level_0__: INT64

@cpcloud
Copy link
Member

cpcloud commented Oct 31, 2017

Hm I would expect strings to be UTF8 there for both Pythons. That seems like a bug.

@jreback
Copy link
Contributor Author

jreback commented Nov 1, 2017

revised, I guess I missed parquet_file.schema.to_arrow_types()

@xhochy
Copy link

xhochy commented Nov 1, 2017

unicode in Python 2 should be UTF8, str has no encoding and does is only regarded as a byte array.

@cpcloud
Copy link
Member

cpcloud commented Nov 1, 2017

Yep, @xhochy when I said "strings" I meant "string types coming from parquet". All string types should be represented by unicode in python2 and str in python3.

@jreback
Copy link
Contributor Author

jreback commented Nov 1, 2017

note that .to_arrow_types() handles this the correct way. I am thinking that there might be not enough meta-data in parquet and arrow is doing some kind of work-around here.

'float': float,
'halffloat': float16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have two spellings of float16 or can we just leave it as float16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can take it out, but this is a common spelling, e.g. like float, double.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine.

class FileClient(ibis.client.Client):

def __init__(self, root):
super(FileClient, self).__init__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def __dir__(self):
dbs = self.list_databases(path=self.path)
tables = self.list_tables(path=self.path)
return sorted(list(set(dbs).union(set(tables))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just call sorted here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


new_name = "{}.{}".format(name, self.extension)
if (self.root / name).is_dir():
path = path / name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do augmented assignment here as: path /= name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def __getattr__(self, name):
try:
return object.__getattribute__(self, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A call to __getattribute__ isn't necessary because you're only ever inside __getattr__ if a call to __getattribute__ has raised AttributeError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done



_ARROW_DTYPE_TO_IBIS_TYPE = {
'int8': dt.int8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the arrow objects here rather than strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not implemented :>

In [5]: hash(pa.int8())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-bca0e6e2f6af> in <module>()
----> 1 hash(pa.int8())

TypeError: unhashable type: 'pyarrow.lib.DataType'

}


def arrow_types_to_ibis_schema(schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make a note somewhere that this doesn't handle complex types yet?

Copy link
Contributor Author

@jreback jreback Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return t

def list_tables(self, path=None):
# tables are files in a dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this looks pretty similar, I think there's some opportunity to factor this code out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed



@pre_execute.register(ParquetTable, ParquetClient)
def parquet_data_preload_uri_client(op, client, scope=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See csv pre_execute comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above

@@ -0,0 +1,6 @@
try:
import pathlib # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the specific flake8 error code here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i remove this file, only used in 1 place

@@ -415,6 +415,32 @@ def valid_literal(self, value):
return isinstance(value, six.string_types + (datetime.datetime,))


class SignedInteger(Integer):
Copy link
Member

@cpcloud cpcloud Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Int8, Int16, Int32, and Int64 classes should inherit from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback
Copy link
Contributor Author

jreback commented Nov 7, 2017

@cpcloud looks good to go

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2017

@cpcloud this looks ready to merge.

@jreback
Copy link
Contributor Author

jreback commented Nov 19, 2017

ping

pass


class HalffloatScalar(ScalarExpr, FloatValue):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to subclass from halffloat value not float value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pass


class HalffloatColumn(NumericColumn, FloatValue):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ibis/file/csv.py Outdated
usecols = None

else:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove some of this extra whitespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if str(d).endswith(self.extension):
tables.append(d.stem)
elif path.is_file():
# by definition we are at the db level at this point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for HDF5 a file is a database, as well as a dir.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the HDF client then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, if you look above you requested that the common code be moved here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I definitely remember that request, but this method is only used in one place and makes assumptions that are specific to HDF5 which is why I'm asking. This isn't a blocker though.

@cpcloud
Copy link
Member

cpcloud commented Nov 20, 2017

Merging! Thanks @jreback!

@cpcloud cpcloud closed this Nov 20, 2017
@cpcloud cpcloud reopened this Nov 20, 2017
@cpcloud cpcloud closed this in 7b777fd Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet back end HDF5 & CSV backends
4 participants