WIP/ENH: Writer #13

martindurant · 2016-03-29T20:34:51Z

I don't know if you saw this, but I attempted to make a parquet writer. It works well enough to make parquet files which the same reader can read. The files cannot be read by true parquet tools (e.g., spark...), however. Any thoughts on what is missing?

Removing logging commands greatly increases read speed

Previously had only been fixed for fixed-width dictionary pages

Full list from parquet-thrift definitions. In addition, recognize that spark timestamps are not even in the list and will require separate handling. Changed INT96 to return bytestrings rather than immediately attempt to convert to int, since almost always actually holds some converted type (e.g., spark timestamp).

Not automatically applied, since spark uses a custom type not defined in parquet, detailed in the footer metadata rather than specified in the schema directly.

Read into numpy arrays

Uses numpy trickery

Was using the incorrect counter, not updating position in the array. Also fixed spark time mapper for the case that the byte-strings end in null bytes. This slows things down, may have to find a better solution (don't know how many people will want to use spark time- stamps).

not tested yet, having no such data.

Uses some absolute values which won't work for multiple columns or various data-types.

Still doesn't work, but got further along the chain.

Still doesn't work... but implemented RLE/hybrid writing, required for definition levels (because most fields are 'optional', but actually we keep every value, so need to generate big array on 1s).

Were slowing things down and not helping.

jcrobak · 2016-03-29T23:34:21Z

Awesome! will try to take a look this weekend.

martindurant · 2016-05-15T16:16:04Z

@jcrobak have you have a look through what I did here?
You maybe want to take out the vectorised reading and py3 adaptation (see #15) for considered merger and leave only the writing stuff in WIP.

martindurant · 2016-05-15T16:18:08Z

... or anybody else out there?
I don't think my code is far off a real and fast parquet writer for simple column types - any experts can tell me what I'm doing wrong?

peterbe · 2016-06-03T20:25:18Z

How are you supposed to use it? I'm a newbie. I tried:

>>> grouped = pd.DataFrame([{'foo':'A', 'bar':'B'}, {'foo':'C', 'bar':'D'}], columns=['foo', 'bar'])
>>> df_to_parquet(grouped, '/tmp/foo.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parquet/writer.py", line 42, in df_to_parquet
    rle_string = make_rle_string(nrows, 1)
  File "parquet/writer.py", line 87, in make_rle_string
    header_bytes = make_unsigned_var_int(header)
  File "parquet/writer.py", line 82, in make_unsigned_var_int
    return bit + result.to_bytes(1, 'little')
AttributeError: 'int' object has no attribute 'to_bytes'

peterbe · 2016-06-03T20:29:57Z

parquet/writer.py

+                typcode = 5
+            start = fo.tell()
+            rle_string = make_rle_string(nrows, 1)
+            fmd.schema.append(SchemaElement(type=typcode, name=col.encode(), repetition_type=1))


You'll get a NameError here if typ not in ('int64','int32','float64').

martindurant · 2016-06-03T20:48:37Z

Correct, only int32/64 and float64 are supported. When I say supported, I couldn't get anything to read the parquet files created except the reader right here... which is why WIP, and not recommended for newbies. The apache feather/arrow shows promise for providing parquet usage for pandas, I stopped trying here because of the forthcoming work described here: http://wesmckinney.com/blog/pandas-and-apache-arrow/

jcrobak · 2016-06-26T18:07:45Z

Thanks again for your work @martindurant. I haven't had a chance to debug the write path, but I'm certainly interested in trying to.

The main reason I haven't tried to integrate this code is that I believe there's value in a pure python implementation with few dependencies. For instance, folks tend to use the avro python tools for quick data checks locally. pip install take a few seconds and can be used to dump small files. I think that's the primary value of this tool in its current shape, since it's far from production-ready.

For high-performance/production, Wes' arrow + cpp bindings will likely be the best bet... and will integrate with pandas/numpy AFAICT. I'm also interested in pandas/numpy bindings, but not as part of the core implementation (not sure how much code reuse their could be...).

I'm going to close this PR for now so that folks have a better sense of viability. Hopefully someone will come along at some point and rebase on top of all the api refactors. I'll certainly try to do so if I ever have some spare cycles!

Martin Durant and others added 22 commits November 6, 2015 11:39

Adapt for python3; add pandas reader

63058a2

Removing logging commands greatly increases read speed

Add support for DECIMAL type

f0eb923

Fixed-wdth byte array loading

1da753b

Previously had only been fixed for fixed-width dictionary pages

Add some data types

f8a46d5

Add timestamp-milli type

c041b2c

Add spark timestamp conversion

3392ebf

Not automatically applied, since spark uses a custom type not defined in parquet, detailed in the footer metadata rather than specified in the schema directly.

Add UTF8 converted type

ddf2d0e

Massively speed up date and time conversions

1f14060

Added unsigned integers

e7e31a1

Attempted vectorization

87d8cf6

Read into numpy arrays

Much faster bitpack reading

81224f8

Uses numpy trickery

Fix import names

cbdaf90

Both kinds of int mappers

1b9b93d

Add JSON, BSON

e6735e3

not tested yet, having no such data.

First column written & read

e116169

Uses some absolute values which won't work for multiple columns or various data-types.

Writing works for a fe data types

0ec87d4

Squashed some bugs preventing spark load of written files

fbdae13

Still doesn't work, but got further along the chain.

Lots of small fixes to conform to parquet standard on write

420e03a

Still doesn't work... but implemented RLE/hybrid writing, required for definition levels (because most fields are 'optional', but actually we keep every value, so need to generate big array on 1s).

Remove debug prints

ac0736f

Were slowing things down and not helping.

Include write scratch code

204d4cd

martindurant changed the title ~~Writer~~ WIP/ENH: Writer Mar 29, 2016

martindurant mentioned this pull request May 15, 2016

python-parquet with Python 3 #15

Closed

peterbe mentioned this pull request May 31, 2016

Writing parquet files #16

Open

peterbe reviewed Jun 3, 2016
View reviewed changes

jcrobak closed this Jun 26, 2016

jcrobak mentioned this pull request Sep 5, 2016

pandas Series support #35

Open

martindurant deleted the writer branch November 4, 2016 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/ENH: Writer #13

WIP/ENH: Writer #13

martindurant commented Mar 29, 2016

jcrobak commented Mar 29, 2016

martindurant commented May 15, 2016

martindurant commented May 15, 2016

peterbe commented Jun 3, 2016

peterbe Jun 3, 2016

martindurant commented Jun 3, 2016 •

edited

Loading

jcrobak commented Jun 26, 2016

WIP/ENH: Writer #13

WIP/ENH: Writer #13

Conversation

martindurant commented Mar 29, 2016

jcrobak commented Mar 29, 2016

martindurant commented May 15, 2016

martindurant commented May 15, 2016

peterbe commented Jun 3, 2016

peterbe Jun 3, 2016

Choose a reason for hiding this comment

martindurant commented Jun 3, 2016 • edited Loading

jcrobak commented Jun 26, 2016

martindurant commented Jun 3, 2016 •

edited

Loading