Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/ENH: Writer #13

Closed
wants to merge 22 commits into from
Closed

WIP/ENH: Writer #13

wants to merge 22 commits into from

Conversation

martindurant
Copy link

I don't know if you saw this, but I attempted to make a parquet writer. It works well enough to make parquet files which the same reader can read. The files cannot be read by true parquet tools (e.g., spark...), however. Any thoughts on what is missing?

Martin Durant and others added 22 commits November 6, 2015 11:39
Removing logging commands greatly increases read speed
Previously had only been fixed for fixed-width dictionary pages
Full list from parquet-thrift definitions.

In addition, recognize that spark timestamps are not even in the list and
will require separate handling.

Changed INT96 to return bytestrings rather than immediately attempt to
convert to int, since almost always actually holds some converted type
(e.g., spark timestamp).
Not automatically applied, since spark uses a custom type not defined
in parquet, detailed in the footer metadata rather than specified in the
schema directly.
Read into numpy arrays
Uses numpy trickery
Was using the incorrect counter, not updating position in the array.

Also fixed spark time mapper for the case that the byte-strings end
in null bytes. This slows things down, may have to find a better
solution (don't know how many people will want to use spark time-
stamps).
not tested yet, having no such data.
Uses some absolute values which won't work for multiple columns
or various data-types.
Still doesn't work, but got further along the chain.
Still doesn't work... but implemented RLE/hybrid writing, required
for definition levels (because most fields are 'optional', but
actually we keep every value, so need to generate big array on 1s).
Were slowing things down and not helping.
@martindurant martindurant changed the title Writer WIP/ENH: Writer Mar 29, 2016
@jcrobak
Copy link
Owner

jcrobak commented Mar 29, 2016

Awesome! will try to take a look this weekend.

@martindurant
Copy link
Author

@jcrobak have you have a look through what I did here?
You maybe want to take out the vectorised reading and py3 adaptation (see #15) for considered merger and leave only the writing stuff in WIP.

@martindurant
Copy link
Author

... or anybody else out there?
I don't think my code is far off a real and fast parquet writer for simple column types - any experts can tell me what I'm doing wrong?

@peterbe peterbe mentioned this pull request May 31, 2016
@peterbe
Copy link

peterbe commented Jun 3, 2016

How are you supposed to use it? I'm a newbie. I tried:

>>> grouped = pd.DataFrame([{'foo':'A', 'bar':'B'}, {'foo':'C', 'bar':'D'}], columns=['foo', 'bar'])
>>> df_to_parquet(grouped, '/tmp/foo.parquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "parquet/writer.py", line 42, in df_to_parquet
    rle_string = make_rle_string(nrows, 1)
  File "parquet/writer.py", line 87, in make_rle_string
    header_bytes = make_unsigned_var_int(header)
  File "parquet/writer.py", line 82, in make_unsigned_var_int
    return bit + result.to_bytes(1, 'little')
AttributeError: 'int' object has no attribute 'to_bytes'

typcode = 5
start = fo.tell()
rle_string = make_rle_string(nrows, 1)
fmd.schema.append(SchemaElement(type=typcode, name=col.encode(), repetition_type=1))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll get a NameError here if typ not in ('int64','int32','float64').

@martindurant
Copy link
Author

martindurant commented Jun 3, 2016

Correct, only int32/64 and float64 are supported. When I say supported, I couldn't get anything to read the parquet files created except the reader right here... which is why WIP, and not recommended for newbies. The apache feather/arrow shows promise for providing parquet usage for pandas, I stopped trying here because of the forthcoming work described here: http://wesmckinney.com/blog/pandas-and-apache-arrow/

@jcrobak
Copy link
Owner

jcrobak commented Jun 26, 2016

Thanks again for your work @martindurant. I haven't had a chance to debug the write path, but I'm certainly interested in trying to.

The main reason I haven't tried to integrate this code is that I believe there's value in a pure python implementation with few dependencies. For instance, folks tend to use the avro python tools for quick data checks locally. pip install take a few seconds and can be used to dump small files. I think that's the primary value of this tool in its current shape, since it's far from production-ready.

For high-performance/production, Wes' arrow + cpp bindings will likely be the best bet... and will integrate with pandas/numpy AFAICT. I'm also interested in pandas/numpy bindings, but not as part of the core implementation (not sure how much code reuse their could be...).

I'm going to close this PR for now so that folks have a better sense of viability. Hopefully someone will come along at some point and rebase on top of all the api refactors. I'll certainly try to do so if I ever have some spare cycles!

@jcrobak jcrobak closed this Jun 26, 2016
@jcrobak jcrobak mentioned this pull request Sep 5, 2016
@martindurant martindurant deleted the writer branch November 4, 2016 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants