-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/ENH: Writer #13
WIP/ENH: Writer #13
Conversation
Removing logging commands greatly increases read speed
Previously had only been fixed for fixed-width dictionary pages
Full list from parquet-thrift definitions. In addition, recognize that spark timestamps are not even in the list and will require separate handling. Changed INT96 to return bytestrings rather than immediately attempt to convert to int, since almost always actually holds some converted type (e.g., spark timestamp).
Not automatically applied, since spark uses a custom type not defined in parquet, detailed in the footer metadata rather than specified in the schema directly.
Read into numpy arrays
Uses numpy trickery
Was using the incorrect counter, not updating position in the array. Also fixed spark time mapper for the case that the byte-strings end in null bytes. This slows things down, may have to find a better solution (don't know how many people will want to use spark time- stamps).
not tested yet, having no such data.
Uses some absolute values which won't work for multiple columns or various data-types.
Still doesn't work, but got further along the chain.
Still doesn't work... but implemented RLE/hybrid writing, required for definition levels (because most fields are 'optional', but actually we keep every value, so need to generate big array on 1s).
Were slowing things down and not helping.
Awesome! will try to take a look this weekend. |
... or anybody else out there? |
How are you supposed to use it? I'm a newbie. I tried:
|
typcode = 5 | ||
start = fo.tell() | ||
rle_string = make_rle_string(nrows, 1) | ||
fmd.schema.append(SchemaElement(type=typcode, name=col.encode(), repetition_type=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll get a NameError
here if typ not in ('int64','int32','float64')
.
Correct, only int32/64 and float64 are supported. When I say supported, I couldn't get anything to read the parquet files created except the reader right here... which is why WIP, and not recommended for newbies. The apache feather/arrow shows promise for providing parquet usage for pandas, I stopped trying here because of the forthcoming work described here: http://wesmckinney.com/blog/pandas-and-apache-arrow/ |
Thanks again for your work @martindurant. I haven't had a chance to debug the write path, but I'm certainly interested in trying to. The main reason I haven't tried to integrate this code is that I believe there's value in a pure python implementation with few dependencies. For instance, folks tend to use the avro python tools for quick data checks locally. For high-performance/production, Wes' arrow + cpp bindings will likely be the best bet... and will integrate with pandas/numpy AFAICT. I'm also interested in pandas/numpy bindings, but not as part of the core implementation (not sure how much code reuse their could be...). I'm going to close this PR for now so that folks have a better sense of viability. Hopefully someone will come along at some point and rebase on top of all the api refactors. I'll certainly try to do so if I ever have some spare cycles! |
I don't know if you saw this, but I attempted to make a parquet writer. It works well enough to make parquet files which the same reader can read. The files cannot be read by true parquet tools (e.g., spark...), however. Any thoughts on what is missing?