Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for GZ format #118

Closed
hpoit opened this issue May 11, 2017 · 8 comments
Closed

Support for GZ format #118

hpoit opened this issue May 11, 2017 · 8 comments

Comments

@hpoit
Copy link
Contributor

hpoit commented May 11, 2017

I would like to inform that I am working towards making Libz compliant with FileIO.

@hpoit hpoit changed the title Support for `load("*.csv.gz") with Libz Support for load("*.csv.gz") with Libz May 11, 2017
@hpoit hpoit changed the title Support for load("*.csv.gz") with Libz Support for GZ file extension with Libz May 11, 2017
@hpoit
Copy link
Contributor Author

hpoit commented May 12, 2017

Very painful process of understanding it and turning it into Julia for registry https://tools.ietf.org/html/rfc1952#page-6 (GZ format specification)

https://gist.github.com/hpoit/007ace3657b94fffd68530464a7d5ad9

If anyone knows of an existing code block representative of this please let me know! Otherwise I'm continuing on my gist.

About the magic number, not sure if it's true
"There is a magic number at the beginning of the file. Just read the first two bytes and check if they are equal to 0x1f8b."

From the spec
"ID2 (IDentification 2)
These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
(0x8b, \213), to identify the file as being in gzip format."
I believe this is what I have to look for in io to detect GZ.

@hpoit
Copy link
Contributor Author

hpoit commented May 12, 2017

Wait, what?

julia> using DataFrames

julia> airline_df = readtable("/Users/randyzwitch/airline/1987.csv.gz");

julia> size(airline_df)
(1311826,29)

julia> typeof(airline_df)
DataFrame  (use methods(DataFrame) to see constructors)

I think I just have to see how readtable identifies it, if it does.

@hpoit
Copy link
Contributor Author

hpoit commented May 12, 2017

   # (2) Path is GZip file
    elseif endswith(pathname, ".gz")
        io = gzopen(pathname, "r")
        nbytes = 2 * filesize(pathname)

https://github.com/JuliaStats/DataFrames.jl/blob/19821b9060a691646114c064e1bc0b8e290c8f66/src/dataframe/io.jl#L932-L935

Hi @SimonDanisch, would this pass for registry and identifying a file is a GZIP?

@hpoit
Copy link
Contributor Author

hpoit commented May 13, 2017

@SimonDanisch don't worry about it, I actually tried julia> df = readtable("/file.csv.gz") and it takes forever. Back to binaries.

@hpoit
Copy link
Contributor Author

hpoit commented May 13, 2017

Okay, I think I've got the right question now: how do I read the first two bytes of a file to check if they are equal to 0x1f and 0x8b. That's what I'm trying to answer.

If the first two bytes are equal to those, then the file is in GZIP format.

@hpoit
Copy link
Contributor Author

hpoit commented May 15, 2017

Hi @SimonDanisch, can I PR this? Is it correct?
add_format(format"GZIP", [0x1f, 0x8b], ".gz", [:Libz])
Let me test it.

@hpoit
Copy link
Contributor Author

hpoit commented May 15, 2017

Hi @SimonDanisch, here goes the test, not sure I passed it

julia> Pkg.test("FileIO")
INFO: Testing FileIO
these tests will print warnings: 
Library "NotInstalled" is not installed but is recommended as a library to load format: ".not_installed"
Should we install "NotInstalled" for you? (y/n):
Library "NotInstalled" is not installed but is recommended as a library to load format: ".not_installed"
Should we install "NotInstalled" for you? (y/n):
invalid is not a valid choice. Try typing y or n
Should we install "NotInstalled" for you? (y/n):
this test will print warnings: 
Errors encountered while loading "test.multierr".
All errors:
   ErrorException("1")
   ErrorException("2")
Fatal error:
Test Summary: | Pass  Total
  FileIO      |  108    108
INFO: FileIO tests passed

julia> 

Should I just PR for you to see what I've done?

@hpoit hpoit changed the title Support for GZ file extension with Libz Support for GZ format May 16, 2017
@timholy
Copy link
Member

timholy commented Mar 3, 2021

A PR would be great!

@timholy timholy closed this as completed Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants