Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: pyarrow >= 0.7.0 compat #17588

Merged
merged 1 commit into from
Sep 19, 2017
Merged

COMPAT: pyarrow >= 0.7.0 compat #17588

merged 1 commit into from
Sep 19, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Sep 19, 2017

closes #17581

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO Parquet parquet, feather labels Sep 19, 2017
@jreback jreback added this to the 0.21.0 milestone Sep 19, 2017
@codecov
Copy link

codecov bot commented Sep 19, 2017

Codecov Report

Merging #17588 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17588      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49625    49625              
==========================================
- Hits        45270    45261       -9     
- Misses       4355     4364       +9
Flag Coverage Δ
#multiple 88.99% <ø> (ø) ⬆️
#single 40.19% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...5bc3def. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 19, 2017

Codecov Report

Merging #17588 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17588      +/-   ##
==========================================
- Coverage   91.22%    91.2%   -0.02%     
==========================================
  Files         163      163              
  Lines       49625    49625              
==========================================
- Hits        45270    45261       -9     
- Misses       4355     4364       +9
Flag Coverage Δ
#multiple 88.99% <ø> (ø) ⬆️
#single 40.19% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e85ca7...0c75389. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2017

In [13]: import pyarrow

In [14]: pyarrow.__version__
Out[14]: '0.7.0'

In [15]: df = pd.DataFrame({'a': pd.Categorical(list('abc'))})

In [16]: df.dtypes
Out[16]: 
a    category
dtype: object

In [17]: df.to_parquet('foo.pq', engine='pyarrow')

In [18]: pd.read_parquet('foo.pq', engine='pyarrow')
Out[18]: 
   a
0  a
1  b
2  c

In [19]: pd.read_parquet('foo.pq', engine='pyarrow').dtypes
Out[19]: 
a    object
dtype: object

@wesm
de-serializing seems not to preserve the cat type ?

@wesm
Copy link
Member

wesm commented Sep 19, 2017

Correct, it comes back as non-categorical. Parquet does not have a categorical type. We can try to emulate it as best we can (not foolproof; when dictionaries grow too big, encoding is turned off) but it will take some more work in parquet-cpp.

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2017

no need to emulate. will make a note in the docs.

@jreback jreback merged commit 6630c4e into pandas-dev:master Sep 19, 2017
@wesm
Copy link
Member

wesm commented Sep 19, 2017

We're going to support direct column reads as categorical soon hopefully, which will use the dictionary page if there is one, but that has some edge cases:

  • Dictionaries generally will be different from file to file
  • Dictionaries may be different within each row group
  • A column chunk may switch to plain encoding mid-stream (if the dictionary got too big)

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TST/CI: PyArrow Test Failures
2 participants