Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running query over a dataframe read by read_parquet errors out with Modin 0.5.1 #642

Closed
ddutt opened this issue May 29, 2019 · 27 comments

Comments

@ddutt
Copy link
Contributor

commented May 29, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 19.04
  • Modin installed from (source or binary): pip install modin
  • Modin version: 0.5.1
  • Python version: 3.7.3
  • Exact command to reproduce:
    final_df = pd.read_parquet(folder, columns=fields, filters=filters or None)
    .query(query_str)
    Where the query_str was of the form ("col1 == ['bar']").

Describe the problem

Executing the pandas code snippet errors out with the traceback shown below. Needless to say, Pandas has no error. This issue is caused with Modin 0.5.1 and ray 0.7.

Source code / logs

File “/home/ddutt/work/a/engines/modin/engine.py”, line 91, in get_table_df
.query(query_str)
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/pandas/dataframe.py”, line 1260, in query
new_query_compiler = self._query_compiler.query(expr, **kwargs)
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py”, line 1669, in query
new_index = self.compute_index(0, new_data, True)
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py”, line 93, in compute_index
old_blocks=old_blocks,
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py”, line 535, in get_indices
if len(self._partitions_cache.T)
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/engines/base/frame/partition_manager.py”, line 534, in
[idx.apply(func).get() for idx in self._partitions_cache.T[0]]
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/engines/ray/pandas_on_ray/frame/partition.py”, line 34, in get
handle_ray_task_error(e)
File “/home/ddutt/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/engines/ray/utils.py”, line 13, in handle_ray_task_error
raise getattr(builtins, s.split(":")[0])("".join(s.split(":")[1:]))
ValueError: Length mismatch Expected axis has 28 elements, new values have 10 elements
@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Thanks @ddutt, for the report!

We're looking into the issue, others have reported running into a related issue as well. We will get this fixed as soon as possible.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

@ddutt Would you be able to tell me if install this way fixes your issue?

pip install git+https://github.com/devin-petersohn/modin/tree/issues/642
@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

The fix breaks support of predicate pushdown

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

I tried running pandas.read_parquet with filters=None, but I get this error in pandas:

In [2]: df = pandas.read_parquet("big.parquet", columns=["col1", "col2"], filters=None)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-51e112ad24f2> in <module>
----> 1 df = pandas.read_parquet("big.parquet", columns=["col1", "col2"], filters=None)

~/.local/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

~/.local/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    127         kwargs['use_pandas_metadata'] = True
    128         result = self.api.parquet.read_table(path, columns=columns,
--> 129                                              **kwargs).to_pandas()
    130         if should_close:
    131             try:

TypeError: read_table() got an unexpected keyword argument 'filters'
@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

Exactly. Predicate pushdown is only supported with ParquetDataset()

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

Also, adding use_pandas_metadata to True in read() didn't help in removing the original error.

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

I'm sure you know this already, but just in case. The fix to use ParquetDataSet along with @williamma12's patches worked in 0.5.0. Something else has changed that's causing this issue.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

So in order to support this predicate pushdown, we should add an experimental API. Anything that deviates from the pandas API should go there because it is important that users understand when they can move to/from Modin. If someone uses the pandas API in Modin it should also work in pandas. This is a little beside the point.

For internal use, predicate pushdown is important for the query planning use-case. We should definitely fix it, but it will have to go through the experimental API because we don't want it to be impossible to go to/from pandas.

Related questions to the bug:

  • Does a print before the query work? It may be an error during reading.
  • Does the vectorized version df[df["col1"] == "['bar']"] work for these?
@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Also, will it work if you remove filter from the read_parquet args?

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Alright, I readded the predicate pushdown functionality in modin.experimental.pandas in the branch above. We have done this before for read_sql and it worked pretty well for that user.

The only difference is importing:

import modin.experimental.pandas as pd
...

This doesn't fix the issue, but if you can install from the branch pip install git+https://github.com/devin-petersohn/modin/tree/issues/642 then try to print the dataset after loading, that would help a lot.

Also, I have been working with @williamma12 to reproduce the issue you saw but we haven't been able to reproduce it.

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

The change didn't fix the problem. I get this error now:

In [3]: import modin.pandas as pd
2019-05-29 13:13:03,132 WARNING worker.py:1341 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of wor
ker processes.                                   
2019-05-29 13:13:03,133 INFO node.py:497 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-29_13-13-03_133142_24822/logs.
2019-05-29 13:13:03,239 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:56705 to respond...
2019-05-29 13:13:03,369 INFO services.py:409 -- Waiting for redis server at 127.0.0.1:27748 to respond...
2019-05-29 13:13:03,375 INFO services.py:806 -- Starting Redis shard with 3.31 GB max memory.                                                                                                                     
2019-05-29 13:13:03,412 INFO node.py:511 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-29_13-13-03_133142_24822/logs.
2019-05-29 13:13:03,413 WARNING services.py:1318 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6854029312 bytes available. This may slow down performance! You may be ab
le to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'd
ocker run'.          
2019-05-29 13:13:03,413 INFO services.py:1441 -- Starting the Plasma object store with 9.0 GB memory using /tmp.

In [4]: df = pd.read_parquet('/home/ddutt/work/parquet-out/foo/')                                                                                                                                              
                                     
In [5]: df                          
Out[5]: ---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)                     
~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)                                                                                                                                                
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)                                                                                                 
    400                         if cls is not object \
    401                                 and callable(cls.__dict__.get('__repr__')):
--> 402                             return _repr_pprint(obj, self, cycle)
    403
    404             return _default_pprint(obj, self, cycle)

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)                                                                                       
    695     """A pprint that just redirects to the normal repr function."""
    696     # Find newlines and replace them with p.break_()
--> 697     output = repr(obj)
    698     for idx,output_line in enumerate(output.splitlines()):
    699         if idx:

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/pandas/dataframe.py in __repr__(self)                                                                                                
     91         num_cols = pandas.get_option("max_columns") or 20
     92
---> 93         result = repr(self._build_repr_df(num_rows, num_cols))
     94         if len(self.index) > num_rows or len(self.columns) > num_cols:
     95             # The split here is so that we don't repr pandas row lengths.

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/pandas/base.py in _build_repr_df(self, num_rows, num_cols)                                                                           
     59
     60         if not hasattr(self, "columns") or len(self.columns) <= num_cols:
---> 61             head_front = head.to_pandas()
     62             # Creating these empty to make the concat logic simpler
     63             head_back = pandas.DataFrame()

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/backends/pandas/query_compiler.py in to_pandas(self)                                                                                 
    485         else:
    486             ErrorMessage.catch_bugs_and_request_email(
--> 487                 len(df.index) != len(self.index) or len(df.columns) != len(self.columns)
    488             )
    489             df.index = self.index

~/.local/share/virtualenvs/a-cFmMv4Vf/lib/python3.7/site-packages/modin/error_message.py in catch_bugs_and_request_email(cls, failure_condition)                                                             
     36         if failure_condition:
     37             raise Exception(
---> 38                 "Internal Error. "
     39                 "Please email bug_reports@modin.org with the traceback and command that"
     40                 " caused this error."

Exception: Internal Error. Please email bug_reports@modin.org with the traceback and command that caused this error.    
@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

To be clear, I made the change to use read_table instead of ParquetDataset and used import modin.pandas as pd. I removed the use of filters to just test if this works.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

That helps a lot. It looks like a miscalculation of the metadata. There were some cleanup changes to #632 that changed the way metadata is calculated. I will revert those and see if that fixes the problem for you.

cc @williamma12

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

Alas, the fix didn't fix the error. I get the same error.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Is there any way to share some minimum set of fake data that reproduces this? Debugging asynchronously is expensive and I hate to keep asking you to try this or that.

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Yes, I can. How can I send it to you?

Dinesh

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 30, 2019

You can send an attachment to bug_reports@modin.org

@devin-petersohn devin-petersohn added this to the 0.5.2 milestone May 30, 2019

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Sorry I've been busy giving a webinar and such and so couldn't get to this. Will get to it later today.

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

Sent

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Got it, thanks! I'm looking into it now.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 31, 2019

Ok, it should be fixed. Can you try

pip install git+https://github.com/devin-petersohn/modin/tree/issues/642

If you want to use filters with that branch, you need to import modin.experimental.pandas as pd, otherwise you can import modin.pandas as pd. I tried them both and it should work.

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

Yep, the fix works.

@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 31, 2019

So the problem is this weird feature (bug?) in Parquet.

With your dataset, if I read with columns=["something"], it ends up reading 3 columns, which are the partition columns. It doesn't matter if you read from read_table or ParquetDataset.read, the partition columns are always a part of the end result. That causes a mismatch in the metadata because we were not filtering those columns out, rather we were assuming that the behavior was that columns was exclusive. It's not a problem in pandas because reading the entire dataset at once will still return the full column set.

I will dig into this a little more to see if this is the intended behavior, but I haven't found it documented anywhere. If necessary, I will file a bug report in Arrow.

I will release later tonight 0.5.2 because others have emailed the bug report email address with similar problems.

@devin-petersohn devin-petersohn referenced this issue May 31, 2019

Merged

Fix Parquet reader for partitioned files #644

2 of 3 tasks complete
@devin-petersohn

This comment has been minimized.

Copy link
Member

commented May 31, 2019

The tests still need to run for another couple of hours before the release and it's getting late, so I'll get it out first thing in the morning.

Thanks again so much for the diligence!

@ddutt

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

You're welcome, and thank you for the rapid responses to fix the issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.