Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Analysis on issue #36 #71

Closed
wants to merge 5 commits into from
Closed

Conversation

Aimaanhasan
Copy link

Analysis on efficiency and usage of extension arrays in dask

Issue #36

ah02887 and others added 2 commits March 19, 2019 01:23
Analysis on efficiency and usage of extension arrays in dask

Issue mozilla#36
@birdsarah
Copy link
Contributor

birdsarah commented Mar 19, 2019

Hi @Aimaanhasan - this is a great start. Congrats on getting an analysis PR up. Now the back and forth starts :D.

Some next steps:

  1. I would like to see this run on more than just one parquet file to get a more meaningful understanding of the speed ups. Can you run your analysis on https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent_value_1000_only.parquet.tar.bz2 or https://public-data.telemetry.mozilla.org/bigcrawl/value_1000_only.parquet.tar.bz2
  2. The notebook gets a little hard to follow due the the fletcher errors. These are good to keep in the notebook, and useful to see. But I think it would be good to pull your write-up and analysis to the top of the notebook, you can then hyperlink to headers further down in your notebook so your write-up can link to the sections of code where you've produced certain results.
  3. I think a summary table and perhaps a plot would be helpful too.
  4. Please add installation instructions or a link to the fletcher docs for people trying to run your code.
  5. Think about how and when you're going to use fletcher and how to measure / document performance changes.

It seems like you might be struggling to convert your columns. To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, but given that I have now here's some pseudocode that might be useful:

import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)

Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see note above

@Aimaanhasan
Copy link
Author

Aimaanhasan commented Mar 19, 2019

Hi, @birdsarah! Thank you so much for your feedback.

I am facing some issues and want to ask some questions regarding the changes.

  1. The psuedocode, you have given is not working for me. A typeError occurs.
    "TypeError: _from_sequence() got an unexpected keyword argument 'copy'"
    I tried it by using below code
    df[col] = fr.FletcherDtype(df[col])
    This gives the error:
    TypeError: Column assignment doesn't support type FletcherDtype
    Converting the dask.DataFrame to pandas.DataFrame ,I am able to convert the columns but after then converting the pandas.DataFrame back to dask.DataFrame crashes the kernel and force restarts it.

  2. I have made the arrangements to make the notebook more readable. I have also added the instructions and link for fletcher docs. Should I commit the changes?

  3. Can you please elaborate more about the kind of summary table and plots?

@birdsarah
Copy link
Contributor

Part 1

df[col] = fr.FletcherDtype(df[col]) Converting the data to pandas and back again will never be a solution. This data does not fit in memory. This is why I gave you the code to show you how to set the column type to fletcher dtype as opposed to creating a whole new column of data which is what your code does.

I can't debug your error without a full traceback.

Part 2

Yes, always be committing and pushing.

Part 3

I'd like to see you work on that yourself. Just think about how to present the information you have gathered carefully.

@birdsarah
Copy link
Contributor

I've just been resting this which gives some context for fletcher so I thought I'd share. https://www.dataschool.io/future-of-pandas/

The trick with dask vs pandas is to remember that dask ends up being lots of little bits of pandas but we have to let dask manage that itself.

Don't get completely stuck, keep trying things and reaching out.

Aimaanhasan and others added 3 commits March 20, 2019 23:00
Added link to the fletcher docs and gave an example for usability. 
Relocated the analyses for readability

Issue mozilla#36
@Aimaanhasan
Copy link
Author

Hello @birdsarah, I've tried in many ways to convert the columns of dask.DataFrame type, but it gives me the following error

Approach 1

Used the code below to implement:

`import pyarrow as pa

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype)
`
This gives me the following error
TypeError Traceback (most recent call last)
in
1 import pyarrow as pa
2 fletcher_string_dtype = fr.FletcherDtype(pa.string())
----> 3 df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype)
4
5

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in astype(self, dtype)
1646 meta = self._meta_nonempty.astype(dtype)
1647 else:
-> 1648 meta = self._meta.astype(dtype)
1649 if hasattr(dtype, 'items'):
1650 # Pandas < 0.21.0, no categories attribute, so unknown

C:\ProgramData\Anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
176 else:
177 kwargs[new_arg_name] = new_arg_value
--> 178 return func(*args, **kwargs)
179 return wrapper
180 return _deprecate_kwarg

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
4999 # else, only a single dtype is given
5000 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5001 **kwargs)
5002 return self._constructor(new_data).finalize(self)
5003

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
3712
3713 def astype(self, dtype, **kwargs):
-> 3714 return self.apply('astype', dtype=dtype, **kwargs)
3715
3716 def convert(self, **kwargs):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3579
3580 kwargs['mgr'] = self
-> 3581 applied = getattr(b, f)(**kwargs)
3582 result_blocks = _extend_blocks(applied, result_blocks)
3583

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
573 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
574 return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 575 **kwargs)
576
577 def _astype(self, dtype, copy=False, errors='raise', values=None,

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
634
635 # astype processing
--> 636 dtype = np.dtype(dtype)
637 if self.dtype == dtype:
638 if copy:

TypeError: data type not understood

Approach 2

Used the code below to implement:
df[df.columns[0]] = fr.FletcherDtype(df[df.columns[0]])

This gives me the following error

TypeError Traceback (most recent call last)
in
----> 1 df[df.columns[0]]=fr.FletcherDtype(df[df.columns[0]])
2
3

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in setitem(self, key, value)
2499 df = self.assign({k: value for k in key})
2500 else:
-> 2501 df = self.assign(
{key: value})
2502
2503 self.dask = df.dask

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs)
2694 callable(v) or pd.api.types.is_scalar(v)):
2695 raise TypeError("Column assignment doesn't support type "
-> 2696 "{0}".format(type(v).name))
2697 if callable(v):
2698 kwargs[k] = v(self)

TypeError: Column assignment doesn't support type FletcherDtype


It will be very helpful if you can guide me here. I have tried searching the docs for the solution but failed to do it. However, Fletcher Arrays works perfectly fine with pandas.DataFrame. Converting dask.DataFrame to pandas.DataFrame, then applying Fletcher Arrays is easier and doesn't give an error.
Please help me!
Thank you.

@birdsarah
Copy link
Contributor

I'm sorry you're having struggles and it's great that you tried a bunch of options. Unfortunately this issue is about figuring out how to work with fletcher. I feel that if I start guiding further from where you are, I'll just be working on the issue myself, which is not the point. I'm going to close this PR for now.

@birdsarah birdsarah closed this Mar 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants