Analysis on issue #36 #71

Aimaanhasan · 2019-03-18T20:24:24Z

Analysis on efficiency and usage of extension arrays in dask

Issue #36

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

birdsarah · 2019-03-19T00:07:05Z

Hi @Aimaanhasan - this is a great start. Congrats on getting an analysis PR up. Now the back and forth starts :D.

Some next steps:

I would like to see this run on more than just one parquet file to get a more meaningful understanding of the speed ups. Can you run your analysis on https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent_value_1000_only.parquet.tar.bz2 or https://public-data.telemetry.mozilla.org/bigcrawl/value_1000_only.parquet.tar.bz2
The notebook gets a little hard to follow due the the fletcher errors. These are good to keep in the notebook, and useful to see. But I think it would be good to pull your write-up and analysis to the top of the notebook, you can then hyperlink to headers further down in your notebook so your write-up can link to the sections of code where you've produced certain results.
I think a summary table and perhaps a plot would be helpful too.
Please add installation instructions or a link to the fletcher docs for people trying to run your code.
Think about how and when you're going to use fletcher and how to measure / document performance changes.

It seems like you might be struggling to convert your columns. To say the fletcher docs are limited would be an understatement. I had to dig around in the fletcher codebase to figure this out, but given that I have now here's some pseudocode that might be useful:

import pyarrow as pa
fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[col] = df[col].astype(fletcher_string_dtype)

birdsarah

see note above

Aimaanhasan · 2019-03-19T10:48:49Z

Hi, @birdsarah! Thank you so much for your feedback.

I am facing some issues and want to ask some questions regarding the changes.

The psuedocode, you have given is not working for me. A typeError occurs.
"TypeError: _from_sequence() got an unexpected keyword argument 'copy'"
I tried it by using below code
df[col] = fr.FletcherDtype(df[col])
This gives the error:
TypeError: Column assignment doesn't support type FletcherDtype
Converting the dask.DataFrame to pandas.DataFrame ,I am able to convert the columns but after then converting the pandas.DataFrame back to dask.DataFrame crashes the kernel and force restarts it.
I have made the arrangements to make the notebook more readable. I have also added the instructions and link for fletcher docs. Should I commit the changes?
Can you please elaborate more about the kind of summary table and plots?

birdsarah · 2019-03-19T13:28:56Z

Part 1

df[col] = fr.FletcherDtype(df[col]) Converting the data to pandas and back again will never be a solution. This data does not fit in memory. This is why I gave you the code to show you how to set the column type to fletcher dtype as opposed to creating a whole new column of data which is what your code does.

I can't debug your error without a full traceback.

Part 2

Yes, always be committing and pushing.

Part 3

I'd like to see you work on that yourself. Just think about how to present the information you have gathered carefully.

birdsarah · 2019-03-20T04:14:14Z

I've just been resting this which gives some context for fletcher so I thought I'd share. https://www.dataschool.io/future-of-pandas/

The trick with dask vs pandas is to remember that dask ends up being lots of little bits of pandas but we have to let dask manage that itself.

Don't get completely stuck, keep trying things and reaching out.

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

Aimaanhasan · 2019-03-22T03:45:43Z

Hello @birdsarah, I've tried in many ways to convert the columns of dask.DataFrame type, but it gives me the following error

Approach 1

Used the code below to implement:

`import pyarrow as pa

fletcher_string_dtype = fr.FletcherDtype(pa.string())
df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype)
`
This gives me the following error
TypeError Traceback (most recent call last)
in
1 import pyarrow as pa
2 fletcher_string_dtype = fr.FletcherDtype(pa.string())
----> 3 df[df.columns[0]] = df[df.columns[0]].astype(fletcher_string_dtype)
4
5

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in astype(self, dtype)
1646 meta = self._meta_nonempty.astype(dtype)
1647 else:
-> 1648 meta = self._meta.astype(dtype)
1649 if hasattr(dtype, 'items'):
1650 # Pandas < 0.21.0, no categories attribute, so unknown

C:\ProgramData\Anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
176 else:
177 kwargs[new_arg_name] = new_arg_value
--> 178 return func(*args, **kwargs)
179 return wrapper
180 return _deprecate_kwarg

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
4999 # else, only a single dtype is given
5000 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5001 **kwargs)
5002 return self._constructor(new_data).finalize(self)
5003

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
3712
3713 def astype(self, dtype, **kwargs):
-> 3714 return self.apply('astype', dtype=dtype, **kwargs)
3715
3716 def convert(self, **kwargs):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3579
3580 kwargs['mgr'] = self
-> 3581 applied = getattr(b, f)(**kwargs)
3582 result_blocks = _extend_blocks(applied, result_blocks)
3583

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
573 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
574 return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 575 **kwargs)
576
577 def _astype(self, dtype, copy=False, errors='raise', values=None,

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
634
635 # astype processing
--> 636 dtype = np.dtype(dtype)
637 if self.dtype == dtype:
638 if copy:

TypeError: data type not understood

Approach 2

Used the code below to implement:
df[df.columns[0]] = fr.FletcherDtype(df[df.columns[0]])

This gives me the following error

TypeError Traceback (most recent call last)
in
----> 1 df[df.columns[0]]=fr.FletcherDtype(df[df.columns[0]])
2
3

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in setitem(self, key, value)
2499 df = self.assign({k: value for k in key})
2500 else:
-> 2501 df = self.assign({key: value})
2502
2503 self.dask = df.dask

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs)
2694 callable(v) or pd.api.types.is_scalar(v)):
2695 raise TypeError("Column assignment doesn't support type "
-> 2696 "{0}".format(type(v).name))
2697 if callable(v):
2698 kwargs[k] = v(self)

TypeError: Column assignment doesn't support type FletcherDtype

It will be very helpful if you can guide me here. I have tried searching the docs for the solution but failed to do it. However, Fletcher Arrays works perfectly fine with pandas.DataFrame. Converting dask.DataFrame to pandas.DataFrame, then applying Fletcher Arrays is easier and doesn't give an error.
Please help me!
Thank you.

birdsarah · 2019-03-22T19:15:37Z

I'm sorry you're having struggles and it's great that you tried a bunch of options. Unfortunately this issue is about figuring out how to work with fletcher. I feel that if I start guiding further from where you are, I'll just be working on the issue myself, which is not the point. I'm going to close this PR for now.

ah02887 and others added 2 commits March 19, 2019 01:23

Analysis on issue mozilla#36

f81c97a

Analysis on efficiency and usage of extension arrays in dask Issue mozilla#36

Author correction from previous commit

4b5e559

birdsarah suggested changes Mar 19, 2019

View reviewed changes

Aimaanhasan and others added 3 commits March 20, 2019 23:00

Restructured Analysis

ad591ca

Added Fletcher Docs and restructured analysis

f587fde

Added link to the fletcher docs and gave an example for usability. Relocated the analyses for readability Issue mozilla#36

Delete 2019_03_Aimaanhasan__issue#36_array_extension - Copy.ipynb

0470002

birdsarah closed this Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis on issue #36 #71

Analysis on issue #36 #71

Aimaanhasan commented Mar 18, 2019

birdsarah commented Mar 19, 2019 •

edited

birdsarah left a comment

Aimaanhasan commented Mar 19, 2019 •

edited

birdsarah commented Mar 19, 2019

birdsarah commented Mar 20, 2019

Aimaanhasan commented Mar 22, 2019

birdsarah commented Mar 22, 2019

Analysis on issue #36 #71

Analysis on issue #36 #71

Conversation

Aimaanhasan commented Mar 18, 2019

birdsarah commented Mar 19, 2019 • edited

birdsarah left a comment

Choose a reason for hiding this comment

Aimaanhasan commented Mar 19, 2019 • edited

birdsarah commented Mar 19, 2019

Part 1

Part 2

Part 3

birdsarah commented Mar 20, 2019

Aimaanhasan commented Mar 22, 2019

Approach 1

TypeError: data type not understood

Approach 2

birdsarah commented Mar 22, 2019

birdsarah commented Mar 19, 2019 •

edited

Aimaanhasan commented Mar 19, 2019 •

edited