Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading from 0.3 to 0.8.1, one of my notebook cells with resampler runs indefinitely #115

Closed
Alexander-Serov opened this issue Aug 25, 2022 · 25 comments · Fixed by #116

Comments

@Alexander-Serov
Copy link

Alexander-Serov commented Aug 25, 2022

I have several figures in a notebook. All other figures plot correctly and I can wrap PlotlyResampler around and show them. However, one particular figure plots just fine, but when I wrap it in PlotlyResampler my cell keeps running indefinitely. This unfortunately blocks my update to 0.8.1. Do you have any idea @jonasvdd ?

#32

FigureResampler(fig, default_n_shown_samples=MAX_POINTS).show_dash(mode="inline")
image

Other observations:

  • if I downgrade to 0.3.0 it still does not work,
  • if I downgrade to 0.3.0 and remove the dash show, it works fine
    image
@jonasvdd
Copy link
Member

jonasvdd commented Aug 25, 2022

Hi @Alexander-Serov,

I'm a little confused; does it keep running because of the show_dash or the wrapping with the FigureResampler?

Could you also provide a minimalistic code example of the plotly- figure (data) that gets stuck?

Cheers,
Jonas

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Aug 25, 2022

Hey @jonasvdd,

No, I think it is the FigureResampler itself. Here are my latest observations for the same code, same data:

  • v0.7.2, no dash: plots fine,
  • v0.8.1, no dash: gets stuck.

I will see if I can provide a minimal example, the problem is I am working with copyright code and data, so I cannot just copy-paste... I will see what I can do. But the problem clearly appears in the 0.8.0 release.

@jonasvdd
Copy link
Member

Hi, is it possible to create a summary of your data properties (e.g. which trace sizes are we speaking of, what dtype does your data have) and share your plotting code! Than I can try to look whether I can replicate it!

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Aug 25, 2022

The plotting code for this function is very simple, the problem likely comes from the data 🤔. Here is a simplified version of the code:

fig = go.Figure()
for param in sorted(data["param"].unique()):
    param_data = data.loc[data["param"] == param,
                ["param1", "param2"]].set_index(
        "param1", drop=True
    )
    param_data = param_data.sort_index()
    fig.add_trace(
        go.Scattergl(
            x=param_data.index,
            y=param_data["param2"],
            name=f"{param}",
            mode="markers",
            marker=dict(color='b', size=5),
        )
    )
fig.update_xaxes(title_text="D")
fig.update_yaxes(title_text='L')
fig.update_layout(title="Av")

Basically grouping a dataset with time index by a parameter and plotting several curves.

@Alexander-Serov
Copy link
Author

And here is a sample of data. I hope you are able to reproduce. Remember to convert the index to a time index (perhaps it has an effect?).
sample_data.csv

@jonasvdd
Copy link
Member

jonasvdd commented Aug 25, 2022

For me, this still works, when I run this in an isolated notebook. What is your behavior with this code snippet (note that it uses the sample_data you shared with me)?

from plotly_resampler import FigureResampler
import plotly.graph_objects as go
import pandas as pd

data = pd.read_csv('sample_data.csv', index_col='time')
fig = go.Figure()
for param in sorted(data["param"].unique()):
    param_data = data.loc[data["param"] == param,
                ["param1", "param2"]].set_index(
        "param1", drop=True
    )
    param_data = param_data.sort_index()
    fig.add_trace(
        go.Scattergl(
            x=param_data.index,
            y=param_data["param2"],
            name=f"{param}",
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.update_xaxes(title_text="D")
fig.update_yaxes(title_text='L')
fig.update_layout(title="Av")
fig.show()
print("-"*80)

FigureResampler(fig, default_n_shown_samples=10_000).show_dash(mode='inline')

📷 ⬇️

image

@Alexander-Serov
Copy link
Author

Hey!

It does work indeed, but the index in your example is of object type. Once I convert it to the DateTimeIndex, it gets stuck. Here is what's missing in your snippet.

data = pd.read_csv('sample_data.csv', index_col='time')
data.index = pd.to_datetime(data.index)

Let me know.

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Aug 30, 2022

Hey!

It does work indeed, but the index in your example is of object type. Once I convert it to the DateTimeIndex, it gets stuck. Here is what's missing in your snippet.

data = pd.read_csv('sample_data.csv', index_col='time')
data.index = pd.to_datetime(data.index)

Let me know.

From your silence, I understand you were able to reproduce the issue @jonasvdd? Let me know if you need anything else.

@jvdd
Copy link
Member

jvdd commented Sep 7, 2022

Hi @Alexander-Serov,

I was able to find 2 errors based on minor adjustments of your snippet and implemented a solution in PR #116

  • Error 1: we should parse hf_x arrays to numeric or datetime when they are of dtype "object"
  • Error 2: to work with multiple time zones in the same hf_x array (we should ignore the time zone information away in the visualization)

We can create a pre-release so that you can test on your end whether this resolves the issue or not :)

PS: we (@jonasvdd & me) were both afk because we were backpacking in the Alps! ⛺ ⛰️

@Alexander-Serov
Copy link
Author

PS: we (@jonasvdd & me) were both afk because we were backpacking in the Alps! ⛺ ⛰️

Hey!

No worries, it seems you guys were closer to my place than I expected. 🏔️

Glad to hear you found the problems and no problem testing the pre-release, let me know!

Just a short question for error 2 though. You are saying you are considering dropping the TZ information and using UTC. This might render the FigureResampler unusable for our applications, for example, because UTC time could be hard to interpret for specific projects. E.g. for this particular data set, with a -7h UTC offset, the daytime data would be shown at night. Perhaps, there is a way to keep the TZ? It seems plotly is able to plot the original figure just fine. Or could we use UTC internally, but show the original labels on the x axis? It is plotting non-UTC data in the UTC timezone that would create the most confusion for us.

Also, a remark just to be sure we are talking about the same things. In the data set I've sent you, you will see the -07:00 and -08:00 UTC offsets. However, it is still the same timezone and it is just the result of the daylight saving time (DST) change. To properly process this kind of DST data, one normally casts them to a named TZ, for instance, for the data I've sent you, you could use

import pandas as pd
data = pd.read_csv("sample_data.csv", index_col='time')
data.index = pd.to_datetime(data.index, utc=True)
data.index = data.index.tz_convert('America/Los_Angeles')
data

Would this adjustment solve the problem you were having? (Surely, not all input data can be converted to such format if you don't know the TZ name, but this is a very common case. And perhaps you could process differently [try-catch] those cases, when this is not a problem?).

@jvdd
Copy link
Member

jvdd commented Sep 9, 2022

Hey! It was a long (night)train-ride away to get there, but what a coincidence indeed 😃

Internally, when there are multiple timezones in the x-data, we now omit the timezone information (but do not convert the time) In the other cases nothing special happens - see additions to our parsing code;

# Try to parse the hf_x data if it contains string values
if hf_x.dtype.type is np.str_ or hf_x.dtype == "object":
if len(hf_x) and isinstance(hf_x[0], str):
try:
# Try to parse to numeric
hf_x = pd.to_numeric(hf_x, errors="raise")
except:
try:
# Try to parse to datetime
hf_x = np.asarray([pd.Timestamp(x) for x in hf_x])
except:
raise ValueError(
"plotly-resampler requires the x-data to be numeric or datetime-like"
)
# Check and update timezones of the hf_x data when there are multiple
# timezones in the data
if hf_x.dtype == "object":
if len(hf_x) and isinstance(hf_x[0], pd.Timestamp):
# Assumes that all values in hf_x are pd.Timestamps
if len(set(x.tz for x in hf_x)) > 1:
# Remove the timezone data for plotting when multiple timezones
warnings.warn(
"x-data of multiple timezones / fixedoffsets is passed, "
+ "omitting the timezone data for plotting",
UserWarning,
)
hf_x = np.asarray(
list(map(lambda x: x.replace(tzinfo=None), hf_x))
)

To answer your two questions;

Perhaps, there is a way to keep the TZ? It seems plotly is able to plot the original figure just fine.

Indeed, plotly can cope well with this! I've checked in tests if omitting timezone information (when there are multiple timezones in the x-data) results in the same x-values as vanilla plotly would show. You are more than welcome to validate this yourself :)

Would this adjustment solve the problem you were having? (Surely, not all input data can be converted to such format if you don't know the TZ name, but this is a very common case. And perhaps you could process differently [try-catch] those cases, when this is not a problem?)

Yes, because in that case the data has only 1 timezone (and not two fixedoffsets). Perhaps we should also include this in the warning message 🤔


I've created pre-release 0.8.2rc1, let us know if this fixes the issue correctly & results in the same behavior as plain plotly :)

Also, feel free to check out the pull request for this issue #116!
I believe you are more experienced with timezones than @jonasvdd & me 🙃

@jvdd
Copy link
Member

jvdd commented Sep 24, 2022

Hey @Alexander-Serov, does the pre-release fixes the issue?

@Alexander-Serov
Copy link
Author

Hey @Alexander-Serov, does the pre-release fixes the issue?

Hey @jvdd,
Sorry for the delay, I will check this today!

@Alexander-Serov
Copy link
Author

Hey @jvdd,

Sorry again for the delay. I had wanted to test but had forgotten about it! But I've found time to test it now. Here is what I see:

  • the snippet I've originally posted works great when upgraded to 0.8.2rc1, good job!
  • however, when I try to test a different one, plotting the same data set, but using its time index, my kernel dies 💀 . Strange...

image

Here is the new (even shorter snippet).

from plotly_resampler import FigureResampler
import plotly.graph_objects as go
import pandas as pd

data = pd.read_csv("sample_data.csv", index_col='time')
data.index = pd.to_datetime(data.index)
fig = go.Figure()
fig.add_trace(
        go.Scattergl(
            x=data.index,
            y=data["param2"],
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.show()
print("-"*80)

FigureResampler(fig).show_dash(mode='inline')

As before, the original plotly plot works fine. Do you have any idea of what could go wrong?

Also, I see that you were testing with default_n_shown_samples=10_000, but I've just checked, it works even with v0.8.1. However, when I remove it, it doesn't. So I performed all my tests without this option.

@jvdd
Copy link
Member

jvdd commented Sep 27, 2022

Hi @Alexander-Serov,

First of all thanks for the thorough review on PR #116, @jonasvdd and me will take your remarks into account and update the PR accordingly in the following days. It is really nice to dive into & discuss code with other (more) experienced people - this is the power of open source 😄

Regarding your comment above: I was also able to reproduce the kernel crash (actually I encountered this issue as well when creating the first implementation in PR #116, but I thought the issue was fixed in the pre-release - which is apparently not the case 🙃 ). We'll try to properly find the root cause for this issue and hopefully fix it as well in PR #116

@jvdd
Copy link
Member

jvdd commented Sep 27, 2022

@Alexander-Serov, taking your comments into account I updated the code and created a new pre-release 0.8.2rc2. Now arrays of datetime.datetime will be parsed as well (which should fix the kernel crash issue).

P.S., I still have to look into finding the most optimal way to drop (or converting?) the tzinfo when there are multiple timezones before we can create a new major release

@Alexander-Serov
Copy link
Author

Hey!
You are welcome, it has been a great pleasure to work with you guys so far. 😃 I wouldn't say I am very knowledgeable in plotly in general (you guys should be much better), but I do indeed manipulate datetimes every day.

Considering the question of dropping/conversion of datetimes with different TZs, I think it's important to ask oneself why different TZs got there in the first place:

  • As I mentioned, the most common case I see is that it's the same named TZ, which - if read as unnamed - will appear as different TZs before and after DST (daylight saving time change) date. For example, for Los Angeles (the dataset I sent you) you will see UTC-07:00 and UTC-08:00. In this case, the only way to show it correctly would be to ask the user to convert to a named TZ. A priori, you cannot really infer you. For me, this is 99% of cases you will see as resampler input with different TZs.
  • if the data really come from different named TZs, there is not much we can do either. We could convert them to UTC or drop the TZ. Conversion to UTC has the disadvantage that the apparent datetime may be wrong. What would have happened at 6 pm in LA would be at ~midninght in UTC. The figures may be confusing. However, if you drop the TZ altogether, you may have an incorrect time shift around the DST dates.

This said, I know plotly itself manages some of these cases without converting all to UTC. I wonder if we can look up what they do in their code.

I will now try the rc2 to see if it fixes the kernel crash and have a look at the PR replies. :) If all goes well, we should have a new version soon. :)

@Alexander-Serov
Copy link
Author

After first tests, it seems like I am having this AssertionError although all my index datetimes are in the same TZ:

image

I am trying to plot

fig = go.Figure()
fig.add_trace(
        go.Scattergl(
            x=data.index,
            y=data["param2"],
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.show()
print("-"*80)

FigureResampler(fig).show_dash(mode='inline')

on the above data set converted to a single named TZ by applying

data = pd.read_csv("sample_data.csv", index_col='time')
data.index = pd.to_datetime(data.index,utc=True).tz_convert('America/Los_Angeles')

Do you confirm?

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Sep 28, 2022

At the same time, if I plot this plot (the one discussed above):

fig = go.Figure()
for param in sorted(data["param"].unique()):
    param_data = data.loc[data["param"] == param,
                ["param1", "param2"]].set_index(
        "param1", drop=True
    )
    param_data = param_data.sort_index()
    fig.add_trace(
        go.Scattergl(
            x=param_data.index,
            y=param_data["param2"],
            name=f"{param}",
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.update_xaxes(title_text="D")
fig.update_yaxes(title_text='L')
fig.update_layout(title="Av")
fig.show()
print("-"*80)

FigureResampler(fig).show_dash(mode='inline')

on the dataset with a single named TZ, but DST, this cell continues to run indefinitely.

Let me know if you can confirm. I will continue looking at it tomorrow. 😃 Thanks for your efforts!

@jvdd
Copy link
Member

jvdd commented Sep 28, 2022

After first tests, it seems like I am having this AssertionError although all my index datetimes are in the same TZ.
...
Do you confirm?

I do not get an AssertionError when all the there is only a single named TZ.
image

Code from screenshot above:

data = pd.read_csv("sample_data.csv", index_col='time')
data.index = pd.to_datetime(data.index,utc=True).tz_convert('America/Los_Angeles')

fig = go.Figure()
fig.add_trace(
        go.Scattergl(
            x=data.index,
            y=data["param2"],
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.update_layout(height=300)
fig.show()
print("-"*80)

FigureResampler(fig).show_dash(mode='inline')

However, I did get an AssertionError when the data was not converted to a single named TZ.

...
on the dataset with a single named TZ, but DST, this cell continues to run indefinitely.

This plot works perfectly on my computer as well
image

It is rather bizarre that I cannot reproduce your errors.. To be sure that we are comparing apples to apples I created a new pre-release rc3 (which contains the fixes from earlier today)

@jvdd
Copy link
Member

jvdd commented Sep 28, 2022

This PR still needs some more thought - I tend to agree with your comment @Alexander-Serov #115 (comment)

At the moment, I'm unsure if dropping the timezone information is the responsibility of this library - as users can convert it to UTC or drop them themselves..
Perhaps we should (just as you explained in your comment)

  • support the 99% of the cases by supporting single named TZs (which is currently supported - if you are able to confirm with the new rc)
  • raise an error with a clear message that we do accept this single named TZ (with DST) when there are multiple TZs detected (but leave them untouched)

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Sep 30, 2022

Hey guys!

I have tested the latest 0.8.2rc3 version, thanks for releasing it! I confirm that the plots behavior has changed, but I am still having a problem. When I launch this plot (with datetime index with a single named TZ), the resampled plot never plots and just runs indefinitely. I think the problem may be due to the long cycle I have commented on in the PR, I will try to investigate (if I have time). Here is the code that never terminates on my system:

from plotly_resampler import FigureResampler
import plotly.graph_objects as go
import pandas as pd

data = pd.read_csv("sample_data.csv", index_col='time')
data.index = pd.to_datetime(data.index,utc=True).tz_convert('America/Los_Angeles')

fig = go.Figure()
fig.add_trace(
        go.Scattergl(
            x=data.index,
            y=data["param2"],
            mode="markers",
            marker=dict(color='lightblue', size=5),
        )
    )
fig.show()
print("-"*80)

FigureResampler(fig).show_dash(mode='inline')

I think it pretty much fits into our 99% discussed above.

I also see that this conde is exactly the same as your code above. Are we running different plotly/numpy versions? 🤔

Here is my simplified pip freeze.

black==22.6.0
bleach==5.0.1
cachetools==5.2.0
certifi==2022.6.15
jupyter-client==7.3.4
jupyter-core==4.11.1
jupyter-dash==0.4.2
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.2
lttbc==0.2.0
matplotlib==3.5.3
matplotlib-inline==0.1.6
mypy-extensions==0.4.3
nbclient==0.6.7
nbconvert==7.0.0
nbformat==5.4.0
nbmake==1.3.4
notebook==6.4.12
numpy==1.21.6
pandas==1.3.5
plotly==5.9.0
plotly-resampler==0.8.2rc3
pydivsufsort==0.0.6
pytz==2022.2.1
scikit-learn==1.0.2
scipy==1.7.3
seaborn==0.10.1
widgetsnbextension==4.0.2
xarray==0.20.2
xarray-einstats==0.2.2

I am running py3.7 in a Windows environment.

@Alexander-Serov
Copy link
Author

I am running py3.7 in a Windows environment.

Update.
I have installed the plotly resampler in a separate environment and the cell works fine. I must have a package incompatibility somewhere. I am checking.

@Alexander-Serov
Copy link
Author

Alexander-Serov commented Sep 30, 2022

Update.
I have investigated further and it is the .show_dash(mode='inline') command that hangs for some reason when x is provided as index. If I use a simple show(), everything is fine. It only occurs in my other environment. When I've created one specifically for the resampler, I do not reproduce with the same code.

Do you have an idea why it could hang on show_dash if it's an index and not a Series?
Specifically, the command app.run_server(mode=mode, **kwargs) is launched by show_dash() and hangs instead of plotting a figure.

@jvdd
Copy link
Member

jvdd commented Oct 7, 2022

Hi @Alexander-Serov,

Thx for investigating this - we both appreciate your efforts :)

I could not reproduce your issue in a Python 3.7.14 environment (with the same packages as from the pip freeze output).. :/


To further answer your questions / comments:

Update.
I have investigated further and it is the .show_dash(mode='inline') command that hangs for some reason when x is provided as index.

So the issue only occurs when you pass data.index as x?
E.g. passing data.index.values as x works fine? 🤯

Do you have an idea why it could hang on show_dash if it's an index and not a Series?

Nope... really boggles my mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants