-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example/latent_sde_lorenz different from whats written in the paper? #92
Comments
Thanks for your interest!
You're absolutely right!
The GRU outputs at intermediate times can also be used for practical performance benefits (e.g. check this out). The
I think it's fair to put it that way, and to say that points in the latent space are mapped to the observed space via an identity transform.
I think you have a fair point. I agree that I was somewhat sloppy with the re-implementation. Essentially, the things that are simplified are 1) variational inference at time t0 (I didn't include the KL penalty, or a prior), and 2) an actual non-trivial decoder that maps points in the latent space back to the observed space. To properly do 1), one would first select a good prior (say N(mu, sigma); mu and sigma can be optimized during training). To compute the KL penalty, one would need also the variational distribution given by the encoder (e.g. dependent on the last output of the GRU or return value of some other encoder).
Thanks for the suggestion and I totally agree! Happy to spend time in documenting the model better in the future, though quite unfortunately my schedule in the near future seems quite packed. |
Thanks for answering! That cleared thing up!
What i meant was: In the paper a context size of 1 is mentioned, while the latent space is 4 dimensional. I assumend context size of 1 was referring to the latent dimension, so that you took only one of the 4 latent variables of the according timestep of the GRU output sequence as the context. But it referrs to the time dimension, so all for latent variables, got it!
Awesome! I am looking forward to it, whenever it may be! Even though this is porbably not the right place, as github issues are not meant as a forum to ask for help, i'll still try sneak in a few practical questions, i hope you don't mind. I would be thrilled if you could give some answers! However, if that is not appropriate, feel free to shoot me down and close. In my potential use case, i have a bunch of timeseries data. I would like to fit one sde model to it to make predictions into the future, based on a given portion of timesteps. So it is pretty similiar to the latent_sde_lorenz example. Even more to the example of the geometric brownian motion in your paper, i think.
|
The context vector is an extra piece of information that isn't really related to the latent space. It doesn't really refer to the timestamps either. On the other hand, the timestamp is used to select which context vector we should use for integrating the SDE in a specific time interval.
I'm closing the issue for now after the fixes, but I'm also happy to keep chatting here if that may be helpful. |
I'm splitting this reply into several segments, as one giant grid of text may seem intimidating.
If you're familiar with Gaussian processes, I'd say that it's reasonable to think that the prior here is analogous to the prior there when you're only fitting a single time series sequence. In fact, the OU process is a Gaussian process, and this is a special case where the two model classes somewhat coincide. Notably, things are a bit different when one is fitting multiple time series sequences and trying to do interpolation/extrapolation for each sequence individually.
If the goal is extrapolation based on observations of a single time series sequence, I'd recommend using the posterior drift. |
I'd agree the naive method would be to estimate the statistics with samples. I'm aware of works that intend to approximately simulate SDEs by only simulating the marginal mean and covariance ODEs. I may not be up-to-date on the latest developments there, but I haven't seen a paper that convincingly demonstrated that such a method is consistently accurate and leads to models of good utility. More generally, the problem is related to simulating the Fokker Planck, which is known to be difficult aside from special cases. |
Our adjoint implementation is in this file. The core functions of interest are the reverse drift and diffusions (e.g. here and here). What you listed in the example is something very different, and comes from another paper. I implemented it for MNIST before they released their codebase, and it was mostly for fun for myself. Notably, the reverse SDE formulation in that paper is totally different from ours. The backward/time-reverse-SDE formulation in our paper ensures that individual sample paths can be reversed given a fixed Brownian motion sample. The time-reverse SDE in their paper only ensures that the marginal distributions can be reconstructed. Note, one could get the same marginals even with different sample paths. For their purposes and applications, their reverse-time SDE formulation was sufficient. |
Thank you for the answers! I think I'll have to read a little further into the topic and try to set up a small model for my use case when i have the time ( it's just a little side project right now, out of interest ) before asking any more questions here. Thanks for offering to keep on chatting! |
Hi there,
first of all thank you for your work! This is great!
Disclaimer: i don't know much about SDE, but i would love to make use of neural sde, since i have some use cases that i think could work with it. But mostly i find it highly interesting! So i hope i didn't get it all wrong entirely.
I think there are quite some differences between your code example and whats decribed in your paper.
In the paper it says, that you have a 1 layer GRU to recover the dynamics and f_net and h_net both are 1 layer MLP, while f_net takes an additional context variable of size 1 . Then, there's a decoder to map back from latent space to feature space.
So if i understand correctly, this model would work just like shown in figure 4 of your paper:
The GRU consumes a (time reversed) sequence of inputs in observation space and outputs a sequence in a 4D latent space. The final output (at t0) then is your initial condition (z0) for the sde, which is integrated through time, producing another sequence in latent space. So the dynamics would happen in latent space. Then the decocer maps back to observation space. What's a little unclear to me, is what the context would be. Just one (the last) latent variable from each of the GRU outputs?
Anyway, this describes an actual latent sde, since all the dynamics are happening in latent space.
However, in your implementation in the example, all the dynamics happen in observation space directly. f_net and h_net both map from observation space to drifts in obeservation space (well f_net again sees an additional context, which is of size 64 here). So the GRU encoder "only " provides the context, but not the initial latent state. Meaning it does not recover the dynamics, if i understood it right.
So this is not an actual "Latent SDE" model, is it? More like a "Latent informed/controlled SDE"?
Similar thing for the latent_sde example. The dynamics are learned in obesrvation space as well, so there is no latent space involved. One could claim that the latent space is equal to the observation space here of course. Or do i simply have the wrong idea of what "Latent SDE/ODE" actually means?
In general, i think this package could greatly benefit from a more in depth documentation in regards to training models with it and/or better explained examples. One or two examples of standart use cases in jupyter notebooks with detailed explanations could help a lot to make this more accessible to people that don't have much background in SDE (like me).
The text was updated successfully, but these errors were encountered: