This is the repository for a python implementation of the algorithm presented in the book chapter
H. Meinlschmidt, M. Sons, & M. Stemmler: A Variational Approach to Continuous
Time Dynamic Models.
The implementation was written by Hannes Meinlschmidt ([lastname] at math.fau.de).
See below for
instructions regarding reproducing the results from the book
chapter. Otherwise, feel free to play with the simulate
CLI option and various settings to randomly
generate a data set using the stochastic differential equation integrator sdeint
.
It is an optimization algorithm which, given sets of data, each consisting of a number of points in space, determines a quadratic matrix driving a linear ordinary differential equation such that the trajectories generated by the differential equation match the points in space in a least-squares optimal way over the whole data set.
It can also optimize for so-called intercepts, that is, forcing functions in the differential equations, either per-class or individually.
The optimizer used is a BFGS algorithm from scipy
which is fed
with a gradient for the least-squares objective function which was calculated
analytically by adjoint calculus.
Details can be found in the book chapter or in forthcoming publications.
Important
In the book chapter, we applied the algorithm to a particular set of data from the German Socio-Economic Panel (SOEP) which is mainted by the German Institute for Economic Research (DIW Berlin). Usage of the SOEP data is free of charge for anyone working at a scientific research institution, but unfortunately, per the contract with DIW Berlin I am not allowed to supply the data file directly within this repository, so you need to request access to the SOEP data yourself.
That being said, in the data_processed folder there is a python snippet which processes the rather unwieldly large SOEP pl.csv file to filter only the relevant columns for our application case. The algorithm then works on the thus produced pl_filtered.csv file.
Once you have the pl_filtered.csv file in the data_processed folder, you can reproduce the results in the book chapter with
python VarCT_optimize.py
This will select the default number of 1% of participants in the data randomly and
use these in the algorithm. To change this number, use the -td percent
command line
interface, e.g.
python VarCT_optimize.py -td 10
to select 10% of participants randomly.
-sim
or--simulate
: Do not use the data file, but simulate data.-n N
or--num_ivs N
: If simulating data (-sim
), number N (integer) of trajectories/subjects to simulate. Default is 50.-ni N
or--num_ints N
: If simulating data (-sim
), sets the number of intercept classes to N (integer). Each trajectory/subject is assigned an intercept class randomly. Default is number of trajectories/subjects.-t N
or--timehorizon N
: If simulating data, overall time N (integer) which is simulated. Default is 15.-ex
or--exportdata
: If simulating data, export the simulated data in export_data.csv which can be used as a data file.
The following CLI options are related to learning of intercepts; they can be used either with the data file or simulated data.
-li
or--learnintercepts
: Switches learning of intercepts on. Each trajectory/subject is aware of what intercept class it belongs to.-all
or--learn_all
: If learning intercepts (-li
), each trajectory/subject tries to learn its own intercept, ignorant of what intercept class it belongs to. Does nothing if number ot trajectories/subjects is equal to number of intercept classes.
-td N
or--trainingdata N
: N percent of trajectories supplied in the data file are used for optimization. Does not apply if data is simulated.-np N
or--num_plots N
: Number N (integer) of trajectories plotted in the visualization plots. Default is 50.
The code was written for and tested with python 3.10 and I do not give any guarantees that it works on other versions.
Warning
A word of warning: I have tried to put reasonable comments in the code but it is an organically grown code and I am not a python professional. Probably there are several rather inefficient and inelegant solutions in there, not to speak of possible bugs. However, the main functionality should definitely be correct. Eventually I would consider rewriting large aspects of the code, especially in order to facilitate usage also on other data sets.
There is also a gradient check routine in VarCT_checkgrad.py which checks the accuracy and speed of the adjoint gradient method used in the calculations against difference quotients of the objective function. Use
python VarCT_checkgrad.py -sim
and see what it says. In principle it is enough to check this on only one trajectory.
If intercepts are not learned, then there is also a comparison with calculating
the full differential (this is feasible because the parameter space is very
small in this case) and numerical differentiation with the numdifftools
package.