Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Improve the VAR module with: i) regression tests against stats model, ii) selecting order and iii) improved lags handling #46

Merged
merged 27 commits into from Oct 26, 2021

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Sep 18, 2021

PR Description

Fixes: #45
Fixes: #28

TODO:

  1. Add test to handle trends -> left for the future.
  2. Consolidate way of handling lags in the dataset when storing as a Connectivity object

Merge checklist

Maintainer, please confirm the following before merging:

  • All comments resolved
  • This is not your own PR
  • All CIs are happy
  • PR title starts with [MRG]
  • whats_new.rst is updated
  • PR description includes phrase "closes <#issue-number>"

@adam2392
Copy link
Member Author

To handle lags of the VAR model and easily forming the companion matrix of the VAR matrices, we will need to see if this issue can be resolved in scipy: scipy/scipy#13124

This would form the multivariate block-companion matrix that can then be leverage to check stability of the VAR model. This would be a useful attribute to expose to the users in DynamicMixin.

@adam2392 adam2392 changed the title [WIP] Improve the VAR module with: i) regression tests against stats model, ii) selecting order and iii) improved lags/trend handling [WIP] Improve the VAR module with: i) regression tests against stats model, ii) selecting order and iii) improved lags handling Sep 20, 2021
@adam2392 adam2392 changed the title [WIP] Improve the VAR module with: i) regression tests against stats model, ii) selecting order and iii) improved lags handling [MRG] Improve the VAR module with: i) regression tests against stats model, ii) selecting order and iii) improved lags handling Sep 21, 2021
Copy link
Member

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stop where I am so far, because I'm noticing a lot of uncovered lines. Is codecov wrong here? If not, then I think there is a lot of code that should either be cut out, or tested properly if possible

mne_connectivity/base.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/model_selection.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/model_selection.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/model_selection.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/utils.py Outdated Show resolved Hide resolved
@adam2392
Copy link
Member Author

I'm going to stop where I am so far, because I'm noticing a lot of uncovered lines. Is codecov wrong here? If not, then I think there is a lot of code that should either be cut out, or tested properly if possible

Ah sorry about that. Was just hoping to get your thoughts on the current implementation ported over from statsmodels backend. I'll get rid of the unnecessary portions of code, based on discussion in #45

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code you copied from statsmodels should be cleaned with the mne rigid code style requirements.

thx !

mne_connectivity/vector_ar/model_selection.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/model_selection.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/tests/test_var.py Outdated Show resolved Hide resolved
mne_connectivity/vector_ar/tests/test_var.py Outdated Show resolved Hide resolved
adam2392 and others added 3 commits September 25, 2021 14:24
Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
@adam2392
Copy link
Member Author

Kay I removed a bunch of the unnecessary code that took from statsmodels and added a few more tests. Checking CI now.

@agramfort
Copy link
Member

can you give me evidence that it works? did you test on simulations? do you see a speed or memory gain?

@adam2392
Copy link
Member Author

Current memory profiler output:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    18    780.3 MiB    780.3 MiB           1   @profile
    19                                         def run_experiment(data, times):
    20                                             """Run RAM experiment.
    21                                         
    22                                             python -m memory_profiler ./benchmarks/bench_var.py
    23                                             """
    24                                             # compute time-varying var
    25   1112.3 MiB    332.0 MiB           1       conn = vector_auto_regression(data, times=times)


Filename: ./benchmarks/bench_var.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    28   1116.4 MiB   1116.4 MiB           1   @profile
    29                                         def run_sm_experiment(sample_data):
    30                                             # statsmodels feeds in (n_samples, n_channels)
    31   1116.5 MiB      0.1 MiB           1       sm_var = VAR(endog=sample_data.squeeze().T)
    32   1166.9 MiB     50.4 MiB           1       sm_params = sm_var.fit(maxlags=1, trend='n')

With a for loop over samples (Tbd):

@agramfort
Copy link
Member

agramfort commented Oct 11, 2021 via email

@adam2392
Copy link
Member Author

sorry can you unpack this?

The total RAM usage is basically comparable with that of statsmodels (a bit lower overall probably cuz we're not storing as many variables).

In #28 (comment) though you suggest using a for loop to prevent high RAM usage with the large number of n_chs/n_samples. I'm not sure how to best incorporate that suggestion with lags though, so I was wondering if you had any tips of how to incorporate lags into the for loop over samples?

@agramfort
Copy link
Member

agramfort commented Oct 11, 2021 via email

@adam2392
Copy link
Member Author

def _test_forloop(X, lags, offset=0, l2_reg=0):
    # possibly offset the endogenous variable over the samples
    endog = X[offset:, :]

    # get the number of equations we want
    n_times, n_equations = endog.shape

    y_sample = endog[lags:]

    # X.T @ X coefficient matrix
    n_channels = n_equations * lags
    XdotX = np.zeros((n_channels, n_channels))

    # X.T @ Y ordinate / dependent variable matrix
    XdotY = np.zeros((n_channels, n_channels))

    # loop over sample points and aggregate the
    # necessary elements of the normal equations
    first_component = np.zeros((n_channels, 1))
    second_component = np.zeros((1, n_channels))
    y_component = np.zeros((1, n_channels))
    for idx in range(n_times - lags):
        for jdx in range(lags):
            first_component[jdx * n_equations: (jdx+1) * n_equations, :] = endog[idx + jdx, :][:, np.newaxis]
            second_component[:, jdx * n_equations: (jdx+1) * n_equations] = endog[idx + jdx, :][np.newaxis, :]
            y_component[:, jdx * n_equations: (jdx+1) * n_equations] = endog[idx + 1 + jdx, :][np.newaxis, :]
        # second_component = np.hstack([endog[idx + jdx, :] for jdx in range(lags)])[np.newaxis, :]
        # print(second_component.shape)
        # increment for X.T @ X
        XdotX += first_component @ second_component

        # increment for X.T @ Y
        # second_component = np.hstack([endog[idx + 1 + jdx, :] for jdx in range(lags)])[np.newaxis, :]
        XdotY += first_component @ y_component

    if l2_reg != 0:
        final_params = np.linalg.lstsq(XdotX + l2_reg * np.eye(n_equations * lags),
                                 XdotY, rcond=1e-15)[0]
    else:
        final_params = np.linalg.lstsq(XdotX, XdotY, rcond=1e-15)[0].T
    
    # format the final matrix as (lags * n_equations, n_equations)
    params = np.empty((lags * n_equations, n_equations))
    for idx in range(lags):
        start_col = n_equations*idx
        stop_col = n_equations*(idx+1)
        start_row = n_equations * (lags - idx - 1)
        stop_row = n_equations*(lags - idx)
        params[start_row:stop_row, ...] = final_params[n_equations * (lags - 1):, start_col:stop_col].T

The for loop over the samples is actually considerably slower and actually for some reason takes a lot more memory. This is done by running bench_var.py. It's probably slower because you end up getting n_times * lags number of loops. Not sure why more RAM would be used though...

# for the current implementation
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     8    784.3 MiB    784.3 MiB           1   @profile
     9                                         def run_experiment(data, times):
    10                                             """Run RAM experiment.
    11                                         
    12                                             python -m memory_profiler ./benchmarks/bench_var.py
    13                                             """
    14                                             # compute time-varying var
    15   1389.2 MiB    604.8 MiB           1       conn = vector_auto_regression(data, times=times, lags=5)

# statsmodels implementation
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    18    780.2 MiB    780.2 MiB           1   @profile
    19                                         def run_sm_experiment(sample_data):
    20                                             """Run RAM expeirment with statsmodels."""
    21                                             # statsmodels feeds in (n_samples, n_channels)
    22    780.3 MiB      0.1 MiB           1       sm_var = VAR(endog=sample_data.squeeze().T)
    23   1233.4 MiB    453.1 MiB           1       sm_params = sm_var.fit(maxlags=5, trend='n')

# for loop implementation
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     8    789.0 MiB    789.0 MiB           1   @profile
     9                                         def run_experiment(data, times):
    10                                             """Run RAM experiment.
    11                                         
    12                                             python -m memory_profiler ./benchmarks/bench_var.py
    13                                             """
    14                                             # compute time-varying var
    15   2107.6 MiB   1318.6 MiB           1       conn = vector_auto_regression(data, times=times, lags=5)

@adam2392
Copy link
Member Author

If that looks okay, then I can revert to the original method and clean up the code, and then keep the benchmark_var.py file in there for future ppl to test against statsmodels.

@adam2392
Copy link
Member Author

Running bench_var.py, I see pretty similar RAM performance wrt statsmodels.

When we do a for loop over samples and lags, the code is significantly slower, so I think it's best to stick with a similar algorithm that statsmodels uses, so this is good to go now from my end. Once @agramfort you approve this, I can remove the _test_forloop() inside var.py, which shows how we compute the VAR model using your suggested for loop.

I also as a result removed scikit-learn from our dependencies

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i trust your judgement @adam2392

if you have tests against statsmodels you should be good !

thx

@adam2392 adam2392 merged commit 8373167 into mne-tools:main Oct 26, 2021
@adam2392 adam2392 deleted the var branch October 26, 2021 23:57
tsbinns pushed a commit to tsbinns/mne-connectivity that referenced this pull request Dec 15, 2023
…model, ii) selecting order and iii) improved lags handling (mne-tools#46)

* Improve VAR ram usage

* Adding updated functionality from statsmodels

* Fix

* Fix whatsnew

* Clean rebase

* Clean rebase

* Clean up

* Adding utils function for block companion

* Adding companion matrix formulation

* Nest imports

* Fix docs

* Apply suggestions from code review

Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>

* Fix coverage

* Fix coverage

* Fix unit tests

* Update mne_connectivity/base.py

Co-authored-by: Eric Larson <larson.eric.d@gmail.com>

* Try again

* Fix unit tests coverage

* Fix

* Fix

* Adding becnhmarks

* Fixing benchmarks

* Try again

* Add benchmarks to date

* Fix manifest

* Fix manifest

* Clean up sklearn

Co-authored-by: Alexandre Gramfort <alexandre.gramfort@m4x.org>
Co-authored-by: Eric Larson <larson.eric.d@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests and consolidation of VAR options RAM Usage in VAR Models
3 participants