Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: reading large CNT files (more than 2Gb files) #6537

Merged
merged 4 commits into from
Jul 10, 2019

Conversation

massich
Copy link
Contributor

@massich massich commented Jul 8, 2019

Fixes #6535

@massich
Copy link
Contributor Author

massich commented Jul 8, 2019

@LM-thinking can you try this in your files? Thx.

@massich
Copy link
Contributor Author

massich commented Jul 8, 2019

lets see how the CIs tourn out

@lgtm-com
Copy link

lgtm-com bot commented Jul 8, 2019

This pull request introduces 1 alert when merging a34a635 into 7d64af3 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@codecov
Copy link

codecov bot commented Jul 9, 2019

Codecov Report

Merging #6537 into master will decrease coverage by 1.4%.
The diff coverage is 62.5%.

@@            Coverage Diff             @@
##           master    #6537      +/-   ##
==========================================
- Coverage   89.34%   87.93%   -1.41%     
==========================================
  Files         416      416              
  Lines       74865    74882      +17     
  Branches    12341    12343       +2     
==========================================
- Hits        66885    65845    -1040     
- Misses       5137     6194    +1057     
  Partials     2843     2843

@lgtm-com
Copy link

lgtm-com bot commented Jul 9, 2019

This pull request introduces 2 alerts when merging 85e376d into 70789c4 - view on LGTM.com

new alerts:

  • 2 for Unused local variable

@AimerLee
Copy link

After reviewing the commit history of cnt.py, I found similar problem in #4520, where the issuer mentioned that files bigger than 2G will result the event_offset be negative.

Yeah, I have the problematic file. It's 2GB. The event_offset field in the file header was a negative value. Rest of the header seems okay. The first if condition in the following code is used to simply check the event_offset overflow condition, and in fact is not needed after the bug is fixed.

        if event_offset < data_offset:  # no events
            data_size = n_samples * n_channels
        else:
            data_size = event_offset - (data_offset + 75 * n_channels)

@AimerLee
Copy link

AimerLee commented Jul 10, 2019

In the _utils, cnt file size is used to decide whether to caculate the event table pos. This is in fact not so robust, because 2G is not so match with the overflow threshold of the int32.

One possible solution might as following code, which obtain the event table pos by check wheather the suffix of binary reprentation between cacluated and the directly readed value of event_table_pos are identical. And infer the nbytes at the same time.

fid.seek(SETUP_NSAMPLES_OFFSET)
(n_samples,) = np.frombuffer(fid.read(4), dtype='<i4')

fid.seek(SETUP_NCHANNELS_OFFSET)
(n_channels,) = np.frombuffer(fid.read(2), dtype='<u2')

fid.seek(SETUP_EVENTTABLEPOS_OFFSET)
(event_table_pos,) = np.frombuffer(fid.read(4), dtype='<i4')


def _infer_n_bytes_event_table_pos(readed_event_table_pos):
    readed_event_table_pos_feature = np.binary_repr(readed_event_table_pos).lstrip('-')

    for n_bytes in [2, 4]:
        event_table_pos = (900 + 75 * int(n_channels) + n_bytes * int(n_channels) * int(n_samples))
        if np.binary_repr(event_table_pos).endswith(readed_event_table_pos_feature):
            return n_bytes, event_table_pos
    raise Exception("event_table_dismatch") 

Note that all these mentioned above base on the fact that n_samples value is reliable, but the comment in cnt.py said the fact is not.

    # Header has a field for number of samples, but it does not seem to be
    # too reliable. That's why we have option for setting n_bytes manually.
    fid.seek(864)
    n_samples = np.fromfile(fid, dtype='<i4', count=1)[0]

However, the n_samples metadata in the SETUP part is trusted in the source code of EEGLAB, and it seems to be unreasonable to say it is not reliable.

@massich
Copy link
Contributor Author

massich commented Jul 10, 2019

@AimerLee makes sense. However it is not working for me. I'm sure is a silly mistake.

@massich massich force-pushed the fix_cnt branch 2 times, most recently from 660eb9d to b16c2f3 Compare July 10, 2019 11:21
@massich
Copy link
Contributor Author

massich commented Jul 10, 2019

I liked 660eb9d, it used event_table_pos to compute n_bytes and it was clean. But I don't understand why does not work.

I've done this toy example from CNT private date people have been sharing with me overtime.

import os.path as op

import numpy as np
import pytest

from mne import __file__ as _mne_file
from mne.utils import run_tests_if_main
from mne.io.cnt import read_raw_cnt
from mne.io.cnt._utils import CNTEventType1, CNTEventType2, CNTEventType3

def foo(fname):
    SETUP_NCHANNELS_OFFSET = 370
    SETUP_NSAMPLES_OFFSET = 864
    SETUP_EVENTTABLEPOS_OFFSET = 886

    def _compute(xx):
        return (900 + 75 * int(n_channels) +
                xx * int(n_channels) * int(n_samples))

    with open(fname, 'rb') as fid:

        fid.seek(SETUP_NSAMPLES_OFFSET)
        (n_samples,) = np.frombuffer(fid.read(4), dtype='<i4')

        fid.seek(SETUP_NCHANNELS_OFFSET)
        (n_channels,) = np.frombuffer(fid.read(2), dtype='<u2')

        fid.seek(SETUP_EVENTTABLEPOS_OFFSET)
        (readed_event_table_pos,) = np.frombuffer(fid.read(4), dtype='<i4')

    print('readed_event_table_pos: ', readed_event_table_pos)
    print('readed     b:', np.binary_repr(readed_event_table_pos))
    print('readed_    b:', np.binary_repr(readed_event_table_pos).lstrip('-'))
    print('compute(2) b:', np.binary_repr(_compute(2)))
    print('compute(4) b:', np.binary_repr(_compute(4)))
    print('compute(2) :', _compute(2))
    print('compute(4) :', _compute(4))

    readed_event_table_pos_feature = np.binary_repr(
        readed_event_table_pos).lstrip('-')

    for n_bytes_candidate in [2, 4]:
        computed_event_table_pos = _compute(n_bytes_candidate)
        if (
            np.binary_repr(computed_event_table_pos)
            .endswith(readed_event_table_pos_feature)
        ):
            n_bytes = n_bytes_candidate
            event_table_pos = computed_event_table_pos
            print('match found: (', 'n_bytes: ', n_bytes,
                  'event_table_pos: ', computed_event_table_pos, ')')
            break

        else:
            n_bytes, event_table_pos = None, None

    if event_table_pos is None:
        print('No match')

    return n_channels, n_samples, event_table_pos, n_bytes


pp = pytest.param
@pytest.mark.parametrize((
    'fname,expected_n_bytes,expected_event_type,expected_event_table_pos'), [
    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/914flankers.cnt'),
       4, CNTEventType2, 156474479),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/cont_68chan_32bit.cnt'),
       4, CNTEventType2, 57267440),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/'
               'pilote_resting_01_neurospin_2019-03-04_15-18-40.cnt'),
       2, CNTEventType2, 15518747),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/'
               'cont_22chan_4gb_32bit_toolong.cnt'),
       4, CNTEventType3, 4971618850),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/SampleCNTFile_16bit.cnt'),
       2, CNTEventType2, 133700),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/cnt_files/'
               'SampleCNTFile_16bit.cnt'),
       2, CNTEventType2, 133700),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/cnt_files/'
               'BoyoAEpic1_16bit.cnt'),
       2, CNTEventType2, 78536260),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/cnt_files/'
               'cont_67chan_resp_32bit.cnt'),
       4, CNTEventType2, 54570725),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/BoyoAEpic1_16bit.cnt'),
       2, CNTEventType2, 78536260),

    pp(op.join(op.dirname(_mne_file),
               '../sandbox/data/confidential/cnt/cont_67chan_resp_32bit.cnt'),
       4, CNTEventType2, 54570725),
], ids=[str(n) for n in range(10)]
)
def test_foo(fname, expected_n_bytes, expected_event_type,
             expected_event_table_pos):
    """Test reading raw cnt files."""
    print('\n')
    print('expected_n_bytes: ', expected_n_bytes)
    print('expected_event_table_pos: ', expected_event_table_pos)
    n_channels, n_samples, event_table_pos, n_bytes = foo(fname)
    assert True


run_tests_if_main()

Check out the trace:

rootdir: /home/sik/code/mne-python, inifile: setup.cfg
plugins: timeout-1.3.3, sugar-0.9.2, pudb-0.7.0, mock-1.10.3, faulthandler-1.5.0, cov-2.6.1
collecting ... 

expected_n_bytes:  4
expected_event_table_pos:  156474479
readed_event_table_pos:  156474479
readed     b: 1001010100111001110001101111
readed_    b: 1001010100111001110001101111
compute(2) b: 100101001110011110011100101
compute(4) b: 1001010011100110001010100101
compute(2) : 78068965
compute(4) : 156132005
No match

 sandbox/mwe/6535_test_cnt.py ✓                                                            10% █         

expected_n_bytes:  4
expected_event_table_pos:  57267440
readed_event_table_pos:  57267440
readed     b: 11011010011101010011110000
readed_    b: 11011010011101010011110000
compute(2) b: 1101101001111011000110000
compute(4) b: 11011010011101010011110000
compute(2) : 28636720
compute(4) : 57267440
match found: ( n_bytes:  4 event_table_pos:  57267440 )
 sandbox/mwe/6535_test_cnt.py ✓✓                                                           20% ██        

expected_n_bytes:  2
expected_event_table_pos:  15518747
readed_event_table_pos:  15518747
readed     b: 111011001100110000011011
readed_    b: 111011001100110000011011
compute(2) b: 101000100111011
compute(4) b: 1001100000110101
compute(2) : 20795
compute(4) : 38965
No match
 sandbox/mwe/6535_test_cnt.py ✓✓✓                                                          30% ███       

expected_n_bytes:  4
expected_event_table_pos:  4971618850
readed_event_table_pos:  676651554
readed     b: 101000010101001110001000100010
readed_    b: 101000010101001110001000100010
compute(2) b: 10010100001010100111011010100010
compute(4) b: 100101000010101001110001000100010
compute(2) : 2485810850
compute(4) : 4971618850
match found: ( n_bytes:  4 event_table_pos:  4971618850 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓                                                         40% ████      

expected_n_bytes:  2
expected_event_table_pos:  133700
readed_event_table_pos:  133700
readed     b: 100000101001000100
readed_    b: 100000101001000100
compute(2) b: 100000101001000100
compute(4) b: 111111111001000100
compute(2) : 133700
compute(4) : 261700
match found: ( n_bytes:  2 event_table_pos:  133700 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓                                                        50% █████     

expected_n_bytes:  2
expected_event_table_pos:  133700
readed_event_table_pos:  133700
readed     b: 100000101001000100
readed_    b: 100000101001000100
compute(2) b: 100000101001000100
compute(4) b: 111111111001000100
compute(2) : 133700
compute(4) : 261700
match found: ( n_bytes:  2 event_table_pos:  133700 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓✓                                                       60% ██████    

expected_n_bytes:  2
expected_event_table_pos:  78536260
readed_event_table_pos:  78536260
readed     b: 100101011100101111001000100
readed_    b: 100101011100101111001000100
compute(2) b: 1111111111111111111010100110111000100
compute(4) b: 11111111111111111110101000010101000100
compute(2) : 137438776772
compute(4) : 274877547844
No match
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓✓✓                                                      70% ███████   

expected_n_bytes:  4
expected_event_table_pos:  54570725
readed_event_table_pos:  54570725
readed     b: 11010000001010111011100101
readed_    b: 11010000001010111011100101
compute(2) b: 1101000000110001100000101
compute(4) b: 11010000001010111011100101
compute(2) : 27288325
compute(4) : 54570725
match found: ( n_bytes:  4 event_table_pos:  54570725 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓✓✓✓                                                     80% ████████  

expected_n_bytes:  2
expected_event_table_pos:  78536260
readed_event_table_pos:  78536260
readed     b: 100101011100101111001000100
readed_    b: 100101011100101111001000100
compute(2) b: 1111111111111111111010100110111000100
compute(4) b: 11111111111111111110101000010101000100
compute(2) : 137438776772
compute(4) : 274877547844
No match
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓✓✓✓✓                                                    90% █████████ 

expected_n_bytes:  4
expected_event_table_pos:  54570725
readed_event_table_pos:  54570725
readed     b: 11010000001010111011100101
readed_    b: 11010000001010111011100101
compute(2) b: 1101000000110001100000101
compute(4) b: 11010000001010111011100101
compute(2) : 27288325
compute(4) : 54570725
match found: ( n_bytes:  4 event_table_pos:  54570725 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓✓✓✓✓✓✓                                                  100% ██████████

It works for some cases, in particular it fixes the >2Gb case:

expected_n_bytes:  4
expected_event_table_pos:  4971618850
readed_event_table_pos:  676651554
readed     b: 101000010101001110001000100010
readed_    b: 101000010101001110001000100010
compute(2) b: 10010100001010100111011010100010
compute(4) b: 100101000010101001110001000100010
compute(2) : 2485810850
compute(4) : 4971618850
match found: ( n_bytes:  4 event_table_pos:  4971618850 )
 sandbox/mwe/6535_test_cnt.py ✓✓✓✓                                                         40% ████      

But it is not able to capture cases where it should work.

@massich
Copy link
Contributor Author

massich commented Jul 10, 2019

@agramfort, @AimerLee maybe we should merge with <2Gb for the moment and continue this converstation in #6550

@agramfort agramfort merged commit 7f77f3c into mne-tools:master Jul 10, 2019
@agramfort
Copy link
Member

thx @massich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error reading big CNT file
3 participants