Fix of #3240 (Drop Temporary file usage while reading data) #3346

rizac · 2023-08-30T08:06:38Z

What does this PR do?

This PR tries to fix the first issue (point 1) in #3240, allowing now to seamlessly read from both file-like objects and file paths and eliminating the highly inefficient fallback of writing data to disk on TypeErrors (if these errors happen now, there is not need for the fallback, and read simply raises them)

Note for reviewers:

(fixed, see thread on this page) ~~I could not write support for file-like objects in the case of 'GCF' format because the code in the related module was quite criptic: as such, this format still allows only file paths as argument~~
The point 2. mentioned in Drop TemporaryFiles usage while reading data #3240, i.e. removing temporary files when reading from URL, requires more workaround and I dropped it for the moment

In account of the two points above, I am available to improve my PR or provide an additional one, but probably some discussion is needed. I am available even via zoom in case as the problem might require some round table and pair programming

Why was it initiated? Any relevant Issues?

See point 1. of #3240

PR Checklist

… objects in obspy.core.util.base._generic_reader

…om_open_files

…o into test_reading_string_io (as implemented in nordic/core.py

paitor · 2023-08-30T17:57:25Z

Hi

Don't have any objections to the changes done to the gcf reader but perhaps that the added guard in lines 214 and 217 seems a bit superfluous, but perhaps I missing something here.

Could you please give some more details what was cryptic in the code base and I can perhaps walk you through this.

rizac · 2023-08-31T07:15:46Z

Hi
thanks for the reply.

Don't have any objections to the changes done to the gcf reader but perhaps that the added guard in lines 214 and 217 seems a bit superfluous, but perhaps I missing something here.

The try catch in lines 214-217 (I guess you refer to that) is a simple workaround to prevent a NameError because obj in the old code might not have been initialized when releasing memory

Could you please give some more details what was cryptic in the code base and I can perhaps walk you through this.

The problem with gcf is that I do not see where the passed file (currently a str) is open, so I cannot work out a solution for file-like objects. If you look at the code for _is_gcf (read_gcf in the same module works the same way):

obspy/obspy/io/gcf/core.py

Lines 210 to 212 in 417b047

    
           obj = _GcfFile() 
        
           b_filename = filename.encode('utf-8') 
        
           ret = gcf_io.read_gcf(b_filename, obj, 3)

It looks like everything is delegated to gcf_io.read_gcf, which accepts a file path (encoded as bytes) and the _GcfFile object. If the file path filename is a file-like object, how could the code work here?

Thank you very much in advance
(Ps: I am not an expert of the format so keep in mind I might miss something trivial for you)

Ps: Writing to TemporaryFile for gcf files only might be a workaround for the function read_cgf, but within _is_gcf is highly inefficient and defeats the purpose of this PR

paitor · 2023-08-31T08:34:04Z

Hi

First of all, thanks for working on this think it'll be a nice contrib.

The python part of the code sends the file path to the underlying c-code where the file is open, see:

obspy/obspy/io/gcf/src/gcf_io.c

Lines 534 to 540 in 417b047

    
           /* open a file for reading */ 
        
           int opengcf(const char *fname, int32 *fid) { 
        
              if ((*fid = open_r(fname, ORFLAG)) <  0 ) { 
        
                 return 1; 
        
              } 
        
              return 0; 
        
           }

where open_r and ORFLAG are macros defined in:

https://github.com/obspy/obspy/blob/417b047a2c13491232a3b472a9cf3c0a78785f22/obspy/io/gcf/src/gcf_io.h

in order to have the c-code compile on (most) platforms. So what needs to be done to the code i:

to move the opening of the file from the C-code to the python code (i.e. into read_gcf)
figure out how to convert the binary stream into a proper file descriptor to send to the c-code
in the c-code remove the call to opengcf.
I've never done 2. so can't really advice you how to go about with this (but perhaps have a look in the other readers that have c-code under the hood, e.g. mseed).

Also in _is_gcf there is a check that the input is a file and then the size of the file as a gcf file always must be an integer multiple of 1024 bytes. This is a fairly quick first test to discard files that does not fit the criteria hence I think it would be nice to keep it. It's not known to me though how to extract this information from a from any type of file-like object but I assume that this can be done.

…e refactor

rizac · 2023-09-03T16:35:05Z

Thanks @paitor for the suggestion, it really helped.

Eventually I opted to keep away from the rabbit hole of binary streams, file descriptors, C and O/S compatibility (which I tried to sort out without success).

I therefore proceeded to:

Made the gcf C implementation more modular, and therefore read_gcf easier to translate into Python
Translated read_gcf into pure Python, and from there manage the file and file-like objects more easily

Tests are passing on my local machine (btw, why are they not passing on Github CLI?)

paitor · 2023-09-04T08:30:30Z

Hi

Don't see any problems with the changed code but out of curiosity, did you try to benchmark the performance? One of the reason for asking is that my decision to update the code from a pure python implementation to an underlying C-implementation were that the C-implementation resulted in an approx 80-fold speed-up in reading gcf data. Would be interesting to see if the update affects the performance and if so if this is a greater loss than the gain.

I have never done a review before so leave this to the other reviewers.

…rove tests

…tream + fix several docstrings

rizac · 2023-09-05T07:56:01Z

@patior On my computer (macOS 2.6 GHz 6-Core Intel Core i7):
Code:

python -m timeit -n 200 -s "..." "read('.../obspy/obspy/io/gcf/tests/data/20160603_1910n.gcf', format='GCF')"

Results:
obspy prior to this PR (version 1.4.0, installed for another project):

200 loops, best of 5: 1.45 msec per loop

200 loops, best of 5: 941 usec per loop

this PR version:

200 loops, best of 5: 1.11 msec per loop

200 loops, best of 5: 1.08 msec per loop

So I do not see any hint for a significant performance difference. However, we should always keep in mind to benchmark any performance change with the performance improvements that this PR aims to give, not only in terms of computing speed, but also in terms of time released from implementing custom code or wrapper functions outside obpsy, in order to prevent writing several files to disk needlessly (as it happened to us).

I am trying to refactor text files because I saw some improvement that could be done to the code, then I would also wait for other reviewers because the PR will probably need some feedback I guess

rizac · 2023-11-25T16:34:32Z

Any feedback?

megies · 2023-11-30T09:41:46Z

Sorry for not seeing this earlier. I'll give this a proper look and review soon! 😬

megies

Looks like a nice further improvement on unifying reading/writing routines. I tried to be as thorough as possible since this touches the very core of things, but in some instances it's hard to think everything through, so to some extent we'll have to rely on our test suite.

There's some comments that need addressing but overall I don't see a reason not to merge this after that. 👍

obspy/core/tests/test_waveform_plugins.py

obspy/core/util/base.py

obspy/io/gcf/core.py

obspy/io/gse2/core.py

obspy/io/mseed/tests/test_mseed_reading_and_writing.py

megies · 2024-02-21T18:23:50Z

Oh and this might need a rebase eventually too

rizac · 2024-02-22T13:37:38Z

Thanks @megies , I'll try to go through all your comments the next days and commit changes to this PR
(hope there isn't any particular forthcoming deadline)

megies · 2024-02-27T13:45:23Z

(hope there isn't any particular forthcoming deadline)

It's in the github milestone for 1.5.0 and it won't get left out for sure, no worries

rizac · 2024-03-16T10:54:02Z

Fixed the last comments of @megies (thanks for the review), apart from the ascii question (see comment to the only unresolved issue above).

As a side note, just for safety, as @megies pointed out this PR touches the very core of things and nobody of us is expert in everything, if you guys know any person experienced in specific code blocks that can check the modifications (or run this branch in their usual workflow) to see if there are additional tests to add, I would be happy to have them invited to the discussion

rizac added 17 commits August 29, 2023 11:15

add open_bytes_stream convenience function

7b0935b

obspy.io.gcf._is_gcf returns False for in-memory binary streams

2c3f4c3

obspy.io.ah: add support for in-memory binary streams

40c1fb0

obspy.io.alsep: add support for in-memory binary streams

b48ba64

obspy.io.gse2: add support for in-memory binary streams

3d56f8c

obspy.io.segy: add support for in-memory binary streams

55cb8b6

improve function docstrings

8b4e54d

remove catching TypeError and writing to file while reading file-like…

1d72b60

… objects in obspy.core.util.base._generic_reader

obspy/io/reftek/core.py: add support for in-memory binary data

71f9e81

obspy/io/mseed: small code refactor and test fix (BytesIO.seek(0))

8eb0c96

obspy/io/mseed/tests/test_mseed_special_issues.py fix (BytesIO.seek(0))

a61c3db

fix obspy/io/nied/tests/test_knet_reading.py::test_read_knet_ascii_fr…

1a54b52

…om_open_files

fix obspy/io/nordic/tests/test_nordic.py: change test_reading_bytes_i…

bc04c42

…o into test_reading_string_io (as implemented in nordic/core.py

fix stream check with Path/str

68b7e48

fix file path + file-like objects for ascii files

82bf162

fix obspy.core.util.base.open_bytes_stream

66527bc

fix obspy/core/tests/test_waveform_plugins.py

646dc30

rizac requested review from trichter, megies, flixha, paitor, s-schneider and d-chambers as code owners August 30, 2023 08:06

rizac changed the title ~~Fix of #3240 (DropTemporary file usage while reading data, point 1.)~~ Fix of #3240 (Drop Temporary file usage while reading data, point 1.) Aug 30, 2023

rizac added 3 commits September 3, 2023 12:29

obspy.io.gcf._is_gcf returns False for in-memory binary streams

548def1

obspy.io.gcf._is_gcf returns False for in-memory binary streams / cod…

b1c6d66

…e refactor

simplify code in obspy/io/alsep/wt/tape.py + small docstring fixes

0dd4cac

rizac added 6 commits September 4, 2023 11:15

obspy/core/util/base: simplify code

eccea9b

obspy/io/alesp restore np.fromfile and remove from_bytes_stream + imp…

6631b17

…rove tests

obspy/io/alsep improve comments and docstrings

3608a90

obspy/io/gse2: leave stream open while checking format

219f22a

obspy/core/util/base: redefinition of get_bytes_stream + open_bytes_s…

4269f39

…tream + fix several docstrings

better docstrings fixing nordic I/O

b81e9a1

rizac added 3 commits September 5, 2023 12:29

implement open_text_stream and fix related modules: ascii, nordic

3eda840

remove leftover in obspy.core.util.base.py

9e1d4d7

minor fixes for pylint / flake8

961bb09

rizac changed the title ~~Fix of #3240 (Drop Temporary file usage while reading data, point 1.)~~ Fix of #3240 (Drop Temporary file usage while reading data) Sep 12, 2023

megies added this to the 1.5.0 milestone Nov 30, 2023

megies added the .core issues affecting our functionality at the very core label Nov 30, 2023

megies requested changes Feb 21, 2024

View reviewed changes

This was referenced Feb 22, 2024

core: Unify "_is_xxx()" file format checks regarding file handling #2459

Closed

file like objects can be used to read and write tspair and slist #3372

Open

fixes after review 2024-02

50396e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix of #3240 (Drop Temporary file usage while reading data) #3346

Fix of #3240 (Drop Temporary file usage while reading data) #3346

rizac commented Aug 30, 2023 •

edited

paitor commented Aug 30, 2023

rizac commented Aug 31, 2023 •

edited

paitor commented Aug 31, 2023

rizac commented Sep 3, 2023 •

edited

paitor commented Sep 4, 2023

rizac commented Sep 5, 2023 •

edited

rizac commented Nov 25, 2023

megies commented Nov 30, 2023

megies left a comment

megies commented Feb 21, 2024

rizac commented Feb 22, 2024

megies commented Feb 27, 2024

rizac commented Mar 16, 2024

Fix of #3240 (Drop Temporary file usage while reading data) #3346

Are you sure you want to change the base?

Fix of #3240 (Drop Temporary file usage while reading data) #3346

Conversation

rizac commented Aug 30, 2023 • edited

What does this PR do?

Why was it initiated? Any relevant Issues?

PR Checklist

paitor commented Aug 30, 2023

rizac commented Aug 31, 2023 • edited

paitor commented Aug 31, 2023

rizac commented Sep 3, 2023 • edited

paitor commented Sep 4, 2023

rizac commented Sep 5, 2023 • edited

rizac commented Nov 25, 2023

megies commented Nov 30, 2023

megies left a comment

Choose a reason for hiding this comment

megies commented Feb 21, 2024

rizac commented Feb 22, 2024

megies commented Feb 27, 2024

rizac commented Mar 16, 2024

rizac commented Aug 30, 2023 •

edited

rizac commented Aug 31, 2023 •

edited

rizac commented Sep 3, 2023 •

edited

rizac commented Sep 5, 2023 •

edited