Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix of #3240 (Drop Temporary file usage while reading data) #3346

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

rizac
Copy link
Contributor

@rizac rizac commented Aug 30, 2023

What does this PR do?

This PR tries to fix the first issue (point 1) in #3240, allowing now to seamlessly read from both file-like objects and file paths and eliminating the highly inefficient fallback of writing data to disk on TypeErrors (if these errors happen now, there is not need for the fallback, and read simply raises them)

Note for reviewers:

  1. (fixed, see thread on this page) I could not write support for file-like objects in the case of 'GCF' format because the code in the related module was quite criptic: as such, this format still allows only file paths as argument
  2. The point 2. mentioned in Drop TemporaryFiles usage while reading data #3240, i.e. removing temporary files when reading from URL, requires more workaround and I dropped it for the moment

In account of the two points above, I am available to improve my PR or provide an additional one, but probably some discussion is needed. I am available even via zoom in case as the problem might require some round table and pair programming

Why was it initiated? Any relevant Issues?

See point 1. of #3240

PR Checklist

  • Correct base branch selected? master for new features, maintenance_... for bug fixes
  • This PR is not directly related to an existing issue (which has no PR yet).
  • While the PR is still work-in-progress, the no_ci label can be added to skip CI builds
  • If the PR is making changes to documentation, docs pages can be built automatically.
    Just add the build_docs tag to this PR.
    Docs will be served at docs.obspy.org/pr/{branch_name} (do not use master branch).
    Please post a link to the relevant piece of documentation.
  • If all tests including network modules (e.g. clients.fdsn) should be tested for the PR,
    just add the test_network tag to this PR.
  • All tests still pass (locally).
  • Any new features or fixed regressions are covered via new tests.
  • Any new or changed features are fully documented.
  • Significant changes have been added to CHANGELOG.txt .
  • First time contributors have added your name to CONTRIBUTORS.txt .
  • If the changes affect any plotting functions you have checked that the plots
    from all the CI builds look correct. Add the "upload_plots" tag so that plotting
    outputs are attached as artifacts.
  • New modules, add the module to CODEOWNERS with your github handle
  • Add the yellow ready for review label when you are ready for the PR to be reviewed.

@rizac rizac changed the title Fix of #3240 (DropTemporary file usage while reading data, point 1.) Fix of #3240 (Drop Temporary file usage while reading data, point 1.) Aug 30, 2023
@paitor
Copy link
Contributor

paitor commented Aug 30, 2023

Hi

Don't have any objections to the changes done to the gcf reader but perhaps that the added guard in lines 214 and 217 seems a bit superfluous, but perhaps I missing something here.

Could you please give some more details what was cryptic in the code base and I can perhaps walk you through this.

@rizac
Copy link
Contributor Author

rizac commented Aug 31, 2023

Hi
thanks for the reply.

Don't have any objections to the changes done to the gcf reader but perhaps that the added guard in lines 214 and 217 seems a bit superfluous, but perhaps I missing something here.

The try catch in lines 214-217 (I guess you refer to that) is a simple workaround to prevent a NameError because obj in the old code might not have been initialized when releasing memory

Could you please give some more details what was cryptic in the code base and I can perhaps walk you through this.

The problem with gcf is that I do not see where the passed file (currently a str) is open, so I cannot work out a solution for file-like objects. If you look at the code for _is_gcf (read_gcf in the same module works the same way):

obspy/obspy/io/gcf/core.py

Lines 210 to 212 in 417b047

obj = _GcfFile()
b_filename = filename.encode('utf-8')
ret = gcf_io.read_gcf(b_filename, obj, 3)

It looks like everything is delegated to gcf_io.read_gcf, which accepts a file path (encoded as bytes) and the _GcfFile object. If the file path filename is a file-like object, how could the code work here?

Thank you very much in advance
(Ps: I am not an expert of the format so keep in mind I might miss something trivial for you)

Ps: Writing to TemporaryFile for gcf files only might be a workaround for the function read_cgf, but within _is_gcf is highly inefficient and defeats the purpose of this PR

@paitor
Copy link
Contributor

paitor commented Aug 31, 2023

Hi

First of all, thanks for working on this think it'll be a nice contrib.

The python part of the code sends the file path to the underlying c-code where the file is open, see:

/* open a file for reading */
int opengcf(const char *fname, int32 *fid) {
if ((*fid = open_r(fname, ORFLAG)) < 0 ) {
return 1;
}
return 0;
}

where open_r and ORFLAG are macros defined in:

https://github.com/obspy/obspy/blob/417b047a2c13491232a3b472a9cf3c0a78785f22/obspy/io/gcf/src/gcf_io.h

in order to have the c-code compile on (most) platforms. So what needs to be done to the code i:

  1. to move the opening of the file from the C-code to the python code (i.e. into read_gcf)
  2. figure out how to convert the binary stream into a proper file descriptor to send to the c-code
  3. in the c-code remove the call to opengcf.
    I've never done 2. so can't really advice you how to go about with this (but perhaps have a look in the other readers that have c-code under the hood, e.g. mseed).

Also in _is_gcf there is a check that the input is a file and then the size of the file as a gcf file always must be an integer multiple of 1024 bytes. This is a fairly quick first test to discard files that does not fit the criteria hence I think it would be nice to keep it. It's not known to me though how to extract this information from a from any type of file-like object but I assume that this can be done.

@rizac
Copy link
Contributor Author

rizac commented Sep 3, 2023

Thanks @paitor for the suggestion, it really helped.

Eventually I opted to keep away from the rabbit hole of binary streams, file descriptors, C and O/S compatibility (which I tried to sort out without success).

I therefore proceeded to:

  1. Made the gcf C implementation more modular, and therefore read_gcf easier to translate into Python
  2. Translated read_gcf into pure Python, and from there manage the file and file-like objects more easily

Tests are passing on my local machine (btw, why are they not passing on Github CLI?)

@paitor
Copy link
Contributor

paitor commented Sep 4, 2023

Hi

Don't see any problems with the changed code but out of curiosity, did you try to benchmark the performance? One of the reason for asking is that my decision to update the code from a pure python implementation to an underlying C-implementation were that the C-implementation resulted in an approx 80-fold speed-up in reading gcf data. Would be interesting to see if the update affects the performance and if so if this is a greater loss than the gain.

I have never done a review before so leave this to the other reviewers.

@rizac
Copy link
Contributor Author

rizac commented Sep 5, 2023

@patior On my computer (macOS 2.6 GHz 6-Core Intel Core i7):
Code:

python -m timeit -n 200 -s "..." "read('.../obspy/obspy/io/gcf/tests/data/20160603_1910n.gcf', format='GCF')" 

Results:
obspy prior to this PR (version 1.4.0, installed for another project):

200 loops, best of 5: 1.45 msec per loop
200 loops, best of 5: 941 usec per loop

this PR version:

200 loops, best of 5: 1.11 msec per loop
200 loops, best of 5: 1.08 msec per loop

So I do not see any hint for a significant performance difference. However, we should always keep in mind to benchmark any performance change with the performance improvements that this PR aims to give, not only in terms of computing speed, but also in terms of time released from implementing custom code or wrapper functions outside obpsy, in order to prevent writing several files to disk needlessly (as it happened to us).

I am trying to refactor text files because I saw some improvement that could be done to the code, then I would also wait for other reviewers because the PR will probably need some feedback I guess

@rizac rizac changed the title Fix of #3240 (Drop Temporary file usage while reading data, point 1.) Fix of #3240 (Drop Temporary file usage while reading data) Sep 12, 2023
@rizac
Copy link
Contributor Author

rizac commented Nov 25, 2023

Any feedback?

@megies
Copy link
Member

megies commented Nov 30, 2023

Sorry for not seeing this earlier. I'll give this a proper look and review soon! 😬

@megies megies added this to the 1.5.0 milestone Nov 30, 2023
@megies megies added the .core issues affecting our functionality at the very core label Nov 30, 2023
Copy link
Member

@megies megies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice further improvement on unifying reading/writing routines. I tried to be as thorough as possible since this touches the very core of things, but in some instances it's hard to think everything through, so to some extent we'll have to rely on our test suite.

There's some comments that need addressing but overall I don't see a reason not to merge this after that. 👍

obspy/core/tests/test_waveform_plugins.py Outdated Show resolved Hide resolved
obspy/core/tests/test_waveform_plugins.py Outdated Show resolved Hide resolved
obspy/core/util/base.py Outdated Show resolved Hide resolved
obspy/core/util/base.py Outdated Show resolved Hide resolved
obspy/core/util/base.py Show resolved Hide resolved
obspy/io/gcf/core.py Show resolved Hide resolved
obspy/io/gcf/core.py Outdated Show resolved Hide resolved
obspy/io/gse2/core.py Outdated Show resolved Hide resolved
obspy/io/gse2/core.py Outdated Show resolved Hide resolved
@megies
Copy link
Member

megies commented Feb 21, 2024

Oh and this might need a rebase eventually too

@rizac
Copy link
Contributor Author

rizac commented Feb 22, 2024

Thanks @megies , I'll try to go through all your comments the next days and commit changes to this PR
(hope there isn't any particular forthcoming deadline)

@megies
Copy link
Member

megies commented Feb 27, 2024

(hope there isn't any particular forthcoming deadline)

It's in the github milestone for 1.5.0 and it won't get left out for sure, no worries

@rizac
Copy link
Contributor Author

rizac commented Mar 16, 2024

Fixed the last comments of @megies (thanks for the review), apart from the ascii question (see comment to the only unresolved issue above).

As a side note, just for safety, as @megies pointed out this PR touches the very core of things and nobody of us is expert in everything, if you guys know any person experienced in specific code blocks that can check the modifications (or run this branch in their usual workflow) to see if there are additional tests to add, I would be happy to have them invited to the discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
.core issues affecting our functionality at the very core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants