Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify download for antenna models / proposal tables / shower library with additional fallback servers #673

Merged
merged 14 commits into from
Jun 14, 2024

Conversation

shallmann
Copy link
Collaborator

  • removed code duplication by adding download function to NuRadioReco.utilities.dataservers.py
  • Added Chicago server as backup. Should simplify changing/adding/removing servers in the future
  • Option to rank servers, not enabled by default
  • Improves exception handling printout (in the past, people asked if the antenna model was wrongly spelled / server down, when in fact they didn't have internet connection)

Things one could discuss:

  • Do we want to rank servers by proximity/response time by default? I added two options, although the ranking by timezone requires additional python packages? (@cozzyd am I missing a much simpler solution than the function I added in dataservers.py?
  • Are pretabulated ARZ tables still used/useful: ./NuRadioMC/SignalGen/ARZ/ARZ.py: URL = 'http://arianna.ps.uci.edu/~arianna/data/ce_shower_library/ARZ_library_v{:d}.{:d}.pkl'.format(*self._version) I think this server does not exist any more?

Copy link
Collaborator

@cg-laser cg-laser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great. You only missed one line (see in-line comment).

Let's discuss in our meeting about the automatic ordering. It is fairly rare to download something as it is always locally buffered, so it is not very important to optimize it.

More important might be a safeguard against multiple downloads of the same file at the same time from cluster jobs.

if r.status_code != requests.codes.ok:
logger.error("error in download of proposal tables")
raise IOError
"downloading pre-calculated proposal tables for {}. This can take a while...".format(config_file))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you forgot to delete this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to leave it here as is. The logger message command starts the line above, so just this line starting with a string looks odd here.

@shallmann shallmann requested a review from cg-laser June 3, 2024 11:06
from datetime import datetime
from geolite2 import geolite2

geo = geolite2.reader()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using geolocation, why not just store the server time zone along with the server names? I don't think we're going to have significantly more than two ever?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the response time test is more robust anyway though...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the function entirely. Just having response time around is enough, if we are not enabling the option for now anyways (we agreed with @cg-laser that since download is only once for new files, we'd not request ordering, but leave DESY the "master")

response_times = []
available_dataservers = []

for dataserver in dataservers:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could make sense to parallelize this, though in practice it probably doesn't matter very much with just two servers...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... right.. I don't expect that we will ever have >> 2 servers. I suggest to leave it as is.

@@ -0,0 +1,115 @@
import requests
import os
import filelock
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filelock claims it requires python >= 3.8, is that a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the tests download without problems. We require 3.6 for NuRadioMC currently. Do you know alternatives? Or would you fix filelock to v3.4.1, when they dropped python 3.6 support?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess pip should do the right thing and automatically download an older version if needed (and it looks like the old versions still exist on pypi), so I think this should be fine.

if os.path.isfile(target_path):
logger.warning(f"{target_path} already exists. Maybe download was already completed by another instance?")
return
elif unpack_tarball and (len(glob(os.path.dirname(target_path) + "/*.dat")) > 0): #just check if any .dat files present (similar to NuRadioProposal.py)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the data files ever updated? Or will the name get changed in that case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The download function just gets called if the file needs download (either because the hash sum is not the latest version or the file is not there. The if/elif here just catches the case when 1000 jobs are started and the first that aquired the lock did already perform the download...

code.write(r.content)
logger.warning("...download finished.")

if unpack_tarball and target_path.endswith(".tar.gz"):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI unpack archive supports additional archive formats. Not sure if it would make sense to support .tar.xz to save some bandwidth in some cases?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, but I don't think this is really needed. the proposal tarballs are the only tarballed files, and we chose to have them as tar.gz... they are not so big, and you only need to download once for a NuRadioMC installation.

@shallmann
Copy link
Collaborator Author

@cozzyd @cg-laser I think the only remaining thing is if we need to change anything about the filelock version we are requiring, since they did drop python 3.6 support a while ago. But tests are running fine.

Otherwise I would like to merge this

@cozzyd
Copy link

cozzyd commented Jun 8, 2024

If tests are running fine pip must be resolving the correct version automatically so I think it's fine to merge?

@shallmann shallmann merged commit ff7db6e into develop Jun 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants