Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mc readers #693

Merged
merged 87 commits into from
May 25, 2020
Merged

Update mc readers #693

merged 87 commits into from
May 25, 2020

Conversation

andLaing
Copy link
Collaborator

Updates the pandas DataFrame based readers
for the MC hits, particles and sensors taking into
account the new nexus output structure.

Adapts existing functions to return the same structure
for both old and new format files.

@jmunozv
Copy link
Collaborator

jmunozv commented Feb 26, 2020

With the last "load_mcsensor_response_df functionality extended" commit we get:

  • The DB_name and run_number is not mandatory, so new geometries not stored in the DataBase can be handled without the need to pass arbitrary values to these variables.
  • A new parameter called sns_name has been added for just the input files with the new hdf5 persistency format. This sns_name, if set, is used by the function to only return the sns_response of that type of sensors.

@jmunozv
Copy link
Collaborator

jmunozv commented Feb 26, 2020

Currently load_mcsensor_response_df returns a Tuple with 2 pandas DataFrames: sns_response and bin_widths. This is the only mc-reader with this behavior. The rest of the mc-readers just return what you ask (hits, particles).
Given that the bin_widths are easily got via get_sensor_binning() function provided by IC, my proposal, for consistency, is that this function should return only the sns_response.

@andLaing
Copy link
Collaborator Author

With the commit 6ca6062 we remove the old readers which returned dictionaries of evm.event_model objects. @paolafer suggests a tag be made before merging these changes into the head.

@mmkekic
Copy link
Collaborator

mmkekic commented Feb 28, 2020

The MC structure table has been bothering me for a while and I am very happy to see it improving! However I think we should all agree on what we want to do and try and keep the code as clean as possible.
Since the MC table structure is changing we need duplicated readers to be able to read different data format during this transition period, is a bit ugly but I dont see a way around it.
Seems that the new readers are always returning pandas dataframe, so the question is do we still want to be able to read the MC information as our custom types (ie dictionaries)? @msorel , @ausonandres , @paolafer , @Aretno might want to comment on this? If the answer is no, I suggest we make an IC tag and clean the unused types from the event model (and all the functions using those types). If the answer is yes we would need to provide a functions that transform dataframes to needed types.

@paolafer
Copy link
Collaborator

I think that eliminating completely the event model requires a more thorough discussion, including @jjgomezcadenas. For the MC readers, we agreed on making a tag and then move on eliminating the readers that convert tables into dictionaries.

@jacg
Copy link
Collaborator

jacg commented Feb 29, 2020

... eliminating completely the event model ...

:-)

I wish I had the time to pay attention to this, or even come over and discuss it with you all in person. Unfortunately, I've just unleashed a deluge of activity in my life, so I'm completely out of the picture for a while.

@paolafer
Copy link
Collaborator

I wish I had the time to pay attention to this, or even come over and discuss it with you all in person. Unfortunately, I've just unleashed a deluge of activity in my life, so I'm completely out of the picture for a while.

I'm glad you commented :-), I didn't mention you because we know you're out of the picture, but your interventions whenever you find the time are greatly appreciated!

@mmkekic
Copy link
Collaborator

mmkekic commented Feb 29, 2020

I think that eliminating completely the event model requires a more thorough discussion, including @jjgomezcadenas.

No, I didnt mean to eliminate the whole event model, that doesnt seem feasible at the moment (though note that we are duplicating some functionalities to work with pandas directly when the performance is the issue... ). I meant to declutter it, ie remove types that are simply not used anywhere in IC (as MCParticle would be in this case we eliminate the reader)

For the MC readers, we agreed on making a tag and then move on eliminating the readers that convert tables into dictionaries.

I think that we should at least keep the possibility for a user to create MCHit type since it is needed to be fed in the paolina functions for example. My suggestion would be to have a function that transforms hits dataframe to a dictionary {event : List[MCHit]}. I am not familiar if the output of the other readers (sns responses etc) is used anywhere.

@mmkekic
Copy link
Collaborator

mmkekic commented Feb 29, 2020

I wish I had the time to pay attention to this, or even come over and discuss it with you all in person. Unfortunately, I've just unleashed a deluge of activity in my life, so I'm completely out of the picture for a while.

I'm glad you commented :-), I didn't mention you because we know you're out of the picture, but your interventions whenever you find the time are greatly appreciated!

Indeed, your suggestion, even sporadic, are more than welcome! Thanks for still keeping an eye on us! :)

@andLaing
Copy link
Collaborator Author

While adjusting the cities to use the new paradigm for the copying of MC info I've come across a problem in a few tests that I'm not sure the best way to solve. The file used (electrons_40keV_z250_RWF.h5) seems to have an error in the MC tables as can be seen in the attached screenshot. The extents row for the last event gives the last hit and last particle one less than the total number of rows in the corresponding table. This causes an error when trying to read either of these tables into a coherent dataframe since the last value of the event_id column is a nan. I could probably catch the exception caused and replace the nan with the last valid event_id but I was worried that this would cover up other corrupt files. It's not relevant for new MC but I thought I'd better ask for opinions before I start rewriting functions or changing test files
Screenshot 2020-03-28 at 19 31 18

@jmunozv
Copy link
Collaborator

jmunozv commented Mar 28, 2020

Andrew, may be the error comes from the event number one (the one with id = 1) ... according to the plot you have sent it wouldn't have any particle

@andLaing
Copy link
Collaborator Author

Nope it's definitely the last one, there are 50 hit rows and the last index is 48 not 49 as it needs to be

@paolafer
Copy link
Collaborator

Hi! The electrons_40keV_z250_RWF.h5 file has clearly an error in it, since, as Javi says, in the extents table you can see that event 0 and event 1 have the same last_particle value. I've checked that the MCRD correspondent file has the correct MC true information and that, if I run diomira on it, a correct table in the RWF file is produced, therefore, the IC code seems to be fine. For some reason, the electrons_40keV_z250_RWF.h5 file was produced in a wrong way.

I would propose to use another RWF file in the tests that use it, we have some of them already in the test_data folder, if those tests don't assume anything in particular for the content of the file.

@jmunozv
Copy link
Collaborator

jmunozv commented Mar 30, 2020

That was exactly my point ... The event with id=1 has 4 hits and 0 particles what has no sense at all.

@andLaing andLaing mentioned this pull request Mar 30, 2020
carmenromo added a commit that referenced this pull request Mar 31, 2020
#713

[author: andLaing]

Removes the use of the file electrons_40keV_z250_RWF.h5 in
the isidora tests and in the commandline tests for isidora
and irene.

This file was found to have badly written MC info in PR #693

[reviewer: paolafer]

This PR eliminates a test file that had wrong MC true information. The
tests that used it have been modified to use a different file with
similar content and the file has been removed. All the tests succeed
now, so I approve this PR.
invisible_cities/io/mcinfo_io.py Outdated Show resolved Hide resolved

with tb.open_file(file_name, 'r') as h5in:
mc_tbls = ['hits', 'particles', 'sns_response']
evt_list = list(_try_unique_evt_itr(h5in.root.MC, mc_tbls))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we understand why are there events that are not in particles or hits table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question for the NEXUS authors really.
There are certain types of simulations that have some and not other MC types saved, a simulation of optical photons for example would have an empty hits table as far as I understand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are different kinds of simulations, where we save different kinds of output (some will save only sensor response information and not true particles, sometimes the opposite happen, for instance). We want to be able to read all kind of MC files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option would be to resurrect the old events table, which contained two columns, namely the eventID and the total deposited true energy, but I don't know if it's worth it...

invisible_cities/io/mcinfo_io.py Outdated Show resolved Hide resolved
def sensor_binning_old(file_name : str) -> pd.DataFrame:
# Is a pre-Feb 2020 MC file
config = pd.read_hdf(file_name, 'MC/configuration').set_index('param_key')
bins = config[config.index.str.contains('binning')].copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since config was read in a previous line and not given as a mutable input, you don't have to copy it

invisible_cities/io/mcinfo_io.py Show resolved Hide resolved
@mmkekic mmkekic mentioned this pull request Apr 6, 2020
Slight adjustment to comparisons in
test_copy_mc_info_which_events_out_of_range
Makes df index types the same in old and
new format reads
Copy link
Collaborator

@mmkekic mmkekic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR modifies MC io module to adapt for the new nexus output format, ensuring that the main readers/writer can be applied to the old format as well. The new functions use extensively dataframes that are faster to manipulate with than the previous dictionary based readers/writer. All the functionalities are well tested, and new MC test dataset is introduced to guarantee the equality in information read from the old and new format.
Some functionalities are repeated due to different old/new table structures, and are planned to be removed in the future once there is no need for the 'old' files readers.
The road to user-transparent compatibility was bumpy but it sparks joy to finally approve this PR! Well done!

@carmenromo carmenromo merged commit db62205 into next-exp:master May 25, 2020
@andLaing andLaing deleted the UpdateMCReaders branch July 24, 2020 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants