Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch missing cols #158

Merged
merged 10 commits into from Dec 11, 2022
Merged

Patch missing cols #158

merged 10 commits into from Dec 11, 2022

Conversation

gaede
Copy link
Contributor

@gaede gaede commented Dec 7, 2022

BEGINRELEASENOTES

  • add utility class CheckCollections that allows to parse lcio files for collections that are not present in every event and to patch such events with empty collections of the given (Name,Type) for further processing
  • add example tool check_missing_cols that checks for collections that are not in every event in a set of files and prints a summary to stdout:
    • usage: check_missing_cols <input-file1> [[input-file2],...]
  • add example tool patch_missing_cols that creates a copy of the input file with the same set of collections in all events:
    • usage: patch_missing_cols <input-file> <output-file>
  • these tools are needed in cases where code expects consistent sets of collections in every event, as for example in a conversion to EDM4hep

ENDRELEASENOTES


if( cnt == 0 ){ // enter type only for first occurence
auto col = evt->getCollection( name ) ;
typeMap.insert( std::make_pair( name, col->getTypeName() ) ) ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
typeMap.insert( std::make_pair( name, col->getTypeName() ) ) ;
const auto [it, inserted] = typeMap.emplace( name, col->getTypeName() ) ;

slightly less cluttered and easier to read IMHO. (Plus we save a few "copies", aka moves, by constructing in-place).

Checking the return type here would give us the possibility to catch (the unlikely) case where collections with different types are stored under the same name in different events. (Would potentially make it possible to get rid of the cntMap (if the main purpose is error checking).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always happy to learn more modern C++ - the above used to be the 'canonical' way ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: the counting is crucial here - as in the end I need to compare the count to the total number of events read. Could possibly merge both maps into one, though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why that comparison is strictly necessary. The main purpose is to collect all names (and types) of the collections stored in all events in the files, which we do by filling the typeMap. Having a counter for each collection just would give us the possibility of also displaying information on which collections were missing in some events, right?

Comment on lines 91 to 93
auto it_c = cntMap.begin();
auto it_t = typeMap.begin();
for( ; it_c != cntMap.end() ; it_c++, it_t++ ){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work as intended? I.e. do the keys of both maps match in every iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why they wouldn't - but can check...

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put the creation of the name to type map into a function and simply call that in check_missing_cols? That would make it much easier to use the same functionality in e.g. a standalone converter.

Comment on lines 51 to 52
std::map<std::string,unsigned> cntMap ;
std::map<std::string,std::string> typeMap ;
Copy link
Contributor

@tmadlener tmadlener Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::map<std::string,unsigned> cntMap ;
std::map<std::string,std::string> typeMap ;
std::unordered_map<std::string,unsigned> cntMap ;
std::unordered_map<std::string,std::string> typeMap ;

For most use cases we are probably at a number of collection names, where we can profit from using a hash map. This will almost definitely break the loop over both maps using the iterators below.

@gaede
Copy link
Contributor Author

gaede commented Dec 8, 2022

Could we put the creation of the name to type map into a function and simply call that in check_missing_cols? That would make it much easier to use the same functionality in e.g. a standalone converter.

Yes, that's the plan - working on it ...

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The majority of my comments is suggestions to sprinkle around some consts, which should also make it possible to use this on multiple threads once configured. Also some &s to avoid unnecessary copies.

As we just discussed, making the Set a vector instead of a set would probably make sense, since the main purpose is to iterate in the end, with the downside that manipulating the internal _patchCols during configuration is a bit more work, because merging vectors is not as straight forward.

src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
src/cpp/src/UTIL/CheckCollections.cc Outdated Show resolved Hide resolved
src/cpp/src/UTIL/CheckCollections.cc Outdated Show resolved Hide resolved
src/cpp/src/UTIL/CheckCollections.cc Outdated Show resolved Hide resolved
src/cpp/src/UTIL/CheckCollections.cc Outdated Show resolved Hide resolved
src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
src/cpp/include/UTIL/CheckCollections.h Outdated Show resolved Hide resolved
*/
void addPatchCollections(Vector cols){
for(const auto& p : cols)
_patchCols.push_back( p ) ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will now no longer work as before with the set, as it is now possible to add the same collections multiple times, so this would have to be something like

Suggested change
_patchCols.push_back( p ) ;
if (const auto it = std::find(_patchCols.begin(), _patchCols.end(), p); it == _patchCols.end()) {
_patchCols.push_back( p ) ;
}

/** Add a collection with (name,type) that should be added to events in patchEvent().
*/
void addPatchCollection(const std::string name, std::string type){
_patchCols.push_back( {name, type} ) ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_patchCols.push_back( {name, type} ) ;
addPatchCollections({{name, type}});

Not sure if this initializer list works, but to avoid having to duplicate the deduplication logic, I would just forward this to the overload taking the vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really would want to put such a check in, I'd do it the other way round, i.e. check in the individual addPatchCollection for duplicates and use this in the other function. But I think this is really overdoing it, as duplicates would not cause any harm (except tiny extra runtime). Much more critical would be to check for viable LCIO type names as that would cause crashes. Not sure we need this for this 'expert' tool, though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I pointed this out is the changed behavior wrt the previous version that had a set. In any case, this will most likely only be called once in any case, and with an empty _patchCols as well, so I agree, deduplicating here is probably not necessary.

The LCIO type check is something we could/should actually do (also to catch typos in steering files), I think. It depends a bit on whether the crash is an exception or a segmentation fault in the collection creation factory. The latter might be hard to diagnose if there is no backtrace. The former, I would consider "good enough" for this, as we will have something printed to the screen with enough information, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess any container will do here, as all we need here is to get all elements. I had put in the vector as you suggested for further downstream use in the converter. Thinking about the type checking, again: this also is not needed or even useful to have here. We should get neither crash or exception, as the type string is really just used as 'information' and no code should depend on it. For empty collection no one will (or should) access this ever. Collections with incorrect type might get lost in routines that list/dump collections as LCTOOLS::dumpEventDetailed - but no crashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the 'WIP', so from my side it would be good to go.

@gaede gaede changed the title WIP: Patch missing cols Patch missing cols Dec 10, 2022
gaede and others added 2 commits December 11, 2022 19:33
@tmadlener tmadlener enabled auto-merge (rebase) December 11, 2022 18:38
@tmadlener tmadlener merged commit 73efd54 into iLCSoft:master Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants