Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AhoCorasick #545

Merged
merged 30 commits into from
May 29, 2024
Merged

Conversation

Joseph-Edwards
Copy link
Collaborator

@Joseph-Edwards Joseph-Edwards commented May 21, 2024

This PR includes the AhoCorasick class in a state ready for V3. In particular:

  • split into hpp/tpp
  • use of no_checks functions
  • TODOs and FIXMEs reviewed
  • Rule of 5
  • asan/usan/tsan
  • doc
  • Python bindings
  • iwyu

@Joseph-Edwards Joseph-Edwards marked this pull request as draft May 21, 2024 15:43
@Joseph-Edwards
Copy link
Collaborator Author

Presently, there are a few things that I'm not sure about when writing the doc. Specifically:

  • Which functions need \complexity
  • Should descriptions be "This function calculates ..." or "Calculate ..."? Is this true for both brief descriptions and long descriptions?
  • How do you document a "checks" version of a function, assuming that the no_checks version has documentation already?
  • When should you use \c vs backticks vs \ref?
  • Should the return description always be "A value of type X"?

Any insight would be much appreciated!

//! This function puts an AhoCorasick object back into the same state as if
//! it had been newly default constructed.
//!
//! \parameters (None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we agreed to not document no parameters or return value in this way.

AhoCorasick& init();

size_t number_of_nodes() const noexcept {
//! Returns the number of nodes in the forest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forest -> trie?

//! \brief Add a word to the trie.
//!
//! Calling this function immediately adds the given word to the trie, and
//! sets the final node to terminal (if it is not already the case). After
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! sets the final node to terminal (if it is not already the case). After
//! makes the final node on the path labelled by this word terminal (if it wasn't already). After

for (auto const& w : words) {
aho_corasick::add_word(ac, w);
}
std::cout << "\n" << dot(ac).to_string() << "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check that REQUIRE(dot(ac).to_string() == "") instead. Or if you really want to insert it in a stream then make a stringstream so that the output is swallowed by that.

Copy link
Member

@james-d-mitchell james-d-mitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, few minor quibbles, but happy to merge this more or less as is

@james-d-mitchell
Copy link
Member

Correction will be happy to merge when the CI passes :)

@Joseph-Edwards
Copy link
Collaborator Author

Thanks for the review @james-d-mitchell. Will finish up the docs, do the Python bindings then un-draft-ify the PR.

@james-d-mitchell
Copy link
Member

  • Which functions need \complexity

Any where you can more or less straightforwardly determine what the complexity is.

  • Should descriptions be "This function calculates ..." or "Calculate ..."? Is this true for both brief descriptions and long descriptions?

I find that often in the brief docs it's too long to write "This function calculates..." or equivalent, so often just remove it. In the detailed description, it's probably better to write something like "This function ...".

  • How do you document a "checks" version of a function, assuming that the no_checks version has documentation already?

By describing when it throws, or maybe more accurately by describing when libsemigroups throws (through LIBSEMIGROUPS_EXCEPTION).

  • When should you use \c vs backticks vs \ref?

\ref is for reference only, doxygen claims to automatically run word that has an underscore, "::", or a capital letter through its cross referencing, so \ref shouldn't be necessary unless you're trying to link to a single word like size. If doxygen complains that it can't find the \ref, then sometimes this is cured by giving the fully qualified name instead libsemigroups::AhoCorasick::size for example.

IIRC \c only works on the next word, so if you want to put something that has a space in it in code (like x == 0), then you have to use back ticks.

  • Should the return description always be "A value of type X"?

I've written that in lots of places, because the detail description often say what is returned already, and it's a little less repetitive. In the python bindings these returns statements are useless, so maybe it's better to just briefly repeat what is returned the rank of the transformation, or something.

Hope this helps!

@Joseph-Edwards Joseph-Edwards marked this pull request as ready for review May 27, 2024 17:52
Copy link
Member

@james-d-mitchell james-d-mitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@james-d-mitchell james-d-mitchell merged commit f712222 into libsemigroups:v3 May 29, 2024
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants