Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parseLongPath() to handle namespaces #479

Merged
merged 1 commit into from
Feb 3, 2021

Conversation

maneeshpm
Copy link
Collaborator

Fixes #477
If a path of length 1 is passed and qualifies as a namespace character, return (ns, "")

@maneeshpm maneeshpm marked this pull request as draft January 25, 2021 07:08
@kelson42 kelson42 requested review from mgautierfr and veloman-yunkan and removed request for mgautierfr and veloman-yunkan January 25, 2021 09:15
@maneeshpm maneeshpm marked this pull request as ready for review January 25, 2021 09:20
@maneeshpm
Copy link
Collaborator Author

@veloman-yunkan The changes have been included.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run unit-tests? The parseLongPath unit test is failing. Please review and modify that unit test respectively, to reflect the updated functionality of parseLongPath(). Try to add more corner cases if needed.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! We are about to converge. Only need to take consistency seriously.

test/parseLongPath.cpp Show resolved Hide resolved
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

The two comments below are just suggestions. Since it's a matter of personal taste, feel free to ignore them. But we also need to get back to your observation regarding Archive::findByPath(). Since parseLongPath() now doesn't reject inputs of the form A/ or /A/, Archive::findByPath() will work incorrectly in those cases. Please check that hypothesis in the test/find.cpp unit test and propose a fix for it, too.

src/tools.cpp Outdated Show resolved Hide resolved
src/tools.cpp Show resolved Hide resolved
@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Jan 26, 2021

@veloman-yunkan archive::findByPath() fails for any parameter that has a trailing slash because of a path.back()++ in the function. I think, removing the trailing slash solves the issue.

Archive::EntryRange<EntryOrder::pathOrder> Archive::findByPath(std::string path) const
{
    /* Removing trailing slash */
    if(path.back() == '/') path.pop_back();
    entry_index_t begin_idx, end_idx;
    if (m_impl->hasNewNamespaceScheme()) {
      ...
    } else {
      ...
    }
    return Archive::EntryRange<EntryOrder::pathOrder>(m_impl, begin_idx.v, end_idx.v);
}

Works as expected and passes the build tests.

@mgautierfr
Copy link
Collaborator

archive::findByPath() fails for any parameter that has a trailing slash because of a path.back()++ in the function. I think, removing the trailing slash solves the issue.

Why it is failing ? (And how ?)

This should not. The path of a item is just a string of bytes. There is no semantics associated to it (except the namespace part).
The fact that paths look like a file path or a url is because of how we use it (in kiwix). At libzim level, we don't care.

If we search for a C/foo/ we must return all entries starting by C/foo/ and we should NOT return C/foo or C/foothing

@kelson42 kelson42 modified the milestone: libzim 7.0.0 Jan 26, 2021
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Please squash all your commits into one - we don't need all the history of this PR.

@veloman-yunkan
Copy link
Collaborator

@mgautierfr makes a good point

If we search for a C/foo/ we must return all entries starting by C/foo/ and we should NOT return C/foo or C/foothing

Please also add corresponding test cases to findByPath()'s unit test.

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Jan 26, 2021

@mgautierfr Sorry, I meant it fails for all namespace urls with trailing slash. This is an implementation issue in findByPath() when we want to use it only for namespace parameters such as A/ or /A/.

Suppose we pass A/ as the parameter, findByPath() assigns begin_idx properly, then performs a path.back()++ which makes the path A/->A0 which is an invalid path for parseLongPath() and throws a runtime error when finding end_idx.

$ zimdump list --ns=M/ khan-academy-videos_fr_amine_2020-06.zim                                                                                            
M/Counter
M/Creator
...
M/Tags
M/Title
X/fulltext/xapian
X/title/xapian
Exception: entry index out of range

The list should only output entries starting with M/ but it continues further until an exception is encountered.

@mgautierfr
Copy link
Collaborator

Sorry, I meant it fails for all namespace urls with trailing slash. This is an implementation issue in findByPath() when we want to use it only for namespace parameters such as A/ or /A/.

You are right. But then, your proposed fix (if(path.back() == '/') path.pop_back();) is not good as it removes all trailing slash.

@maneeshpm
Copy link
Collaborator Author

@mgautierfr I understand now, thanks for pointing it out! Perhaps I should modify the condition to check if it's actually a namespace url and only then remove the trailing slash if required.

src/archive.cpp Outdated Show resolved Hide resolved
@kelson42
Copy link
Contributor

@mgautierfr I think your review is again needed here.

@kelson42
Copy link
Contributor

@maneeshpm Please rebase your branch on latest HEAD origin/master.

@maneeshpm maneeshpm force-pushed the 477-parseLongPath-return-ns branch 2 times, most recently from 0172ab3 to c90fe22 Compare January 29, 2021 13:41
@kelson42
Copy link
Contributor

@maneeshpm You have only one commit now in the PR, not sure if you have squased everything together or if you have lost some commits... but to do a rebase you need to resync your fork (https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork) and then make in your feature branch:

$git fetch
$git rebase origin/master

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Jan 29, 2021

@kelson42 Thanks for the info. I am never going to forget that 😅, sorry for the clutter. I have squashed the commits and this pr contains all the intended changes.

@mgautierfr
Copy link
Collaborator

There is another use case not handled : A// (Namespace A, path: /).
In this case, we want to increment the last /.
And the test if(path.size() <= 3 && path.back() == '/') path.pop_back(); is not good.

The increment of the last char is a bit more complex that it seems at first sight, maybe we need a specific method.
Or simply parse the long path first and then increment the last char of the parsed short path.


It is a matter of style, but I prefer to have the test written on several lines :

if(path.size() <= 3 && path.back() == '/') {
  path.pop_back();
}

@maneeshpm
Copy link
Collaborator Author

@mgautierfr Thats a valid point, parsing the path before beforehand seems to be the correct approach. I agree, in the future having a more sophisticated method for this would be a nice idea.
For now, I am planning the following changes:

Archive::EntryRange<EntryOrder::pathOrder> Archive::findByPath(std::string path) const
{
    entry_index_t begin_idx, end_idx;
    if (path.empty() || path == "/") {
      begin_idx = m_impl->getStartUserEntry();
      end_idx = m_impl->getEndUserEntry(); 
    } else if (m_impl->hasNewNamespaceScheme()) {
      begin_idx = m_impl->findx('C', path).second;
      path.back()++;
      end_idx = m_impl->findx('C', path).second;
    } else {
      char ns;
      std::tie(ns, path) = parseLongPath(path);
      begin_idx = m_impl->findx(ns, path).second;
      if (path.empty()){
        ns++;
        end_idx = m_impl->findx(ns, path).second;
      } else {
        path.back()++;
        end_idx = m_impl->findx(ns, path).second;
      }
    } 
    return Archive::EntryRange<EntryOrder::pathOrder>(m_impl, begin_idx.v, end_idx.v);
  }

I think this is pretty much exhaustive and covers all the unit tests. After this modification, all the invalid paths like the one you mentioned will be handled by parseLongPath by throwing an std::runtime_error. One unit test

auto range0 = archive.findByPath("unkwonUrl");
ASSERT_EQ(range0.begin(), range0.end());

will be changed from ASSERT_EQ to ASSERT_THROW due to the nature of parseLongPath.
Is this the right approach?

@mgautierfr
Copy link
Collaborator

will be changed from ASSERT_EQ to ASSERT_THROW due to the nature of parseLongPath.

We should not change the API based on the nature of a internal method.
findByPath searches for a range of entries starting by the prefix. If no entries start with the given prefix, we return an empty range.
The question is more about what we should do if the user give a invalid path (and what is a invalid path) ?

For a long time, a (long) path was composed of a namespace and a short path. The new api remove this.
Now, we hide the namespace, so a path is simply a (short path). We still have a namespace for compatibility with old zim file but the idea is to hide it. (And it is hidden as a subdirectory in the path)
So there is no invalid path. At worst a path is wrong (pointing to a non existing entry).
And if there is no invalid path, for this method, we must not throw an exception (at least not because the namespace is missing, we may throw other exception as ZimFileFormatError).

@maneeshpm
Copy link
Collaborator Author

@mgautierfr That makes sense. Since we are trying to hide the old namespace scheme, I think we should place parseLongPath() inside a try-catch block and return an empty range as you mentioned if an error is encountered. This way, the API remains unchanged and we can facilitate a smooth change to the new scheme.

@maneeshpm
Copy link
Collaborator Author

@mgautierfr This fix will prevent any error from being thrown in parseLongPath and rather, return an empty range if an "unknown" URL is passed to the function.

@mgautierfr
Copy link
Collaborator

We should try/catch only the parsing of the url.
If there is a ZimFileFormatError thrown when we findx we don't want to discard it.

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Feb 3, 2021

@mgautierfr Understood, I've made the necessary changes. Is this code structure fine?

@mgautierfr
Copy link
Collaborator

We are good.
I would have written it this way to avoid the parseOk variable and the extra indentation

char ns;
try {
  std::tie(ns, path) = parseLongPath(path);
} catch (...) {
  Archive::EntryRange<EntryOrder::pathOrder>(m_impl, 0, 0);
}
begin_idx = m_impl->findx(ns, path).second;
if (path.empty()) {
  ns++;
} else {
  path.back()++;
}
end_idx = m_impl->findx(ns, path).second;

But it is a matter of preference, it works anyway.

If @veloman-yunkan is ok, we can merge.

@maneeshpm
Copy link
Collaborator Author

Thanks for the suggestion @mgautierfr. This looks much more neater and concise. I will try to follow it in the future as well.

@kelson42
Copy link
Contributor

kelson42 commented Feb 3, 2021

@maneeshpm Your branch would benefit to be rebased on our git master, so we can merge it.

Return empty range for unknown paths
@maneeshpm
Copy link
Collaborator Author

@kelson42 rebased to master. Thanks!

@kelson42 kelson42 merged commit 6340e80 into openzim:master Feb 3, 2021
@kelson42
Copy link
Contributor

kelson42 commented Feb 3, 2021

@maneeshpm It seems you latest PR breaks on windows, see https://github.com/openzim/libzim/runs/1825653239?check_suite_focus=true

@maneeshpm
Copy link
Collaborator Author

maneeshpm commented Feb 4, 2021

@kelson42 Looks like windows.h header file has macros for min & max which are interfering with: See ref
auto shortPath = longPath.substr(std::min(i+2, (unsigned int)longPath.size()));. I think the best way to fix this is by explicitly mentioning our function type like int k = std::min<int>(3, 4);. Do I need to open a new issue to fix this?

@kelson42
Copy link
Contributor

kelson42 commented Feb 4, 2021

@maneeshpm yes please. You should have write permission now on this repo, please male you PR here (and not in your fork).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zim::parseLongPath() should handle paths consisting of the namespace only
4 participants