Fix parseLongPath() to handle namespaces #479

maneeshpm · 2021-01-25T06:47:31Z

Fixes #477
If a path of length 1 is passed and qualifies as a namespace character, return (ns, "")

maneeshpm · 2021-01-25T09:22:45Z

@veloman-yunkan The changes have been included.

veloman-yunkan

Did you run unit-tests? The parseLongPath unit test is failing. Please review and modify that unit test respectively, to reflect the updated functionality of parseLongPath(). Try to add more corner cases if needed.

veloman-yunkan

Good! We are about to converge. Only need to take consistency seriously.

test/parseLongPath.cpp

veloman-yunkan

Great!

The two comments below are just suggestions. Since it's a matter of personal taste, feel free to ignore them. But we also need to get back to your observation regarding Archive::findByPath(). Since parseLongPath() now doesn't reject inputs of the form A/ or /A/, Archive::findByPath() will work incorrectly in those cases. Please check that hypothesis in the test/find.cpp unit test and propose a fix for it, too.

src/tools.cpp

maneeshpm · 2021-01-26T12:50:52Z

@veloman-yunkan archive::findByPath() fails for any parameter that has a trailing slash because of a path.back()++ in the function. I think, removing the trailing slash solves the issue.

Archive::EntryRange<EntryOrder::pathOrder> Archive::findByPath(std::string path) const
{
    /* Removing trailing slash */
    if(path.back() == '/') path.pop_back();
    entry_index_t begin_idx, end_idx;
    if (m_impl->hasNewNamespaceScheme()) {
      ...
    } else {
      ...
    }
    return Archive::EntryRange<EntryOrder::pathOrder>(m_impl, begin_idx.v, end_idx.v);
}

Works as expected and passes the build tests.

mgautierfr · 2021-01-26T16:11:37Z

archive::findByPath() fails for any parameter that has a trailing slash because of a path.back()++ in the function. I think, removing the trailing slash solves the issue.

Why it is failing ? (And how ?)

This should not. The path of a item is just a string of bytes. There is no semantics associated to it (except the namespace part).
The fact that paths look like a file path or a url is because of how we use it (in kiwix). At libzim level, we don't care.

If we search for a C/foo/ we must return all entries starting by C/foo/ and we should NOT return C/foo or C/foothing

veloman-yunkan

Approving. Please squash all your commits into one - we don't need all the history of this PR.

veloman-yunkan · 2021-01-26T16:45:33Z

@mgautierfr makes a good point

If we search for a C/foo/ we must return all entries starting by C/foo/ and we should NOT return C/foo or C/foothing

Please also add corresponding test cases to findByPath()'s unit test.

maneeshpm · 2021-01-26T16:45:43Z

@mgautierfr Sorry, I meant it fails for all namespace urls with trailing slash. This is an implementation issue in findByPath() when we want to use it only for namespace parameters such as A/ or /A/.

Suppose we pass A/ as the parameter, findByPath() assigns begin_idx properly, then performs a path.back()++ which makes the path A/->A0 which is an invalid path for parseLongPath() and throws a runtime error when finding end_idx.

$ zimdump list --ns=M/ khan-academy-videos_fr_amine_2020-06.zim                                                                                            
M/Counter
M/Creator
...
M/Tags
M/Title
X/fulltext/xapian
X/title/xapian
Exception: entry index out of range

The list should only output entries starting with M/ but it continues further until an exception is encountered.

mgautierfr · 2021-01-26T16:50:56Z

Sorry, I meant it fails for all namespace urls with trailing slash. This is an implementation issue in findByPath() when we want to use it only for namespace parameters such as A/ or /A/.

You are right. But then, your proposed fix (if(path.back() == '/') path.pop_back();) is not good as it removes all trailing slash.

maneeshpm · 2021-01-26T16:54:25Z

@mgautierfr I understand now, thanks for pointing it out! Perhaps I should modify the condition to check if it's actually a namespace url and only then remove the trailing slash if required.

src/archive.cpp

kelson42 · 2021-01-29T12:36:41Z

@mgautierfr I think your review is again needed here.

kelson42 · 2021-01-29T12:37:35Z

@maneeshpm Please rebase your branch on latest HEAD origin/master.

kelson42 · 2021-01-29T13:52:58Z

@maneeshpm You have only one commit now in the PR, not sure if you have squased everything together or if you have lost some commits... but to do a rebase you need to resync your fork (https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork) and then make in your feature branch:

$git fetch
$git rebase origin/master

maneeshpm · 2021-01-29T13:57:10Z

@kelson42 Thanks for the info. I am never going to forget that 😅, sorry for the clutter. I have squashed the commits and this pr contains all the intended changes.

mgautierfr · 2021-02-02T09:48:34Z

There is another use case not handled : A// (Namespace A, path: /).
In this case, we want to increment the last /.
And the test if(path.size() <= 3 && path.back() == '/') path.pop_back(); is not good.

The increment of the last char is a bit more complex that it seems at first sight, maybe we need a specific method.
Or simply parse the long path first and then increment the last char of the parsed short path.

It is a matter of style, but I prefer to have the test written on several lines :

if(path.size() <= 3 && path.back() == '/') {
  path.pop_back();
}

maneeshpm · 2021-02-02T12:57:25Z

@mgautierfr Thats a valid point, parsing the path before beforehand seems to be the correct approach. I agree, in the future having a more sophisticated method for this would be a nice idea.
For now, I am planning the following changes:

Archive::EntryRange<EntryOrder::pathOrder> Archive::findByPath(std::string path) const
{
    entry_index_t begin_idx, end_idx;
    if (path.empty() || path == "/") {
      begin_idx = m_impl->getStartUserEntry();
      end_idx = m_impl->getEndUserEntry(); 
    } else if (m_impl->hasNewNamespaceScheme()) {
      begin_idx = m_impl->findx('C', path).second;
      path.back()++;
      end_idx = m_impl->findx('C', path).second;
    } else {
      char ns;
      std::tie(ns, path) = parseLongPath(path);
      begin_idx = m_impl->findx(ns, path).second;
      if (path.empty()){
        ns++;
        end_idx = m_impl->findx(ns, path).second;
      } else {
        path.back()++;
        end_idx = m_impl->findx(ns, path).second;
      }
    } 
    return Archive::EntryRange<EntryOrder::pathOrder>(m_impl, begin_idx.v, end_idx.v);
  }

I think this is pretty much exhaustive and covers all the unit tests. After this modification, all the invalid paths like the one you mentioned will be handled by parseLongPath by throwing an std::runtime_error. One unit test

auto range0 = archive.findByPath("unkwonUrl");
ASSERT_EQ(range0.begin(), range0.end());

will be changed from ASSERT_EQ to ASSERT_THROW due to the nature of parseLongPath.
Is this the right approach?

mgautierfr · 2021-02-02T13:52:26Z

will be changed from ASSERT_EQ to ASSERT_THROW due to the nature of parseLongPath.

We should not change the API based on the nature of a internal method.
findByPath searches for a range of entries starting by the prefix. If no entries start with the given prefix, we return an empty range.
The question is more about what we should do if the user give a invalid path (and what is a invalid path) ?

For a long time, a (long) path was composed of a namespace and a short path. The new api remove this.
Now, we hide the namespace, so a path is simply a (short path). We still have a namespace for compatibility with old zim file but the idea is to hide it. (And it is hidden as a subdirectory in the path)
So there is no invalid path. At worst a path is wrong (pointing to a non existing entry).
And if there is no invalid path, for this method, we must not throw an exception (at least not because the namespace is missing, we may throw other exception as ZimFileFormatError).

maneeshpm · 2021-02-02T14:04:20Z

@mgautierfr That makes sense. Since we are trying to hide the old namespace scheme, I think we should place parseLongPath() inside a try-catch block and return an empty range as you mentioned if an error is encountered. This way, the API remains unchanged and we can facilitate a smooth change to the new scheme.

maneeshpm · 2021-02-03T08:25:09Z

@mgautierfr This fix will prevent any error from being thrown in parseLongPath and rather, return an empty range if an "unknown" URL is passed to the function.

mgautierfr · 2021-02-03T09:06:18Z

We should try/catch only the parsing of the url.
If there is a ZimFileFormatError thrown when we findx we don't want to discard it.

maneeshpm · 2021-02-03T14:45:08Z

@mgautierfr Understood, I've made the necessary changes. Is this code structure fine?

mgautierfr · 2021-02-03T16:15:59Z

We are good.
I would have written it this way to avoid the parseOk variable and the extra indentation

char ns;
try {
  std::tie(ns, path) = parseLongPath(path);
} catch (...) {
  Archive::EntryRange<EntryOrder::pathOrder>(m_impl, 0, 0);
}
begin_idx = m_impl->findx(ns, path).second;
if (path.empty()) {
  ns++;
} else {
  path.back()++;
}
end_idx = m_impl->findx(ns, path).second;

But it is a matter of preference, it works anyway.

If @veloman-yunkan is ok, we can merge.

maneeshpm · 2021-02-03T16:35:10Z

Thanks for the suggestion @mgautierfr. This looks much more neater and concise. I will try to follow it in the future as well.

kelson42 · 2021-02-03T17:28:24Z

@maneeshpm Your branch would benefit to be rebased on our git master, so we can merge it.

Return empty range for unknown paths

maneeshpm · 2021-02-03T17:33:44Z

@kelson42 rebased to master. Thanks!

kelson42 · 2021-02-03T20:43:11Z

@maneeshpm It seems you latest PR breaks on windows, see https://github.com/openzim/libzim/runs/1825653239?check_suite_focus=true

maneeshpm · 2021-02-04T05:09:42Z

@kelson42 Looks like windows.h header file has macros for min & max which are interfering with: See ref
auto shortPath = longPath.substr(std::min(i+2, (unsigned int)longPath.size()));. I think the best way to fix this is by explicitly mentioning our function type like int k = std::min<int>(3, 4);. Do I need to open a new issue to fix this?

kelson42 · 2021-02-04T06:15:14Z

@maneeshpm yes please. You should have write permission now on this repo, please male you PR here (and not in your fork).

maneeshpm mentioned this pull request Jan 25, 2021

Fixes #171 zimdump list --ns=<N> doesn't work as a filter openzim/zim-tools#216

Merged

maneeshpm marked this pull request as draft January 25, 2021 07:08

kelson42 requested review from mgautierfr and veloman-yunkan and removed request for mgautierfr and veloman-yunkan January 25, 2021 09:15

maneeshpm marked this pull request as ready for review January 25, 2021 09:20

veloman-yunkan requested changes Jan 25, 2021

View reviewed changes

maneeshpm requested a review from veloman-yunkan January 25, 2021 18:21

veloman-yunkan requested changes Jan 25, 2021

View reviewed changes

test/parseLongPath.cpp Show resolved Hide resolved

maneeshpm requested a review from veloman-yunkan January 26, 2021 07:03

veloman-yunkan requested changes Jan 26, 2021

View reviewed changes

src/tools.cpp Outdated Show resolved Hide resolved

src/tools.cpp Show resolved Hide resolved

maneeshpm requested a review from veloman-yunkan January 26, 2021 13:02

kelson42 modified the milestone: libzim 7.0.0 Jan 26, 2021

veloman-yunkan reviewed Jan 26, 2021

View reviewed changes

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from 16df835 to 9134973 Compare January 27, 2021 12:33

maneeshpm requested a review from veloman-yunkan January 27, 2021 12:43

veloman-yunkan requested changes Jan 27, 2021

View reviewed changes

src/archive.cpp Outdated Show resolved Hide resolved

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from 9134973 to 01f08fa Compare January 27, 2021 16:05

veloman-yunkan approved these changes Jan 27, 2021

View reviewed changes

maneeshpm requested a review from mgautierfr January 28, 2021 16:33

maneeshpm force-pushed the 477-parseLongPath-return-ns branch 2 times, most recently from 0172ab3 to c90fe22 Compare January 29, 2021 13:41

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from d41a7bc to e7a26f7 Compare February 3, 2021 08:28

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from e7a26f7 to 24330dd Compare February 3, 2021 14:42

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from 24330dd to 7662fa4 Compare February 3, 2021 16:34

Fixes openzim#477 Modify parseLongPath, findByPath

6d8de41

Return empty range for unknown paths

maneeshpm force-pushed the 477-parseLongPath-return-ns branch from 7662fa4 to 6d8de41 Compare February 3, 2021 17:32

kelson42 merged commit 6340e80 into openzim:master Feb 3, 2021

maneeshpm mentioned this pull request Feb 4, 2021

windows.h min/max macros interfere with std::min/max #489

Closed

veloman-yunkan mentioned this pull request Feb 7, 2021

zim::windows::FD::close() buggy for a FD constructed from a POSIX file descriptor #478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parseLongPath() to handle namespaces #479

Fix parseLongPath() to handle namespaces #479

maneeshpm commented Jan 25, 2021

maneeshpm commented Jan 25, 2021

veloman-yunkan left a comment

veloman-yunkan left a comment

veloman-yunkan left a comment

maneeshpm commented Jan 26, 2021 •

edited

Loading

mgautierfr commented Jan 26, 2021

veloman-yunkan left a comment

veloman-yunkan commented Jan 26, 2021

maneeshpm commented Jan 26, 2021 •

edited

Loading

mgautierfr commented Jan 26, 2021

maneeshpm commented Jan 26, 2021

kelson42 commented Jan 29, 2021

kelson42 commented Jan 29, 2021

kelson42 commented Jan 29, 2021

maneeshpm commented Jan 29, 2021 •

edited

Loading

mgautierfr commented Feb 2, 2021

maneeshpm commented Feb 2, 2021

mgautierfr commented Feb 2, 2021

maneeshpm commented Feb 2, 2021

maneeshpm commented Feb 3, 2021

mgautierfr commented Feb 3, 2021

maneeshpm commented Feb 3, 2021 •

edited

Loading

mgautierfr commented Feb 3, 2021

maneeshpm commented Feb 3, 2021

kelson42 commented Feb 3, 2021

maneeshpm commented Feb 3, 2021

kelson42 commented Feb 3, 2021

maneeshpm commented Feb 4, 2021 •

edited

Loading

kelson42 commented Feb 4, 2021

Fix parseLongPath() to handle namespaces #479

Fix parseLongPath() to handle namespaces #479

Conversation

maneeshpm commented Jan 25, 2021

maneeshpm commented Jan 25, 2021

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan left a comment

Choose a reason for hiding this comment

maneeshpm commented Jan 26, 2021 • edited Loading

mgautierfr commented Jan 26, 2021

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan commented Jan 26, 2021

maneeshpm commented Jan 26, 2021 • edited Loading

mgautierfr commented Jan 26, 2021

maneeshpm commented Jan 26, 2021

kelson42 commented Jan 29, 2021

kelson42 commented Jan 29, 2021

kelson42 commented Jan 29, 2021

maneeshpm commented Jan 29, 2021 • edited Loading

mgautierfr commented Feb 2, 2021

maneeshpm commented Feb 2, 2021

mgautierfr commented Feb 2, 2021

maneeshpm commented Feb 2, 2021

maneeshpm commented Feb 3, 2021

mgautierfr commented Feb 3, 2021

maneeshpm commented Feb 3, 2021 • edited Loading

mgautierfr commented Feb 3, 2021

maneeshpm commented Feb 3, 2021

kelson42 commented Feb 3, 2021

maneeshpm commented Feb 3, 2021

kelson42 commented Feb 3, 2021

maneeshpm commented Feb 4, 2021 • edited Loading

kelson42 commented Feb 4, 2021

maneeshpm commented Jan 26, 2021 •

edited

Loading

maneeshpm commented Jan 26, 2021 •

edited

Loading

maneeshpm commented Jan 29, 2021 •

edited

Loading

maneeshpm commented Feb 3, 2021 •

edited

Loading

maneeshpm commented Feb 4, 2021 •

edited

Loading