New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive, Native Windows Support #519

Open
Alexhuszagh opened this Issue Oct 17, 2017 · 21 comments

Comments

Projects
None yet
9 participants
@Alexhuszagh

Alexhuszagh commented Oct 17, 2017

Now, before you tell me this is a lot of work: I know, and am working on it (and almost done). Ideally, I would like to have my changes merged here, so I have a few questions and concerns for my current port.

Questions

Should I target a specific C++ standard?

Currently, my code depends on a few C++11 features, which can be easily removed with a few macros. This makes the code less readable, however, if C++03 support is desired, I will gladly change my implementation to conform to an older standard.

How to handle Unicode filesystem support?

Currently, LevelDB uses char-based (narrow) strings for for all filesystem operations, which does not translate well for Windows systems (since narrow strings use the ANSI, or OEM legacy codepages, and not UTF-8, for backwards compatibility). This means paths using international characters, or emojis, are therefore not supported with a simple port, something I consider to be an undesirable solution for a modern library. All the current forks of levelDB do not solve this fundamental issue, leading me to create my own implementation. Possible solutions include:

  1. A narrow (UTF-8) API on *Nix, and a wide (UTF-16) API on Windows, using a typedef to determine the proper path type.
  2. Converting all narrow strings from UTF-8 to UTF-16 before calling WinAPI functions.
  3. Providing both a narrow (ANSI) and wide (UTF-16) API on Windows.

The 2nd option, although the least amount of work, is the least amenable for me since the expected encoding for paths from levelDB would then conflict with the entirety of the WinAPI. The 3rd option, however, duplicates code to support both the narrow and wide WinAPI, which would increase the amount of work required to maintain levelDB. The first option is a happy median: it minimizes redundancy and is consistent with expectations about *Nix and Windows paths. I am, however, amenable to any suggestions the levelDB authors may have.

Intellectual Property

To emulate the behavior of mmap on Windows, I used a very lightweight library (<250 lines of code) from Steven Lee, mman-win32. However, looking over your contributor license agreement, it seems that my port would not satisfy Google's CLA until I remove this code from my implementation. If this is the case, I could easily use the raw WinAPI functions rather than the emulated mmap in my Windows port. Please notify me if I should remove this code prior to submitting a pull request.

Other Changes

CMake Build System

I introduced a CMake build system, which retains most of the same logic as the existing Makefile. The existing Makefile has not been deprecated.

AppVeyor Continual Integration

To ensure builds do not break the Windows builds, I am planning to add an AppVeyor configuration, which allows continual integration on Windows using MSVC.

Summary

If there is still interest for native Windows support, and the proposed changes are amenable to the levelDB authors, I would gladly submit a pull request.

@ghemawat

This comment has been minimized.

Contributor

ghemawat commented Oct 17, 2017

@pwnall

This comment has been minimized.

Member

pwnall commented Oct 17, 2017

Work on CMake support is already underway. Please use https://github.com/pwnall/leveldb/tree/cmake as a starting point, to avoid rework. Also, please use the Travis CI and AppVeyor configurations in https://github.com/google/snappy as a starting point for yours.

In general, I recommend following the approach taken by Chromium's LevelDB integration. Chromium builds (and runs) on Windows, and does not require modifications to the rest of the LevelDB.

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Oct 17, 2017

Thank you for the feedback. This would be very feasible to do @ghemawat, especially if we use UTF-8 paths and just convert them in the Windows environment. As for CMake support, I will use that as a starting point (thank you). I will remove the mmap compatibility and use the raw WinAPI calls. Due to my other work, I should be able to finish this later this week.

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Oct 20, 2017

@ghemawat and @pwnall, a quick question: When I asked the C++11 features and limiting myself to C++03, did you mean limit the codebase to C++98 or C++11? I have a few situations where std::chrono is dramatically more convenient than other code, however, I can remove this (it's only for NowMicros and SleepForMicroseconds). Other than that, the port should not require any new features.

Thank you and I am effectively done with my port, other than this minor question.

@pwnall

This comment has been minimized.

Member

pwnall commented Oct 20, 2017

We've recently decided that the next release will require C++11, so it's OK to use C++11. Sorry for the code churn on your end... this decision was not taken lightly.

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Oct 21, 2017

@pwnall No worries, I understand that such fundamental choices do not lend themselves to casual decisions. Thank you for all the help.

@andschwa

This comment has been minimized.

andschwa commented Oct 26, 2017

Hi @Alexhuszagh,

As a soon-to-be user of the leveldb Windows support (for Mesos), here are are my thoughts:

Should I target a specific C++ standard?

For us, C++11 is fine, we already target this.

How to handle Unicode filesystem support?

This is an annoying problem, I had to fix this for Mesos. I went with:

  1. Converting all narrow strings from UTF-8 to UTF-16 before calling WinAPI functions.

It also leads into long path issues on Windows. I took an approach similar to the one CMake took for their Windows port, a longpath helper to translate all paths as they reach WinAPI functions from UTF-8 to UTF-16 with \\?\ prepended if necessary. My helper is here (note that the max path is not 255 despite documentation).

I need to stress: native long path support is probably a must for most Windows projects nowadays. It's not terribly difficult to do, it's just really annoying.

Also, for your comment:

is the least amenable for me since the expected encoding for paths from levelDB would then conflict with the entirety of the WinAPI

I'm not sure I entirely agree, there is no data loss going from UTF-8 to UTF-16. Both encodings are Unicode, it's just an implementation difference. Plus, you can do the conversion natively in C++11. Anyway, I've not had any problems on Windows having gone this route so far.

Thanks for your work! I personally know the trouble it is 😉

@andschwa

This comment has been minimized.

andschwa commented Oct 26, 2017

Also:

I introduced a CMake build system

Yay, a million times, yay! We'll pull it into Mesos with ExternalProject_Add.

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Oct 30, 2017

Hi @pwnall and @ghemawat, I've had a few issues I cannot currently debug. I will attempt to use the Boost-based "windows" branch as a reference-point in short order.

Specifically, I've had 3 major issues:

  1. ~~issue178_test fails intermittently. ~10% of the time, it succeeds, without issue. 25% of the time, it produces a slightly lower number of keys than the 1.1m requires (almost always greater than 1.09m). The rest of the time, it produces the error Assertion Failed: r->options.comparator->Compare(key, Slice(r->last_key)) > 0, file level-db\table\table_builder.cc, line 97.~~

  2. db_test fails intermittently (at about the same frequency) during the random_read_counter_ section (from env_->random_read_counter_.Reset(); to ASSERT_LE(reads, N + 2*N/100);

  3. Most severely though, however, is the multi-threaded section seems to produce the bad block type error consistently, which would defeat the entire purpose of a multi-thread access.

Otherwise, all the test cases and benchmarks work. Just a heads up for the major delay.

(Items with strikethrough have been patched).

EDIT: Everything has been patched.

@pwnall

This comment has been minimized.

Member

pwnall commented Oct 30, 2017

No worries about delays. Honestly, I'm backed up with other work for the next two of weeks, and the odds that I'll be able to look at this are very low.

@cmumford cmumford added the enhancement label Nov 3, 2017

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Nov 7, 2017

Everything has been patched, and I am adding extended file length support currently. @andschwa, for the extended file length support, since the "\\?\" prefix effectively removes all path parsing, do you know if there's anything else I need to consider other than:

  1. Relative paths (which cannot use the extended file length prefix).
  2. /, ., or .. in paths (somewhat tricky, see below).

For the . and .. operators, manually parsing them is somewhat tricky, since directory symbolic links may be in play. The only time-tested strategy for this is to iterate over all roots, parent directories, check to see if the item is a directory symbolic link or junction, get the real path of the directory if it is a symlink or junction, and then continue from there. This is because C:\leveldb\..\leveldb-1\README.md does not actually point to C:\leveldb-1\README.md if C:\leveldb is a symlink or junction.

This is very doable (and is relatively easy to implement), but it is fairly expensive since it requires filesystem calls. It requires a temporary vector to store each path component. Step-wise, the general approach is as follows:

  1. Check if the path is absolute (skip remaining steps otherwise).
  2. Replace all forward separators with backslashes.
  3. Recurse over each parent directory, from the root (drive letter, such as C:, or UNC root, such as \\host-name\share-name), to the (and excluding the) file (we don't care if the file is a symlink, since relative path components cannot follow it).
  4. If the directory basename is ., ignore the directory.
  5. If the directory basename is .., remove the preceding directory component.
  6. Otherwise, check if the directory is a symlink by calling CreateFile with the FILE_FLAG_OPEN_REPARSE_POINT and FILE_FLAG_BACKUP_SEMANTICS flags. If the handle is successfully created, it's a symlink, otherwise, it is not.
  7. If the directory is a symlink, read the proper path using DeviceIoControl with the FSCTL_GET_REPARSE_POINT code, and reset the vector using the absolute new path.

I would be amenable to implementing this (I've done this before, as may be obvious due to the detail of my explanation on how to implement such functionality), but this may add a lot of expense for a feature that application developer should have to be aware of themselves (that is, leveldb will already support an extended length path, if provided by the end-user).

Another major caveat is leveldb's filenames are short (the longest being the MANIFEST-00000X files, at 15 characters). Since the Windows documentation clearly states the maximum directory length must be MAX_PATH - 12, even with the extended length path prefix, this seems like a lot of work for an added 3 characters.

@pwnall, any thoughts? Should extended file length support be added, including with the caveats mentioned above?

@andschwa

This comment has been minimized.

andschwa commented Nov 8, 2017

@Alexhuszagh I don't believe you missed anything.

Re: bullet 6: I'm using this logic in Mesos:

  const DWORD access_flags = resolved_path_is_directory
    ? (FILE_FLAG_OPEN_REPARSE_POINT | FILE_FLAG_BACKUP_SEMANTICS)
    : FILE_FLAG_OPEN_REPARSE_POINT;

  const HANDLE handle = ::CreateFileW(
      longpath(absolute_path).data(),
      GENERIC_READ,     // Open the file for reading only.
      FILE_SHARE_READ,  // Just reading this file, allow others to do the same.
      nullptr,          // Ignored.
      OPEN_EXISTING,    // Open existing symlink.
      access_flags,     // Open symlink, not the file it points to.
      nullptr);         // Ignored.

adding the FILE_FLAG_BACKUP_SEMANTICS flag only if its a directory. I also specifically use OPEN_EXISTING, though you probably got that.

Re: point 7: I resolve the path using GetFinalPathNameByHandleW with the FILE_NAME_NORMALIZED flag, after using:

  const DWORD access_flags = resolved_path_is_directory
    ? FILE_FLAG_BACKUP_SEMANTICS
    : FILE_ATTRIBUTE_NORMAL;

  const HANDLE handle = ::CreateFileW(
      longpath(absolute_path).data(),
      GENERIC_READ,     // Open the file for reading only.
      FILE_SHARE_READ,  // Just reading this file, allow others to do the same.
      nullptr,          // Ignored.
      OPEN_EXISTING,    // Open existing file.
      access_flags,     // Open file, not the symlink itself.
      nullptr);         // Ignored.

to get a handle to the file/directory at the resolved path.

Re:

but this may add a lot of expense for a feature that application developer should have to be aware of themselves (that is, leveldb will already support an extended length path, if provided by the end-user)

I agree that just letting the end-user provide \\?\C:\long\paths without any extra handling from leveldb might be just fine, so long as all the Windows APIs used are the Unicode versions (and specifically listed as supporting long paths; though this is most of them).

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented Nov 8, 2017

@andschwa All the Window APIs are the Unicode versions. Currently, I use FILE_FLAG_BACKUP_SEMANTICS for files and directories, but that is easily changed. As for GetFinalPathNameByHandleW, unfortunately it somewhat raises a chicken/egg problem.

@andschwa

This comment has been minimized.

andschwa commented Nov 8, 2017

If FILE_FLAG_BACKUP_SEMANTICS is working for both, don't let me tell you it's wrong! I was using GetFinalPathNameByHandle specifically for resolution of symlinks; I see where it wouldn't quite work for you here.

All the Window APIs are the Unicode versions.

Perfecto.

@jenokizm

This comment has been minimized.

jenokizm commented May 1, 2018

Hi, I wanted to ask how your work is going? It's been about six months since your last messages, but I do not see any result. I need to build a library under Windows and I do not know how.

@chrismorfos

This comment has been minimized.

chrismorfos commented May 2, 2018

Hi how are you today

@ketnoimai

This comment has been minimized.

ketnoimai commented May 2, 2018

@Alexhuszagh

This comment has been minimized.

Alexhuszagh commented May 3, 2018

@jenokizm Sorry, I got extremely busy with work and have submitted a few PRs to this extent but it still needs work. My Windows development PC just arrived after breaking in March, so I should be able to finish this soon.

If you would to use this branch, it currently works on Windows:
https://github.com/Alexhuszagh/leveldb

@pwnall

This comment has been minimized.

Member

pwnall commented May 10, 2018

I think the ball is currently in our court. I need to find time to reconcile the various Windows PRs we've received with what we think this should look like.

@thomasjm

This comment has been minimized.

thomasjm commented Jul 29, 2018

Hi -- no pressure, but is there any ETA for when Windows support will land? Thanks!

@pwnall

This comment has been minimized.

Member

pwnall commented Jul 30, 2018

We have no timeline for this, sorry.

cmumford added a commit to cmumford/leveldb that referenced this issue Aug 31, 2018

Added native support for Windows.
This change adds a native Windows port (port_windows.h) and a
Windows Env (WindowsEnv).

Note1: "small" is defined when including <Windows.h> so some
parameters were renamed to avoid conflict.

Note2: leveldb::Env defines the method: "DeleteFile" which is
also a constant defined when including <Windows.h>. The solution
was to ensure this macro is defined in env.h which forces
the function, when compiled, to be either DeleteFileA or
DeleteFileW when building for MBCS or UNICODE respectively.

This fixes issue google#519.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment