klient/mount: machine mount index #10134

Merged
merged 5 commits into from Dec 31, 2016

Projects

None yet

3 participants

@ppknap
Member
ppknap commented Dec 28, 2016 edited

This PR adds index type.

⚠️ NOTE: this version is NOT optimized.

Depends on: -

Code coverage: 79.7% of statements

Description

This is an index-based machine mount architecture. It is intended to solve all(known) performance and stability problems with current implementation. It will allow for:

  • multiple mounts per remote machine.
  • two-way synchronization using FUSE for fast local->remote synchronization and Poller for remote->local->remote sync.
  • FUSE logic is loosely coupled - it is possible to easily replace/extend it if we want to support different VFS(may be useful for Windows support).
  • "lazy" mounts - in order to create complete FS on local machine, we only need to fetch remote index. Other files can be downloaded when requested by Kernel.
  • Context based synchronization - all sync processes will be stopped if remote machine is unreachable.
  • Star network mounts - local machines A & B can mount directory from machine C and changes made by A will be seen by machine B(not in real time).
  • etc.

Architecture

image

  • Mount - abstract representation of single mount.

  • Local/remote Index - index of files stored in mounted directory.

  • Mounts - storage of mounts per machine.

  • Sync - type responsible for syncing remote and local indexes.

  • Syncer - interface responsible for transferring files between local and remote machine.

  • RSync - RSync process used for syncing files.

  • Notifier - interface responsible for creating index changes.

  • FUSE - real time FS notification system based on FUSE.

  • Poller - periodically checks remote and local index in order to keep them synced.

How Has This Been Tested?

Unit tests.

Screenshots (if appropriate):

none

Types of changes

  • New feature (non-breaking change which adds functionality)
@ppknap ppknap requested review from cihangir and rjeczalik Dec 28, 2016
+
+ if maxsize < 0 {
+ return len(idx.entries)
+ } else if maxsize == 0 {
@rjeczalik
rjeczalik Dec 28, 2016 Member

question: Why do we treat 0 as nop?

@ppknap
ppknap Dec 28, 2016 Member

because we know that that there are no files(on UNIX) with size == 0 this is a small optimization made to not iterate over the index.

@rjeczalik
rjeczalik Dec 28, 2016 Member

I meant 0 and < 0 arguments are treated as all in stdlib. Tried to understand why we treat -1 and 0 differently here.

@ppknap
ppknap Dec 28, 2016 Member

I took the idea from strings(regex) packages eg. https://golang.org/pkg/strings/#Replace where only negative numbers are treated as all. If this is a case I can change it.

@rjeczalik
rjeczalik Dec 28, 2016 Member

If a nop would be handy (e.g. no need for conditional Count) then let it be. Did not think of it as an optimisation.

+// UnmarshalJSON satisfies json.Unmarshaler interface. It is used to unmarshal
+// data into private index fields.
+func (idx *Index) UnmarshalJSON(data []byte) error {
+ return json.Unmarshal(data, &idx.entries)
@rjeczalik
rjeczalik Dec 28, 2016 edited Member

nit: It should take a lock, for consistency with MarshalJSON.

+ mu sync.RWMutex
+ entries map[string]*Entry
+}
+
@rjeczalik
rjeczalik Dec 28, 2016 Member

nit: You can show here explicitly that Index implements json.Marshaler/json.Unmarshaler.

+ ChangeMetaRemove // File was removed.
+ ChangeMetaAdd // File was added.
+
+ ChangeMetaLarge ChangeMeta = 1 << (8 + iota) // File size is above 4GB.
@rjeczalik
rjeczalik Dec 28, 2016 edited Member

question: Is there any reason this is separate from the ChangeMeta group?

@ppknap
ppknap Dec 28, 2016 Member

I made this for future usage. I plan to add more change types like ChangeMetaChmod ChangeMetaMovedFrom/To. This will allow to select better sync strategy based on observed operation. Eg. RSync is not very good with moves, so Sync can use KD/Kite(remotely) & mv(locally) instead.

+
+// NewEntryFile creates new Entry from a file stored under path argument.
+// Info is optional and, if given, should store LStat result on the given file.
+func NewEntryFile(root, path string, info os.FileInfo) (e *Entry, err error) {
@rjeczalik
rjeczalik Dec 28, 2016 Member

From what I understand:

  • we store relative paths, relative to the mount dir, so it's possible to use the same index for two mounts with different mount directories; the mount dir here is root (can we add a comment for this argument?)
  • we store paths converted to Unix paths, so it'd possible to use window index on linux

Is it correct?

@ppknap
ppknap Dec 28, 2016 Member

@rjeczalik you are correct. root is different on remote machine than on local one. And yes, the plan is to make possible to create mounts between Windows <--> Linux.

I took root name from https://golang.org/pkg/path/filepath/#Walk

+
+// Index stores a virtual working tree state. It recursively records objects in
+// a given root path and allows to efficiently detect changes on it.
+type Index struct {
@rjeczalik
rjeczalik Dec 28, 2016 edited Member

question: Is it possible for a single index to index files from different mounts?

I'm trying to find out whether it would be feasible to makeroot a field of index instead of passing it to most functions as an argument.

@ppknap
ppknap Dec 28, 2016 Member

A: Everything is possible:). But, do we want one index for all mounts? It is guarded by mutex so eg. unmounting will stop all other mounts making it slow.

Other reason is that when index get broken we can just drop it and recreate. It's better to make this for a single mount rather than all of them(some of other ones might be even unreachable at that moment - machine is powered off).

@rjeczalik
rjeczalik Dec 28, 2016 Member

I don't want to store multiple mounts in a single index. Can we make a root member of index?

@ppknap
ppknap Dec 30, 2016 Member

Two identical indexes can use different roots. That's why I didn't add Root field to Index struct. Eg.
you're sending Index from remote machine to local. Root value from remote will be (almost) always incorrect on local.

@rjeczalik
rjeczalik Dec 30, 2016 Member

Oh, got it. I assumed remote paths are already relative, so it's easier to work with index locally if the side owning an index presents paths in relative form.

@rjeczalik
Member
rjeczalik commented Dec 28, 2016 edited

I really like your diagrams and designs, makes reading PR much more pleasurable.

I have 2 questions if you wouldn't mind:

NOTE: this version is NOT optimized.

  1. What optimisations do you have in mind? Just curious.
  2. How we would handle corrupted mount dir? E.g. a directory tree on remote has one directory with no access rights so it's not possible to index it and user wants to mount it locally - do we want to show it's not possible to have a full mount due to access problems? Right now we ignore all errors.
@ppknap
Member
ppknap commented Dec 28, 2016

What optimisations do you have in mind? Just curious.

  • when scanning the directory for changes we don't need to check directory when its mtime didn't change. (idea is taken from git index - however, not all operating systems update mtime of directory when file inside it changes). This can greatly reduce the number of files we need to stat.
  • binary format of index/compression - will reduce the amount of data we need to transfer over the network.
  • custom filepath.Walk function - we don't need to walk directories in lexical order.
  • if we have two indexes to sync. we can send just SHA-1 of the entry, if second index doesn't have it, this means that we need to sync it. This gives 200Kb over network for index that manages 10000files.

I'm going to check more ways for opt after POC.

How we would handle corrupted mount dir? E.g. a directory tree on remote has one directory with no access rights so it's not possible to index it and user wants to mount it locally - do we want to show it's not possible to have a full mount due to access problems? Right now we ignore all errors.

All corrupted indexes can be dropped and recreated without harm to the actual mount. If there is a directory with no access rights it won't be added to the index. If someone change its permissions on the remote machine and it become accessible, Pooler sync will eventually fetch the changes to local machine.

@rjeczalik
Member
rjeczalik commented Dec 28, 2016 edited

binary format of index/compression - will reduce the amount of data we need to transfer over the network.

We should introduce msgpack2 to koding/kite instead and agree on used encoding and transport during handshake (koding/kite#158).

All corrupted indexes can be dropped and recreated without harm to the actual mount. If there is a directory with no access rights it won't be added to the index. If someone change its permissions on the remote machine and it become accessible, Pooler sync will eventually fetch the changes to local machine.

I'm only concerned whether there should exist an API that would tell if there are any difficulties with remote dir (inaccessible directories etc.) that would cause incomplete index / mount.

+ // Compute file's SHA-1 sum.
+ var sum []byte
+ if !info.IsDir() {
+ if sum, err = readSHA1(path); err != nil {
@cihangir
cihangir Dec 29, 2016 Member

lets use crc32 ( i have tried with IEEETable ) here instead of sha1, it is way faster than any of crc64 ( 6x ) , sha1( 11x), md5 (11x), sha256 (22x) both in big files & small files

@ppknap
ppknap Dec 30, 2016 Member

I'd like to keep SHA-1 here because I have some interest in keeping hashes consistent with these stored in git index(which means that I need SHA-1).

I have (long-term) plan to add another Syncer interface implementation. Lets call it GitSync that can detect git repository in mounted folder and improve git operations on it. Eg.

User runs git checkout my_other_branch on mounted directory. This can create a lot of FUSE changes and require as much of Rsync syncer work. Now, assuming we have GitSync that can examine git index/reflog/etc. and detect eg. checkout. Thus, it will be able to just run git checkout mu_other_branch on remote machine without copying anything.

These are just plans but, if we ever start implementing this, it would be nice to have SHA-1 sums in order to perform sanity checks on git index.

@cihangir
cihangir Dec 30, 2016 Member

you are calculating the hash just from the file content, git hashing uses other kind of info ( commit message, author, committer, previous hash etc ), how are you planning to keep them insync? or from another angle, git has history of commit orders, this implementation of index does not have any kind of ordering/sorting. Another point for git indexes would be, files could be in the folder (uncommitted) without a hash value, but in any case you need to keep them in the index for various reasons.

I dont have a strong opinion at this moment since we will see what is waiting for us in the future, i just wanted to give you a speed bump of 10x :)

@ppknap
ppknap Dec 31, 2016 Member

I dont have a strong opinion at this moment since we will see what is waiting for us in the future..

Exactly, I have the same feeling and I don't mind changing this to CRC32 when SHA-1 turns out to be useless :)

you are calculating the hash just from the file content, git hashing uses other kind of info ( commit message, author, committer, previous hash etc ),

You are partially right I'm speaking of git index itself which stores SHA-1 of file content

20 byte SHA-1 over the previous content of the entry. Note that according to my experiments with the assume valid flag, the flags that follow it are not considered in this SHA-1.
+func (idx *Index) Apply(root string, cs ChangeSlice) {
+ for i := range cs {
+ switch {
+ case cs[i].Meta&(ChangeMetaUpdate|ChangeMetaAdd) != 0:
@cihangir
cihangir Dec 29, 2016 Member

could you explain the cases as comments

@@ -0,0 +1,14 @@
+// +build !darwin,!linux
@cihangir
cihangir Dec 29, 2016 Member

it seems this file is not needed, if fi.Sys returns nil, linux implementation will result with the same, if sometime in the future Sys is implemented any other OS, it will work without requiring any code change

@rjeczalik
rjeczalik Dec 29, 2016 Member

This file stubs ctimeFromSys for platforms other than darwin/linux (e.g. Windows). Just to make it possible to build the index package.

@cihangir
cihangir Dec 29, 2016 Member

ok i missed index_darwin file, what keeps us from implementing for others or using a package for that ( i remember there was a package just for unifying them )

@cihangir

Could you update the description with the coverage results?

@ppknap
Member
ppknap commented Dec 30, 2016

@cihangir Done

ppknap added some commits Dec 22, 2016
@ppknap ppknap klient/mount: initial index implementation 3af988d
@ppknap ppknap klient/mount: add inline docs to index package a5775ee
@ppknap ppknap klient/mount: add tests for index package 1dd9e4f
@ppknap ppknap klient/mount: add JSON support for index ae3a83b
@ppknap ppknap klient/mount: better way of getting ctime from file info
e0e9c13
@ppknap ppknap referenced this pull request Dec 31, 2016
Merged

Machine mount index cache #10188

1 of 1 task complete
@rjeczalik rjeczalik merged commit 2f57d6f into master Dec 31, 2016

2 checks passed

continuous-integration/wercker Wercker build passed
Details
licence/cla Contributor License Agreement is signed.
Details
@rjeczalik rjeczalik deleted the machine_mount_index branch Dec 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment