Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File changed during synchronization is not synced #1

Open
MarSoft opened this issue Aug 18, 2022 · 3 comments
Open

File changed during synchronization is not synced #1

MarSoft opened this issue Aug 18, 2022 · 3 comments

Comments

@MarSoft
Copy link
Contributor

MarSoft commented Aug 18, 2022

Consider the following: 2 files were changed almost simultaneously - A, then B. File A is large enough.
fswatch detects that file A is changed and tells us. We find out that file A was changed and start rsync.
Now, rsync runs for long enough, and during that files B and C are changed too.
After rsync has finished, we set last_synced time to the current time (i.e. to the time when rsync has finished) and find out the next line from fswatch telling us that file B is changed, too.
We check current time and find that just a little moment has passed since we did the last sync (as we check synchronization end time, not synchronization start time). So we don't run rsync.

Now, nothing is changed for a long time. And directories are out of sync: file A is synchronized but file B isn't.
This will only be fixed after some other file is changed, which may happen not very soon.

What we should do IMO: if no new changes happen for rsync_interval seconds but we know about some non-synchronized data then still run rsync.
This will complicate the code logic, and simple for line in ... loop won't work in this scenario. But it will avoid having directories unsynchronized for too long.

@MarSoft MarSoft changed the title Multi-file simultaneous change sync File changed during synchronization is not synced Aug 18, 2022
@axkibe
Copy link

axkibe commented Aug 20, 2022

Original but defunct Lsyncd author here, I skimmed over the code and it seemed to me as well there is a race condition in here. You need to set last_rsynced time before you run rsync, because for changes during the rsync command you just don't know if rsync got them over the line or not. You need to call it another time for everything that happened in the meantime. Unless I'm mistaken this should be easily fixable by simply exchanging line 45 and 46 main.py -- but not tested.

I still suggest making a heavy duty unit test that randomly creates/changes/deletes and moves a ton of files and directory in rapid succession and then runs a diff to check if source and target are the same. Like churn-rsync.lua but it's Lua and I guess you want to redo it Python. Took me a while to remove all race conditions (AFAIK). Funny enough when I first published it, multiple people told me that you just can't reliable create an asynchronous live program, but it is possible, but there are pitfalls.

Further short notes, IMO you should consider network error codes of rsync to retry, otherwise you run out of sync on a network hick up. And what I learned from a lot of feedback, don't do --delete by default config, I know Lsyncd has that, but that was less ideal in hindsight and would do different.

Two minor difference things, moves via ssh feature isn't there, but that was not so important optimization requested and Lsyncd always calls rsync with a custom filter setup so rsync only has to inspect those files that actually changes, that can be a big difference on large file trees. On the flipside using fswatch as established peer program I guess you got fanotify functionality out of the box, something Lsyncd never got. Back the day the kernel using fanotify didn't report moves thus was not usable and had to rely on inotify which uses kernel memory for every directory watched. It has been fixed since then.

Honestly if I would redo everything from scratch, I'd use a database like leveldb to store the timestamps of the source tree, thus the need to always run a full initial rsync on startup should fall flat.

Wish you much success on your project! Back the day when I needed a live asynchronization functionality I was surprised there was no opensource solution available, we certainly can use more diversity here. A few projects flickered up but unfortunally most seem to died of again. There was also one in Python -- unfortunally I forget the name -- it used multithreading with Pythons built in semaphore mechanics the code even seemed not very complicated.

@axkibe
Copy link

axkibe commented Aug 20, 2022

Ah no, one thing I forget which indeed makes the loop more complicated (as MarSoft suggested) you should first fully exhaust the message queue from your notification mechanism, then set the timestamp (for everything covered) and then call rsync. Dunno how your mechanism works here, put you'd need some peek functionality to see if another event is incoming before you try to receive it (and thus block if it isn't)

@ngocbh
Copy link
Owner

ngocbh commented Aug 22, 2022

Thank @MarSoft @axkibe for your great suggestions. This package was built within a night since I needed a quick solution to synchronize the source code in my local machine to a remote server where I could compile and run my code. Many things have not been considered yet.

Unfortunately, I am not using the output of fswatch for rsync. fswatch is just a signal to activate rsync (rsync still has to run on the entire source tree). Besides, it seems that setting a time interval for two consecutive rsync is not the ideal, efficient solution also.

My actual experience when using this package is that it's too slow. After changing the source code, I often go directly to the server's console and run the code, however, sometimes, the code wasn't there, making the debugging experience awful. Probably, this package or even lsyncd would be more suitable for backing up purposes.

I think this package needs a deeper thought about the use cases and especially, its efficiency. Probably, some things like mirror would be more suitable here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants