Trash Does Not Cleanup #702

mwaeckerlin · 2018-05-18T08:33:44Z

I have a strange situation.

I create hourly, daily, weekly and monthly snaphots. After each snapshot, old snapshots are removed using lizardfs rremove. Since I have a lot of files and chunks, there are a lot of removals each hour. By default. trash-bin time is 24h, I reduced it yesterday to 1h.

But the trash space seems to be constantly growing. Current status:

LizardFS v3.12.0
Memory usage:   20GiB
Total space:    89TiB
Available space:        27TiB
Trash space:    382TiB
Trash files:    28424063
Reserved space: 0B
Reserved files: 0
FS objects:     50357273
Directories:    1309839
Files:  47818938
Chunks: 8350029
Chunk copies:   16699906
Regular copies (deprecated):    16699906

So many trash files, but where are they? I cannot see a single file in trash:

# mfsmount -o mfsmeta,mfsmaster=universum /mnt
mfsmaster accepted connection with parameters: read-write,restricted_ip
# ls /mnt
reserved  trash
# cd /mnt/trash
# time ls -lA
total 0

real    8m53.309s
user    0m0.000s
sys     0m0.004s

Questions:

Why is there no restore folder?
Why do I see no trash files?
How can I force immediate trash bin cleanup?

Edit: ls -lA lasts ~9min minutes before it shows 0 files.

The text was updated successfully, but these errors were encountered:

mwaeckerlin · 2018-05-18T09:00:48Z

My current configuration:

In /etc/mfs/mfsmaster.cfg:

LOAD_FACTOR_PENALTY = 0.5
ENDANGERED_CHUNKS_PRIORITY = 0.6
REJECT_OLD_CLIENTS = 1
CHUNKS_WRITE_REP_LIMIT = 20
CHUNKS_READ_REP_LIMIT = 100

In /etc/mfs/mfschunkserver.cfg:

MASTER_HOST = universum
HDD_TEST_FREQ = 3600
ENABLE_LOAD_FACTOR = 1
NR_OF_NETWORK_WORKERS = 10
NR_OF_HDD_WORKERS_PER_NETWORK_WORKER = 4
PERFORM_FSYNC = 0

mwaeckerlin · 2018-05-18T11:20:09Z

It's still growing, currently:

Trash space:    394TiB
Trash files:    29234626

Now it seems to cleanup the trash:

Two hours later:

Trash space:    390TiB
Trash files:    29197844

One more hour later:

Trash space:    381TiB
Trash files:    29331928

Three more hours later:

Trash space:    354TiB
Trash files:    29553804

But the questions above still remain: How can I immediately clear the trash?

4Dolio · 2018-05-18T15:40:54Z

I think your trash is too large for ls perhaps. Try using fine instead. Find is faster as it does not attempt to stat each file as it goes. Of you just want to purge the trash find can -remove or -delete as it goes. 28 million files in one folder is quite a lot, so it can easily overwhelm things like ls.

mwaeckerlin · 2018-05-18T17:21:22Z

Do you mean:

find /tmp/trash -exec rm {} \;

Or what command do you mean shall I execute?

Ok, it is running (since ½h), but does not seem to be a success:

# time find /mnt/trash
/mnt/trash
 [… still running …]

guestisp · 2018-05-18T17:26:22Z

find /tmp/trash -print -delete

4Dolio · 2018-05-18T18:15:22Z

Good call with the print.. it is probably working but some verbose is helpful to see what it is doing. It could take a very long time... you should quote your '{}' if you do it the original way, guestisp way is 'better'. Do you see change in the cgi? Less trash files maybe?

guestisp · 2018-05-18T18:43:49Z

With the above command,every deleted file is also printed.

Anyway, with rsync is even faster, just sync an empty directory:

mkdir /tmp/empty
rsync -av --delete /tmp/empty /mnt/trash

mwaeckerlin · 2018-05-18T21:22:03Z

The ssh connection aborts before find ends, but not s single file is found in this time. So there's something wrong with the meta filesystem!

Anyway, at least, trash space is still decreasing:

Trash space:    332TiB
Trash files:    28902103

mwaeckerlin · 2018-05-18T21:53:31Z

@guestisp, @4Dolio, unfortunately it does not do anything at all. The metafilesystem is somehow defect!

$ time sudo rsync -av --delete /tmp/empty/ /mnt/trash/
sending incremental file list
./

sent 59 bytes  received 19 bytes  0.18 bytes/sec
total size is 0  speedup is 0.00

real    7m9.107s
user    0m0.008s
sys     0m0.000s

But still more or less the same as before (it was 328TiB / 28715421 just before I started rsync):

Trash space:    328TiB
Trash files:    28680974

4Dolio · 2018-05-19T00:53:17Z

Try this:
find /mnt/trash/ | head

And your mfsexport has a valid entry for that client for metadata?

mwaeckerlin · 2018-05-19T06:43:35Z

@4Dolio, head cuts the first lines of the output. Since there is no output at all, it wouldn't change anything.

4Dolio · 2018-05-19T06:51:26Z

You are working on a massive set, can not hurt.

4Dolio · 2018-05-19T06:57:20Z

Can you tab complete /mnt/trash/und^t or /mnt/res^t ?

mwaeckerlin · 2018-05-19T07:57:00Z

Absolutely nothing, not with tab, not with ls, nor find, nor rsync. Directly on the host (no SSH), find terminates with no result after ~8 minutes.

So there is some massive bug in the meta file system!

4Dolio · 2018-05-19T09:38:58Z

shrugs.. there should be a /mnt/reserved and feel like it should still exist and be visible. Seems like you could be not connected properly, like an export problem. Sometimes a normal dataset mount will act oddly if say no CS happen to be present for example. Just grasping for clues.

4Dolio · 2018-05-19T09:45:50Z

If you set trash time to zero on a Non-empty file, and lock it, and delete it, i think it will show up in reserved until you let go of the lock(maybe from a different client?). .oO( managed to delete all data except locked reserved files once because lio iSCSI was still active and locked... manually rolled back the text changelog to before the deletion, rebuild, restarted, saved the reserved chunks and back online. Was lucky, don't try in production ;)

Maybe reserved no longer exists? I just rolled back my 3.10/3.12 to 2.6 so can not check.

Maybe mfsmeta is bugged out? It is normally magical and awesome. You should just be able to find $magic delete to purge all/some trash. Sigh.

I would say no more snapshots till it trashtime 0 + purges in real time-ish...

4Dolio · 2018-05-19T09:53:52Z

Maybe try to mount the mfsmeta from a different client?

mwaeckerlin · 2018-05-19T13:38:05Z

Yes, @4Dolio, /mnt/reserved exists, but is also empty. But the restore folder does not exist.

And yes, I already mounted directly on the master host and on other hosts too. It's everywhere the same.

The normal filesystem works. All data seem to be there. Just meta is strange.

mwaeckerlin · 2018-05-19T15:36:05Z

Could it be, that deleting after the timeout is an expensive and slow operation? Could it be, that I snapshod and delete faster (all data once per hour) than the data can be cleaned up? Could this be the reason of my crash 2 days before, see #700?

Now I removed the hourly snapshot and reduce to daily snapshots.

4Dolio · 2018-05-19T15:59:52Z

the existence of metadata.mfs.tmp indicates the master was dumping that file. But it did not finish that dump and rename it to metadata.mfs . And the lock indicates it did not exit cleanly and clear the lock.

We don't know why it died though nor what unusuail state that resulted in.

I have never used the rremove? You mentioned using so idk what that does, too new for my experience. If you didn't modify trash times first then you maybe still have 24 hours of retention after deletion, so accumulate 24*(objects count per snapshot). Perhaps your trash objects count is correct?

Maybe you need to settrashtime 0 before the rremove so they get purged more quickly? So they do not stack up and overload mfsmeta into your broken state.

Maybe the broken state is only related to the crash?

Only thing i can suggest is wait for it to clear out the trash.. maybe a dev can jump in with other things to try...

mwaeckerlin · 2018-05-19T17:27:30Z

Well, I had to hard-reset the computer, since it was no more responsive.

rremove was introduced as fast `rm -f`` replacement. There is somewhere here an issue about this topic.

Yesterday I decreased the trash time from 24h to 1h. But there are still files in the trash that I removed on Thursday. I can't see those files, but I know they are there, because they have missing chunks and i still can get the list of missing chunks, and they are still here. The missing chunks are from February, when my harddisk ran full.

Yeah, still waiting…

guestisp · 2018-05-19T17:45:10Z

Do you have free space available ?

mwaeckerlin · 2018-05-20T08:27:00Z

Now, plenty:

$ lizardfs-admin info universum 9421
LizardFS v3.12.0
Memory usage:   9.9GiB
Total space:    89TiB
Available space:        27TiB
Trash space:    218TiB
Trash files:    18714600
Reserved space: 0B
Reserved files: 0
FS objects:     35706013
Directories:    1144199
Files:  33335611
Chunks: 8366008
Chunk copies:   16729323
Regular copies (deprecated):    16729323

I added two more 10TB disks… :)

4Dolio · 2018-05-20T08:40:21Z

Your trash objects are lower at least... 18 million... would be interesting to know if mfsmeta begins to work, and when..

mwaeckerlin · 2018-05-20T09:14:26Z

I'll keep you up to date. All my services are down since Thursday. I hope, I'll get them back running today. It's a docker swarm running on top of lizardfs, where the nodes are both, lizard chunk-/ master-/ logger-server and swarm nodes at the same time. That used to work well for some month, but currently docker on the swarm master is no more responsive, so I migrated to another node. Migration is still in progress.

You find my configuration here.

guestisp · 2018-05-20T09:23:01Z

Are you using Lizard replicating over a Powerline? Seriously?

Powerline has huge and unstable latency...

mwaeckerlin · 2018-05-20T09:35:42Z

@guestisp

Are you using Lizard replicating over a Powerline?

No more. After some problems described in #659, now all are on the same location connected through the same 1GB switch.

mwaeckerlin · 2018-05-20T09:58:26Z

→ I updated my blog post.

mwaeckerlin · 2018-05-21T10:14:49Z

Now, trash space begins to stabilize:

LizardFS v3.12.0
Memory usage:   12GiB
Total space:    89TiB
Available space:        27TiB
Trash space:    82TiB
Trash files:    6042804
Reserved space: 0B
Reserved files: 0
FS objects:     22029983
Directories:    1109053
Files:  19695237
Chunks: 8377626
Chunk copies:   16749271
Regular copies (deprecated):    16749271

mwaeckerlin · 2018-05-22T07:44:19Z

Now that I don't create hourly backups any more and only delete a backup a day instead of one every hour, it's back to normal work:

Trash space:    69MiB
Trash files:    6743

So: Deletion of snapshots is much too slow!

4Dolio · 2018-05-22T08:02:27Z

Did your mfsmeta start working?

True. And unfortunate. Long long ago, continuous snapshotting was floated. Can't do it yet, nor really used as rapidly as hourly.

mwaeckerlin · 2018-05-22T09:22:37Z

Did your mfsmeta start working?

Yes!

4Dolio · 2018-05-22T15:01:15Z

Ok, good. I wonder at what point it becomes broken?

dminca · 2018-07-12T07:22:04Z

Another way of deleting trash files is to mount MFSMETA filesytem for manually trashing.

First of all, allow that server to mount LizardFS by editing the exports file with the server's IPv4 Address

On LizardFS master

# mfsexports.cfg
172.26.3.1/32 .                rw

On server X

sudo mkdir -p /mnt/mfsmeta
sudo mfsmount -m /mnt/mfsmeta/ -o mfsmaster=lizardfs-master.org

You can trash/purge the deleted files from here:

/mnt/mfsmeta/trash

mwaeckerlin · 2018-07-19T09:37:11Z

Final conclusion: LizardFS is too slow in cleaning up for hourly snapshots (if the oldest snapshot is removed). Daily snapshots (incl. cleanup) are not a problem.

If you need a backup script for your cron job, try mine:
https://mrw.sh/admin-scripts/backup/src/branch/master/lizardfs

mwaeckerlin · 2018-11-09T08:02:50Z

Last update of my backup scripts, I accidentally re-enabled hourly snapshots. So it creates and deletes one snapshot per hour. Now my master server is down since two days.

What I see, as additional information to be mentioned: Even though the server has 40GB and normally needs ~23GB, the memory exhausts and it is swapping. So memory seems to be the limit.

I suppose this is not because of the snapshots, but due to the delete operations?

I just ordered another 32GB RAM and I'll upgrade the server this evening to 72GB. Let's see how much this will help.

mwaeckerlin mentioned this issue May 18, 2018

temporary metadata file exists #700

Closed

mwaeckerlin closed this as completed Jul 19, 2018

mwaeckerlin mentioned this issue Jan 18, 2019

Chunks Lost When Snapshots Are Removed #799

Open

Trash Does Not Cleanup #702

Trash Does Not Cleanup #702

Comments

mwaeckerlin commented May 18, 2018 • edited

Questions:

mwaeckerlin commented May 18, 2018

mwaeckerlin commented May 18, 2018 • edited

4Dolio commented May 18, 2018 • edited

mwaeckerlin commented May 18, 2018

guestisp commented May 18, 2018

4Dolio commented May 18, 2018 • edited

guestisp commented May 18, 2018

mwaeckerlin commented May 18, 2018

mwaeckerlin commented May 18, 2018

4Dolio commented May 19, 2018 • edited

mwaeckerlin commented May 19, 2018

4Dolio commented May 19, 2018

4Dolio commented May 19, 2018

mwaeckerlin commented May 19, 2018 • edited

4Dolio commented May 19, 2018

4Dolio commented May 19, 2018 • edited

4Dolio commented May 19, 2018 • edited

mwaeckerlin commented May 19, 2018

mwaeckerlin commented May 19, 2018

4Dolio commented May 19, 2018

mwaeckerlin commented May 19, 2018 • edited

guestisp commented May 19, 2018

mwaeckerlin commented May 20, 2018

4Dolio commented May 20, 2018

mwaeckerlin commented May 20, 2018

guestisp commented May 20, 2018

mwaeckerlin commented May 20, 2018 • edited

mwaeckerlin commented May 20, 2018 • edited

mwaeckerlin commented May 21, 2018

mwaeckerlin commented May 22, 2018

4Dolio commented May 22, 2018 • edited

mwaeckerlin commented May 22, 2018

4Dolio commented May 22, 2018

dminca commented Jul 12, 2018

mwaeckerlin commented Jul 19, 2018

mwaeckerlin commented Nov 9, 2018 • edited

mwaeckerlin commented May 18, 2018 •

edited

mwaeckerlin commented May 18, 2018 •

edited

4Dolio commented May 18, 2018 •

edited

4Dolio commented May 18, 2018 •

edited

4Dolio commented May 19, 2018 •

edited

mwaeckerlin commented May 19, 2018 •

edited

4Dolio commented May 19, 2018 •

edited

4Dolio commented May 19, 2018 •

edited

mwaeckerlin commented May 19, 2018 •

edited

mwaeckerlin commented May 20, 2018 •

edited

mwaeckerlin commented May 20, 2018 •

edited

4Dolio commented May 22, 2018 •

edited

mwaeckerlin commented Nov 9, 2018 •

edited