Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange 'find' tool behaviour when scanning mounted .zip file #104

Closed
Vadiml1024 opened this issue Feb 16, 2023 · 29 comments
Closed

Strange 'find' tool behaviour when scanning mounted .zip file #104

Vadiml1024 opened this issue Feb 16, 2023 · 29 comments

Comments

@Vadiml1024
Copy link

I'm mounting a .zip file as following:

sudo -b ./ratarmount-x86_64.AppImage -o ro,allow_other ~/tmp/viruses/000-test-001-003-302.zip /tmp/mnt
then i'm doing:

vadim@vadim-tp:~/tmp/viruses$ find /tmp/mnt/003-virus -name '*.exe'
/tmp/mnt/003-virus/gozi_2.exe
/tmp/mnt/003-virus/louyue_liandianqi2.1.exe
/tmp/mnt/003-virus/virus2.exe
/tmp/mnt/003-virus/virus3.exe

And

vadim@vadim-tp:~/tmp/viruses$ find  /tmp/mnt/ -name '*.exe'
vadim@vadim-tp:~/tmp/viruses$

For some reason find does not recurse into the mount point.
ls -lr however have no problem to recurse into it.

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 16, 2023

Could it be that your find command is aliased or overriden to automatically add the -mount option?

Does it happen for other FUSE mounts, e.g., archivemount, sshfs, etc.? Does it happen for mounted tar archives as opposed to zip?

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 16, 2023

I cannot reproduce the problem. Even when using sudo, the AppImage and your specified options. Additionally to my previous suggestions: what do the permissions on tmp/mnt look like? Does find work without any arguments like -name?

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

vadim@vadim-tp:~/work/ratarmount$ find /tmp/mnt
/tmp/mnt
/tmp/mnt/001-eicar
/tmp/mnt/003-virus
/tmp/mnt/302-positifs-PWD-Protected
vadim@vadim-tp:~/work/ratarmount$ ls -l /tmp/mnt
total 2
dr-xr-xr-x 1 root root 0 nov.  29 17:01 001-eicar
dr-xr-xr-x 1 root root 0 nov.  29 17:01 003-virus
dr-xr-xr-x 1 root root 0 nov.  29 17:01 302-positifs-PWD-Protected

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

I'm attaching my .zip file: ATTN some of files insied it contains Window viruses

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 17, 2023

Hm. It might be better if you deleted the attachment again 😅 . But I can indeed reproduce the problem. An idea that comes to mind is the 0 shown in ls -l. Maybe find is optimized to not descend into directories that advertize their own size as 0?

I tried setting the size to 1 but it doesn't help ... I think this might be more helpful to report for the find tool itself? At this point, I'll have to compile find myself and debug it there to see why it doesn't descend into the folders.

@Vadiml1024
Copy link
Author

find -L /tmp/mnt does go into subdirs

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

This small QT program does not recurse into subdirs too...

#include <QCoreApplication>
#include <QDirIterator>
#include <QDebug>

int main(int argc, char *argv[])
{

	QDirIterator it(argv[1], QDir::Files|QDir::Hidden, QDirIterator::Subdirectories);
	QString path;
	for( path = it.next(); !path.isEmpty(); path = it.next()) {

		QFileInfo fileInfo(path);
		if(!fileInfo.isFile() || fileInfo.isSymLink()) {
			qDebug() << "Maybe Ignoring" << path <<
			 "fileInfo:" << fileInfo.isFile() << "isSymlink:" << fileInfo.isSymLink() <<
			 "isDir:" << fileInfo.isDir();
			if (fileInfo.isDir()) {
				qDebug() << path << "is a directory. Will process it.";
			}
			continue;
		}

		qDebug() << path;
	}

	return 0;
}

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 17, 2023

Thank you for the minimal reproducer in Qt! I can reproduce it with that. find -L doesn't work for me though. And the output of the Qt reproducer makes it all the weirder:

Maybe Ignoring "mount-test/subdir/subdir" fileInfo: false isSymlink: false isDir: true
"mount-test/subdir/subdir" is a directory. Will process it.

Everything looks fine here and it still does not show the contained file.

Test archive with dummy data (probably any archive with a subfolder is fine. I might have only tested an archive with a single file initially):

mkdir subdir
echo foo > subdir/mimi.exe
zip subdir.zip subdir/mimi.exe

Some idea for possible problems that come to mind is the readdir interface. There are two possible FUSE implementations for that. One that only returns the names and one that returns the names and all stats for each entry. Maybe something is wrong with that.

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

The following program (C++17) show the same beahviour

#include <filesystem>
#include <iostream>

void ls_recursive(const std::filesystem::path& path) {
	for(const auto& p: std::filesystem::recursive_directory_iterator(path)) {
        std::cout << "seeing: " << p.path() << '\n';
		if (!std::filesystem::is_directory(p)) {
			std::cout << p.path() << '\n';
		}
	}
}



int main(int argc, char *argv[])
{
    ls_recursive(argv[1]);
	return 0;
}

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

Even this one does not work correctly:

#include <stdio.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>

void list_dir(const char *path) {
    DIR *dir = opendir(path);
    if (dir == NULL) {
        perror("opendir");
        return;
    }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL) {
        if (entry->d_type == DT_DIR) {
            // Ignore the "." and ".." directories
            if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0) {
                continue;
            }

            // Recurse into the subdirectory
            char new_path[1024];
            snprintf(new_path, sizeof(new_path), "%s/%s", path, entry->d_name);
            list_dir(new_path);
        } else {
            // Print the file name
            printf("%s/%s\n", path, entry->d_name);
        }
    }

    closedir(dir);#include <stdio.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>

void list_dir(const char *path) {
    DIR *dir = opendir(path);
    if (dir == NULL) {
        perror("opendir");
        return;
    }

    struct dirent *entry;
    while ((entry = readdir(dir)) != NULL) {
        if (entry->d_type == DT_DIR) {
            // Ignore the "." and ".." directories
            if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0) {
                continue;
            }

            // Recurse into the subdirectory
            char new_path[1024];
            snprintf(new_path, sizeof(new_path), "%s/%s", path, entry->d_name);
            list_dir(new_path);
        } else {
            // Print the file name
            printf("%s/%s\n", path, entry->d_name);
        }
    }

    closedir(dir);
}

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <directory>\n", argv[0]);
        return 1;
    }

    list_dir(argv[1]);

    return 0;
}

}

int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <directory>\n", argv[0]);
        return 1;
    }

    list_dir(argv[1]);

    return 0;
}

Apparently entry->d_type == DT_DIR is FALSE !!!!

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 17, 2023

This is probably related:
ziglang/zig#5123

@Vadiml1024
Copy link
Author

Well, i see the d_type for the offending subdirs is 8 which is DT_REG -- regular file

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 17, 2023

I dont' get it!? I added debug output for the returned mode and I also tried modifying the mode to match that of a mounted identical tar file. The tar file works but not the zip file ... The output of stat also looks fine and shows the d flag. I'm going insane.

Fortunately, the fact that the TAR backend works, but not the zip backend, gave me an idea:

Please try the development version, which works for me:

python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core'
python3 -m pip install --user --force-reinstall 'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount'

The development version refactors the zip backend to use the same index backend as the tar backend. And for some reason it works there.

@Vadiml1024
Copy link
Author

It still does not work here:

vadim@vadim-tp:~/work/ratarmount$ ratarmount --version
ratarmount 0.12.0
ratarmountcore 0.4.0

System Software:

Python 3.8.10
FUSE 2.9.9
libsqlite3 3.31.1

Compression Backends:

indexed_bzip2 1.4.0
indexed_gzip 1.7.0
xz 0.4.0
indexed_zstd 1.1.3
rarfile 4.0

@Vadiml1024
Copy link
Author

Btw another small bug in exception message:

vadim@vadim-tp:~/work/ratarmount$ ratarmount  -f -d 3  -o ro,allow_other ~/tmp/viruses/000-test-001-003-302.zip /tmp/mnt/
[Info] Detected compression None for file object: <_io.BufferedReader name='/home/vadim/tmp/viruses/000-test-001-003-302.zip'>
[Info] File object <_io.BufferedReader name='/home/vadim/tmp/viruses/000-test-001-003-302.zip'> is not a TAR.
[Info] Checking for (compressed) TAR file raised an exception: File object (<_io.BufferedReader name='/home/vadim/tmp/viruses/000-test-001-003-302.zip'>) could not be opened as a TAR file!
Traceback (most recent call last):
  File "/home/vadim/.local/lib/python3.8/site-packages/ratarmountcore/factory.py", line 58, in openMountSource
    return SQLiteIndexedTar(fileOrPath, **options)
  File "/home/vadim/.local/lib/python3.8/site-packages/ratarmountcore/SQLiteIndexedTar.py", line 736, in __init__
    raise RatarmountError("File object (" + str(fileObject) + ") could not be opened as a TAR file!")
ratarmountcore.utils.RatarmountError: File object (<_io.BufferedReader name='/home/vadim/tmp/viruses/000-test-001-003-302.zip'>) could not be opened as a TAR file!


@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 17, 2023

Unfortunately, the --version output is insufficient for verification because I did not yet increment the file version on the develop branch. Please try python3 -c 'import ratarmountcore; print(ratarmountcore.SQLiteIndexedTar.isDir)'. For the develop branch it should return: AttributeError: type object 'SQLiteIndexedTar' has no attribute 'isDir'. Did you mean: 'isdir'? while in ratarmountcore 0.4.0 it should return: <function SQLiteIndexedTar.isDir at 0x7ff0a191f370>.

With the develop branch version I cannot reproduce the issue, not even with your uploaded zip file even though I could reproduce that issue with ratarmountcore 0.4.0.

@Vadiml1024
Copy link
Author

Yes, you were righ, somehow i had an old version of ratarmount core...

Now it works correctly

@Vadiml1024
Copy link
Author

Can you please build an AppImage for this version?

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 17, 2023

I've sent a link in the Telegram group because the Github does not allow attachments over 25 MB. It seems like the addition of pragzip or the update to Python 3.11 pushed me over that limit.

I'll try to do a new official release this weekend. I think I wanted to hear back from the original issue reporter that wanted the index for zip archives but he did not write back and I forgot about this because of pragzip.

@mxmlnkn mxmlnkn closed this as completed Feb 19, 2023
@Vadiml1024
Copy link
Author

I did not perform yet any formal testing, but I have impression that access to big .zip file is muuuuch slower

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 22, 2023

:( That is frustrating. I did some benchmarks for #98 and it was vastly faster there. But maybe I did the wrong kind of benchmarks. Please specify your exact conditions. Does "big" mean size-wise or number of files? Is it slow for ls/find or for cat file? The index should only affect metadata queries. Reading from files should still use the usual zipfile module...

@Vadiml1024
Copy link
Author

Vadiml1024 commented Feb 22, 2023

After some testing: my bad, i was wrong, actually the new version is faster...
I have a .zip while with following content:

sudo ratarmount -o ro,allow_other /media/vadim/Elements/01-Malware\ Cleaner\ ISO\ UPDATES/Iso.zip /tmp/mnt
ls -lR /tmp/mnt
/tmp/mnt:
total 1
dr-xr-xr-x 1 root root 0 nov.   3  2021 Iso

/tmp/mnt/Iso:
total 22896900
-r-xr-xr-x 1 root root 7610609664 nov.  29  2018 mc-3.0.22.iso
-r-xr-xr-x 1 root root 8327004160 janv. 25  2019 mc-3.0.30.iso
-r-xr-xr-x 1 root root 7508809728 janv. 30  2020 mc-3.0.35.iso

i'm running this command in 3 windows:

dd if=/tmp/mnt/Iso/mc-3.0.22.iso of=/dev/null status=progress bs=1M

With only one window the copy is doing 115-120 MB/s
With 3 windows it varies in one window it's 25.8 MB/s in another ~45 MB/s in third 70 MB/s...
The total rate is still 115 - 120 MB/s
When launching a new instance of the command the other running instances are blocked for some amount of time...
This is probably because the new instance causes python zipfile need to seek to beginning of the file so that other instances need to reread/decompress the file from the beginning.
To fix this one - will need to implement checkpoints on decompressor context - not very easy.

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 22, 2023

What kind of compression does this zip use? Could you send the output of zipinfo your.zip? I would only expect such a seek-from-beginning if they are stored with any kind of compression. If they are stored as "store", then it shouldn't be that slow. If it is stored as deflate and unencrypted, then I might be able to plug in pragzip. But I would have to break parse the zip file format myself to some extent. But I took look at it already and it does not seem impossible (for unencrypted) zips. I would probably still only do this for large (> 100 MB) members.

@Vadiml1024
Copy link
Author

I did some more testing, this time on .tar.gz files

vadim@vadim-tp:~/work/ratarmount$ sudo ../enka-focal/src/service/admin/package/usr/bin/ratarmount -o ro,allow_other --index-folders /tmp/index --use-backend pragzip /media/vadim/Elements/Test100G/bigtar100g.tar.gz /tmp/mnt
Creating new SQLite index database at /tmp/index/_media_vadim_Elements_Test100G_bigtar100g.tar.gz.index.sqlite
Creating offset dictionary for /media/vadim/Elements/Test100G/bigtar100g.tar.gz ...
Position 3541733376 of 35477926483 (9.98%). Remaining time: 19 min 22 s (current rate), 19 min 22 s (average rate). Spent time: 2 min 8 s
Position 7086178304 of 35477926483 (19.97%). Remaining time: 17 min 6 s (current rate), 17 min 9 s (average rate). Spent time: 4 min 16 s
Position 10654289920 of 35477926483 (30.03%). Remaining time: 14 min 43 s (current rate), 14 min 54 s (average rate). Spent time: 6 min 24 s
Position 14188863488 of 35477926483 (39.99%). Remaining time: 12 min 36 s (current rate), 12 min 44 s (average rate). Spent time: 8 min 29 s
Position 17723379712 of 35477926483 (49.96%). Remaining time: 9 min 33 s (current rate), 10 min 24 s (average rate). Spent time: 10 min 23 s
Position 21256552448 of 35477926483 (59.91%). Remaining time: 7 min 36 s (current rate), 8 min 13 s (average rate). Spent time: 12 min 17 s
Position 24815681536 of 35477926483 (69.95%). Remaining time: 5 min 41 s (current rate), 6 min 5 s (average rate). Spent time: 14 min 11 s
Position 28369866752 of 35477926483 (79.96%). Remaining time: 4 min 13 s (current rate), 4 min 5 s (average rate). Spent time: 16 min 18 s
Position 31903862784 of 35477926483 (89.93%). Remaining time: 2 min 12 s (current rate), 2 min 4 s (average rate). Spent time: 18 min 29 s
Position 35468808192 of 35477926483 (99.97%). Remaining time: 0 min 0 s (current rate), 0 min 0 s (average rate). Spent time: 20 min 35 s
Creating offset dictionary for /media/vadim/Elements/Test100G/bigtar100g.tar.gz took 1236.55s
[Info] Reopening the gzip with the pragzip backend...
[Info] Reopened the gzip with the pragzip backend.
Writing out TAR index to /tmp/index/_media_vadim_Elements_Test100G_bigtar100g.tar.gz.index.sqlite took 0s and is sized 198512640 B
vadim@vadim-tp:~/work/ratarmount$ sudo umount /tmp/mnt 
vadim@vadim-tp:~/work/ratarmount$ sudo ../enka-focal/src/service/admin/package/usr/bin/ratarmount -o ro,allow_other --index-folders /tmp/index --use-backend pragzip /media/vadim/Elements/Test100G/bigtar100g.tar.gz /tmp/mnt
[sudo] password for vadim: 
Successfully loaded offset dictionary from /tmp/index/_media_vadim_Elements_Test100G_bigtar100g.tar.gz.index.sqlite
Loading gzip block offsets took 1.22s
[Info] Reopening the gzip with the pragzip backend...
[Info] Reopened the gzip with the pragzip backend.
vadim@vadim-tp:~/work/ratarmount$ sudo umount /tmp/mnt 
vadim@vadim-tp:~/work/ratarmount$ sudo ../enka-focal/src/service/admin/package/usr/bin/ratarmount -o ro,allow_other --index-folders /tmp/index  /media/vadim/Elements/Test100G/bigtar100g.tar.gz /tmp/mnt
Successfully loaded offset dictionary from /tmp/index/_media_vadim_Elements_Test100G_bigtar100g.tar.gz.index.sqlite
Loading gzip block offsets took 1.02s
vadim@vadim-tp:~/work/ratarmount$ time dd if=/tmp/mnt/Iso/mc-3.0.22.iso of=/dev/null status=progress bs=1M
dd: failed to open '/tmp/mnt/Iso/mc-3.0.22.iso': No such file or directory

real	0m0,019s
user	0m0,000s
sys	0m0,010s
vadim@vadim-tp:~/work/ratarmount$ ls /tmp/mnt/
level2      usr1-hl-1.tar  usr1-hl-3.tar  usr1-hl-5.tar  usr1-hl-7.tar  usr1-hl-9.tar  vectors
level2.tar  usr1-hl-2.tar  usr1-hl-4.tar  usr1-hl-6.tar  usr1-hl-8.tar  usr1.tar
vadim@vadim-tp:~/work/ratarmount$ ls -l /tmp/mnt/
total 98979037
drwxrwxr-x 1 1001 nx           0 juin   1  2022 level2
-rw-rw-r-- 1 1001 nx    67307520 juin   1  2022 level2.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-1.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-2.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-3.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-4.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-5.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-6.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-7.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-8.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1-hl-9.tar
-rw-r--r-- 1 1001 nx 10128721920 mai   31  2022 usr1.tar
drwxr-xr-x 1 1001 nx           0 juil. 19  2019 vectors
vadim@vadim-tp:~/work/ratarmount$ ls -lh /tmp/mnt/
total 95G
drwxrwxr-x 1 1001 nx    0 juin   1  2022 level2
-rw-rw-r-- 1 1001 nx  65M juin   1  2022 level2.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-1.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-2.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-3.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-4.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-5.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-6.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-7.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-8.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1-hl-9.tar
-rw-r--r-- 1 1001 nx 9,5G mai   31  2022 usr1.tar
drwxr-xr-x 1 1001 nx    0 juil. 19  2019 vectors
vadim@vadim-tp:~/work/ratarmount$ time dd if=/tmp/mnt/usr1.tar of=/dev/null status=progress bs=1M
10066329600 bytes (10 GB, 9,4 GiB) copied, 63 s, 159 MB/s
9659+1 records in
9659+1 records out
10128721920 bytes (10 GB, 9,4 GiB) copied, 63,4207 s, 160 MB/s

real	1m3,431s
user	0m0,042s
sys	0m4,983s
vadim@vadim-tp:~/work/ratarmount$ sudo umount /tmp/mnt 
vadim@vadim-tp:~/work/ratarmount$ sudo ../enka-focal/src/service/admin/package/usr/bin/ratarmount -o ro,allow_other --index-folders /tmp/index --use-backend pragzip /media/vadim/Elements/Test100G/bigtar100g.tar.gz /tmp/mnt
Successfully loaded offset dictionary from /tmp/index/_media_vadim_Elements_Test100G_bigtar100g.tar.gz.index.sqlite
Loading gzip block offsets took 1.06s
[Info] Reopening the gzip with the pragzip backend...
[Info] Reopened the gzip with the pragzip backend.
vadim@vadim-tp:~/work/ratarmount$ time dd if=/tmp/mnt/usr1.tar of=/dev/null status=progress bs=1M
9982443520 bytes (10 GB, 9,3 GiB) copied, 57 s, 175 MB/s 
9659+1 records in
9659+1 records out
10128721920 bytes (10 GB, 9,4 GiB) copied, 57,9802 s, 175 MB/s

real	0m57,998s
user	0m0,055s
sys	0m6,358s
vadim@vadim-tp:~/work/ratarmount$ 

I see that it does not use pragzip backend when generatine sqlite index file...
I wonder this phase could be accelerated somehow?

@Vadiml1024
Copy link
Author

What kind of compression does this zip use? Could you send the output of zipinfo your.zip? I would only expect such a seek-from-beginning if they are stored with any kind of compression. If they are stored as "store", then it shouldn't be that slow. If it is stored as deflate and unencrypted, then I might be able to plug in pragzip. But I would have to break parse the zip file format myself to some extent. But I took look at it already and it does not seem impossible (for unencrypted) zips. I would probably still only do this for large (> 100 MB) members.

It is compressed:
vadim@vadim-tp:$ zipinfo /media/vadim/Elements/01-Malware\ Cleaner\ ISO\ UPDATES/Iso.zip
Archive: /media/vadim/Elements/01-Malware Cleaner ISO UPDATES/Iso.zip
Zip file size: 23359237096 bytes, number of entries: 4
drwx--- 6.3 fat 0 bx stor 21-Nov-03 09:27 Iso/
-rw-a-- 6.3 fat 7610609664 bx defN 18-Nov-29 17:25 Iso/mc-3.0.22.iso
-rw-a-- 6.3 fat 8327004160 bx defN 19-Jan-25 10:34 Iso/mc-3.0.30.iso
-rw-a-- 6.3 fat 7508809728 bx defN 20-Jan-30 09:46 Iso/mc-3.0.35.iso
4 files, 23446423552 bytes uncompressed, 23359236304 bytes compressed: 0.4%
vadim@vadim-tp:~/work/ratarmount$

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 22, 2023

I see that it does not use pragzip backend when generatine sqlite index file...

Sorry about that. This is intended as of now. The reason is this singular issue. :/ Without implementing that, memory usage could grow up to 2 * 1032 * 4 MiB * 2 * numberOfCores. by limiting parallel decompression to compression ratios of up to 20x, this requirement could be reduced to ~820 MiB per core, which is kinda alright in my opinion. Or maybe limit the compression ratio to 10x. With maybe 20% of performance slowdown, the 4 MiB could also be reduced to 2 or 1 MiB for ~100 MiB per core, which definitely would sound fine to me. So much to do ;)...

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 22, 2023

It is compressed:

Ok, the defN is "deflate normal", so definitely a use case for pragzip. This is worth a separate issue.

@Vadiml1024
Copy link
Author

I see that it does not use pragzip backend when generatine sqlite index file...

Sorry about that. This is intended as of now. The reason is this singular issue. :/ Without implementing that, memory usage could grow up to 2 * 1032 * 4 MiB * 2 * numberOfCores. by limiting parallel decompression to compression ratios of up to 20x, this requirement could be reduced to ~820 MiB per core, which is kinda alright in my opinion. Or maybe limit the compression ratio to 10x. With maybe 20% of performance slowdown, the 4 MiB could also be reduced to 2 or 1 MiB for ~100 MiB per core, which definitely would sound fine to me. So much to do ;)...

I've personally rarely seen a .zip file achieving more the 2x compression.
I would say: fallback to non-parallel if the ratio is more the 4x (maybe control this behavior with cli option?).
I could be wrong though.
BTW I see that one way to accelerate decompression is to use parallel (SSE4 based) CRC32 computation:
https://github.com/anandsuresh/sse4_crc32

@mxmlnkn
Copy link
Owner

mxmlnkn commented Feb 22, 2023

I've multiple cases with a compression ratio of more than 4 but still under 10. Around 8 maybe. E.g., (build) logs and notably the Chrome Trace Event Format and in general, similar JSON files with lots of redundancy.

Pragzip does not yet compute the CRC32 in parallel decompression mode. Using SIMD adds complexities with dynamic dispatch based on supported CPU instruction sets... Then again, SSE4 might be old enough that almost any x86 CPU of the last 10 years supports it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants