Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libarchive backend #130

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
6749a02
[feature] Add --transform option
mxmlnkn Mar 6, 2023
12646d9
[refactor] SQLiteIndex: Make _openPath a static method
mxmlnkn Feb 27, 2024
3b01df4
[feature] Print rapidgzip parallelization and slow-drive-detection re…
mxmlnkn Feb 28, 2024
1c8f878
[style] Get rid of pylint warning
mxmlnkn Feb 29, 2024
9fc39ba
[feature] Do not check file header for zip, only for the footer, to d…
mxmlnkn Mar 16, 2024
e147a89
[fix] Better error message when trying to use empty index
mxmlnkn Mar 23, 2024
ef937e9
[fix] The index should not be created for very small archives
mxmlnkn Mar 24, 2024
fc5e15f
[performance] Do not use UnionMountSource for a single input
mxmlnkn Mar 31, 2024
6882ff6
[fix] SingleFileMountSource: Joined files did not work because of an …
mxmlnkn Mar 31, 2024
4472c2b
[test] Add more tests with chimera file
mxmlnkn Mar 24, 2024
35f5828
[refactor] SQLiteIndex: Add generic method for storing key-value meta…
mxmlnkn Mar 24, 2024
3d019cf
[fix] Apply specified priorities for opening all archives not just gzip
mxmlnkn Mar 23, 2024
134746a
[refactor] SQLiteIndex: Add helper for checking arguments in metadata
mxmlnkn Mar 24, 2024
8e59e73
[fix] Do not check for consistency of folder because parent folders g…
mxmlnkn Mar 24, 2024
79af276
[fix] Index validation did fail for TAR entries with more than 2 meta…
mxmlnkn Mar 26, 2024
20ac7ab
[performance] Determine incremental archives from index rows to avoid…
mxmlnkn Mar 26, 2024
da1ae6d
[fix] Check the index compatibility if backend name is not stored in …
mxmlnkn Mar 24, 2024
a5ff32b
[API] Store isGnuIncremental to index
mxmlnkn Mar 26, 2024
04d78cf
[API] Store backendName to index
mxmlnkn Mar 26, 2024
7c3d921
[refactor] Move _createFileInfo out of MountSource class to fix "prot…
mxmlnkn Mar 29, 2024
43e07dc
[fix] Root file info userdata was not initialized correctly
mxmlnkn Mar 29, 2024
f6d4fce
[fix] Do not check mount point, simply try fusermount
mxmlnkn Mar 29, 2024
f871505
[style] Fix unspecified-encoding warning
mxmlnkn Mar 29, 2024
bc631fa
[style] Suppress "Using the global statement" warning
mxmlnkn Mar 29, 2024
a6d907b
[test] Add more test archives intended for libarchive backend
mxmlnkn Mar 23, 2024
bb14cdb
[wip][feature] Add libarchive backend
mxmlnkn Mar 23, 2024
658724a
[performance] LibarchiveFile: Do not buffer large files fully in memory
mxmlnkn Apr 1, 2024
85a783c
[refactor] LibarchiveMountSource: Introduce nextEntry method to repla…
mxmlnkn Apr 3, 2024
a5a1955
[performance] LibarchiveMountSource: Reuse IterableArchive object for…
mxmlnkn Apr 3, 2024
1fc7511
[test] Take more care to clean up temporary files
mxmlnkn Apr 4, 2024
c43bc5b
[test] Parallelize long-running pytest files
mxmlnkn Apr 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 20 additions & 5 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ jobs:
run: |
echo "uname -a: $( uname -a )"
echo "Shell: $SHELL"
echo "Cores: $( nproc )"
echo "Mount points:"; mount

- uses: msys2/setup-msys2@v2
Expand All @@ -106,7 +107,11 @@ jobs:
- name: Install Dependencies (Linux)
if: startsWith( matrix.os, 'ubuntu' )
run: |
sudo apt-get -y install fuse bzip2 pbzip2 pixz zstd unar
# Libarchive calls the grzip, lrzip, lzop binaries for lrzip support. Others, such as bzip2, gzip, lz4, lzma,
# zstd, may also call external binaries depending on how libarchive was compiled!
# https://github.com/libarchive/libarchive/blob/ad5a0b542c027883d7069f6844045e6788c7d70c/libarchive/
# archive_read_support_filter_lrzip.c#L68
sudo apt-get -y install fuse bzip2 pbzip2 pixz zstd unar lrzip lzop
set -x

- name: Install Dependencies (MacOS)
Expand Down Expand Up @@ -183,11 +188,21 @@ jobs:
- name: Unit Tests
if: ${{ !startsWith( matrix.os, 'macos' ) }}
run: |
python3 -m pip install pytest
python3 -m pip install pytest pytest-xdist
for file in core/tests/test_*.py tests/test_*.py; do
# Fusepy warns about usage of use_ns because the implicit behavior is deprecated.
# But there has been no development to fusepy for 4 years, so I think it should be fine to ignore.
pytest --disable-warnings "$file"
case "$file" in
"core/tests/test_AutoMountLayer.py"\
|"core/tests/test_BlockParallelReaders.py"\
|"core/tests/test_LibarchiveMountSource.py"\
|"core/tests/test_SQLiteIndexedTar.py")
echo "$file" # pytest-xdist seems to omit the test file name
pytest -n auto --disable-warnings "$file"
;;
*)
# Fusepy warns about usage of use_ns because the implicit behavior is deprecated.
# But there has been no development to fusepy for 4 years, so I think it should be fine to ignore.
pytest --disable-warnings "$file"
esac
done

- name: Regression Tests
Expand Down
21 changes: 12 additions & 9 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ init-hook='import sys; sys.path.append("./core")'
# A comma-separated list of package or module names from where C extensions may
# be loaded. Extensions are loading into the active Python interpreter and may
# run arbitrary code.
extension-pkg-whitelist=indexed_gzip,indexed_bzip2,indexed_zstd,lzmaffi,rapidgzip
extension-pkg-whitelist=indexed_gzip,indexed_bzip2,indexed_zstd,libarchive,libarchive.ffi,lzmaffi,rapidgzip

# Specify a score threshold to be exceeded before program exits with error.
fail-under=10.0
Expand Down Expand Up @@ -62,14 +62,14 @@ confidence=
# --disable=W".
disable=invalid-name,
broad-except,
too-many-branches,
too-many-statements,
broad-exception-raised,
chained-comparison, # Only available since Python 3.8
too-many-arguments,
too-many-instance-attributes,
too-many-locals,
too-many-lines,
too-few-public-methods,
unnecessary-lambda,
# I don't need the style checker to bother me with missing docstrings and todos.
missing-class-docstring,
missing-function-docstring,
missing-module-docstring,
Expand Down Expand Up @@ -433,7 +433,7 @@ max-attributes=7
max-bool-expr=5

# Maximum number of branch for function / method body.
max-branches=12
max-branches=40

# Maximum number of locals for function / method body.
max-locals=15
Expand All @@ -442,16 +442,19 @@ max-locals=15
max-parents=7

# Maximum number of public methods for a class (see R0904).
max-public-methods=20
max-public-methods=50

# Maximum number of return / yield for function / method body.
max-returns=6
# The default limit was too low when considering guards as good style.
max-returns=20

# Maximum number of statements in function / method body.
max-statements=50
max-statements=200

# Minimum number of public methods for a class (see R0903).
min-public-methods=2
# Even no public methods make sense when writing pure context managers.
# Tests also often may have classes with only a single public method.
min-public-methods=1


[CLASSES]
Expand Down
29 changes: 27 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,28 @@ And in contrast to [tarindexer](https://github.com/devsnd/tarindexer), which als

- **Rar** as provided by [rarfile](https://github.com/markokr/rarfile) by Marko Kreen. See also the [RAR 5.0 archive format](https://www.rarlab.com/technote.htm).
- **Zip** as provided by [zipfile](https://docs.python.org/3/library/zipfile.html), which is distributed with Python itself. See also the [ZIP File Format Specification](https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT).
- **Many Others** as provided by [libarchive](https://github.com/libarchive/libarchive) via [python-libarchive-c](https://github.com/Changaco/python-libarchive-c).
- Formats with tests:
[7z](https://github.com/ip7z/7zip/blob/main/DOC/7zFormat.txt),
ar,
[cab](https://download.microsoft.com/download/4/d/a/4da14f27-b4ef-4170-a6e6-5b1ef85b1baa/[ms-cab].pdf),
compress, cpio,
[iso](http://www.brankin.com/main/technotes/Notes_ISO9660.htm),
[lrzip](https://github.com/ckolivas/lrzip),
[lzma](https://www.7-zip.org/a/lzma-specification.7z),
[lz4](https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md),
[lzip](https://www.ietf.org/archive/id/draft-diaz-lzip-09.txt),
lzo,
[warc](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/),
xar.
- Untested formats that might work or not: deb, grzip,
[rpm](https://refspecs.linuxbase.org/LSB_4.1.0/LSB-Core-generic/LSB-Core-generic/pkgformat.html),
[uuencoding](https://en.wikipedia.org/wiki/Uuencoding).
- Beware that libarchive has no performant random access to files and to file contents.
In order to seek or open a file, in general, it needs to be assumed that the archive has to be parsed from the beginning.
If you have a performance-critical use case for a format only supported via libarchive,
then please open a feature request for a faster customized archive format implementation.
The hope would be to add suitable stream compressors such as "short"-distance LZ-based compressions to [rapidgzip](https://github.com/mxmlnkn/rapidgzip).


# Table of Contents
Expand Down Expand Up @@ -131,10 +153,13 @@ On macOS, you have to install [macFUSE](https://osxfuse.github.io/) with:
brew install macfuse
```

If you are installing on a system for which there exists no manylinux wheel, then you'll have to install dependencies required to build from source:
If you are installing on a system for which there exists no manylinux wheel, then you'll have to install further dependencies that are required to build some of the Python packages that ratarmount depends on from source:

```bash
sudo apt install python3 python3-pip fuse build-essential software-properties-common zlib1g-dev libzstd-dev liblzma-dev cffi
sudo apt install \
python3 python3-pip fuse \
build-essential software-properties-common \
zlib1g-dev libzstd-dev liblzma-dev cffi libarchive-dev
```

## PIP Package Installation
Expand Down
84 changes: 84 additions & 0 deletions benchmarks/scripts/createLargeRar.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/usr/bin/env bash

set -e

echoerr() { echo "$@" 1>&2; }


function createLargeRar()
{
local folder iFolder firstSubFolder iFile

# Creates an archive with many files with long names making file names the most memory consuming part of the index.
if [[ ! "$nFolders" -eq "$nFolders" ]]; then
echoerr "Argument 1 must be number to specify the number of folders containing each 1k files but is: $nFolders"
return 1
fi

echoerr "Creating a archive with $(( nFolders * nFilesPerFolder )) files..."
folder="$( mktemp -d -p "$( pwd )" )"

iFolder=0
firstSubFolder="$folder/$( printf "%0${nameLength}d" "$iFolder" )"
mkdir -p -- "$firstSubFolder"

for (( iFile = 0; iFile < nFilesPerFolder; ++iFile )); do
base64 /dev/urandom | head -c "$nBytesPerFile" > "$firstSubFolder/$( printf "%0${nameLength}d" "$iFile" )"
done

for (( iFolder = 1; iFolder < nFolders; ++iFolder )); do
subFolder="$folder/$( printf "%0${nameLength}d" "$iFolder" )"
ln -s -- "$firstSubFolder" "$subFolder"
done

file="$nFolders-folders-with-$nFilesPerFolder-files-${nBytesPerFile}B-files.rar"
( cd -- "$folder" && rar a "../$file" -r . )

#file="$nFolders-folders-with-$nFilesPerFolder-files-${nBytesPerFile}B-files.qo+.rar"
#( cd -- "$folder" && rar a -qo+ "../$file" -r . )

#file="$nFolders-folders-with-$nFilesPerFolder-files-${nBytesPerFile}B-files.qo-.rar"
#( cd -- "$folder" && rar a -qo- "../$file" -r . )

'rm' -rf -- "$folder"
}


mountFolder=$( mktemp -d )

nameLength=32
nFilesPerFolder=1000
extendedBenchmarks=1

nFolders=100
nBytesPerFile=$(( 64 * 1024 ))

createLargeRar

rmdir "$mountFolder"


python3 -c 'import sys
import time
import rarfile

t0 = time.time()
f = rarfile.RarFile(sys.argv[1])
t1 = time.time()
print(f"Opening the RAR took: {t1-t0:.3f} s")
print("File Count:", len(f.infolist())) # This is alway instant. Seems to get initialized during open
t2 = time.time()
print(f"Getting infolist took: {t2-t1:.3f} s")
' "$file"

# Creating the RAR is VERY slow. Takes many minutes for the 100 folders case.
# 10-folders-with-1000-files-65536B-files.rar 487 MiB -> 0.224
# 100-folders-with-1000-files-65536B-files.rar 4.8 GiB -> 4.821 2.242 2.311 2.317
# 100-folders-with-1000-files-65536B-files.qo+.rar 4.8 GiB -> 16.309 2.332 2.361 2.276
# 100-folders-with-1000-files-65536B-files.qo-.rar 4.8 GiB -> 4.072 2.283 2.245
# -qo[+-] option for the quick open service block seems to not impact open-performance at all after the file is cached.
# Unfortunately, these benchmarks also imply that opening a RAR file (and also a zip file) will always have the overhead
# for opening the file even if an index exists. I'm not even sure whether adding the index for the ZIP did any good for
# the user requesting it.
# In order to profit from an index, I would have to implement my own RAR/ZIP layer that is at least able to read the
# local records given a record/header offset.
4 changes: 2 additions & 2 deletions core/ratarmountcore/AutoMountLayer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from .compressions import stripSuffixFromTarFile
from .factory import openMountSource
from .FolderMountSource import FolderMountSource
from .MountSource import FileInfo, MountSource
from .MountSource import FileInfo, MountSource, createRootFileInfo
from .SQLiteIndexedTar import SQLiteIndexedTar, SQLiteIndexedTarUserData
from .utils import overrides

Expand All @@ -38,7 +38,7 @@ def __init__(self, mountSource: MountSource, **options) -> None:
self.lazyMounting: bool = self.options.get('lazyMounting', False)
self.printDebug = int(options.get("printDebug", 0)) if isinstance(options.get("printDebug", 0), int) else 0

rootFileInfo = MountSource._createRootFileInfo(userdata=['/'])
rootFileInfo = createRootFileInfo(userdata=['/'])

# Mount points are specified without trailing slash and with leading slash
# representing root of this mount source.
Expand Down
2 changes: 2 additions & 0 deletions core/ratarmountcore/BlockParallelReaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,7 @@ def _tryOpenGlobalFile(filename):
# This is not thread-safe! But it will be executed in a process pool, in which each worker has its own
# global variable set. Using a global variable for this is safe because we know that there is one process pool
# per BlockParallelReader, meaning the filename is a constant for each worker.
# pylint: disable=global-statement
global _parallelXzReaderFile
if _parallelXzReaderFile is None:
_parallelXzReaderFile = xz.open(filename, 'rb')
Expand Down Expand Up @@ -316,6 +317,7 @@ def _decodeBlock(filename, offset, size):
# This is not thread-safe! But it will be executed in a process pool, in which each worker has its own
# global variable set. Using a global variable for this is safe because we know that there is one process pool
# per BlockParallelReader, meaning the filename is a constant for each worker.
# pylint: disable=global-statement
global _parallelZstdReaderFile
if _parallelZstdReaderFile is None:
_parallelZstdReaderFile = indexed_zstd.IndexedZstdFile(filename)
Expand Down
Loading