Skip to content

Compatibility issue with soundfile #787

@alexjbuck

Description

@alexjbuck

Problem description

Somewhere between smart_open and soundfile data is getting lost when writing or reading to S3 (minio in this case).

The following is a mostly mvp that reproduces this issue. When reading a FLAC file from S3 through passing a file handle from smart_open to the soundfile library, soundfile appears to find 1 less sample (2 bytes in this case) than was written. You can successfully read len(input)-1 samples out of the file where input is the original list used to create the audio file. When you try to read every sample out, libsndfile errors: LibsndfileError: Internal psf_fseek() failed.

I do not know if this is a soundfile, libsndfile or smart_open issue, because the audio library seems to think its getting shorted on data, in that the seek fails when seeking a length that should work (the length of the original sample).

To confirm that it isn't an indexing issue, the actual length of the returned samples is one short (24999 vice 25000 in my example).

There is a related issue that I'll file separately that happens when you don't specify the number of frames/samples to read from the audio file when reading it through smart_open.

I also included an example of using smart_open with a filesystem target, to demonstrate that it works for local filesystem objects, but not for remote s3 connections, which also makes me think this might be behavior inside smart_open at fault.

My final objective is to read and write audio files (FLAC preferrably) to and from S3 storage from within Python.

  • What are you trying to achieve?
  • What is the expected result?
  • What are you seeing instead?

Steps/code to reproduce the problem

import os
import boto3
import smart_open
import soundfile as sf
from s3path import S3Path
from soundfile import SoundFile

session = boto3.Session(
    aws_access_key_id='<id>',
    aws_secret_access_key='<key>',
)
client = session.client('s3',endpoint_url="http://localhost:9000")

stop = 5
sample_rate = 5000
start=0
t = np.linspace(start=start, stop=stop, num=stop*sample_rate)
x1 = np.sin(2*pi*100*t)
x2 = np.sin(2*pi*200*t)
signal = x1+x2

path = S3Path.from_uri('s3://etl')
filepath = path / 'test.flac'

transport_params = {'transport_params':{'client':client}}

# Writing to local filesystem through soundfile - Apparent Success
with smart_open.open('test.flac', "wb", **transport_params) as file:
    with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
        f.write(signal)
        print(f"Filesystem IO: {f.frames=}")
# > Filesystem IO: f.frames=25000

# Reading full sample from filesystem - Success
with smart_open.open('test.flac', 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal))
        print(f"Filesystem IO: {len(samples)=}")
# > Filesystem IO: len(samples)=25000

# Writing to S3 through soundfile - Apparent Success
with smart_open.open(filepath.as_uri(), "wb", **transport_params) as file:
    with sf.SoundFile(file,mode='w', samplerate=sample_rate,channels=1,format='flac') as f:
        f.write(signal)
        print(f"S3 IO: {f.frames=}")
# > S3 IO: f.frames=25000

# Reading 1 less sample than was written - Success
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal)-1)
        print(f"S3 IO: {len(samples)=}")
# > S3 IO: len(samples)=24999

# Reading the same number of samples as was written - Failure
with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
    with sf.SoundFile(io.BytesIO(file.read())) as fin:
        samples = fin.read(len(signal))
        print(f"S3 IO: {len(samples)=}")
# > LibsndfileError: Internal psf_fseek() failed.

Versions

Please provide the output of:

import platform, sys, smart_open
print(platform.platform())
print("Python", sys.version)
print("smart_open", smart_open.__version__)
macOS-14.0-arm64-arm-64bit
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
smart_open 6.4.0

Full Error

---------------------------------------------------------------------------
LibsndfileError                           Traceback (most recent call last)
Cell In[330], line 58
     56 with smart_open.open(filepath.as_uri(), 'rb', **transport_params) as file:
     57     with sf.SoundFile(io.BytesIO(file.read())) as fin:
---> 58         samples = fin.read(len(signal))
     59         print(f"S3 IO: {len(samples)=}")
     60 # > LibsndfileError: Internal psf_fseek() failed.

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:895, in SoundFile.read(self, frames, dtype, always_2d, fill_value, out)
    893     if frames < 0 or frames > len(out):
    894         frames = len(out)
--> 895 frames = self._array_io('read', out, frames)
    896 if len(out) > frames:
    897     if fill_value is None:

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1344, in _array_io(self, action, array, frames)
   1342 ctype = self._check_dtype(array.dtype.name)
   1343 assert array.dtype.itemsize == _ffi.sizeof(ctype)
-> 1344 cdata = _ffi.cast(ctype + '*', array.__array_interface__['data'][0])
   1345 return self._cdata_io(action, cdata, ctype, frames)

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1356, in _cdata_io(self, action, data, ctype, frames)
   1354 frames = func(self._file, data, frames)
   1355 _error_check(self._errorcode)
-> 1356 if self.seekable():
   1357     self.seek(curr + frames, SEEK_SET)  # Update read & write position
   1358 return frames

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:802, in SoundFile.seek(self, frames, whence)
    800 self._check_if_closed()
    801 position = _snd.sf_seek(self._file, frames, whence)
--> 802 _error_check(self._errorcode)
    803 return position

File ~/Library/Caches/pypoetry/virtualenvs/hermes-raNvVsk4-py3.11/lib/python3.11/site-packages/soundfile.py:1407, in _error_check(err, prefix)
   1405 def _error_check(err, prefix=""):
   1406     """Raise LibsndfileError if there is an error."""
-> 1407     if err != 0:
   1408         raise LibsndfileError(err, prefix=prefix)

LibsndfileError: Internal psf_fseek() failed.

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions