Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: File Download/Preview/Streaming Gets Truncated/Corrupted when Served from SMB/CIFS Share (Not Using NC smbclient) #31361

Closed
6 of 8 tasks
GuyPaddock opened this issue Feb 25, 2022 · 7 comments
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap bug needs info

Comments

@GuyPaddock
Copy link

GuyPaddock commented Feb 25, 2022

⚠️ This issue respects the following points: ⚠️

  • This is a bug, not a question or a configuration/webserver/proxy issue.
  • This issue is not already reported on Github (I've searched it).
  • Nextcloud Server is up to date. See Maintenance and Release Schedule for supported versions.
  • I agree to follow Nextcloud's Code of Conduct.

Bug description

We recently upgraded an installation of Nextcloud from 18.0.14 to 19.0.13 and now are experiencing a serious problem where most files that are larger than about 2 MB do not download properly from Nextcloud when they are sourced from an SMB/CIFS share. Almost all of our files are hosted on Azure Files over SMB, so this is a big problem for us. (We can reproduce the issue with ALL VERSIONS OF NEXTCLOUD after 19 -- see below).

When working with image files that were previously uploaded (it does not matter if the image was uploaded before or after the upgrade), we will see one of the following happen:

  1. The image file won't load at all when being previewed: image image

  2. OR - The image loads, but is truncated or corrupt:
    image
    image

  3. When we try to download the image, the download fails with a "Network error".
    image

With video files, we find that the video either does not play or plays with artifacts and corruption throughout the image (the source video does NOT have this corruption).
image

Oddly, if we select multiple files and select "Download" (so that they download as a single ZIP), the resulting ZIP downloads without issue and the files inside are not corrupt, so this is only affecting the behavior of streaming/loading individual files from the web UI directly.

I saw #26457 and #26933 but our issue differs from those issues in significant ways:

  • We are not using any of the SMB functionality of Nextcloud itself. Rather, we are mounting storage from SMB into the Docker container via the Kubernetes Azure Files and SMB drivers.
  • Both of those issues were against 21.0.1.0 which did not have the fix from update icewind/smb to 3.4.1 #26627 / [stable21] update icewind/smb to 3.4.1 #26704, which got released in 21.0.2 and 20.0.9, but we are seeing this in all recent versions of Nextcloud that already have the Icewind fixes.

When nginx is being used to host Nextcloud, it is significantly more strict about the issue and will terminate the connection if something isn't right, failing with the following error:

[alert] 32#32: *1 readv() failed (14: Bad address) while reading upstream

image

If we switch to Apache-based Docker containers, Apache does not report any errors but the file it serves-up has corruption throughout, which is arguably way worse (because users might not realize that the file is corrupt since they still get the right number of bytes that they asked for, but the bytes are corrupt).

Steps to reproduce

  1. Setup Nextcloud to run in a Kubernetes-powered container (PHP-FPM or Apache; it does not matter).

  2. Setup the container to mount a volume over an Azure Files or SMB to the path /mnt/share/test.
    For example, a container with a volume mount like (the rest of the container boilerplate has been omitted):

        volumeMounts:
        - mountPath: /mnt/share/test
          name: volume-nextcloud-test

    Declared as:

      volumes:
      - name: volume-nextcloud-test
        persistentVolumeClaim:
          claimName: claim-nextcloud-test

    With PVs and PVCs of (secrets omitted):

    apiVersion: v1
    kind: PersistentVolume
    metadata:
      annotations:
      name: vol-share-nextcloud-test
    spec:
      accessModes:
      - ReadWriteMany
      azureFile:
        secretName: nextcloud-storage-azure-files-creds
        shareName: nextcloud-test
      capacity:
        storage: 1Ti
      mountOptions:
      - uid=33
      - gid=33
      - dir_mode=0770
      - file_mode=0770
      persistentVolumeReclaimPolicy: Retain
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      annotations:
      name: claim-nextcloud-test
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 1Ti
      storageClassName: ""
      volumeName: vol-nextcloud-test
    status:
      accessModes:
      - ReadWriteMany
      capacity:
        storage: 1Ti
      phase: Bound
  3. Configure Nextcloud with a local External storage that points to /mnt/share/test:
    image

  4. Turn image previews OFF for the share, so that the preview lightbox shows full-resolution images (makes this easier to reproduce).

  5. Upload an image or video file that is larger than 5 MB to the share (for example, the 25 MB Rome, Italy image from https://effigis.com/en/solutions/satellite-images/satellite-image-samples/).

  6. Attempt to open the image or video file in the preview lightbox. Repeat this step at least 5 times, clearing browser cache in between attempts.

  7. Attempt to download the image or video file through the Nextcloud web UI. Repeat this step at least 5 times, clearing browser cache in between attempts.

Expected behavior

The image or video should load without error.

Installation method

Official Docker image

Operating system

Debian/Ubuntu

PHP engine version

PHP 7.4

Web server

Nginx

Database engine version

MariaDB

Is this bug present after an update or on a fresh install?

Updated to a major version (ex. 22.2.3 to 23.0.1)

Are you using the Nextcloud Server Encryption module?

Encryption is Disabled

What user-backends are you using?

  • Default user-backend (database)
  • LDAP/ Active Directory
  • SSO - SAML
  • Other

Configuration report

{
    "system": {
        "simpleSignUpLink.shown": false,
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "dev.cloudvault.inveniem.com"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "dbtype": "mysql",
        "version": "23.0.2.1",
        "overwrite.cli.url": "https:\/\/dev.cloudvault.inveniem.com",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "",
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "htaccess.RewriteBase": "\/",
        "maintenance": false,
        "appstoreenabled": false,
        "enable_previews": true,
        "preview_max_filesize_image": 50,
        "apps_paths": [
            {
                "path": "\/var\/www\/html\/apps",
                "url": "\/apps",
                "writable": false
            },
            {
                "path": "\/var\/www\/html\/custom_apps",
                "url": "\/custom_apps",
                "writable": false
            }
        ],
        "logfile": "\/var\/log\/nextcloud.log",
        "loglevel": 1,
        "filelocking.enabled": false,
        "memcache.local": "\\OC\\Memcache\\APCu",
        "memcache.distributed": "\\OC\\Memcache\\Redis",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": "6379",
            "timeout": 1.5,
            "password": "***REMOVED SENSITIVE VALUE***"
        },
        "dbdriveroptions": {
            "1009": "\/etc\/ssl\/certs\/Baltimore_CyberTrust_Root.pem"
        },
        "config_is_read_only": true,
        "skeletondirectory": "",
        "mysql.utf8mb4": true,
        "preview_max_x": 1024,
        "preview_max_y": 1024,
        "trusted_proxies": "***REMOVED SENSITIVE VALUE***",
        "app_install_overwrite": [
            "user_external",
            "user_saml"
        ]
    }
}

List of activated Apps

Enabled:
  - accessibility: 1.9.0
  - activity: 2.15.0
  - admin_audit: 1.13.0
  - checksum: 1.1.3
  - circles: 23.0.1
  - cloud_federation_api: 1.6.0
  - comments: 1.13.0
  - contactsinteraction: 1.4.0
  - dav: 1.21.0
  - federatedfilesharing: 1.13.0
  - federation: 1.13.0
  - files: 1.18.0
  - files_downloadactivity: 1.12.0
  - files_external: 1.15.0
  - files_pdfviewer: 2.4.0
  - files_rightclick: 1.2.0
  - files_sharing: 1.15.0
  - files_trashbin: 1.13.0
  - files_versions: 1.16.0
  - files_videoplayer: 1.12.0
  - logreader: 2.8.0
  - lookup_server_connector: 1.11.0
  - metadata: 0.15.0
  - music: 1.5.1
  - nextcloud_announcements: 1.12.0
  - notifications: 2.11.1
  - oauth2: 1.11.0
  - password_policy: 1.13.0
  - photos: 1.5.0
  - privacy: 1.7.0
  - provisioning_api: 1.13.0
  - recommendations: 1.2.0
  - serverinfo: 1.13.0
  - settings: 1.5.0
  - sharebymail: 1.13.0
  - support: 1.6.0
  - survey_client: 1.11.0
  - systemtags: 1.13.0
  - text: 3.4.0
  - theming: 1.14.0
  - twofactor_backupcodes: 1.12.0
  - updatenotification: 1.13.0
  - user_external: 2.1.0
  - user_saml: 4.1.1
  - viewer: 1.7.0
  - workflowengine: 2.5.0
Disabled:
  - dashboard: 7.0.0
  - encryption
  - files_antivirus: 2.4.1
  - files_automatedtagging
  - firstrunwizard: 2.4.0
  - previewgenerator
  - user_ldap
  - user_status: 1.0.1
  - weather_status: 1.0.0

Nextcloud Signing status

No errors have been found.

Nextcloud Logs

N/A (container logs)

Additional info

Configurations Attempted

This issue appears when using files > 2 MB over SMB with ALL of the following configurations:

  • Nextcloud 19.0.13, 20.0.14, 21.0.9, 22.2.5, and 23.0.2
  • nginx 1.19.6, 1.21.0, 1.21.3, 1.21.6:
    • nginx with sendfile off and sendfile on.
    • nginx with sendfile on and sendfile_max_chunk 1m.
  • Apache 2.4.52 (Debian) and PHP/8.0.16.
  • SMB version 3.0.0 and version 3.1.1.
  • SMB with cache=none and cache=strict.
  • Firefox 97.0.1 (64-bit) and Microsoft Edge 98.0.1108.56.
  • Kubernetes 1.15.10 and 1.22.6
  • Both the Azure Files and SMB CSI drivers
  • Both Azure Files and a Qumulo-based SMB cluster.

The issue does NOT appear if we create a folder inside the container (e.g., mkdir -p /mnt/local/test) and then mount that as external storage. The issue also was not reproducible with Nextcloud 18.0.14.

@GuyPaddock GuyPaddock added 0. Needs triage Pending check for reproducibility or if it fits our roadmap bug labels Feb 25, 2022
@GuyPaddock
Copy link
Author

UPDATE: I tried rolling Icewind back to the version that was in Nextcloud 20.0.18, per comments in #26457, but the rollback had no effect on the behavior, so this is not looking like an Icewind issue per say.

@GuyPaddock
Copy link
Author

GuyPaddock commented Feb 26, 2022

Another update: It appears that this issue is either related to a change in PHP or a change in Sabre/HTTP.

It is documented elsewhere that mmap combined with SMB volumes can lead to corrupt file reads:

Now, it looks like Sabre HTTP 5.0.3 added enhancements to make better use of mmap() through the stream_copy_to_stream() method:
image

For more context, the MR referenced by the Sabre HTTP changelog is:
sabre-io/http#119

And that version was pulled into Nextcloud 19.0.0 (coincidentally, we started seeing this issue upon upgrading from 18.0.14 to 19.0.13 as mentioned in the issue description):
nextcloud/3rdparty@b0afba6

Prior to NC 19, NC was using version 4.2.4 of Sabre HTTP.


Now, I thought I had a slam dunk here because this theory would have worked perfectly to explain the behavior that we are seeing, except that the code in sabre-io/http#119 doesn't actually call mmap() at all. Rather, it ensures that data is read in chunks of 4 MB so that under the hood, PHP uses mmap() within its implementation of stream_copy_to_stream(). In our container image, I created a patch to Nextcloud 23.0.2 that reverted Sabre HTTP to version 5.0.2 (the version right before sabre-io/http#119 was merged) and tried that but we got the same behavior.

HOWEVER one last thing I tried was -- in addition to reverting 119 -- I modified \Sabre\HTTP\Sapi::sendResponse() so that instead of this:

        if (null !== $contentLength) {
            $output = fopen('php://output', 'wb');
            if (is_resource($body) && 'stream' == get_resource_type($body)) {
                if (PHP_INT_SIZE !== 4) {
                    // use the dedicated function on 64 Bit systems
                    stream_copy_to_stream($body, $output, (int) $contentLength);
                } else {
                    // workaround for 32 Bit systems to avoid stream_copy_to_stream
                    while (!feof($body)) {
                        fwrite($output, fread($body, 8192));
                    }
                }
            } else {
                fwrite($output, $body, (int) $contentLength);
            }
        } else {
            file_put_contents('php://output', $body);
        }

It read as:

        if (null !== $contentLength) {
            $output = fopen('php://output', 'wb');
            if (is_resource($body) && 'stream' == get_resource_type($body)) {
                // workaround for 32 Bit systems to avoid stream_copy_to_stream
                while (!feof($body)) {
                    fwrite($output, fread($body, 8192));
                }
            } else {
                fwrite($output, $body, (int) $contentLength);
            }
        } else {
            file_put_contents('php://output', $body);
        }

... and that works. Attached is the resulting patch (for debugging only). This is obviously not a fix because it seems to have a significant performance hit, but it does seem to narrow down where the issue is.

sabre-http-revert-5.0.2-and-mmap.patch.txt

I can't take credit for this idea, as it was suggested by #26457 (comment).

@GuyPaddock
Copy link
Author

More surgical patch that just removes calls to stream_copy_to_stream() from Sabre/HTTP.
sabre-http-avoid-mmap.patch.txt

@GuyPaddock
Copy link
Author

GuyPaddock commented Feb 26, 2022

It looks like our Nextcloud v18.0.14 image was running PHP 7.3.26 whereas the Nextcloud v19.0.13 image runs PHP 7.4.21. That's significant because, in addition to the mmap() enhancement that was made to Sabre/HTTP, the maintainers filed an enhancement request to the PHP core team, which resulted in this change to PHP code:
php/php-src@333d607.

Which... rolled out in PHP 7.4.

My hypothesis is that files on an SMB share that are < 4 MB can be read with minimal or no side effects through mmap(), while files > 4 MB are unstable via mmap(). Since PHP 5.3 through 7.3 avoided using mmap() for files > 4 MB, this defect was avoided. Now that PHP 7.4 no longer has that logic, it's not safe to use stream_copy_to_stream() on files from SMB until the underlying kernel bug or whatever is causing these reads to be unstable is fixed.

@AndyXheli

This comment was marked as resolved.

@GuyPaddock

This comment was marked as resolved.

GuyPaddock added a commit to Inveniem/nextcloud-azure-aks that referenced this issue Mar 6, 2022
@szaimen
Copy link
Contributor

szaimen commented Jan 23, 2023

Hi, please update to 24.0.9 or better 25.0.3 and report back if it fixes the issue. Thank you!

My goal is to add a label like e.g. 25-feedback to this ticket of an up-to-date major Nextcloud version where the bug could be reproduced. However this is not going to work without your help. So thanks for all your effort!

If you don't manage to reproduce the issue in time and the issue gets closed but you can reproduce the issue afterwards, feel free to create a new bug report with up-to-date information by following this link: https://github.com/nextcloud/server/issues/new?assignees=&labels=bug%2C0.+Needs+triage&template=BUG_REPORT.yml&title=%5BBug%5D%3A+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap bug needs info
Projects
None yet
Development

No branches or pull requests

3 participants