Skip to content

Incomplete GZIP Block During Interrupted Writes Corrupts Files #45

@asmith-elastic

Description

@asmith-elastic

Description of the problem including expected versus actual behavior:

Expected Behavior:

When writing to HDFS using the WebHDFS output plugin with GZIP compression, it should handle interrupted writes gracefully, ensuring data integrity and file consistency.

Actual Behavior:

Currently, the compress_gzip method in the WebHDFS output plugin does not account for interrupted writes or retries. When an interruption occurs during a write operation, the GZIP block being written becomes incomplete. GZIP's nature doesn't allow for appending more data to this incomplete block later on, effectively corrupting the file.


Steps to Reproduce:

  1. Configure Logstash with the WebHDFS output plugin and enable GZIP compression.
  2. Start ingesting data to HDFS through Logstash.
  3. Simulate a WebHDFS failure or maintenance activity during a write operation (you can kill the process, for example).
  4. Observe the resulting HDFS file. It will contain an incomplete GZIP block, making it corrupted and unreadable.

Logstash Information:

  1. Logstash version: 7.17.1
  2. Logstash installation source: tar
  3. How is Logstash being run: systemd

OS Version: RHEL 7

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions