Skip to content

Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters  #25409

@hhrutter

Description

@hhrutter

We have both lzw reader and writer under compress/lzw.

There is also an lzw reader hidden under https://github.com/golang/image/blob/master/tiff/lzw/reader.go
that is supposed to be used for reading tiff files. (There is no writer.)
The difference being a slight modification of the algorithm as stated in https://github.com/golang/image/blob/master/tiff/lzw/reader.go

I am working on a pdf processor and have a use case for encoding/decoding LZW data using both variations. The PDF spec defines the LZWDecode filter with the parameter EarlyChange that decides about the variation of the algorithm to use:

if EarlyChange = 1 => use golang/image/tiff/lzw/reader.go from the experimental repos at golang.org/x and a correspondingwriter.

if EarlyChange = 0 => use compress.lzw.reader.go and compress/lzw/writer.go from the std lib.

The default value for EarlyChange is 1.

This proposal is about extending compress/lzw so it can be used by any package that needs to implement lzw compression in any way. While reading/writing tiff files may be an exotic usecase these days, upcoming PDF readers,writers,processors would greatly benefit from this. In addition this would be the first time to introduce tiff encoding and decoding within the stdlib.

These are the proposed code changes. The modifications of the algorithm are minor but
but for now I don't see any better way than extending the API:

  • Introduce a bool parameter oneOff :
// oneOff makes code length increases occur one code early. It should be true
// for tiff files or pdf LZWDecode filters with earlyChange=1 which is also the default.
func NewReader(r io.Reader, order Order, litWidth int, oneOff bool) io.ReadCloser
func NewWriter(w io.Writer, order Order, litWidth int, oneOff bool) io.WriteCloser
  • Extend encoder and decoder structs:
// oneOff makes code length increases occur one code early.
oneOff bool
  • Extend the encoding algorithm:
func (e *encoder) incHi() error {
	e.hi++

        // Code length increases dependent on d.oneOff.
	ui := e.hi
	if e.oneOff {
		ui++
	}

	if ui == e.overflow {
		e.width++
		e.overflow <<= 1
	}

	if ui == maxCode {
		clear := uint32(1) << e.litWidth
		if err := e.write(e, clear); err != nil {
			return err
		}
		e.width = e.litWidth + 1
		e.hi = clear + 1
		e.overflow = clear << 1
		for i := range e.table {
			e.table[i] = invalidEntry
		}
		return errOutOfCodes
	}
	return nil
}
  • Extend the decoding algorithm:
// decode decompresses bytes from r and leaves them in d.toRead.
// read specifies how to decode bytes into codes.
// litWidth is the width in bits of literal codes.
func (d *decoder) decode() {
	// Loop over the code stream, converting codes into decompressed bytes.
loop:
	for {
		code, err := d.read(d)
		if err != nil {
			if err == io.EOF {
				err = io.ErrUnexpectedEOF
			}
			d.err = err
			break
		}
		switch {
		case code < d.clear:
			// We have a literal code.
			d.output[d.o] = uint8(code)
			d.o++
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(code)
				d.prefix[d.hi] = d.last
			}
		case code == d.clear:
			d.width = 1 + uint(d.litWidth)
			d.hi = d.eof
			d.overflow = 1 << d.width
			d.last = decoderInvalidCode
			continue
		case code == d.eof:
			d.err = io.EOF
			break loop
		case code <= d.hi:
			c, i := code, len(d.output)-1
			if code == d.hi && d.last != decoderInvalidCode {
				// code == hi is a special case which expands to the last expansion
				// followed by the head of the last expansion. To find the head, we walk
				// the prefix chain until we find a literal code.
				c = d.last
				for c >= d.clear {
					c = d.prefix[c]
				}
				d.output[i] = uint8(c)
				i--
				c = d.last
			}
			// Copy the suffix chain into output and then write that to w.
			for c >= d.clear {
				d.output[i] = d.suffix[c]
				i--
				c = d.prefix[c]
			}
			d.output[i] = uint8(c)
			d.o += copy(d.output[d.o:], d.output[i:])
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(c)
				d.prefix[d.hi] = d.last
			}
		default:
			d.err = errors.New("lzw: invalid code")
			break loop
		}
		d.last, d.hi = code, d.hi+1
                 
                // Code length increases dependent on d.oneOff.
		ui := d.hi
		if d.oneOff {
			ui++
		}
		if ui >= d.overflow {
			if d.width == maxWidth {
				d.last = decoderInvalidCode
				// Undo the d.hi++ a few lines above, so that (1) we maintain
				// the invariant that d.hi <= d.overflow, and (2) d.hi does not
				// eventually overflow a uint16.
				if !d.oneOff {
					d.hi--
				}
			} else {
				d.width++
				d.overflow <<= 1
			}
		}
		if d.o >= flushBuffer {
			break
		}
	}
	// Flush pending output.
	d.toRead = d.output[:d.o]
	d.o = 0
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions