Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters 

We have both lzw reader and writer under compress/lzw.

There is also an lzw reader hidden under https://github.com/golang/image/blob/master/tiff/lzw/reader.go
that is supposed to be used for reading tiff files. (There is no writer.)
The difference being a slight modification of the algorithm as stated in https://github.com/golang/image/blob/master/tiff/lzw/reader.go

I am working on a [pdf processor](https://github.com/hhrutter/pdfcpu) and have a use case for encoding/decoding LZW data using both variations. The [PDF spec](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) defines the LZWDecode filter with the parameter EarlyChange that decides about the variation of the algorithm to use:

if EarlyChange = 1 => use [golang/image/tiff/lzw/reader.go](https://github.com/golang/image/blob/master/tiff/lzw/reader.go) from the experimental repos at golang.org/x and a correspondingwriter.

if EarlyChange = 0 => use [compress.lzw.reader.go](https://github.com/golang/go/blob/master/src/compress/lzw/reader.go) and [compress/lzw/writer.go](https://github.com/golang/go/blob/master/src/compress/lzw/writer.go) from the std lib.

The default value for EarlyChange is 1.

This proposal is about extending compress/lzw so it can be used by any package that needs to implement lzw compression in any way. While reading/writing tiff files may be an exotic usecase these days, upcoming PDF readers,writers,processors would greatly benefit from this. In addition this would be the first time to introduce tiff encoding and decoding within the stdlib.

These are the proposed code changes. The modifications of the  algorithm are minor but
but for now I don't see any better way than extending the API:

- Introduce a bool parameter _oneOff_ : 
```
// oneOff makes code length increases occur one code early. It should be true
// for tiff files or pdf LZWDecode filters with earlyChange=1 which is also the default.
func NewReader(r io.Reader, order Order, litWidth int, oneOff bool) io.ReadCloser
func NewWriter(w io.Writer, order Order, litWidth int, oneOff bool) io.WriteCloser
```

- Extend encoder and decoder structs:
```
// oneOff makes code length increases occur one code early.
oneOff bool
```

- Extend the encoding algorithm:
```
func (e *encoder) incHi() error {
	e.hi++

        // Code length increases dependent on d.oneOff.
	ui := e.hi
	if e.oneOff {
		ui++
	}

	if ui == e.overflow {
		e.width++
		e.overflow <<= 1
	}

	if ui == maxCode {
		clear := uint32(1) << e.litWidth
		if err := e.write(e, clear); err != nil {
			return err
		}
		e.width = e.litWidth + 1
		e.hi = clear + 1
		e.overflow = clear << 1
		for i := range e.table {
			e.table[i] = invalidEntry
		}
		return errOutOfCodes
	}
	return nil
}
```

- Extend the decoding algorithm:
```
// decode decompresses bytes from r and leaves them in d.toRead.
// read specifies how to decode bytes into codes.
// litWidth is the width in bits of literal codes.
func (d *decoder) decode() {
	// Loop over the code stream, converting codes into decompressed bytes.
loop:
	for {
		code, err := d.read(d)
		if err != nil {
			if err == io.EOF {
				err = io.ErrUnexpectedEOF
			}
			d.err = err
			break
		}
		switch {
		case code < d.clear:
			// We have a literal code.
			d.output[d.o] = uint8(code)
			d.o++
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(code)
				d.prefix[d.hi] = d.last
			}
		case code == d.clear:
			d.width = 1 + uint(d.litWidth)
			d.hi = d.eof
			d.overflow = 1 << d.width
			d.last = decoderInvalidCode
			continue
		case code == d.eof:
			d.err = io.EOF
			break loop
		case code <= d.hi:
			c, i := code, len(d.output)-1
			if code == d.hi && d.last != decoderInvalidCode {
				// code == hi is a special case which expands to the last expansion
				// followed by the head of the last expansion. To find the head, we walk
				// the prefix chain until we find a literal code.
				c = d.last
				for c >= d.clear {
					c = d.prefix[c]
				}
				d.output[i] = uint8(c)
				i--
				c = d.last
			}
			// Copy the suffix chain into output and then write that to w.
			for c >= d.clear {
				d.output[i] = d.suffix[c]
				i--
				c = d.prefix[c]
			}
			d.output[i] = uint8(c)
			d.o += copy(d.output[d.o:], d.output[i:])
			if d.last != decoderInvalidCode {
				// Save what the hi code expands to.
				d.suffix[d.hi] = uint8(c)
				d.prefix[d.hi] = d.last
			}
		default:
			d.err = errors.New("lzw: invalid code")
			break loop
		}
		d.last, d.hi = code, d.hi+1
                 
                // Code length increases dependent on d.oneOff.
		ui := d.hi
		if d.oneOff {
			ui++
		}
		if ui >= d.overflow {
			if d.width == maxWidth {
				d.last = decoderInvalidCode
				// Undo the d.hi++ a few lines above, so that (1) we maintain
				// the invariant that d.hi <= d.overflow, and (2) d.hi does not
				// eventually overflow a uint16.
				if !d.oneOff {
					d.hi--
				}
			} else {
				d.width++
				d.overflow <<= 1
			}
		}
		if d.o >= flushBuffer {
			break
		}
	}
	// Flush pending output.
	d.toRead = d.output[:d.o]
	d.o = 0
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters #25409

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: compress/lzw: extend package for correctly handling tiff files and pdf LZWDecode filters #25409

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions