New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specializing on an input seems much slower than making the input a GeneratorParam #3120

Open
yiming9 opened this Issue Jul 12, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@yiming9

yiming9 commented Jul 12, 2018

Release: 2018_02_15
Platform: Ubuntu Linux 16.04
Compiler: clang 4.0
LLVM: 4.0

I'm learning halide by writing an area downsampler for packed rgb image. The generator looks like this:

class AreaDownsampler : public Halide::Generator<AreaDownsampler> {
 public:
  GeneratorParam<int> num_channels{"num_channels", 3};
  GeneratorParam<int> scale{"scale", 3};

  Input<Buffer<uint8_t>> input{"input", 3};
  Output<Buffer<uint8_t>> output{"output", 3};

  void generate() {
    RDom r(0, scale);
    sum_y(c, x, y) = sum(cast<uint16_t>(input(c, x, y * scale + r)));

    Expr scale_sqr = cast<uint16_t>(scale) * cast<uint16_t>(scale);
    Expr offset = scale_sqr / 2;

    sum_xy(c, x, y) = sum(sum_y(c, x * scale + r, y));
    output(c, x, y) = cast<uint8_t>((sum_xy(c, x, y) + offset) / scale_sqr);
  }

  void schedule() {
    input
      .dim(1).set_stride(num_channels)
      .dim(0).set_stride(1);
    output
      .dim(1).set_stride(num_channels)
      .dim(0).set_stride(1);

    input.dim(0).set_bounds(0, num_channels);

    Var t, xo, xi;
    output.bound(c, 0, num_channels);
    output.split(x, xo, xi, 64)
        .fuse(c, xi, t)
        .vectorize(t, 16 * num_channels);

    sum_xy.bound(c, 0, num_channels);
    sum_xy.compute_at(output, xo).store_root();

    sum_y.bound(x, 0, input.dim(1).extent()).bound(c, 0, num_channels);
    sum_y.compute_at(output, y).store_root();
    sum_y.fuse(c, x, t).vectorize(t, 32 * num_channels);
  }

 private:
  Var x{"x"};
  Var y{"y"};
  Var c{"c"};

  Func sum_y{"sum_y"};
  Func sum_xy{"sum_xy"};
};

}  // namespace

This generator works pretty well, except that I have to specify the downsampling factor as a generator parameter. I tried to change this generator parameter to a input parameter, and specialize when it equals 3. I think the behavior of the generated function should be exactly same when scale==3, because the compiler has as much knowledge as when we scale as a GeneratorParam.

class AreaDownsampler : public Halide::Generator<AreaDownsampler> {
 public:
  GeneratorParam<int> num_channels{"num_channels", 3};

  Input<Buffer<uint8_t>> input{"input", 3};
  Input<int> scale{"scale"};
  Output<Buffer<uint8_t>> output{"output", 3};

  void generate() {
     ...
  }

  void schedule() {
    ...
    output.specialize(scale == 3);
  }
};

However, the performance drops ~40% after this change. I took a look at the generated lowered stmt, I find that behavior around this line changes:

    sum_y.compute_at(output, y).store_root();

In both versions, sum_y is allocated outside the outermost loop. However, in the fast version (in which scale is used as a GeneratorParam), it figures out that sum_y just needs one row, which is great.

  allocate sum_y[uint16 * 3 * input.extent.1 * 1]

But in the slow version (in which 'scale' is used as an input parameter but specialized when scale==3), it fails to figure out that sum_y just needs one row. Instead, it was allocated for the whole image.

  allocate sum_y[uint16 * 3 * input.extent.1 * output.extent.2]

The two lowered stmts are here:
fast_stmt.txt
slow_stmt.txt

I'm curious that whether it is possible to make the slow version as fast as the fast version. Can I find a way to give halide more hints to enable it to figure out that sum_y just needs one row even if scale is used as an input param as opposed to a GeneratorParam?

UPDATE:
By the way, when using scale as an input parameter and specialize it when scale=3, I have also tried

    sum_y.compute_at(output, y);

which implicitly specifies store_at(output, y). This way, halide can figure out that sum_y just needs one row. However, its allocation and deallocation happens inside the outermost loop, which makes it still about ~25% slower than the fast version (in which scale is specified as a GeneratorParam, and sum_y.compute_at(output, y).store_root()).

Thanks a lot!

@yiming9 yiming9 changed the title from Is specializing an input equivalent to make the input a GeneratorParam? to Specializing on an input seems much slower than making the input a GeneratorParam Jul 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment