Squaring x*x of a number x can be implemented more efficiently than general multiplication x*y. Instead of providing an additional Sqr function, recognize calls of the form z.Mul(x, x) and internally use squaring code. This will boost squaring code independent of whether an explicit square function was used or not.