Fixed an endian-related issue that was causing rotations to be performed incorrectly. Also added optimized PTX for rotr32. No (meaningful) effect on sm_32+, but sm_30 and below may benefit. Have yet to find a block, may still not work.