Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime selection of fp16/fp32 #1649

Merged
merged 6 commits into from
Jul 25, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/GTP.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,9 @@ bool cfg_dumbpass;
std::vector<int> cfg_gpus;
bool cfg_sgemm_exhaustive;
bool cfg_tune_only;
#ifdef USE_HALF
bool cfg_use_half;
#endif
#endif
float cfg_puct;
float cfg_softmax_temp;
Expand Down Expand Up @@ -101,6 +104,9 @@ void GTP::setup_default_parameters() {
cfg_gpus = { };
cfg_sgemm_exhaustive = false;
cfg_tune_only = false;
#ifdef USE_HALF
cfg_use_half = false;
#endif
#endif
cfg_puct = 0.8f;
cfg_softmax_temp = 1.0f;
Expand Down
3 changes: 3 additions & 0 deletions src/GTP.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ extern bool cfg_dumbpass;
extern std::vector<int> cfg_gpus;
extern bool cfg_sgemm_exhaustive;
extern bool cfg_tune_only;
#ifdef USE_HALF
extern bool cfg_use_half;
#endif
#endif
extern float cfg_puct;
extern float cfg_softmax_temp;
Expand Down
9 changes: 9 additions & 0 deletions src/Leela.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,9 @@ static void parse_commandline(int argc, char *argv[]) {
"ID of the OpenCL device(s) to use (disables autodetection).")
("full-tuner", "Try harder to find an optimal OpenCL tuning.")
("tune-only", "Tune OpenCL only and then exit.")
#ifdef USE_HALF
("use-half", "Use half-precision OpenCL code. Trades off some accuracy for higher performance")
#endif
;
#endif
po::options_description selfplay_desc("Self-play options");
Expand Down Expand Up @@ -318,6 +321,12 @@ static void parse_commandline(int argc, char *argv[]) {
if (vm.count("tune-only")) {
cfg_tune_only = true;
}

#ifdef USE_HALF
if (vm.count("use-half")) {
cfg_use_half = true;
}
#endif
#endif

if (vm.count("benchmark")) {
Expand Down
26 changes: 21 additions & 5 deletions src/Network.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -345,26 +345,42 @@ void Network::initialize(int playouts, const std::string & weightsfile) {

std::vector<ForwardPipe*> to_init;

bool use_selfcheck = true;
#ifdef USE_OPENCL
if (cfg_cpu_only) {
myprintf("Initializing CPU-only evaluation.\n");
m_forward = std::make_unique<CPUPipe>();
use_selfcheck = false;
} else {
myprintf("Initializing OpenCL.\n");
m_forward = std::make_unique<OpenCLScheduler>();
}
#ifdef USE_HALF
if (cfg_use_half) {
myprintf("Initializing OpenCL (half precision).\n");
m_forward = std::make_unique<OpenCLScheduler<half_float::half>>();
use_selfcheck = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How badly would the self-check need to be relaxed for HALF to pass?

Copy link
Member

@gcp gcp Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm asking is that the comment for USE_HALF says "please test before enabling it".

And the user is going to wonder: test what? 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit experimenting and ran the whole thing overnight (with self-checking everything) and I recall that eventually failed even with 100% margins. Most of the cases the error was less than 1% so it was okay.

I even moved the self-check to the final results (from the output of the forward() call) and it still did fail with 100% margins. It didn't seem to yield something too problematic (e.g., policy net moving a probability from 0.05 to 0.1) which isn't something that the tree search cannot fix, though.

I guess if we still want the USE_HALF self-check we have to do something like 'N cases with average error of XXX%' but I am not sure what is the right value/way to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self-check has a catch that "rounds" all small values to zero (because the relative error gets very big on a very small value). Maybe it's just a matter of slightly pulling that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Leela Zero Chess has a probabilistic self-check, in that it tolerates the occasional failure. Not sure how that combines with us already doing the self check once every xxx nodes though. At some point you're not going to catch buggy drivers either any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically what we should test... is the strength of the engine (which one yields better win rate - speed vs. NN accuracy) and that can differ quite a bit depending on which GPU the user uses. The last time I tested, I figured out that high-bandwidth GPUs (e.g., Tesla P100 from Google Cloud) doesn't yield much additional performance hence we were sacrificing accuracy for nothing. In those cases --use-half would be worthless.

Maybe all I can say for now is to delete the comment? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option would be to use KL divergence or some other measure to calculate the error in the self check. KL divergence doesn't care if the low probabilities are little off unlike the current method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, what I will do is:

  • If it fails 2 out of 5 most recent checks it will assert-fail.
  • Regarding the error margin... I will test a bit to see what is acceptable.

} else {
myprintf("Initializing OpenCL (single precision).\n");
m_forward = std::make_unique<OpenCLScheduler<float>>();
}
#else
myprintf("Initializing OpenCL (single precision).\n");
m_forward = std::make_unique<OpenCLScheduler<float>>();
#endif
}

#else //!USE_OPENCL
myprintf("Initializing CPU-only evaluation.\n");
m_forward = std::make_unique<CPUPipe>();
use_selfcheck = false;
#endif

to_init.emplace_back(m_forward.get());

#ifdef USE_OPENCL_SELFCHECK
if (!cfg_cpu_only) {
if (use_selfcheck) {
m_forward_cpu = std::make_unique<CPUPipe>();
to_init.emplace_back(m_forward_cpu.get());
}
#else
(void)use_selfcheck;
#endif

for (const auto& p : to_init) {
Expand Down
66 changes: 44 additions & 22 deletions src/OpenCL.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,19 @@

using namespace Utils;

static std::string cl_args =
template <typename net_t> static std::string getClArgs();

template <> std::string getClArgs<float>() {
return
"-cl-mad-enable -cl-fast-relaxed-math -cl-no-signed-zeros -cl-denorms-are-zero";
}
#ifdef USE_HALF
"-DUSE_HALF "
template <> std::string getClArgs<half_float::half>() {
return
"-DUSE_HALF "
"-cl-mad-enable -cl-fast-relaxed-math -cl-no-signed-zeros -cl-denorms-are-zero";
}
#endif
"-cl-mad-enable -cl-fast-relaxed-math -cl-no-signed-zeros -cl-denorms-are-zero";

static std::string sourceCode_config = R"(
#ifdef USE_HALF
Expand Down Expand Up @@ -487,25 +495,23 @@ __kernel void out_transform_fused_bn_in(
}
)";

#ifdef USE_HALF
const std::string sourceCode_sgemm =
"#ifdef USE_HALF\n"
#include "clblast_level3_half/common.opencl"
#include "clblast_level3_half/xgemm_part1.opencl"
#include "clblast_level3_half/xgemm_part2.opencl"
#include "clblast_level3_half/xgemm_part3.opencl"
#include "clblast_level3_half/xgemm_batched.opencl"
;
#else
const std::string sourceCode_sgemm =
"#else\n"
#include "clblast_level3/common.opencl"
#include "clblast_level3/xgemm_part1.opencl"
#include "clblast_level3/xgemm_part2.opencl"
#include "clblast_level3/xgemm_part3.opencl"
#include "clblast_level3/xgemm_batched.opencl"
;
#endif
"#endif\n";

void OpenCL::ensure_context_initialized(OpenCLContext &opencl_context) {
template <typename net_t>
void OpenCL<net_t>::ensure_context_initialized(OpenCLContext &opencl_context) {
if (!opencl_context.m_is_initialized) {
// Make kernels
opencl_context.m_convolve1_kernel =
Expand All @@ -526,7 +532,8 @@ void OpenCL::ensure_context_initialized(OpenCLContext &opencl_context) {
}
}

void OpenCL_Network::add_weights(size_t layer,
template <typename net_t>
void OpenCL_Network<net_t>::add_weights(size_t layer,
size_t size,
const float * weights) {
if (layer >= m_layers.size()) {
Expand All @@ -538,15 +545,16 @@ void OpenCL_Network::add_weights(size_t layer,
converted_weights.emplace_back(weights[i]);
}

auto weightSize = size * sizeof(decltype(converted_weights)::value_type);
auto weightSize = size * sizeof(typename decltype(converted_weights)::value_type);
m_layers.back().weights.emplace_back(
m_opencl.m_context,
CL_MEM_COPY_HOST_PTR | CL_MEM_READ_ONLY,
weightSize,
const_cast<net_t*>(converted_weights.data()));
}

void OpenCL_Network::forward(const std::vector<float>& input,
template <typename net_t>
void OpenCL_Network<net_t>::forward(const std::vector<float>& input,
std::vector<float>& output_pol,
std::vector<float>& output_val,
OpenCLContext & opencl_context,
Expand Down Expand Up @@ -729,7 +737,8 @@ void OpenCL_Network::forward(const std::vector<float>& input,

}

void OpenCL_Network::convolve3(OpenCLContext & opencl_context,
template <typename net_t>
void OpenCL_Network<net_t>::convolve3(OpenCLContext & opencl_context,
int channels, int outputs,
cl::Buffer& bufferIn,
cl::Buffer& bufferOut,
Expand Down Expand Up @@ -877,7 +886,8 @@ void OpenCL_Network::convolve3(OpenCLContext & opencl_context,
}
}

void OpenCL_Network::convolve1(OpenCLContext & opencl_context,
template <typename net_t>
void OpenCL_Network<net_t>::convolve1(OpenCLContext & opencl_context,
int channels, int outputs,
cl::Buffer& bufferInput,
cl::Buffer& bufferOutput,
Expand Down Expand Up @@ -966,7 +976,8 @@ static std::string trim(std::string trim_me) {
return trim_me;
}

void OpenCL::process_tuners(std::string tuners) {
template <typename net_t>
void OpenCL<net_t>::process_tuners(std::string tuners) {
std::string buf;
std::stringstream ss(tuners);
std::size_t found;
Expand Down Expand Up @@ -1043,7 +1054,8 @@ void OpenCL::process_tuners(std::string tuners) {
}
}

std::vector<size_t> OpenCL::get_sgemm_tuners(void) {
template <typename net_t>
std::vector<size_t> OpenCL<net_t>::get_sgemm_tuners(void) {
std::vector<size_t> tuners;

tuners.emplace_back(m_sgemm_tuners.mwg);
Expand All @@ -1057,7 +1069,8 @@ std::vector<size_t> OpenCL::get_sgemm_tuners(void) {
return tuners;
}

void OpenCL::initialize(const int channels, int gpu, bool silent) {
template <typename net_t>
void OpenCL<net_t>::initialize(const int channels, int gpu, bool silent) {
std::vector<cl::Platform> platforms;
try {
cl::Platform::get(&platforms);
Expand Down Expand Up @@ -1183,9 +1196,9 @@ void OpenCL::initialize(const int channels, int gpu, bool silent) {
throw std::runtime_error("Error getting OpenCL kernels.");
}

m_cl_args = cl_args;
m_cl_args = getClArgs<net_t>();

auto t = Tuner(*this, m_context, m_device);
auto t = Tuner<net_t>(*this, m_context, m_device);
auto sgemm_tuners =
t.load_sgemm_tuners(channels, WINOGRAD_P, channels, WINOGRAD_TILE);

Expand All @@ -1198,7 +1211,7 @@ void OpenCL::initialize(const int channels, int gpu, bool silent) {

// Build program for these specific devices
try {
std::string args = cl_args;
std::string args = m_cl_args;
// Intel iGPUs need vector types for math for best performance
if (m_device.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT>() > 1) {
args += " -DWINOGRAD_SIMD";
Expand Down Expand Up @@ -1234,7 +1247,8 @@ void OpenCL::initialize(const int channels, int gpu, bool silent) {
m_init_ok = true;
}

std::string OpenCL::get_device_name() {
template <typename net_t>
std::string OpenCL<net_t>::get_device_name() {
std::stringstream ss;

ss << "OpenCL: ";
Expand All @@ -1244,4 +1258,12 @@ std::string OpenCL::get_device_name() {

return ss.str();
}

template class OpenCL<float>;
template class OpenCL_Network<float>;
#ifdef USE_HALF
template class OpenCL<half_float::half>;
template class OpenCL_Network<half_float::half>;
#endif

#endif
21 changes: 12 additions & 9 deletions src/OpenCL.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,11 @@

#include "Tuner.h"

class OpenCL;
template <typename net_t> class OpenCL;
template <typename net_t> class OpenCL_Network;

class Layer {
friend class OpenCL_Network;
template <typename> friend class OpenCL_Network;
private:
unsigned int channels{0};
unsigned int outputs{0};
Expand All @@ -49,8 +50,8 @@ class Layer {
};

class OpenCLContext {
friend class OpenCL;
friend class OpenCL_Network;
template <typename> friend class OpenCL;
template <typename> friend class OpenCL_Network;
private:
bool m_is_initialized{false};
cl::CommandQueue m_commandqueue;
Expand All @@ -69,10 +70,11 @@ class OpenCLContext {
bool m_buffers_allocated{false};
};

template <typename net_t>
class OpenCL_Network {
public:
OpenCL_Network(OpenCL & opencl) : m_opencl(opencl) {}
OpenCL & getOpenCL() {
OpenCL_Network(OpenCL<net_t> & opencl) : m_opencl(opencl) {}
OpenCL<net_t> & getOpenCL() {
return m_opencl;
}

Expand Down Expand Up @@ -166,7 +168,7 @@ class OpenCL_Network {
weight_slice_t weights,
int batch_size);

OpenCL & m_opencl;
OpenCL<net_t> & m_opencl;

// this mutex is not required for correctness, but this exists simply
// because queue.finish() is a busy wait and having a lot of threads
Expand All @@ -176,9 +178,10 @@ class OpenCL_Network {
std::vector<Layer> m_layers;
};

template <typename net_t>
class OpenCL {
friend class OpenCL_Network;
friend class Tuner;
friend class OpenCL_Network<net_t>;
friend class Tuner<net_t>;
public:
void initialize(const int channels, int gpu, bool silent = false);
void ensure_context_initialized(OpenCLContext & opencl_context);
Expand Down