Skip to content
Jin Wang edited this page Mar 27, 2015 · 1 revision

Introduction

The functional emulator provided by Ocelot can be bound to one or many trace generators that are allowed to inspect the entire system state after each instruction is executed. Ocelot provides a high level API that can be integrated into existing CUDA programs to bind user defined trace generators to CUDA kernels.

Built In Trace Generators

Ocelot currently has four supported trace generators (InteractiveDebugger, MemoryChecker, MemoryRaceDetector, and KernelTimer) that are integrated in with the runtime. There are also some other trace generators that are not built by default, and are not kept up to date with the rest of the project, but may be useful as examples in the trunk/trace-generators directory. Upon startup, Ocelot will read from the configure.ocelot file in the current directory of any CUDA program. This file allows you to specify which trace generators you want to use. If you select a trace generator in this file, it will be bound to all kernels that are invoked. For example,

{
	ocelot: "ocelot-refactored",
	version: "1.0.65",
	trace: { 
		enabled: true,
		database: "traces/database.trace",
		memory: true,
		branch: false,
		sharedComputation: true,
		instruction: false,
		parallelism: false,
		cacheSimulator: false,
		memoryChecker: true,
		raceDetector: true
	},
	cuda: {
		implementation: "CudaRuntime",
		runtimeApiTrace: "trace/CudaAPI.trace"
	},
	executive: {
		devices: [ nvidia, emulated, llvm ],
		optimizationLevel: basic,
		workerThreadLimit: 1
	}
}

This config file specifies that Ocelot should add the MemoryTraceGenerator and the SharedComputationGenerator to each kernel that is launched. These trace generators make use of a database file that keeps track of traces for individual kernels. It is specified in the database parameter.

Once traces have been generated for example:

gdiamos@cuda:~/checkout/gpuocelot/trunk/tests/cuda2.3$ ls traces/
database.trace                                   SortingNetwork__Z18bitonicMergeGlob_1_20.header  _Z17bitonicSortShare_2_0.trace    _Z18bitonicMergeGlob_3_22.header
SortingNetwork__Z17bitonicSortShare_0_0.header   SortingNetwork__Z18bitonicMergeGlob_1_21.header  _Z17bitonicSortShare_2_1.header   _Z18bitonicMergeGlob_3_25.header
SortingNetwork__Z17bitonicSortShare_0_0.trace    SortingNetwork__Z18bitonicMergeGlob_1_22.header  _Z17bitonicSortShare_2_1.trace    _Z18bitonicMergeGlob_3_27.header

Analyzing Traces

They can be examined with a Trace Analyzer. Again, Ocelot has four trace analyzers built in:

For example, you can use the MemoryTraceAnalayzer to print out information about the memory behavior of kernels. You can list the memory traces for each kernel:

gdiamos@cuda:~/checkout/gpuocelot/trunk/tests/cuda2.3$ ../../ocelot/MemoryTraceAnalyzer -i traces/database.trace -l
There are 88 kernels referenced in the database
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z17bitonicSortShare_2_0.trace
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z17bitonicSortShare_2_1.trace
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z17bitonicSortShare_2_2.trace
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z17bitonicSortShare_2_3.trace
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z17bitonicSortShare_2_4.trace
 /home/gdiamos/checkout/gpuocelot/trunk/tests/cuda2.3/traces/_Z18bitonicSortShare_2_5.trace

and print out statistics about them:

gdiamos@cuda:~/checkout/gpuocelot/trunk/tests/cuda2.3$ ../../ocelot/MemoryTraceAnalyzer -i traces/database.trace --Global

Kernel _Z17bitonicSortSharedPjS_S_S_jj
         thread count: 512
  memory instructions: 512
            halfwarps: 1152720
         transactions: 3430300
  average access size: 2048 bytes
                  t/h: 2.97583

A New Trace Generator

The base trace generator class is defined in ocelot/trace/interface/TraceGenerator.h . It is passed a handle to the kernel before it is executed in order to read input parameters, and then is passed a TraceEvent after each instruction is executed.

An Example

This example shows how to use the trace generator API. Unfortunately, TraceEvent uses parts of boost which cannot be compiled by NVCC, so the program has to be split into two parts, trace.cpp which is compiled by gcc, and tracegen.cu which is compiled with nvcc.

The Code

trace.cpp

#include <ocelot/api/interface/ocelot.h>
#include <ocelot/trace/interface/TraceGenerator.h>
#include <ocelot/trace/interface/TraceEvent.h>
#include <iostream>

class TraceGenerator : public trace::TraceGenerator
{
	public:
		void event(const trace::TraceEvent & event)
		{
			std::cout << "Got event " << event.instruction->toString() << "\n";
		}
};

extern void sampleKernel();

int main()
{
	TraceGenerator generator;
	ocelot::addTraceGenerator( generator );
	
	sampleKernel();
}

tracegen.cu

__global__ void sampleKernel( int* a )
{
	a[ threadIdx.x ] += threadIdx.x;
}

void sampleKernel()
{
	int* a;
	cudaMalloc( (void**) &a, sizeof( int ) * 128 );
	cudaMemset( a, 0, sizeof( int ) * 128 );
	
	sampleKernel<<< 1, 128, 0 >>>( a );
	
	cudaFree( a );
}

Building

I did the following to build this example on my machine:

nvcc -c tracegen.cu

g++ -std=c++0x trace.cpp -c -I .

g++ -o tracegen tracegen.o trace.o OcelotConfig -l

Running

Running the program with:

./tracegen

produced the following output:

Got event cvt.u32.u16 %r1, %tid.x
Got event ld.param.u64 %rd1, [__cudaparm__Z12sampleKernelPi_a]
Got event cvt.u64.u32 %rd2, %r1
Got event mul.lo.u64 %rd3, %rd2, 4
Got event add.u64 %rd4, %rd1, %rd3
Got event ld.global.u32 %r2, [%rd4 + 0]
Got event add.u32 %r3, %r2, %r1
Got event st.global.s32 [%rd4 + 0], %r3
Got event exit