KPU

What is KPU? Why do you need a KPU?

KPU, Neural Network Processor, or Knowledge Processing Unit, is the core of the AI processing part of K210. So how does KPU handle AI algorithms? First of all, the current (2019Q1) so-called AI algorithm is mainly based on neural network models derived from neural network algorithms, such as VGG, ResNet, Inception, Xception, SeqeezeNet, MobileNet, etc.
Then why not use ordinary CPU/MCU What about the calculation of neural network algorithms?
Because for most application scenarios, the amount of computation of the neural network is too large: for example, RGB image analysis of 640 × 480 pixels, assuming that the first layer of the network has 16 3 × 3 convolution kernels per color channel, then only the first layer To perform a 640x480x3x16=15M convolution operation, and the calculation time of a 3×3 matrix is 9 times of multiplication and addition, load two operands into the register, each requires 3 cycles, multiply one cycle, add one cycle, compare Whether to jump to a cycle of 9 cycles, then roughly need 9x (3 + 3 + 1 + 1 + 1) = 90 cycles, so calculate a layer of network when using 15M * 90 = 1.35G cycles!
We remove the fraction, 1G cycles, then the STM32 running at 100MHz is about 10s, and the Cortex-A7 running at 1GHz requires 1s to calculate one layer! Usually, a practical neural network model requires more than 10 layers of calculations! For CPUs that are not optimized, you need seconds or even minutes to calculate!
Therefore, in general, the CPU/MCU to calculate the neural network is very time consuming and not practical. The application scenarios of neural network operations are further divided into training side and inference side. For the high computing power required to train the model, we already have NVIDIA's high-performance graphics cards to speed up the computing. For model inference, usually on consumer electronics/industrial electronic terminals, ie AIOT , there is a requirement for volume and energy consumption. Therefore, we must introduce a dedicated acceleration module to accelerate the model inference calculation. At this time, the KPU will run. gave birth!

KPU infrastructure

Let us review the basic operations of classical neural networks:

Convolution: 1 × 1 convolution, 3 × 3 convolution, 5 × 5 and higher convolution
Batch Normalization
Activate
Pooling
Matrix Calculate: matrix multiplication, plus for the basic neural network structure, only has 1, 2, 3, 4 operations; for new networks Structures, such as ResNet, add a variable after the convolution result, and you need to use the fifth operation, matrix operation.

Kendryte K210 has built-in hardware acceleration for convolution, batch normalization, activation, and pooling, but does not implement general matrix operations, so the network structure must be implemented.
There are restrictions. For the network structure that requires additional operations, the user must manually insert the processing layer of the CPU intervention after the hardware completes the basic operation, which will result in a decrease in the number of frames.
Therefore, users are advised to optimize their network structure to the basic network form. Fortunately, the second generation of the chip will support general-purpose matrix computing and cure more types of network structures.
In the KPU, the above four basic operations are not a separate acceleration module, but an integrated acceleration module, which effectively avoids the loss caused by CPU intervention, but also loses some operational flexibility.
We have analyzed the schematic diagram of the KPU acceleration module from standalone sdk/demo and Model Compiler as follows.

KPU register configuration instructions

The chip manufacturer did not give a register manual. We analyzed each register definition from kpu.c, kpu.h, and Model Compiler. The KPU register configuration is written in the kpu_layer_argument_t structure. We take the gencode.c in the kpu demo in the standalone demo to analyze it. (https://github.com/kendryte/kendryte-standalone-demo/blob/master/kpu/ Gencode_output.c)

// Layer parameter list, a total of 16 layers kpu_layer_argument_t la[] __attribute__((aligned(128))) = {
// layer 0 {
 .kernel_offset.data = {
  .coef_row_offset = 0, // fixed to 0
  .coef_column_offset = 0 //fixed to 0
 },
 .image_addr.data = { //Image input and output address, one before, one after, the next layer of operation, turn over, can avoid copy work.
  .image_dst_addr = (uint64_t)0x6980, //Image output address, int((0 if idx & 1 else (img_ram_size - img_output_size)) / 64)
  .image_src_addr = (uint64_t)0x0 //Image load address
 },
 .kernel_calc_type_cfg.data = {
  .load_act = 1, //Enable the activation function, it must be enabled (the hardware design is so), if it is not enabled, the output is all 0.
  .active_addr = 0, //Activate parameter loading first address, initialized to active line table in kpu_task_init
  .row_switch_addr = 0x5, //Number of cells occupied by image width, one unit 64Byte. ceil(width/64)=ceil(320/64)=5
  .channel_switch_addr = 0x4b0, //Number of units occupied by a single channel. row_switch_addr*height=5*240=1200=0x4b0
  .coef_size = 0, // fixed to 0
  .coef_group = 1 //The number of groups that can be calculated at one time, because a unit is 64 bytes.
							//So width >32, set to 1; width 17~32, set to 2; width <=16, set to 4
 },
 .interrupt_enabe.data = {
  .depth_wise_layer = 0, // regular convolutional layer, set to 0
  .ram_flag = 0, // fixed to 0
  .int_en = 0, //disable interrupt
  .full_add = 0 //fixed to 0
 },
 .dma_parameter.data = { //DMA transfer parameters
  .dma_total_byte = 307199, //This layer outputs 16 channels, ie 19200*16=308200
  .send_data_out = 0, //Enable output data
  .channel_byte_num = 19199 //Output the number of bytes in a single channel, because it is 2x2 pooling, so the size is 160*120=19200
 },
 .conv_value.data = { //convolution parameter, y = (x*arg_x)>>shr_x
  .arg_x = 0x809179, //24bit multiplication parameter
  .arg_w = 0x0,
  .shr_x = 8, //4bit shift parameter
  .shr_w = 0
 },
 .conv_value2.data = { //arg_add = kernel_size * kernel_size * bw_div_sw * bx_div_sx =3x3x?x?
  .arg_add = 0
 },
 .write_back_cfg.data = { //Write back the configuration
  .wb_row_switch_addr = 0x3, //ceil(160/64)=3
  .wb_channel_switch_addr = 0x168, //120*3=360=0x168
  .wb_group = 1 //Input line width >32, set to 1
 },
 .image_size.data = { //Enter 320*240, output 160*120
  .o_col_high = 0x77,
  .i_col_high = 0xef,
  .i_row_wid = 0x13f,
  .o_row_wid = 0x9f
 },
 .kernel_pool_type_cfg.data = {
  .bypass_conv = 0, // hardware cannot skip convolution, fixed to 0
  .pad_value = 0x0, //Boundary padding 0
  .load_para = 1, //The hardware cannot skip normalization, fixed to 1
  .pad_type = 0, //Use padding value
  .kernel_type = 1, //3x3 is set to 1, 1x1 is set to 0
  .pool_type = 1, //pooled type, 2x2 max pooling with step size 2
  .dma_burst_size = 15, //dma burst transfer size, 16 bytes; fixed to 16 in the script
  .bwsx_base_addr = 0, // batch normalizes the first address, initialized in kpu_task_init
  .first_stride = 0 //The image height does not exceed 255; the image height is up to 512.
 },
 .image_channel_num.data = {
  .o_ch_num_coef = 0xf, // One-time parameter loads the number of channels that can be calculated, 16 channels. 4K/single channel convolution kernel
						//o_ch_num_coef = math.floor(weight_buffer_size / o_ch_weights_size_pad)	
  .i_ch_num = 0x2, //Input channel, 3 channel RGB
  .o_ch_num = 0xf //output channel, 16 channels
 },
 .kernel_load_cfg.data = {
  .load_time = 0, // convolution load times, no more than 72KB, only load once
  .para_size = 864, // convolution parameter size 864 bytes, 864=3 (RGB)*9(3x3)*2*16
  .para_start_addr = 0, // start address
  .load_coor = 1 //Allow loading convolution parameters
 }
},
   //The 0th layer parameter ends...
};

Some of the structure contents in the above table are not filled, and are filled in the KPU initialization function:

kpu_task_t* kpu_task_init(kpu_task_t* task){
 La[0].kernel_pool_type_cfg.data.bwsx_base_addr = (uint64_t)&bwsx_base_addr_0; //Initialize the batch normalization table
 La[0].kernel_calc_type_cfg.data.active_addr = (uint64_t)&active_addr_0; //Initialize the activation table
 La[0].kernel_load_cfg.data.para_start_addr = (uint64_t)¶_start_addr_0; //Initialization parameter loading
 ...... //16 layers of parameters, calculated layer by layer
 Task->layers = la;
 Task->layers_length = sizeof(la)/sizeof(la[0]); //16 layers
 Task->eight_bit_mode = 0; //16bit mode
 Task->output_scale = 0.12349300010531557; //Output scaling, offset
 Task->output_bias = -13.528212547302246;
 Return task;
}

It can be seen that the batch normalization table, the activation table, and the convolution parameter load address are initialized here. Activate function vertices table, 16 intervals:

//y=(uint8_t)((((uint64_t)(x - x_start) * y_mul) >> shift) + bias); 
kpu_activate_table_t active_addr_0 __attribute__((aligned(128) )) = { 
.activate_para = { //shift_number 8bit, y_mul 16bit, x_start 36bit, 8+16+36=60, pack into 64bit reg 
  {.data = {.shift_number=0, .y_mul=0, .x_start=0x800000000 }}, 
  {.data = {.shift_number=39, .y_mul=29167, .x_start=0xfe1a77234 }}, 
  {.data = {.shift_number=39, .y_mul=29167, .x_start=0xff4c0c897 }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0xfffffafbb }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0xc90319 }}, 
  {.data = {.shift_number=35 , .y_mul=18229, .x_start=0x2b1f223 }},
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0x49ae12d }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0x683d037 }}, 
  {.data = {. Shift_number=35, .y_mul=18229, .x_start=0x86cbf41 }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0xa55ae4b }}, 
  {.data = {.shift_number=35, .y_mul =18229, .x_start=0xc3e9d54 }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0xe278c5e }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start= 0x10107b68 }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0x11f96a72 }}, 
  {.data = {.shift_number=35, .y_mul=18229, .x_start=0x13e2597c }}, 
  {. Data = {.shift_number=35, .y_mul=18229, .x_start=0x15cb4886 }} 
},
.activate_para_bias0.data = { //bias 8bit, 8 bais pack into 64bit reg 
  .result_bias = {0,0,17,27,34,51,68,85} 
}, 
.activate_para_bias1.data = { 
  .result_bias = {102,119,136,153,170,187,204,221 } 
} 
};

Batch normalization table, 16 channels:

//y = (x*norm_mul)>>norm_shift + norm_add 
//Generate shift write to 15 
kpu_batchnorm_argument_t bwsx_base_addr_0[] __attribute__((aligned(128))) = { 
{.batchnorm.data = {.norm_mul = 0x4c407, .norm_add = 0x23523f0, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x79774, .norm_add = 0x493a3e, .norm_shift = 15}}, 
{ .batchnorm.data = {.norm_mul = 0x4bd72, .norm_add = 0xf58bae, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x10a7ae, .norm_add = 0x99cf06, .norm_shift = 15}}, 
{.batchnorm .data = {.norm_mul = 0xe1ea4, .norm_add = 0x289634, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x150a0, .norm_add = 0x2428afc, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0xa72e4, .norm_add = 0xffd850ff, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x7b54b, .norm_add = 0x71a3b5, .norm_shift = 15}}, 
{. Batchnorm.data = {.norm_mul = 0x1cb84b, .norm_add = 0x13fef34, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x1b8a86, .norm_add = 0x342b07, .norm_shift = 15}}, 
{.batchnorm. Data = {.norm_mul = 0x5dd03, .norm_add = 0x965b43, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0xb2607, .norm_add = 0x259e2c0, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0xa1abb, .norm_add = 0x1b68398, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x25a89, .norm_add = 0x202e81c, .norm_shift = 15}},
{.batchnorm.data = {.norm_mul = 0x54d31, .norm_add = 0x61c1e20, .norm_shift = 15}}, 
{.batchnorm.data = {.norm_mul = 0x62b56, .norm_add = 0x6cd3fc, .norm_shift = 15}} 
};

// convolution kernel parameters
Table ```uint16_t para_start_addr_0[] __attribute__((aligned(128))) = {
0x51d4, 0x560f, 0x4496, 0x555b, 0x5119, 0x5a03, 0x566f, 0x53c6, 0x498f, 0xb5ef, 0xbf72, 0xa7ab, 0x9d7e, 0x9035, 0xa15d, 0x8e32, 0x9507, 0x85d2, 0x70b1, 0x806f, 0x79c0, 0x8b4d, 0x98fe, 0x95ee, 0x9c96, 0x9bfc, 0x9f36, 0xdb30, 0x33ef, 0x6032, 0xebe6, 0x39d3, 0x633b, 0xd744, 0x4194, 0x6707, 0xcb4e, 0x34ba, 0x7687, 0xdfb0, 0x30bb, 0x7927, 0xb97d, 0x40d3, 0x7fe4, 0xb72b, 0x523d, 0x7104, 0xc994, 0x50be, 0x70e3, 0xb16a, 0x58dd, 0x6914, 0x8afb, 0x7f23, 0x7e6f, 0x7fdc, 0x4bf7, 0x7835, 0x80bf, 0x7dc3, 0x7ba0, 0x70db,
0x774a, 0x7f8f, 0x791c, 0x5f55, 0x82b8, 0x8066, 0x83f0, 0x820b, 0x825d, 0x8649, 0x7df9, 0x7a0e, 0x558a, 0x8ae2, 0x7f27, 0x7f64, 0x79a9, 0x615e, 0x6635, 0x65f2, 0x824f, 0x816a, 0x8680, 0x98e6, 0x9884, 0x933f, 0x680a, 0x6a0d, 0x6b9e, 0x9035, 0x87a4, 0x8779, 0x87f4, 0x8c33, 0x84bb, 0x6415, 0x7002, 0x6db9, 0x99cc, 0x8e8d, 0x9150, 0x8556, 0x8298, 0x82e6, 0x872e, 0x7ff5, 0x7c8a, 0x81e7, 0x4df1, 0xadaf, 0xb520, 0xc1b9, 0x0, 0x8093, 0x812b, 0x82d4, 0x7b23, 0x53f7, 0xb5e5, 0xa308, 0xc0fc, 0xd2e, 0x7f08, 0x8090,
0x7ac9, 0x7b27, 0x5049, 0xb1f0, 0xa683, 0xc544, 0x1633, 0x73b7, 0x6d6e, 0x7597, 0x7b5c, 0x71c0, 0x7b5d, 0x7561, 0x7153, 0x7ec1, 0x74af, 0x6acf, 0x7898, 0x7ee8, 0x73be, 0x7e1a, 0x856e, 0x7fe0, 0x8b5d, 0x78f3, 0x77b6, 0x7fd6, 0x77d0, 0x73c8, 0x8384, 0x70ab, 0x7638, 0x8448, 0x5e13, 0x41d6, 0x5742, 0xd6fd, 0xf185, 0xd8ff, 0x52ac, 0x3afd, 0x531b, 0x674c, 0x4db2, 0x5a31, 0xc677, 0xe222, 0xbd9b, 0x64ce, 0x494b, 0x5a67, 0x82e9, 0x721e, 0x7b5b, 0xae49, 0xbedb, 0xac77, 0x5161, 0x41bb, 0x56f4, 0xb5e4, 0xb0e6, 0x942f,
0x8681, 0x8714, 0x8395, 0x4160, 0x4763, 0x5e49, 0xbae2, 0xb877, 0x940d, 0x9473, 0x9238, 0x91d7, 0x3023, 0x33e8, 0x56ec, 0xa9d7, 0xa6de, 0x8f28, 0x94c0, 0x9261, 0x8ba5, 0x452b, 0x4c9c, 0x5ad7, 0x93df, 0x80e4, 0x685c, 0x887f, 0x85e8, 0x5ae7, 0x6a0a, 0x715e, 0xb7fb, 0x8c45, 0x7f99, 0x6077, 0x8768, 0x8bed, 0x6308, 0x70c2, 0x72cf, 0xb400, 0x7731, 0x7b42, 0x76eb, 0x7f80, 0x899d, 0x68f0, 0x7aec, 0x7948, 0xa766, 0x6cf7, 0x9a9c, 0x848c, 0x8f6a, 0x8f23, 0x64ce, 0x9288, 0x6d6e, 0x779b, 0x6d4b, 0x986d, 0x81ce, 0x9b3c,
0x8ee0, 0x64bb, 0x8cda, 0x5922, 0x6a11, 0x596b, 0x9142, 0x86e6, 0x9107, 0x95c2, 0x7b8a, 0x9113, 0x73df, 0x6fc0, 0x4482, 0x5aef, 0xddf4, 0x43b3, 0x39a5, 0xffff, 0x43db, 0x4dc9, 0xe663, 0x50eb, 0x5bea, 0xd0a1, 0x5395, 0x42ce, 0xeb37, 0x5f02, 0x54b9, 0xc84f, 0x4b78, 0x697c, 0xc693, 0x5686, 0x4e78, 0xdd55, 0x53c2, 0x6351, 0xc0fe, 0x8eb1, 0x817c, 0x7590, 0x7a66, 0x7168, 0x74f3, 0x7d86, 0x6f2d, 0x8b15, 0x7f21, 0x80a5, 0x6c26, 0x7561, 0x7661, 0x726d, 0x8272, 0x7d32, 0x87e9, 0x90a0, 0x85e5, 0x7229, 0x7ff5, 0x7c3c,
0x7095, 0x83f7, 0x7424, 0x7eac, 0x81b8, 0x7245, 0xa0b1, 0x777e, 0x73e2, 0x74b5, 0x7f83, 0x73c2, 0x68b1, 0x85b2, 0x715e, 0x957b, 0x83d2, 0x7c75, 0x71d2, 0x8525, 0x830d, 0x6fc2, 0x76f8, 0x7454, 0x8f1f, 0x7cbb, 0x7867, 0x714e, 0x82bb, 0x80af, 0x705a, 0x4ef2, 0x492d, 0x487b, 0x5ed4, 0x5c4a, 0x60f8, 0x9158, 0x8a70, 0x90a5, 0x6cdd, 0x7c1d, 0x78a6, 0x71fe, 0x6fae, 0x680d, 0x59e7, 0x4e69, 0x6926, 0xafcb, 0xbffc, 0xbaa5, 0xb21c, 0xbaa3, 0xa6f3, 0x98f3, 0x9715, 0x96ff, 0x823e, 0x80ce, 0x77d4, 0x80c3, 0x74d0, 0x6a80,
0x8556, 0x6202, 0x7250, 0x860a, 0x8417, 0x8168, 0x892b, 0x7612, 0x6c7b, 0x8c31, 0x6669, 0x7b0f, 0x7f76, 0x835f, 0x7188, 0x842f, 0x7e1c, 0x7227, 0x7ef1, 0x678d, 0x7b64, 0x4bbd, 0x37fa, 0x4cf3, 0xa1cf, 0x819b, 0x699b, 0xc2c3, 0xc53e, 0x94da, 0x5049, 0x354e, 0x553e, 0xa78b, 0x8ccc, 0x647e, 0xba65, 0xbd12, 0x8b34, 0x4b5b, 0x35b1, 0x4562, 0xa49e, 0x8aec, 0x703c, 0xbb96, 0xc214, 0xa3f5};

Content

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KPU

What is KPU? Why do you need a KPU?

KPU infrastructure

KPU register configuration instructions

Content

Building

Downloads

Modules

Running two MicroPython instances

Frozen Modules

Documents and links

Clone this wiki locally