qat engine segfault issue in multi-thread. #110

dhkim88 · 2019-06-20T05:42:54Z

HI
I was check the maximum performance about QAT-Engine using IXIA in multi-thread env. And I found a problem that caused segfault. So, debugging and found the source code of the problem.

gdb trace
#0 0x00007f3fe71fed20 in ASYNC_get_wait_ctx () from /secui/lib/QAT/libcrypto.so.1.1
#1 0x00007f3fd8aa042d in qat_wake_job () from /secui/lib/QAT/lib/engines-1.1/qat.so
#2 0x00007f3fd8aa489c in qat_chained_callbackFn () from /secui/lib/QAT/lib/engines-1.1/qat.so
#3 0x00007f3fd8800485 in LacSymCb_ProcessCallback () from /secui/lib/QAT/libqat_s.so
#4 0x00007f3fd8816532 in adf_user_notify_msgs_poll () from /secui/lib/QAT/libqat_s.so
#5 0x00007f3fd88126ac in adf_pollRing () from /secui/lib/QAT/libqat_s.so
#6 0x00007f3fd88129da in icp_adf_pollInstance () from /secui/lib/QAT/libqat_s.so
#7 0x00007f3fd880d436 in icp_sal_CyPollInstance () from /secui/lib/QAT/libqat_s.so
#8 0x00007f3fd8aa08e9 in timer_poll_func () from /secui/lib/QAT/lib/engines-1.1/qat.so
#9 0x00007f3fe22ee50b in start_thread () from /lib64/libpthread.so.0
#10 0x00007f3fe0b0038f in clone () from /lib64/libc.so.6

The below code is in the qat_chained_callbackFn function.

opdone->opDone.flag = 1;
if (opdone->opDone.job) {                  <<--- Check async mode
    qat_wake_job(opdone->opDone.job, 0);
}

I don't use asynchronous mode, but it seems to be entering asynchronous paths.

This seems to be a synchronization problem between the polling thread and the packet handling thread in the QAT engine. The opdone variable used in the polling thread is a local variable declared in the packet-handling thread. When the callback function is called, If the opdone-> opDone.flag = 1 flag is set and interrupted then opdone address can be released in the packet handle thread. This problem may also occur in async mode.

I did some testing and added the following defensive code to confirm that there was no problem.

in qat_chained_callbackFn function

opdone->opDone.flag = 1;
if (opdone->opDone.job) {
    qat_wake_job(opdone->opDone.job, 0);
}
opdone->callback_done = 1;   <----- Add

in qat_chained_ciphers_do_cipher function

    qctx->npipes_last_used = qctx->numpipes > qctx->npipes_last_used
        ? qctx->numpipes : qctx->npipes_last_used;
}
while (!done.callback_done)      <------ Add
    pthread_yield();

return outlen;

}

please check this

Thank you.

The text was updated successfully, but these errors were encountered:

paulturx · 2019-06-21T13:25:33Z

Hi dhkim88
Thanks very much for your input. I will test out your proposal and get back to you presently.
Regards,
paulturx

dhkim88 · 2019-07-01T02:27:19Z

Hi paulturx
Do you have any updates that you can share?
Thank you.

stevelinsell · 2019-07-01T10:09:14Z

Hi @dhkim88

I have been thinking about the issue you have raised.
I agree it certainly is an issue in synchronous mode as you have described and demonstrated.
In asynchronous mode I don't believe it will cause a problem.
This is because of the way the calls are sequenced. When running asynchronously the caller is blocked not just by checking the opdone->opDone.flag but the fact the userspace thread needs to be switched to to run (by qat_wake_job being called and the caller looking for the eventfd being set). This means it will not get switched to run until qat_wake_job has already used the value of opdone->opDone.job so there should be no possibility for the opDone structure to get cleaned up before it has been finished with in the polling thread.

Interestingly if you look in the file qat_callback.c you'll see the callback that is used for the asymmetric crypto callbacks (qat_crypto_callbackFn()). This has been written in the following way:

    if (opDone->job) {
        opDone->flag = 1;
        qat_wake_job(opDone->job, ASYNC_STATUS_OK);
    } else {
        opDone->flag = 1;
    }

It is purposely written this way so that opDone->job is checked first before the flag is set. This means in the synchronous use case a race condition does not happen and in the asynchronous use case a race condition cannot happen anyway, so it does not matter that job is used in the qat_wake_job call.
My proposal would be to fix qat_chained_callbackFn() in the same way by checking the job first, so that a race condition is not possible.

Would this be acceptable to you?

Kind Regards,

Steve.

dhkim88 · 2019-07-02T02:00:12Z

Hi Steve

Thanks for your update.
I've already tested your suggestion and confirmed that there is no problem in sync mode. But testing in the proposed way will cause problems in async mode. And I'm sorry. The synchronization issue I talked about in async mode is another synchronization issue. I actually confirmed that the problem occurred in async mode. The following describes synchronization issues in asynchronous mode.

Packet Handling thread

do {
    if (done.opDone.job != NULL) {
        if ((job_ret = qat_pause_job(done.opDone.job, 0)) == 0)
            pthread_yield();
    } else {
        pthread_yield();
    }
} while (!done.opDone.flag ||
         QAT_CHK_JOB_RESUMED_UNEXPECTEDLY(job_ret));
....
qat_cleanup_op_done_pipe(&done);

Polling thread

if (opdone->opDone.job) {
    opdone->opDone.flag = 1;
    qat_wake_job(opdone->opDone.job, 0);
} else {
    opdone->opDone.flag = 1;
}

Please just follow the number sequence.

Packet handling thread side.
1). If call the qat_pause_job and qat_pause_job fails, call pthread_yield.
4). Wake up and call the qat_cleanup_op_done_pipe function and set opDone-> job = NULL.

Polling thread side.
2). Check opdone->opDone.job and entering async paths.
3). If opdone-> opDone.flag is set and an interrupt or scheduling occurs.
5). The qat_wake_job function is handling the null pointer value of opdone-> opDone.job. And segfault occurs.

There seem to be some similar synchronization issues in asynchronous mode.

Thank you.

stevelinsell · 2019-07-02T13:53:28Z

Hi @dhkim88

I agree in the failure case you mention the asynchronous path effectively becomes like the synchronous path and is spinning waiting for the flag to get set. In that case the race condition could occur.
In reality if you are failing to pause the job you have bigger issues, Really if you can't pause the job it is a terminal error but because allocated memory has passed ownership from the engine to the QAT driver/hardware on submission of the request we do not want to free up the memory until a response has come back from the hardware/driver or we might get a later crash. That is why we sit and spin in a synchronous fashion to try and still get a response. I would not expect this path to get exercised during normal operation. That is not to say we shouldn't eliminate the race condition.

The simplest fix for the issue would be to refactor the callback using a local variable to effectively cache the value of job. As such we will still allow the opdone structure to be cleaned up before the callback has finished if the thread is interrupted but it would have no bad effects in the callback as nothing would be using it.

An example of the fix would be:

static void qat_chained_callbackFn(void *callbackTag, CpaStatus status,
                                   const CpaCySymOp operationType,
                                   void *pOpData, CpaBufferList *pDstBuffer,
                                   CpaBoolean verifyResult)
{
    ASYNC_JOB *job = NULL;
    op_done_pipe_t *opdone = (op_done_pipe_t *)callbackTag;
    CpaBoolean res = CPA_FALSE;

    ....

    /* Cache job pointer to avoid a race condition if opdone gets cleaned up
     * in the calling thread.
     */
    job = opdone->opDone.job;

    /* Mark job as done when all the requests have been submitted and
     * subsequently processed.
     */
    opdone->opDone.flag = 1;
    if (job) {
        qat_wake_job(job, ASYNC_STATUS_OK);
    }
}

It would also need to be fixed in the same way in the qat_crypto_callbackFn().
Would this meet your needs for both the synchronous and asynchronous use cases?

Kind Regards,

Steve.

dhkim88 · 2019-07-04T07:22:45Z

Hi steve.

I agree with your suggestion. Thank you for your consideration.
I'll try some test and close this case if I don't have any problems.

And I have another question. Can I get performance comparison reports for multi-threaded and multi-process? Can you tell me roughly?

Thank you.

dhkim88 changed the title ~~qat engine sagfault issue in multi-thread.~~ qat engine segfault issue in multi-thread. Jun 25, 2019

dhkim88 closed this as completed Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qat engine segfault issue in multi-thread. #110

qat engine segfault issue in multi-thread. #110

dhkim88 commented Jun 20, 2019

paulturx commented Jun 21, 2019

dhkim88 commented Jul 1, 2019

stevelinsell commented Jul 1, 2019

dhkim88 commented Jul 2, 2019 •

edited

Loading

stevelinsell commented Jul 2, 2019

dhkim88 commented Jul 4, 2019

qat engine segfault issue in multi-thread. #110

qat engine segfault issue in multi-thread. #110

Comments

dhkim88 commented Jun 20, 2019

paulturx commented Jun 21, 2019

dhkim88 commented Jul 1, 2019

stevelinsell commented Jul 1, 2019

dhkim88 commented Jul 2, 2019 • edited Loading

stevelinsell commented Jul 2, 2019

dhkim88 commented Jul 4, 2019

dhkim88 commented Jul 2, 2019 •

edited

Loading