Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QAT support for AES-GCM #7282

Merged
merged 1 commit into from
Mar 9, 2018
Merged

QAT support for AES-GCM #7282

merged 1 commit into from
Mar 9, 2018

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Mar 7, 2018

This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Signed-off-by: Chengfeix Zhu chengfeix.zhu@intel.com
Signed-off-by: Weigang Li weigang.li@intel.com
Signed-off-by: Tom Caputi tcaputi@datto.com

Motivation and Context

This code allows ZFS to utilize QAT for the purpose of accelerating encryption. Unfortunately, QAT does not support AES-CCM (other than 128 bits), so the implementation only runs on AES-GCM.

How Has This Been Tested?

Tested on a system with a QAT DH895x PCIe card installed and with the card disabled to ensure that the implementations with and without the card are compatible. I will have performance numbers soon.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf
Copy link
Contributor

@wli5 would you mind reviewing this.

@codecov
Copy link

codecov bot commented Mar 8, 2018

Codecov Report

Merging #7282 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7282      +/-   ##
==========================================
+ Coverage   76.53%   76.56%   +0.03%     
==========================================
  Files         328      327       -1     
  Lines      104006   103874     -132     
==========================================
- Hits        79600    79536      -64     
+ Misses      24406    24338      -68
Flag Coverage Δ
#kernel 76.27% <100%> (-0.09%) ⬇️
#user 65.87% <100%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e5d148...ea1644b. Read the comment docs.

@wli5
Copy link
Contributor

wli5 commented Mar 8, 2018

@behlendorf Yes, I will do it, thanks!

@wli5
Copy link
Contributor

wli5 commented Mar 8, 2018

@tcaputi sorry for misleading, I just check the doc, you are right, CCM only support 128 bits:
Cipher key length in bytes. For AES it can be 128 bits (16 bytes), 192 bits (24 bytes) or 256 bits (32 bytes). For the CCM mode of operation, the only supported key length is 128 bits (16 bytes).
https://01.org/sites/default/files/downloads/intelr-quickassist-technology/qacyapiv201public.pdf

@rlaager
Copy link
Member

rlaager commented Mar 8, 2018

This commit seems to mix a lot of what must be general refactoring in with the encryption. Ideally, those should be separate commits, which would make review and troubleshooting (e.g. bisection) easier.

@rlaager
Copy link
Member

rlaager commented Mar 8, 2018

My ability to give a reasonable review on this code is limited, but from a read-through, I don't see anything glaringly wrong.

I see there's a 4kB limit on buffer size. How does this interact with recordsize=1M? Does that get encrypted on the CPU, and is that faster than trying to offload it?

Is the performance break-even point a function of particular hardware (e.g. as CPUs get faster, even 4kB should be run in software and/or as QAT cards get faster, it would make sense to offload 8kB blocks)? If so, should those numbers be conditionalized on the card model? Or should we do something where we benchmark it on module load (which I think the assembly-optimized versions of checksumming and RAIDZ do)? Or should it be user-configurable (e.g. as a module parameter)?

module/zfs/qat.c Outdated

ret = qat_crypt_init();
if (ret != 0)
goto error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering when users who don't want to use QAT encryption, they just have DC instances in the configure file, for this case the qat_init should not return fail. Only when qat_dc_init and qat_cy_init all fail, qat_init returns fail. Does it make sense?

@wli5
Copy link
Contributor

wli5 commented Mar 8, 2018

@rlaager the buffer size 4KB is for DMA restriction, for each offloading request, it actually can be very large, e.g., in record size - 128KB, but the 128KB record buffer will be split into 32 pages and send to device as one sg-list buffer, this is to make sure the address is physical contiguous for DMA access.

#define QAT_CRYPT_MAX_INSTANCES 48

#define MAX_PAGE_NUM 1024
#define DIGEST_LENGTH 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these 4 consts are not used.

}
}

static inline struct page *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those 3 inline functions exist twice in both qat_crypt.c and qat_compress.c, can we merge them to qat.h?

status = init_cy_session_ctx(dir, cy_inst_handle, &cy_session_ctx, key,
crypt, aad_len);
if (status != CPA_STATUS_SUCCESS)
goto error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can return fail directly (safely - as memory are all freed), if goto error branch, the calling of cpaCySymRemoveSession will cause issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, return (status) here.

flat_src_buf = flat_src_buf_array;
flat_dst_buf = flat_dst_buf_array;
while (bytes_left > 0) {
in_pages[page_num] = mem_to_page(in);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget page_num++, this will cause the pages not un-mapped.

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 8, 2018

I see there's a 4kB limit on buffer size. How does this interact with recordsize=1M? Does that get encrypted on the CPU, and is that faster than trying to offload it?

For clarification, the 4kB is a minimum buffer size. The maximum is 128k. As far as how this interacts with recordsize=1M, right now QAT will only accelerate blocks that are compressed below the 128k threshold. I received that upper bound from Intel. @wli5 is there any reason we should limit this functionality to 128k?

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 8, 2018

@wli5 Thanks for your suggestions and your comments have been addressed. Take a look when you get a chance.

};

static kstat_t *qat_ksp = NULL;
int zfs_qat_disable = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should pull the module_param() for this in to qat.c as well.

module/zfs/qat.c Outdated

ret = qat_dc_init();
if (ret != 0)
goto error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please just return (ret) here, and for the qat_crypt_init() failure below call qat_dc_fini() and return. Then we can drop the error label and it's contents.

module/zfs/qat.c Outdated
}

qat_dc_fini();
qat_crypt_fini();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: qat_crypt_fini() before qat_dc_fini().

module/zfs/qat.h Outdated
* Note: when qat fail happens, it doesn't mean a critical hardware
* issue, sometimes it is because the output buffer is not big enough,
* and the compression job will be transfered to gzip software again,
* so the functionality of ZFS is not impacted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence needs to be updated to refer to the encyption and not compression job.

module/zfs/qat.h Outdated
kfree(*pp_mem_addr);
*pp_mem_addr = NULL;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than maintaining mem_alloc_contig(), and mem_free_contig() as static inlines they can be moved to qat.c converted to normal functions and renamed qat_mem_alloc_contig() and qat_mem_free_contig(). Let's add the qat_ prefix to mem_to_page as well but leave it a static inline since this one might actually have an impact on performance.

module/zfs/qat.h Outdated
#define PHYS_CONTIG_ALLOC(pp_mem_addr, size_bytes) \
mem_alloc_contig((void *)(pp_mem_addr), (size_bytes))
#define PHYS_CONTIG_FREE(p_mem_addr) \
mem_free_contig((void *)&(p_mem_addr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QAT_PHYS_CONTIG_ALLOC / QAT_PHYS_CONTIG_FREE, to head off any future namespace collision.


for (i = 0; i < num_inst; i++) {
cpaCyStopInstance(cy_inst_handles[i]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: { } can be dropped, and the Cpa32U i declaration moved in to the for initializer.

for (Cpa32U i = 0; i < num_inst; i++)
	cpaCyStopInstance(cy_inst_handles[i]);

status = init_cy_session_ctx(dir, cy_inst_handle, &cy_session_ctx, key,
crypt, aad_len);
if (status != CPA_STATUS_SUCCESS)
goto error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, return (status) here.


return (CPA_STATUS_SUCCESS);

error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's model this after qat_compress() which shares a common cleanup path for success and failure. Lines 351-363 are virtually identical to 370-383.

This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 8, 2018

@behlendorf your comments have been addressed. Thank you for the feedback.

@wli5
Copy link
Contributor

wli5 commented Mar 9, 2018

is there any reason we should limit this functionality to 128k?

@tcaputi for QAT hardware, there is no such limitation, we limit it in software for performance reason, user can modify the 128K to 1M in the code, it should be OK. We can also change it in this PR, one concern is for compression, there is a list of intermediate buffers each will be pre-allocated based on the max buffer size, if most cases the record size is < 1M, while we allocated so many "big" intermediate buffer, it is a waste. I don't know how many people use 1M record size, if you think 1M is common, we may change the value to 1M in this PR.

			status = QAT_PHYS_CONTIG_ALLOC(
			    &buffer_array[i][buff_num]->pBuffers->
			    &buffer_array[i][buff_num]->pBuffers->
			    pData, 2 * QAT_MAX_BUF_SIZE);
			    pData, 2 * QAT_MAX_BUF_SIZE);`

@tcaputi
Copy link
Contributor Author

tcaputi commented Mar 9, 2018

These graphs (courtesy of Intel) show the performance improvements granted when using QAT for encryption. I verified these results myself with a spinning disk pool and the results were very similar. As expected, QAT greatly reduces the CPU overhead associated with encryption and provides a marginal bump to throughput (encryption is usually not a bottleneck here).

image005
image006

@behlendorf behlendorf merged commit cf63739 into openzfs:master Mar 9, 2018
@adilger
Copy link
Contributor

adilger commented Mar 26, 2018

for QAT hardware, there is no such limitation, we limit it in software for performance reason, user can modify the 128K to 1M in the code, it should be OK. We can also change it in this PR, one concern is for compression, there is a list of intermediate buffers each will be pre-allocated based on the max buffer size, if most cases the record size is < 1M, while we allocated so many "big" intermediate buffer, it is a waste. I don't know how many people use 1M record size, if you think 1M is common, we may change the value to 1M in this PR.

@wli5 I think that recordsize=1MB will become increasingly common for new datasets, as this improves performance significantly on faster storage devices, and is also the default for Lustre+ZFS data servers. Since recordsize is a dataset parameter, it should be fairly straight forward to determine at runtime if a 128KB or 1MB buffer is needed for intermediate compression buffers.

Either the buffers can be allocated based on recordsize, or they could be reallocated at 1MB (or larger, as needed) if larger blocks are compressed/encrypted. It would seem to me that offloading larger blocks to QAT would be more efficient than offloading smaller blocks due to fixed per-block overhead being amortized over more data, so this would be even better for QAT.

@wli5
Copy link
Contributor

wli5 commented Mar 27, 2018

@adilger thanks for comments.
Per our discussion with @tcaputi, the record size is dynamic and can be changed in each operation, while the intermediate buffer is allocated in the QAT initialize phase. Maybe we can think of having different buffer size limitation for compression and encryption, as encryption doesn't need intermediate buffer.

@adilger
Copy link
Contributor

adilger commented Mar 27, 2018

Is it possible to have the QAT intermediate compression buffers be reallocated if a larger request is submitted? At worst, if the larger buffer allocation fails it can return an error and fall back to CPU-based compression algorithm for larger buffers (i.e. what it is doing now), but in most cases the allocation should succeed and allow QAT to compress the larger buffers. That avoids the overhead of allocating too-large buffers if they are not needed, while allowing larger blocks to be compressed in the common cases.

A more complex buffer management system might allow "small" and "large" buffers separately (we do this for Lustre routers), only using "large" buffers if needed, and using "small" buffers for other requests. That would also reduce the amount of RAM pinned by the intermediate buffers to the number of concurrent "large" requests, rather than guessing in advance how many buffers might be needed and forcing all buffers to be the maximum size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants