Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support data compression & delta-encoding of posting lists #297

Merged
merged 28 commits into from
Jun 9, 2022

Conversation

suiguoxin
Copy link
Member

@suiguoxin suiguoxin commented May 31, 2022

New Features in This PR

  • Delta-encoding
  • Data Compression/Decompression with zstd
    • share dictionary or not
  • Rearrange vid/vector in the posting list:
    • vid0, vector0, vid1, vector1... -> vector0, vector1..., vid0, vid1...

All these features are by default disabled

Config format:

[BuildSSDIndex]
EnableDeltaEncoding=true
EnablePostingListRearrange=true
EnableDataCompression=true
EnableDictTraining=true
MinDictTrainingBufferSize=1024000
DictBufferCapacity=10240
ZstdCompressLevel=19

Evaluation

tail_30M

  • no dict share, no delta-encoding, CompressLevel: 0, compression ratio: 0.7326
  • MinDictTrainingBufferSize=1024000, DictBufferCapacity: 1024. CompressLevel: 19, compression ratio: 0.7235
  • MinDictTrainingBufferSize=102400, DictBufferCapacity: unknown. CompressLevel: 19, compression ratio: 0.7250
  • MinDictTrainingBufferSize=1024000, DictBufferCapacity: 10240. CompressLevel: 19, compression ratio: 0.7254

precision_30M

EnableDeltaEncoding EnableDataCompression Compression Ratio Avg Search Latency Latency Regression
False False 1 1.953 0
False True 0.7437 2.107 ~8%
True True 0.7314 2.438 ~25%

Key Observations

  • Regression on the search latency:
    • avg: 1.953 -> 2.438
  • Less disk page access:
    • avg: 63.368 -> 52.865

Notes

  • config detail: MinDictTrainingBufferSize=1024000, DictBufferCapacity: 10240. CompressLevel: 19
  • gzip: 0.7401 (tested with dumped posting lists, with delta-encoding)

Evaluation Details on Precision_30M

  • Without Delta-encoding & Decompress
Head Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 1.751       1.621   2.132   2.290   3.519   3.981   4.284
[1]
Ex Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 0.202       0.196   0.235   0.249   0.318   0.394   0.453
[1]
Total Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 1.953       1.815   2.359   2.519   3.843   4.291   4.603
[1]
Total Disk Page Access Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 63.368        63      74      78      83      89      92
[1]
Total Disk IO Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 32.000        32      32      32      32      32      32
  • Without Delta-encoding, with data-compression

BuildIndex: Total used time: 121.88 minutes (about 2.03 hours)

Head Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 1.720       1.599   2.129   2.224   3.381   3.740   4.138
[1]
Ex Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 0.387       0.382   0.454   0.478   0.555   0.794   1.925
[1]
Total Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 2.107       1.983   2.546   2.652   3.972   4.350   4.690
[1]
Total Disk Page Access Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 53.318        53      63      66      72      81      87
[1]
Total Disk IO Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 32.000        32      32      32      32      32      32
  • With delta-encoding & data-compression
    BuildIndex: Total used time: 121.60 minutes (about 2.03 hours).
Head Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 1.706       1.607   2.073   2.180   2.407   3.611   4.305
[1]
Ex Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 0.732       0.724   0.878   0.925   1.029   1.323   3.567
[1]
Total Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 2.438       2.352   2.846   2.973   3.272   4.693   6.239
[1]
Total Disk Page Access Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 52.847        52      63      66      72      80      85
[1]
Total Disk IO Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles       Max
[1] 32.000        32      32      32      32      32      32

@suiguoxin suiguoxin changed the title support data compression & delta-encoding of posting list support data compression & delta-encoding of posting lists Jun 2, 2022
AnnService/inc/Core/SPANN/Compressor.h Outdated Show resolved Hide resolved
AnnService/inc/Core/SPANN/Compressor.h Outdated Show resolved Hide resolved
AnnService/inc/Core/SPANN/Compressor.h Outdated Show resolved Hide resolved
AnnService/inc/Core/SPANN/ExtraFullGraphSearcher.h Outdated Show resolved Hide resolved
AnnService/inc/Core/SPANN/ExtraFullGraphSearcher.h Outdated Show resolved Hide resolved
@PhilipBAdams
Copy link
Contributor

@MaggieQi, for the CI to build with Guoxin's PR, we need to turn on 'Checkout submodules' option in the 'Get Sources' step of the SPTAG-GITHUB pipeline. I don't have access to do it - can you take a look?

 # with '#' will be ignored, and an empty message aborts the commit.
PhilipBAdams
PhilipBAdams previously approved these changes Jun 8, 2022
AnnService/SSDServing.vcxproj Outdated Show resolved Hide resolved
@PhilipBAdams
Copy link
Contributor

Please also change SPTAG.nuspec to include your new files, it is what we use to generate the nuget package

@suiguoxin
Copy link
Member Author

suiguoxin commented Jun 9, 2022

Please also change SPTAG.nuspec to include your new files, it is what we use to generate the nuget package

Added.

@PhilipBAdams PhilipBAdams merged commit f0579d4 into microsoft:main Jun 9, 2022
@suiguoxin suiguoxin deleted the fb-data-compress branch June 9, 2022 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants