Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVector recall regression vs HNSW #33

Closed
jbellis opened this issue Aug 31, 2023 · 3 comments
Closed

JVector recall regression vs HNSW #33

jbellis opened this issue Aug 31, 2023 · 3 comments
Assignees

Comments

@jbellis
Copy link
Owner

jbellis commented Aug 31, 2023

HNSW (concurrent5 branch + hnswrecall)

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
HNSW   M=16 ef=100: top 100/1 recall 0.7170, build 37.37s, query 11.68s. 209069030 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
HNSW   M=16 ef=100: top 100/1 recall 0.6954, build 81.76s, query 8.32s. 241306820 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
HNSW   M=16 ef=100: top 100/1 recall 0.6329, build 142.23s, query 12.33s. 251840840 nodes visited

JVector

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index   M=16 ef=100: top 100/1 recall 0.6393, build 45.96s, query 10.84s. 165252200 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index   M=16 ef=100: top 100/1 recall 0.5328, build 133.32s, query 9.93s. 229974700 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index   M=16 ef=100: top 100/1 recall 0.3820, build 210.34s, query 11.57s. 183856980 nodes visited
@jbellis jbellis self-assigned this Aug 31, 2023
@jbellis
Copy link
Owner Author

jbellis commented Aug 31, 2023

At least as far back as 0401011 when we first merged Bench into JVector

@jbellis
Copy link
Owner Author

jbellis commented Sep 1, 2023

It looks like the problem is with enforceMaxConnLimit. I took a shortcut and just applied the max-alpha diversity check, which means that often all the edges in the list are fine and we just end up removing the farthest-away edge.

If instead we apply increase the diversity incrementally, like we do in insertDiverse, then we do a better job of prioritizing removal of nodes that are okay wrt max-alpha, but not with alpha=1.0.

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Average degree is 31.9973529909132
Index   M=16 ef=100: top 100/1 recall 0.7327, build 38.62s, query 11.71s. 219005090 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Average degree is 31.99960879212244
Index   M=16 ef=100: top 100/1 recall 0.7116, build 128.20s, query 8.44s. 262719410 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Average degree is 31.989657072075193
Index   M=16 ef=100: top 100/1 recall 0.6451, build 208.60s, query 13.42s. 281149900 nodes visited

@jbellis
Copy link
Owner Author

jbellis commented Sep 1, 2023

Also tested neighborOverflow. 1.2 is the sweet spot for speed. Recall keeps creeping up past that though. Could be interesting to allow significantly larger M overall during construction (and then trim it back when complete), not just a small amount of overflow. Here's the raw data:

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index   M=16 ef=100 ov=1.0: top 100/1 recall 0.7248, build 41.33s, query 10.30s. 216100630 nodes visited
Index   M=16 ef=100 ov=1.1: top 100/1 recall 0.7253, build 34.46s, query 10.22s. 216682080 nodes visited
Index   M=16 ef=100 ov=1.2: top 100/1 recall 0.7273, build 34.09s, query 10.19s. 217711860 nodes visited
Index   M=16 ef=100 ov=1.3: top 100/1 recall 0.7282, build 34.87s, query 10.00s. 217189300 nodes visited
Index   M=16 ef=100 ov=1.4: top 100/1 recall 0.7300, build 34.40s, query 10.06s. 217574420 nodes visited
Index   M=16 ef=100 ov=1.5: top 100/1 recall 0.7310, build 35.67s, query 9.92s. 216313450 nodes visited
Index   M=16 ef=100 ov=1.6: top 100/1 recall 0.7318, build 36.42s, query 9.95s. 217324750 nodes visited
Index   M=16 ef=100 ov=1.7: top 100/1 recall 0.7324, build 37.23s, query 10.00s. 216880550 nodes visited
Index   M=16 ef=100 ov=1.8: top 100/1 recall 0.7331, build 38.25s, query 10.10s. 218359990 nodes visited
Index   M=16 ef=100 ov=1.9: top 100/1 recall 0.7349, build 39.12s, query 9.98s. 217950400 nodes visited
Index   M=16 ef=100 ov=2.0: top 100/1 recall 0.7361, build 39.77s, query 10.15s. 219628900 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index   M=16 ef=100 ov=1.0: top 100/1 recall 0.7083, build 129.28s, query 7.96s. 262328150 nodes visited
Index   M=16 ef=100 ov=1.1: top 100/1 recall 0.7099, build 112.93s, query 7.89s. 259868660 nodes visited
Index   M=16 ef=100 ov=1.2: top 100/1 recall 0.7107, build 111.35s, query 7.95s. 262270020 nodes visited
Index   M=16 ef=100 ov=1.3: top 100/1 recall 0.7115, build 114.18s, query 7.86s. 262288160 nodes visited
Index   M=16 ef=100 ov=1.4: top 100/1 recall 0.7115, build 120.58s, query 8.23s. 262504090 nodes visited
Index   M=16 ef=100 ov=1.5: top 100/1 recall 0.7112, build 121.68s, query 8.32s. 264554650 nodes visited
Index   M=16 ef=100 ov=1.6: top 100/1 recall 0.7124, build 125.39s, query 7.89s. 261228060 nodes visited
Index   M=16 ef=100 ov=1.7: top 100/1 recall 0.7130, build 120.96s, query 7.87s. 263518890 nodes visited
Index   M=16 ef=100 ov=1.8: top 100/1 recall 0.7112, build 123.28s, query 7.92s. 266836720 nodes visited
Index   M=16 ef=100 ov=1.9: top 100/1 recall 0.7137, build 128.29s, query 7.86s. 263501200 nodes visited
Index   M=16 ef=100 ov=2.0: top 100/1 recall 0.7148, build 127.33s, query 7.93s. 263909890 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index   M=16 ef=100 ov=1.0: top 100/1 recall 0.6385, build 214.84s, query 12.92s. 273614200 nodes visited
Index   M=16 ef=100 ov=1.1: top 100/1 recall 0.6390, build 191.94s, query 12.99s. 276782920 nodes visited
Index   M=16 ef=100 ov=1.2: top 100/1 recall 0.6376, build 192.05s, query 13.21s. 278794360 nodes visited
Index   M=16 ef=100 ov=1.3: top 100/1 recall 0.6392, build 194.30s, query 13.36s. 283579970 nodes visited
Index   M=16 ef=100 ov=1.4: top 100/1 recall 0.6422, build 199.53s, query 13.16s. 278914420 nodes visited
Index   M=16 ef=100 ov=1.5: top 100/1 recall 0.6445, build 204.94s, query 13.13s. 278050930 nodes visited
Index   M=16 ef=100 ov=1.6: top 100/1 recall 0.6445, build 207.79s, query 13.21s. 277796700 nodes visited
Index   M=16 ef=100 ov=1.7: top 100/1 recall 0.6422, build 209.80s, query 13.06s. 277976660 nodes visited
Index   M=16 ef=100 ov=1.8: top 100/1 recall 0.6435, build 216.14s, query 13.20s. 280327590 nodes visited
Index   M=16 ef=100 ov=1.9: top 100/1 recall 0.6435, build 219.54s, query 13.03s. 277604230 nodes visited
Index   M=16 ef=100 ov=2.0: top 100/1 recall 0.6458, build 225.36s, query 13.19s. 279667700 nodes visited

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant