Skip to content

Conversation

@aytekinar
Copy link

@aytekinar aytekinar commented Oct 23, 2023

Rationale

Binary distributions of pgvector often need to disable the -march=native flag (e.g., Debian patch) because, during build time, one cannot know in advance where the binary will be running. As a result, loops inside the vector operations cannot be auto-vectorized (contrary to the assumption in the code), which results in mediocre performances in places where the binary distributions are used (e.g., cloud providers).

To alleviate the above, I have implemented the SSE, AVX and AVX512F versions of the vector operations, and added a CPU dispatching mechanism (during extension load time) to pick the most recent version the underlying CPU supports (by following the best practices mentioned in Agner Fog's Optimizing Software in C++ (Chapters 12 and 13) manual).

Benchmark

I have created a separate simd-playground repository where I host the different SIMD implementations of the relevant vector operations, together with their unit tests and benchmarks.

When doing the benchmarks, I have used the following flags:

  • GCC/Clang (non-native): -Wall -Wpedantic -O2 -DNDEBUG -ftree-vectorize -fassociative-math -fno-trapping-math -fno-math-errno -fno-signed-zeros -funroll-loops
  • GCC/Clang (native): -Wall -Wpedantic -O2 -DNDEBUG -ftree-vectorize -fassociative-math -fno-trapping-math -fno-math-errno -fno-signed-zeros -funroll-loops -march=native
  • MSVC (non-native): /Wall /O2 /fp:fast
  • MSVC (native): /Wall /O2 /fp:fast /arch:AVX512

C Benchmark

Below, I provide the benchmark results on my machine (with 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz). Note that the benchmark name is in the form BM_<vector_op>/{1,2,3,4}/<vector_len>, where 1: scalar, 2: sse, 3: avx and 4: avx512f. The scalar version is simply a for-loop, which is vectorized in native-builds, and the rest are manually implemented vectorized versions for the respective targets:

Non-native builds
Benchmark Time (MSVC) CPU (MSVC) Iterations (MSVC) Time (GCC) CPU (GCC) Iterations (GCC) Time (Clang) CPU (Clang) Iterations (Clang)
BM_add/1/1024 128 ns 126 ns 5600000 90.6 ns 90.6 ns 7677939 124 ns 124 ns 5583785
BM_add/2/1024 83 ns 83.7 ns 7466667 89.7 ns 89.7 ns 7573045 134 ns 134 ns 5201665
BM_add/3/1024 46.9 ns 47.1 ns 14933333 51.1 ns 51 ns 13665016 52.6 ns 52.6 ns 13217498
BM_add/4/1024 39.7 ns 38.1 ns 17230769 41.8 ns 41.8 ns 17133409 37.6 ns 37.6 ns 18376567
BM_add/1/2048 204 ns 199 ns 3446154 175 ns 175 ns 4002587 243 ns 243 ns 2799188
BM_add/2/2048 220 ns 220 ns 3200000 180 ns 180 ns 3895405 262 ns 262 ns 2545064
BM_add/3/2048 111 ns 112 ns 6400000 98 ns 98 ns 7077248 109 ns 109 ns 6344137
BM_add/4/2048 114 ns 109 ns 5600000 78.9 ns 78.9 ns 8771963 76.1 ns 76.1 ns 9199390
BM_add/1/4096 456 ns 449 ns 1600000 370 ns 370 ns 1897552 504 ns 504 ns 1360571
BM_add/2/4096 516 ns 531 ns 1000000 373 ns 373 ns 1889332 546 ns 546 ns 1204813
BM_add/3/4096 265 ns 261 ns 2635294 209 ns 209 ns 3339430 227 ns 227 ns 2922807
BM_add/4/4096 151 ns 146 ns 4480000 168 ns 168 ns 4184801 162 ns 162 ns 4124729
BM_add/1/8192 1177 ns 1172 ns 560000 1035 ns 1035 ns 669624 1088 ns 1088 ns 627519
BM_add/2/8192 1375 ns 1245 ns 640000 1077 ns 1077 ns 661930 1230 ns 1230 ns 539861
BM_add/3/8192 822 ns 795 ns 746667 737 ns 737 ns 935601 832 ns 832 ns 840245
BM_add/4/8192 983 ns 977 ns 896000 763 ns 763 ns 919993 769 ns 769 ns 838225
BM_add/1/16384 3214 ns 3181 ns 235789 1943 ns 1943 ns 361022 2162 ns 2162 ns 317620
BM_add/2/16384 3164 ns 3069 ns 224000 1972 ns 1972 ns 357117 2467 ns 2467 ns 274078
BM_add/3/16384 1834 ns 1814 ns 344615 1463 ns 1463 ns 459819 1644 ns 1644 ns 418869
BM_add/4/16384 1620 ns 1573 ns 407273 1521 ns 1521 ns 461963 1548 ns 1548 ns 458229
BM_sub/1/1024 147 ns 145 ns 5600000 87.7 ns 87.7 ns 7951808 124 ns 124 ns 5680735
BM_sub/2/1024 139 ns 138 ns 4977778 94.7 ns 94.7 ns 7756491 133 ns 133 ns 5063914
BM_sub/3/1024 60.5 ns 61.4 ns 11200000 51.2 ns 51.2 ns 11758678 52.7 ns 52.7 ns 13341999
BM_sub/4/1024 93.8 ns 92.1 ns 7466667 41.9 ns 41.9 ns 16740797 38.1 ns 38.1 ns 18706934
BM_sub/1/2048 274 ns 279 ns 2635294 175 ns 175 ns 4003315 245 ns 245 ns 2853324
BM_sub/2/2048 290 ns 279 ns 2635294 181 ns 181 ns 3861775 267 ns 267 ns 2625535
BM_sub/3/2048 120 ns 117 ns 5600000 99.5 ns 99.5 ns 7032434 107 ns 107 ns 6480215
BM_sub/4/2048 183 ns 184 ns 4072727 79.6 ns 79.6 ns 8745474 76.1 ns 76.1 ns 8960849
BM_sub/1/4096 462 ns 446 ns 1120000 367 ns 367 ns 1888530 501 ns 501 ns 1382872
BM_sub/2/4096 469 ns 465 ns 1445161 372 ns 372 ns 1874261 545 ns 545 ns 1187354
BM_sub/3/4096 260 ns 257 ns 2488889 210 ns 210 ns 3326323 226 ns 226 ns 3079691
BM_sub/4/4096 158 ns 157 ns 4977778 168 ns 168 ns 4154352 162 ns 162 ns 4366883
BM_sub/1/8192 1151 ns 1161 ns 497778 1044 ns 1044 ns 665623 1071 ns 1071 ns 645419
BM_sub/2/8192 1171 ns 1172 ns 640000 1046 ns 1046 ns 662160 1238 ns 1238 ns 558128
BM_sub/3/8192 780 ns 711 ns 746667 735 ns 735 ns 944414 823 ns 823 ns 789451
BM_sub/4/8192 791 ns 802 ns 896000 756 ns 756 ns 873892 770 ns 770 ns 893439
BM_sub/1/16384 2541 ns 2567 ns 280000 1938 ns 1938 ns 363417 2193 ns 2193 ns 265749
BM_sub/2/16384 2512 ns 2448 ns 248889 2031 ns 2031 ns 332417 2443 ns 2443 ns 287533
BM_sub/3/16384 1702 ns 1709 ns 448000 1475 ns 1475 ns 470820 1664 ns 1664 ns 401319
BM_sub/4/16384 1700 ns 1674 ns 448000 1537 ns 1537 ns 455991 1531 ns 1531 ns 440341
BM_dot_product/1/1024 111 ns 103 ns 6400000 188 ns 188 ns 3698332 102 ns 102 ns 6738661
BM_dot_product/2/1024 210 ns 205 ns 3200000 188 ns 188 ns 3694425 146 ns 146 ns 4747852
BM_dot_product/3/1024 84.2 ns 83.7 ns 8960000 79.8 ns 79.8 ns 8554048 66.2 ns 66.2 ns 10392464
BM_dot_product/4/1024 49.1 ns 48 ns 16592593 38 ns 38 ns 18404189 38.4 ns 38.4 ns 18331705
BM_dot_product/1/2048 217 ns 220 ns 3200000 404 ns 404 ns 1724995 213 ns 213 ns 3272856
BM_dot_product/2/2048 423 ns 424 ns 1659259 404 ns 404 ns 1735376 311 ns 311 ns 2277208
BM_dot_product/3/2048 189 ns 193 ns 3733333 186 ns 186 ns 3700616 147 ns 147 ns 4701893
BM_dot_product/4/2048 96.3 ns 92.1 ns 5600000 79.5 ns 79.5 ns 8760611 80.8 ns 80.8 ns 8602193
BM_dot_product/1/4096 462 ns 465 ns 1544828 836 ns 836 ns 834767 426 ns 426 ns 1636841
BM_dot_product/2/4096 907 ns 879 ns 746667 834 ns 834 ns 818919 635 ns 635 ns 1083863
BM_dot_product/3/4096 421 ns 417 ns 1723077 404 ns 404 ns 1726134 312 ns 312 ns 2163755
BM_dot_product/4/4096 217 ns 222 ns 3733333 183 ns 183 ns 3793057 186 ns 186 ns 3734224
BM_dot_product/1/8192 919 ns 921 ns 746667 1696 ns 1696 ns 411496 881 ns 881 ns 781927
BM_dot_product/2/8192 1796 ns 1758 ns 373333 1706 ns 1706 ns 408267 1276 ns 1276 ns 540025
BM_dot_product/3/8192 846 ns 854 ns 896000 842 ns 842 ns 833770 650 ns 650 ns 1080996
BM_dot_product/4/8192 489 ns 500 ns 1000000 529 ns 529 ns 1301488 598 ns 598 ns 1172999
BM_dot_product/1/16384 1796 ns 1765 ns 407273 3420 ns 3420 ns 204340 1757 ns 1757 ns 389621
BM_dot_product/2/16384 3619 ns 3530 ns 194783 3431 ns 3431 ns 204736 2583 ns 2583 ns 265834
BM_dot_product/3/16384 1831 ns 1883 ns 448000 1699 ns 1699 ns 406700 1295 ns 1295 ns 519284
BM_dot_product/4/16384 1228 ns 1228 ns 560000 1089 ns 1089 ns 629957 1172 ns 1172 ns 591139
BM_cosine_distance/1/1024 173 ns 171 ns 4480000 211 ns 211 ns 3259795 167 ns 167 ns 4177995
BM_cosine_distance/2/1024 224 ns 225 ns 2986667 211 ns 211 ns 3297316 218 ns 218 ns 3262108
BM_cosine_distance/3/1024 123 ns 120 ns 5600000 111 ns 111 ns 6199282 114 ns 114 ns 6073426
BM_cosine_distance/4/1024 81.5 ns 83.7 ns 8960000 59.5 ns 59.5 ns 11877432 65.7 ns 65.7 ns 10613216
BM_cosine_distance/1/2048 359 ns 353 ns 1947826 433 ns 433 ns 1642719 332 ns 332 ns 2118325
BM_cosine_distance/2/2048 450 ns 439 ns 1493333 426 ns 426 ns 1638678 432 ns 432 ns 1606191
BM_cosine_distance/3/2048 239 ns 229 ns 2800000 225 ns 225 ns 3110254 229 ns 229 ns 3046923
BM_cosine_distance/4/2048 140 ns 136 ns 4480000 122 ns 122 ns 5746390 125 ns 125 ns 5432138
BM_cosine_distance/1/4096 706 ns 680 ns 896000 851 ns 851 ns 818869 664 ns 664 ns 1021273
BM_cosine_distance/2/4096 917 ns 900 ns 746667 860 ns 860 ns 812051 865 ns 865 ns 792026
BM_cosine_distance/3/4096 458 ns 454 ns 1445161 447 ns 447 ns 1562938 502 ns 502 ns 1000000
BM_cosine_distance/4/4096 265 ns 267 ns 2635294 241 ns 241 ns 2901303 252 ns 252 ns 2797007
BM_cosine_distance/1/8192 1425 ns 1395 ns 448000 1718 ns 1718 ns 405313 1332 ns 1332 ns 523563
BM_cosine_distance/2/8192 1782 ns 1758 ns 373333 1728 ns 1728 ns 399779 1758 ns 1758 ns 399745
BM_cosine_distance/3/8192 935 ns 935 ns 1120000 913 ns 913 ns 762412 920 ns 920 ns 730501
BM_cosine_distance/4/8192 501 ns 502 ns 1120000 580 ns 580 ns 1208493 631 ns 631 ns 1108676
BM_cosine_distance/1/16384 2694 ns 2773 ns 298667 3440 ns 3440 ns 203346 2673 ns 2673 ns 263766
BM_cosine_distance/2/16384 3731 ns 3575 ns 179200 3466 ns 3466 ns 202783 3508 ns 3508 ns 198097
BM_cosine_distance/3/16384 1946 ns 2009 ns 373333 1839 ns 1839 ns 382622 1837 ns 1837 ns 381716
BM_cosine_distance/4/16384 1399 ns 1395 ns 560000 1240 ns 1240 ns 569119 1267 ns 1267 ns 554124
BM_l1_distance/1/1024 128 ns 125 ns 6400000 199 ns 199 ns 3509361 110 ns 110 ns 6408907
BM_l1_distance/2/1024 213 ns 213 ns 3446154 199 ns 199 ns 3421705 151 ns 151 ns 4542396
BM_l1_distance/3/1024 98.7 ns 100 ns 7466667 90.8 ns 90.8 ns 7434131 114 ns 114 ns 9181592
BM_l1_distance/4/1024 52.8 ns 53 ns 11200000 51.9 ns 51.9 ns 13334502 44.4 ns 44.4 ns 12730004
BM_l1_distance/1/2048 231 ns 220 ns 2986667 415 ns 415 ns 1690429 219 ns 219 ns 2986092
BM_l1_distance/2/2048 435 ns 429 ns 1493333 415 ns 415 ns 1682359 319 ns 319 ns 2170577
BM_l1_distance/3/2048 210 ns 218 ns 3733333 205 ns 204 ns 3376457 159 ns 159 ns 4421009
BM_l1_distance/4/2048 112 ns 113 ns 7466667 117 ns 117 ns 5973547 93.4 ns 93.4 ns 7619189
BM_l1_distance/1/4096 462 ns 471 ns 1493333 847 ns 847 ns 819386 440 ns 440 ns 1610553
BM_l1_distance/2/4096 865 ns 879 ns 640000 845 ns 845 ns 816821 646 ns 646 ns 1070798
BM_l1_distance/3/4096 437 ns 439 ns 1600000 419 ns 419 ns 1673956 318 ns 318 ns 2175732
BM_l1_distance/4/4096 229 ns 225 ns 2986667 257 ns 257 ns 2717743 198 ns 198 ns 3533783
BM_l1_distance/1/8192 935 ns 942 ns 896000 1707 ns 1707 ns 398941 875 ns 875 ns 779632
BM_l1_distance/2/8192 1789 ns 1779 ns 448000 1707 ns 1707 ns 402967 1296 ns 1296 ns 540974
BM_l1_distance/3/8192 864 ns 858 ns 746667 850 ns 850 ns 810368 663 ns 663 ns 1026355
BM_l1_distance/4/8192 484 ns 487 ns 1445161 558 ns 558 ns 1240620 605 ns 605 ns 1152246
BM_l1_distance/1/16384 1829 ns 1800 ns 373333 3426 ns 3426 ns 203558 1780 ns 1780 ns 383330
BM_l1_distance/2/16384 3466 ns 3530 ns 194783 3444 ns 3444 ns 201842 2594 ns 2594 ns 265590
BM_l1_distance/3/16384 1908 ns 1918 ns 407273 1711 ns 1711 ns 407681 1324 ns 1324 ns 529667
BM_l1_distance/4/16384 1321 ns 1318 ns 497778 1141 ns 1141 ns 612057 1232 ns 1232 ns 562264
BM_l1_norm/1/1024 108 ns 105 ns 7466667 182 ns 182 ns 3818439 99.9 ns 99.9 ns 7006081
BM_l1_norm/2/1024 196 ns 195 ns 3446154 181 ns 181 ns 3815030 138 ns 138 ns 5013680
BM_l1_norm/3/1024 79.4 ns 80.2 ns 8960000 70.8 ns 70.8 ns 9654999 55.6 ns 55.6 ns 12349883
BM_l1_norm/4/1024 45.1 ns 45.2 ns 16592593 42.4 ns 42.4 ns 16562034 34.3 ns 34.3 ns 20678210
BM_l1_norm/1/2048 209 ns 209 ns 3446154 397 ns 397 ns 1753457 208 ns 208 ns 3356520
BM_l1_norm/2/2048 420 ns 410 ns 1600000 429 ns 429 ns 1766138 303 ns 303 ns 2308119
BM_l1_norm/3/2048 193 ns 186 ns 3446154 188 ns 188 ns 3391691 141 ns 141 ns 4815792
BM_l1_norm/4/2048 99.4 ns 98.4 ns 7466667 87 ns 87 ns 7966876 69.7 ns 69.7 ns 10107150
BM_l1_norm/1/4096 462 ns 449 ns 1600000 834 ns 834 ns 839654 425 ns 425 ns 1652737
BM_l1_norm/2/4096 892 ns 879 ns 746667 828 ns 828 ns 833251 625 ns 625 ns 1102482
BM_l1_norm/3/4096 416 ns 405 ns 1659259 398 ns 398 ns 1762030 303 ns 303 ns 2307003
BM_l1_norm/4/4096 226 ns 227 ns 3446154 223 ns 223 ns 3121718 166 ns 166 ns 4196180
BM_l1_norm/1/8192 872 ns 854 ns 640000 1672 ns 1672 ns 411089 868 ns 868 ns 795247
BM_l1_norm/2/8192 1791 ns 1803 ns 407273 1659 ns 1659 ns 409753 1272 ns 1272 ns 544230
BM_l1_norm/3/8192 895 ns 879 ns 640000 830 ns 830 ns 842833 631 ns 631 ns 1096417
BM_l1_norm/4/8192 443 ns 439 ns 1493333 499 ns 499 ns 1367345 359 ns 359 ns 1969440
BM_l1_norm/1/16384 1733 ns 1726 ns 407273 3409 ns 3409 ns 204824 1730 ns 1730 ns 394826
BM_l1_norm/2/16384 3450 ns 3449 ns 194783 3422 ns 3422 ns 205418 2586 ns 2586 ns 271846
BM_l1_norm/3/16384 1795 ns 1803 ns 407273 1695 ns 1695 ns 412955 1324 ns 1324 ns 515807
BM_l1_norm/4/16384 930 ns 907 ns 896000 1056 ns 1056 ns 652319 764 ns 764 ns 906384
BM_l2_distance/1/1024 116 ns 115 ns 6400000 197 ns 197 ns 3483591 111 ns 111 ns 6350992
BM_l2_distance/2/1024 205 ns 201 ns 3733333 196 ns 196 ns 3541562 154 ns 154 ns 4587694
BM_l2_distance/3/1024 98.2 ns 98.4 ns 7466667 94.9 ns 94.9 ns 7333137 77.2 ns 77.2 ns 9068567
BM_l2_distance/4/1024 51.7 ns 51.6 ns 11200000 45.7 ns 45.7 ns 15655878 45.2 ns 45.2 ns 15490828
BM_l2_distance/1/2048 230 ns 225 ns 2986667 416 ns 416 ns 1671239 219 ns 219 ns 3171136
BM_l2_distance/2/2048 434 ns 443 ns 1659259 415 ns 415 ns 1683423 315 ns 315 ns 2197536
BM_l2_distance/3/2048 210 ns 205 ns 3200000 204 ns 204 ns 3427226 161 ns 161 ns 4352365
BM_l2_distance/4/2048 104 ns 103 ns 5600000 94 ns 94 ns 7538262 96.7 ns 96.7 ns 7308980
BM_l2_distance/1/4096 445 ns 445 ns 1544828 833 ns 833 ns 812645 439 ns 439 ns 1598507
BM_l2_distance/2/4096 905 ns 907 ns 896000 840 ns 840 ns 830059 646 ns 646 ns 1062386
BM_l2_distance/3/4096 442 ns 443 ns 1659259 426 ns 426 ns 1653865 329 ns 329 ns 2150032
BM_l2_distance/4/4096 237 ns 234 ns 3200000 210 ns 210 ns 3336856 216 ns 216 ns 3267460
BM_l2_distance/1/8192 902 ns 854 ns 640000 1700 ns 1700 ns 409054 892 ns 892 ns 766577
BM_l2_distance/2/8192 1820 ns 1807 ns 320000 1717 ns 1717 ns 409019 1309 ns 1309 ns 545699
BM_l2_distance/3/8192 887 ns 889 ns 896000 845 ns 845 ns 818212 667 ns 667 ns 924030
BM_l2_distance/4/8192 505 ns 516 ns 1000000 556 ns 556 ns 1255504 605 ns 605 ns 1177922
BM_l2_distance/1/16384 1815 ns 1800 ns 373333 3431 ns 3431 ns 204036 1798 ns 1798 ns 392743
BM_l2_distance/2/16384 3640 ns 3606 ns 203636 3379 ns 3379 ns 205775 2573 ns 2573 ns 270791
BM_l2_distance/3/16384 1783 ns 1803 ns 407273 1688 ns 1688 ns 408888 1330 ns 1330 ns 507839
BM_l2_distance/4/16384 1316 ns 1350 ns 497778 1128 ns 1128 ns 617136 1248 ns 1248 ns 564954
BM_l2_norm/1/1024 101 ns 100 ns 7466667 175 ns 175 ns 3958920 96.8 ns 96.8 ns 7244464
BM_l2_norm/2/1024 192 ns 192 ns 4072727 180 ns 180 ns 3920431 137 ns 137 ns 5110201
BM_l2_norm/3/1024 83 ns 82 ns 8960000 69.6 ns 69.6 ns 9813983 57.2 ns 57.2 ns 11693718
BM_l2_norm/4/1024 48.7 ns 48.8 ns 16000000 29.4 ns 29.4 ns 24060919 29.3 ns 29.3 ns 23755105
BM_l2_norm/1/2048 210 ns 199 ns 2986667 392 ns 392 ns 1764290 203 ns 203 ns 3421385
BM_l2_norm/2/2048 405 ns 396 ns 1659259 393 ns 393 ns 1782037 301 ns 301 ns 2305953
BM_l2_norm/3/2048 200 ns 200 ns 3200000 178 ns 178 ns 3927075 138 ns 138 ns 5052382
BM_l2_norm/4/2048 92.7 ns 92.8 ns 6400000 68.8 ns 68.8 ns 10070711 70.2 ns 70.2 ns 9953319
BM_l2_norm/1/4096 458 ns 465 ns 1544828 819 ns 819 ns 840595 422 ns 422 ns 1651462
BM_l2_norm/2/4096 887 ns 879 ns 746667 809 ns 809 ns 850374 628 ns 628 ns 1125505
BM_l2_norm/3/4096 409 ns 408 ns 1723077 387 ns 387 ns 1781291 301 ns 301 ns 2322359
BM_l2_norm/4/4096 218 ns 218 ns 3446154 167 ns 167 ns 4187913 168 ns 168 ns 4133173
BM_l2_norm/1/8192 888 ns 879 ns 1120000 1675 ns 1675 ns 419021 860 ns 860 ns 798707
BM_l2_norm/2/8192 1734 ns 1716 ns 373333 1683 ns 1683 ns 408809 1274 ns 1274 ns 543401
BM_l2_norm/3/8192 913 ns 942 ns 746667 823 ns 823 ns 845670 628 ns 628 ns 1105021
BM_l2_norm/4/8192 426 ns 430 ns 1600000 387 ns 387 ns 1807098 388 ns 388 ns 1792908
BM_l2_norm/1/16384 1837 ns 1800 ns 373333 3403 ns 3403 ns 205434 1745 ns 1745 ns 397115
BM_l2_norm/2/16384 3756 ns 3683 ns 203636 3410 ns 3410 ns 204378 2599 ns 2599 ns 268432
BM_l2_norm/3/16384 1810 ns 1716 ns 373333 1686 ns 1686 ns 414183 1330 ns 1330 ns 533271
BM_l2_norm/4/16384 917 ns 900 ns 746667 831 ns 831 ns 834923 839 ns 839 ns 828062
Native builds
Benchmark Time (MSVC) CPU (MSVC) Iterations (MSVC) Time (GCC) CPU (GCC) Iterations (GCC) Time (Clang) CPU (Clang) Iterations (Clang)
BM_add/1/1024 57.5 ns 56.2 ns 10000000 51.7 ns 51.7 ns 13234391 54.7 ns 54.7 ns 12958619
BM_add/2/1024 82.3 ns 83.7 ns 8960000 102 ns 102 ns 6624177 133 ns 133 ns 5065284
BM_add/3/1024 44.6 ns 46.1 ns 16592593 61.8 ns 61.8 ns 11280734 69.8 ns 69.8 ns 9732685
BM_add/4/1024 25.2 ns 25.1 ns 28000000 45.7 ns 45.7 ns 15291231 57.9 ns 57.9 ns 11891293
BM_add/1/2048 99.8 ns 100 ns 7466667 102 ns 102 ns 6788656 119 ns 119 ns 5798412
BM_add/2/2048 217 ns 214 ns 2986667 201 ns 201 ns 3486827 266 ns 266 ns 2556640
BM_add/3/2048 115 ns 114 ns 5600000 124 ns 124 ns 5618302 157 ns 157 ns 4326618
BM_add/4/2048 88.9 ns 87.9 ns 7466667 89.3 ns 89.3 ns 7682860 115 ns 115 ns 6005367
BM_add/1/4096 157 ns 157 ns 4480000 215 ns 215 ns 3260921 249 ns 249 ns 2835786
BM_add/2/4096 351 ns 353 ns 2036364 411 ns 411 ns 1695882 542 ns 542 ns 1269726
BM_add/3/4096 177 ns 180 ns 4072727 254 ns 254 ns 2748763 326 ns 326 ns 2144973
BM_add/4/4096 129 ns 128 ns 5600000 186 ns 186 ns 3748768 244 ns 244 ns 2874838
BM_add/1/8192 726 ns 725 ns 1120000 667 ns 667 ns 1056132 643 ns 643 ns 1071193
BM_add/2/8192 1168 ns 1172 ns 640000 1209 ns 1209 ns 571610 1144 ns 1144 ns 579137
BM_add/3/8192 821 ns 820 ns 896000 737 ns 737 ns 948779 854 ns 854 ns 821051
BM_add/4/8192 808 ns 820 ns 896000 773 ns 773 ns 868640 776 ns 776 ns 908474
BM_add/1/16384 1734 ns 1726 ns 497778 1290 ns 1290 ns 545510 1356 ns 1356 ns 541797
BM_add/2/16384 2402 ns 2407 ns 298667 2102 ns 2102 ns 333210 2305 ns 2305 ns 302586
BM_add/3/16384 1656 ns 1573 ns 407273 1305 ns 1305 ns 538326 1698 ns 1697 ns 405094
BM_add/4/16384 1678 ns 1650 ns 407273 1517 ns 1517 ns 460136 1548 ns 1548 ns 444384
BM_sub/1/1024 40.3 ns 40.8 ns 17230769 51.6 ns 51.6 ns 13200424 58.1 ns 58.1 ns 11428105
BM_sub/2/1024 90.2 ns 87.9 ns 7466667 102 ns 102 ns 6830101 134 ns 134 ns 5122880
BM_sub/3/1024 46.2 ns 45.5 ns 15448276 62 ns 62 ns 11205594 70 ns 70 ns 9909680
BM_sub/4/1024 33.8 ns 33.8 ns 23578947 45.9 ns 45.9 ns 15340243 58.6 ns 58.6 ns 11731051
BM_sub/1/2048 103 ns 103 ns 7466667 103 ns 103 ns 6715498 119 ns 119 ns 5741399
BM_sub/2/2048 237 ns 237 ns 2635294 201 ns 201 ns 3331268 270 ns 270 ns 2595869
BM_sub/3/2048 121 ns 119 ns 7466667 124 ns 124 ns 5550468 155 ns 155 ns 4427659
BM_sub/4/2048 93.4 ns 92.4 ns 8960000 89.6 ns 89.6 ns 7598397 116 ns 116 ns 6017385
BM_sub/1/4096 172 ns 171 ns 4480000 215 ns 215 ns 3260775 245 ns 245 ns 2797614
BM_sub/2/4096 370 ns 381 ns 1723077 408 ns 408 ns 1712427 538 ns 538 ns 1283161
BM_sub/3/4096 185 ns 185 ns 4480000 252 ns 252 ns 2765698 326 ns 326 ns 2091898
BM_sub/4/4096 127 ns 129 ns 4977778 186 ns 186 ns 3758012 240 ns 240 ns 2902249
BM_sub/1/8192 716 ns 732 ns 896000 659 ns 659 ns 1051694 690 ns 690 ns 1090776
BM_sub/2/8192 1248 ns 1193 ns 497778 1219 ns 1219 ns 574599 1140 ns 1140 ns 588787
BM_sub/3/8192 781 ns 785 ns 896000 730 ns 730 ns 894365 856 ns 856 ns 815508
BM_sub/4/8192 809 ns 816 ns 746667 777 ns 777 ns 890595 784 ns 784 ns 910068
BM_sub/1/16384 1670 ns 1688 ns 407273 1299 ns 1299 ns 540988 1287 ns 1287 ns 542477
BM_sub/2/16384 2407 ns 2407 ns 298667 2105 ns 2105 ns 332456 2294 ns 2294 ns 300641
BM_sub/3/16384 1698 ns 1639 ns 448000 1302 ns 1302 ns 540345 1735 ns 1735 ns 409342
BM_sub/4/16384 1589 ns 1569 ns 497778 1508 ns 1508 ns 468978 1563 ns 1563 ns 393596
BM_dot_product/1/1024 48.9 ns 48.8 ns 11200000 69.9 ns 69.9 ns 9723073 37.6 ns 37.6 ns 18730154
BM_dot_product/2/1024 188 ns 184 ns 3733333 174 ns 174 ns 3962483 103 ns 103 ns 6715298
BM_dot_product/3/1024 80.4 ns 80.2 ns 8960000 69.7 ns 69.7 ns 9969209 49.4 ns 49.4 ns 14335523
BM_dot_product/4/1024 43.9 ns 43.3 ns 16592593 39.1 ns 39.1 ns 17698363 37.9 ns 37.9 ns 18657691
BM_dot_product/1/2048 109 ns 107 ns 6400000 177 ns 177 ns 3960353 83.3 ns 83.3 ns 8250533
BM_dot_product/2/2048 412 ns 417 ns 1723077 397 ns 397 ns 1794637 210 ns 210 ns 3305595
BM_dot_product/3/2048 190 ns 188 ns 4072727 177 ns 177 ns 3939651 107 ns 107 ns 6597506
BM_dot_product/4/2048 88.2 ns 87.9 ns 7466667 81.8 ns 81.8 ns 8485383 79.5 ns 79.5 ns 8699229
BM_dot_product/1/4096 213 ns 215 ns 3200000 397 ns 397 ns 1785594 167 ns 167 ns 4229183
BM_dot_product/2/4096 881 ns 889 ns 896000 822 ns 822 ns 819394 430 ns 430 ns 1606082
BM_dot_product/3/4096 414 ns 408 ns 1723077 396 ns 396 ns 1782694 218 ns 218 ns 3171811
BM_dot_product/4/4096 211 ns 209 ns 3733333 189 ns 189 ns 3685825 186 ns 186 ns 3829154
BM_dot_product/1/8192 503 ns 500 ns 1000000 826 ns 826 ns 835478 428 ns 428 ns 1619272
BM_dot_product/2/8192 1892 ns 1880 ns 407273 1656 ns 1656 ns 408369 865 ns 865 ns 802446
BM_dot_product/3/8192 910 ns 889 ns 896000 815 ns 815 ns 838213 554 ns 554 ns 1257114
BM_dot_product/4/8192 465 ns 450 ns 1493333 526 ns 526 ns 1340246 522 ns 522 ns 1317260
BM_dot_product/1/16384 958 ns 928 ns 640000 1670 ns 1670 ns 417705 913 ns 913 ns 753537
BM_dot_product/2/16384 3585 ns 3610 ns 194783 3409 ns 3409 ns 205000 1779 ns 1779 ns 399019
BM_dot_product/3/16384 1695 ns 1678 ns 344615 1685 ns 1685 ns 415396 1124 ns 1124 ns 634655
BM_dot_product/4/16384 1248 ns 1200 ns 560000 1072 ns 1072 ns 644320 1101 ns 1101 ns 630868
BM_cosine_distance/1/1024 72.1 ns 69.8 ns 11200000 100 ns 100 ns 6685054 54 ns 54 ns 12935535
BM_cosine_distance/2/1024 237 ns 231 ns 3446154 205 ns 205 ns 3416694 167 ns 167 ns 4198687
BM_cosine_distance/3/1024 122 ns 122 ns 6400000 100 ns 100 ns 6964744 94.5 ns 94.5 ns 7379272
BM_cosine_distance/4/1024 82.2 ns 83.7 ns 11200000 71.1 ns 71.1 ns 9807658 65.9 ns 65.9 ns 10574018
BM_cosine_distance/1/2048 135 ns 134 ns 5600000 209 ns 209 ns 3329408 102 ns 102 ns 6969592
BM_cosine_distance/2/2048 447 ns 446 ns 1120000 421 ns 421 ns 1669333 330 ns 330 ns 2138860
BM_cosine_distance/3/2048 222 ns 225 ns 2986667 207 ns 207 ns 3349447 182 ns 182 ns 3749543
BM_cosine_distance/4/2048 145 ns 140 ns 5600000 131 ns 131 ns 5316951 125 ns 125 ns 5543948
BM_cosine_distance/1/4096 262 ns 267 ns 2635294 423 ns 423 ns 1634376 194 ns 194 ns 3605839
BM_cosine_distance/2/4096 903 ns 907 ns 896000 849 ns 849 ns 811078 664 ns 664 ns 1055979
BM_cosine_distance/3/4096 475 ns 439 ns 1493333 424 ns 424 ns 1621854 358 ns 358 ns 1965163
BM_cosine_distance/4/4096 262 ns 268 ns 2800000 251 ns 251 ns 2772106 245 ns 245 ns 2864407
BM_cosine_distance/1/8192 538 ns 544 ns 1120000 861 ns 861 ns 806758 478 ns 478 ns 1476230
BM_cosine_distance/2/8192 1905 ns 1803 ns 407273 1717 ns 1717 ns 407036 1325 ns 1325 ns 528420
BM_cosine_distance/3/8192 897 ns 900 ns 746667 862 ns 862 ns 800247 734 ns 734 ns 957514
BM_cosine_distance/4/8192 508 ns 502 ns 1120000 585 ns 585 ns 1184507 580 ns 580 ns 1212552
BM_cosine_distance/1/16384 1068 ns 1036 ns 497778 1757 ns 1757 ns 403980 992 ns 992 ns 728100
BM_cosine_distance/2/16384 3744 ns 3735 ns 213333 3439 ns 3439 ns 202559 2666 ns 2666 ns 263941
BM_cosine_distance/3/16384 1820 ns 1800 ns 373333 1721 ns 1721 ns 406468 1473 ns 1473 ns 471534
BM_cosine_distance/4/16384 1455 ns 1465 ns 448000 1189 ns 1189 ns 570071 1189 ns 1189 ns 592421
BM_l1_distance/1/1024 56.5 ns 55.8 ns 11200000 91.3 ns 91.3 ns 7245664 46.8 ns 46.8 ns 14801858
BM_l1_distance/2/1024 200 ns 201 ns 3733333 202 ns 202 ns 3452904 109 ns 109 ns 5932032
BM_l1_distance/3/1024 99.3 ns 100 ns 7466667 91 ns 91 ns 7363584 55.2 ns 55.2 ns 12853777
BM_l1_distance/4/1024 51.4 ns 50 ns 10000000 47.1 ns 47.1 ns 14880408 37.9 ns 37.9 ns 18475585
BM_l1_distance/1/2048 116 ns 112 ns 6400000 204 ns 204 ns 3389982 90.8 ns 90.8 ns 7484518
BM_l1_distance/2/2048 450 ns 424 ns 1659259 418 ns 418 ns 1678127 219 ns 219 ns 3214164
BM_l1_distance/3/2048 217 ns 215 ns 3200000 204 ns 204 ns 3413314 113 ns 113 ns 6183959
BM_l1_distance/4/2048 113 ns 112 ns 6400000 103 ns 103 ns 6772275 73.8 ns 73.8 ns 9443316
BM_l1_distance/1/4096 233 ns 234 ns 2800000 419 ns 419 ns 1667891 179 ns 179 ns 3933727
BM_l1_distance/2/4096 867 ns 837 ns 746667 860 ns 860 ns 812015 434 ns 434 ns 1602184
BM_l1_distance/3/4096 452 ns 453 ns 1723077 418 ns 418 ns 1666553 226 ns 226 ns 3125710
BM_l1_distance/4/4096 232 ns 234 ns 2800000 218 ns 218 ns 3210311 145 ns 145 ns 4831735
BM_l1_distance/1/8192 520 ns 519 ns 1445161 861 ns 861 ns 783456 444 ns 444 ns 1570762
BM_l1_distance/2/8192 1818 ns 1814 ns 448000 1707 ns 1707 ns 409140 878 ns 878 ns 772444
BM_l1_distance/3/8192 913 ns 921 ns 746667 851 ns 851 ns 810576 556 ns 556 ns 1240558
BM_l1_distance/4/8192 470 ns 465 ns 1445161 546 ns 546 ns 1290908 552 ns 552 ns 1233519
BM_l1_distance/1/16384 987 ns 1004 ns 746667 1714 ns 1714 ns 403530 945 ns 945 ns 756923
BM_l1_distance/2/16384 3760 ns 3683 ns 203636 3433 ns 3433 ns 203633 1778 ns 1778 ns 386926
BM_l1_distance/3/16384 1861 ns 1842 ns 373333 1720 ns 1720 ns 405158 1175 ns 1175 ns 581446
BM_l1_distance/4/16384 1311 ns 1311 ns 560000 1166 ns 1166 ns 600541 1154 ns 1154 ns 620035
BM_l1_norm/1/1024 50 ns 50 ns 10000000 64.1 ns 64.1 ns 10724337 31.9 ns 31.9 ns 22010094
BM_l1_norm/2/1024 190 ns 188 ns 3733333 181 ns 181 ns 3854390 94.3 ns 94.3 ns 7378067
BM_l1_norm/3/1024 92.1 ns 87.9 ns 7466667 70 ns 70 ns 9620248 39.7 ns 39.7 ns 17783188
BM_l1_norm/4/1024 44.5 ns 44.3 ns 16592593 38.3 ns 38.3 ns 18346003 25.3 ns 25.3 ns 27631735
BM_l1_norm/1/2048 103 ns 97.7 ns 6400000 170 ns 170 ns 4052450 60.8 ns 60.8 ns 11567017
BM_l1_norm/2/2048 414 ns 405 ns 1659259 388 ns 388 ns 1796565 201 ns 201 ns 3489763
BM_l1_norm/3/2048 197 ns 200 ns 3200000 182 ns 182 ns 3796985 95.1 ns 95.1 ns 7241819
BM_l1_norm/4/2048 97.2 ns 96.3 ns 7466667 78.4 ns 78.4 ns 8761390 47.9 ns 47.9 ns 14808409
BM_l1_norm/1/4096 227 ns 230 ns 2986667 387 ns 387 ns 1819920 116 ns 116 ns 6016552
BM_l1_norm/2/4096 904 ns 889 ns 896000 846 ns 846 ns 833455 418 ns 418 ns 1674749
BM_l1_norm/3/4096 433 ns 415 ns 1544828 421 ns 421 ns 1652946 203 ns 203 ns 3439508
BM_l1_norm/4/4096 217 ns 218 ns 3733333 196 ns 196 ns 3550265 109 ns 109 ns 6357914
BM_l1_norm/1/8192 438 ns 439 ns 1493333 851 ns 851 ns 781811 230 ns 230 ns 3033368
BM_l1_norm/2/8192 1824 ns 1758 ns 373333 1689 ns 1689 ns 414056 849 ns 849 ns 801551
BM_l1_norm/3/8192 883 ns 879 ns 640000 832 ns 832 ns 815085 418 ns 418 ns 1668147
BM_l1_norm/4/8192 472 ns 455 ns 1544828 409 ns 409 ns 1708226 231 ns 231 ns 3026840
BM_l1_norm/1/16384 1002 ns 959 ns 896000 1678 ns 1678 ns 416647 538 ns 538 ns 1220433
BM_l1_norm/2/16384 3772 ns 3749 ns 179200 3438 ns 3438 ns 204660 1687 ns 1687 ns 408022
BM_l1_norm/3/16384 1897 ns 1880 ns 407273 1692 ns 1692 ns 405762 861 ns 861 ns 811964
BM_l1_norm/4/16384 914 ns 921 ns 746667 854 ns 854 ns 817550 628 ns 628 ns 1108765
BM_l2_distance/1/1024 51.1 ns 50 ns 10000000 82.6 ns 82.6 ns 8324513 44.6 ns 44.6 ns 15166406
BM_l2_distance/2/1024 204 ns 201 ns 3733333 189 ns 189 ns 3712265 108 ns 108 ns 6462954
BM_l2_distance/3/1024 88.6 ns 88.9 ns 8960000 82.5 ns 82.5 ns 8198171 57.4 ns 57.4 ns 12152124
BM_l2_distance/4/1024 51.2 ns 51.6 ns 10000000 46.6 ns 46.6 ns 15030439 44.5 ns 44.5 ns 15713973
BM_l2_distance/1/2048 105 ns 105 ns 6400000 191 ns 191 ns 3638607 89.4 ns 89.4 ns 7815579
BM_l2_distance/2/2048 432 ns 439 ns 1600000 406 ns 406 ns 1735215 214 ns 214 ns 3254081
BM_l2_distance/3/2048 202 ns 204 ns 3446154 190 ns 190 ns 3633012 116 ns 116 ns 6120524
BM_l2_distance/4/2048 110 ns 111 ns 7466667 96.8 ns 96.8 ns 7103815 94.9 ns 94.9 ns 7346382
BM_l2_distance/1/4096 222 ns 220 ns 3200000 410 ns 410 ns 1723814 173 ns 173 ns 3917039
BM_l2_distance/2/4096 909 ns 879 ns 746667 835 ns 835 ns 827758 435 ns 435 ns 1626498
BM_l2_distance/3/4096 423 ns 420 ns 1600000 408 ns 408 ns 1720496 230 ns 230 ns 3028207
BM_l2_distance/4/4096 222 ns 220 ns 2986667 214 ns 214 ns 3278108 211 ns 211 ns 3309542
BM_l2_distance/1/8192 501 ns 502 ns 1493333 840 ns 840 ns 831937 438 ns 438 ns 1597999
BM_l2_distance/2/8192 1695 ns 1688 ns 407273 1669 ns 1669 ns 414166 878 ns 878 ns 784525
BM_l2_distance/3/8192 863 ns 858 ns 746667 838 ns 838 ns 820267 569 ns 569 ns 1204036
BM_l2_distance/4/8192 493 ns 465 ns 1445161 535 ns 535 ns 1312708 543 ns 543 ns 1274720
BM_l2_distance/1/16384 987 ns 984 ns 746667 1715 ns 1715 ns 410271 940 ns 940 ns 746765
BM_l2_distance/2/16384 3916 ns 3836 ns 179200 3429 ns 3429 ns 204097 1756 ns 1756 ns 395332
BM_l2_distance/3/16384 1853 ns 1768 ns 344615 1700 ns 1700 ns 411560 1166 ns 1166 ns 598408
BM_l2_distance/4/16384 1347 ns 1311 ns 560000 1166 ns 1166 ns 600533 1149 ns 1149 ns 599830
BM_l2_norm/1/1024 41.4 ns 41 ns 17920000 63.4 ns 63.4 ns 11178305 26.3 ns 26.3 ns 27063945
BM_l2_norm/2/1024 179 ns 180 ns 4072727 159 ns 159 ns 4339438 94.3 ns 94.3 ns 7344786
BM_l2_norm/3/1024 76.1 ns 75 ns 8960000 65.4 ns 65.4 ns 10630834 43.1 ns 43.1 ns 16270976
BM_l2_norm/4/1024 43.2 ns 43 ns 16000000 34.3 ns 34.3 ns 20535513 28.1 ns 28.1 ns 24914233
BM_l2_norm/1/2048 90 ns 87.9 ns 7466667 160 ns 160 ns 4319870 51.4 ns 51.4 ns 13508011
BM_l2_norm/2/2048 411 ns 401 ns 1792000 375 ns 375 ns 1867021 203 ns 203 ns 3435485
BM_l2_norm/3/2048 183 ns 184 ns 4072727 163 ns 163 ns 4347032 97.8 ns 97.8 ns 7161646
BM_l2_norm/4/2048 91.4 ns 92.1 ns 7466667 78.4 ns 78.4 ns 8927877 67.9 ns 67.9 ns 10221694
BM_l2_norm/1/4096 212 ns 199 ns 3446154 377 ns 377 ns 1862238 108 ns 108 ns 6594200
BM_l2_norm/2/4096 901 ns 858 ns 746667 804 ns 804 ns 841235 417 ns 417 ns 1674381
BM_l2_norm/3/4096 403 ns 408 ns 1723077 378 ns 378 ns 1842725 209 ns 209 ns 3381721
BM_l2_norm/4/4096 211 ns 209 ns 3446154 188 ns 188 ns 3708922 165 ns 165 ns 4223085
BM_l2_norm/1/8192 482 ns 500 ns 1000000 808 ns 808 ns 851276 224 ns 224 ns 3274797
BM_l2_norm/2/8192 1833 ns 1814 ns 448000 1671 ns 1671 ns 416141 847 ns 847 ns 796679
BM_l2_norm/3/8192 891 ns 879 ns 1120000 807 ns 807 ns 850240 427 ns 427 ns 1636746
BM_l2_norm/4/8192 431 ns 424 ns 1659259 408 ns 408 ns 1720236 387 ns 387 ns 1807035
BM_l2_norm/1/16384 902 ns 921 ns 1120000 1676 ns 1676 ns 403521 454 ns 454 ns 1526590
BM_l2_norm/2/16384 3480 ns 3530 ns 203636 3507 ns 3507 ns 206377 1684 ns 1684 ns 413647
BM_l2_norm/3/16384 1967 ns 1953 ns 448000 1678 ns 1678 ns 405899 869 ns 869 ns 799324
BM_l2_norm/4/16384 958 ns 942 ns 746667 850 ns 850 ns 821211 827 ns 827 ns 839082

In short, manual SIMD implementations are more performant than those of GCC in both non-native (up to ~70%) and native (~50%) builds. They are still more performant than those of Clang and MSVC in non-native builds (up to ~70%), while staying within the 5-10% range in native builds (here, Clang is a different beast; it performs much better in native builds especially when/if the vector lengths get bigger).

pgbench

To see if the above results from pure (in-memory) benchmarks are also applicable to SQL operations, I have created the following benchmark data:

-- 00-dump-data.sql
\set dim 1536
\set nrows 1000

drop extension if exists vector cascade ;
create extension vector ;

drop table if exists vector_benchmark cascade ;
create table vector_benchmark (
  id int generated always as identity primary key,
  embedding vector(:dim)
) ;

insert into vector_benchmark(embedding)
     select array_agg(random_normal())
       from generate_series(1, :dim * :nrows) i
   group by i % :nrows ;

and the vector operations:

-- 01-vector-ops.sql
\set dim 1536

select array_fill(1, array[:dim])::vector <-> embedding as l2_distance,
       1 - (array_fill(1, array[:dim])::vector <=> embedding) as cosine_similarity,
      -1 * (array_fill(1, array[:dim])::vector <#> embedding) as inner_product,
       l1_distance(array_fill(1, array[:dim])::vector, embedding) as l1_distance,
       vector_norm(array_fill(1, array[:dim])::vector) as myvec_norm,
       vector_norm(embedding) as embedding_norm
  from vector_benchmark ;

Below are the results (v0.5.1; non-native build):

$ pgbench --host localhost --username postgres --no-vacuum --client 10 --jobs 4 --time 30 --file 01-vector-ops.sql postgres
pgbench (16.0)
transaction type: 01-vector-ops.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 4
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 11298
number of failed transactions: 0 (0.000%)
latency average = 26.567 ms
initial connection time = 5.215 ms
tps = 376.402088 (without initial connection time)
$

and the patch in this PR (again, non-native build):

$ pgbench --host localhost --username postgres --no-vacuum --client 10 --jobs 4 --time 30 --file 01-vector-ops.sql postgres
pgbench (16.0)
transaction type: 01-vector-ops.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 4
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 13196
number of failed transactions: 0 (0.000%)
latency average = 22.739 ms
initial connection time = 6.215 ms
tps = 439.765535 (without initial connection time)
$

As can be seen, there is 17% relative speed-up compared to the non-native build (which is the case when binary distributions of this extension are / need to be used). Obviously, this is far from the in-memory benchmark results I have reported above, which results from the density of vector operations in a typical PostgreSQL session (see, e.g., the below flame graph for a sample session that runs the benchmark script manually, in which the portion of the operations is around 12%).

flame graph

@ankane
Copy link
Member

ankane commented Oct 24, 2023

Hi @aytekinar, thanks for the PR! It's a neat idea, and the code looks really clean. However, it adds a good amount of complexity.

I think a minimal version would be:

  1. AVX only (SSE is always available for x86-64, and AVX is more widely available than AVX-512F)
  2. Only if AVX is not enabled by compilation flags
  3. Only functions that provide the most benefit (L2 distance squared, inner product, and cosine distance)

For benchmarking, it'd be best to use ann-benchmarks to see the impact on the total time.

Overall, I'm not sure it's something I'd like to move forward with right now, but will think on it for a bit (and welcome other comments).

@xfalcox
Copy link

xfalcox commented Oct 24, 2023

To alleviate the above, I have implemented the SSE, AVX and AVX512F versions of the vector operations, and added a CPU dispatching mechanism (during extension load time) to pick the most recent version the underlying CPU supports (by following the best practices mentioned in Agner Fog's Optimizing Software in C++ (Chapters 12 and 13) manual).

That's refreshing to hear. At @discourse we had problems with dependencies shipping AVX improvements, that were followed by the random user running the application in a 13 years old CPU frustrated because it broke.

Overall, I'm not sure it's something I'd like to move forward with right now, but will think on it for a bit (and welcome other comments).

For us, the CPU cost of the "search" (inner_product) is what is holding us back on enabling this for more of our hosted forums, as most of the 500M+ monthly pageviews we serve will do one search to show related topics when we enable our embeddings feature for a site. We do cache the result of the query for a whole week, but over a million topics are created month again, and with anon visits coming from Google Search, the cache isn't always there.

That all to say that a (17% ~ 50%) speed-up would be very welcome for us. I do appreciate the the concern over complexity tho.

@jkatz
Copy link
Contributor

jkatz commented Oct 24, 2023

I want to work on benchmarking over the coming days. I do think there's merit to being explicit about SIMD, but I'm not convinced on adding the complexity based on the above data. That said, I do generally think this is a good idea and, at least on paper, supportive of doing this.

The main issues I've seen as of late with the latest version of pgvector is calculations getting CPU bound on high levels of concurrency with higher levels of hnsw.ef_search. Specifically, I've observed that HNSW searches scale linearly quite well but they start to fall off with hnsw.ef_search that's 200+ with 48+ clients connecting (this is on a dataset of 10MM 128-dim vectors).

That said, the benchmarks above don't convince me yet around adding them -- these are very short tests on very small data sets. I agree with @ankane on running these against ANN Benchmarks (I'm happy to do so), but I also want to test something similar to the proposed pgbench method at a much higher concurrency scale over a longer duration. If this continues to show a significant gain, I would be in favor of getting this in sooner.

On the types, I would still suggest having AVX-512 available ("in for a penny, in for a pound") unless it truly adds that much more complexity. For my own benefit, is the work for AVX/AVX2 similar? From what I read for floating point calculations, they're pretty similar.

On the code, I'd recommending keeping the SIMD implementation work to its own file (upstream PostgreSQL does this in a simd.c) so that way vector.c can just contain the implementation of the vector calculation functions. That way it's easier to maintain/grok the SIMD work.

Thanks!

@ashvardanian
Copy link

Hello everyone,

A colleague recently highlighted the potential benefits of integrating SIMD similarity measures into pgvector. To my delight, I discovered this ongoing implementation, which aligns perfectly with a project of mine.

I am the creator of a library named SimSIMD. It is tailored to implement various similarity measures, including L2, Cosine, Inner Product, Jaccard, Hamming, and Jensen Shannon distances. It uses AVX2 and AVX-512 techniques, including VNNI for int8 and FP16 extensions for half-precision floating points on x86 platforms, including the most recent Sapphire Rapids. Furthermore, it supports Arm NEON and SVE, making it compatible with Graviton 3 and other recent chipsets.

One of the major deployments of SimSIMD is in USearch. This tool has gained traction and is now integrated into platforms like ClickHouse and various data lakes.

I've chronicled some benchmarks in the following articles. While they provide insight into the performance of SimSIMD, I'm keen to determine their relevance to this context:

  1. GCC 12 is 119x slower than AVX-512 SIMD on this task
  2. SciPy distances... up to 200x faster with AVX-512 & SVE

@ankane and @jkatz, please let me know if the integration makes sense. It should be as easy as adding a git submodule and invoking simsimd_metric_punned to get the appropriate function pointer. I'm happy to help with the implementation and testing.

@aytekinar aytekinar force-pushed the feature/simd-operations branch from 9e519b4 to 406eccf Compare October 26, 2023 11:31
@ankane
Copy link
Member

ankane commented Oct 26, 2023

Hi @ashvardanian, thanks for sharing. Looks really neat, but don't want to take on any external dependencies or different licenses (however, if someone decides to fork and benchmark, I'd be curious to see the results).

@ashvardanian
Copy link

ashvardanian commented Oct 27, 2023

@ankane, the license is not an issue, I can dual-license to be compatible, and have previously done the for StringZilla, my strings processing libraries. As for benchmarking, shouldn't be hard, just let me know how you generally do that - same way as in the first message of this thread? 🤗

@aytekinar aytekinar force-pushed the feature/simd-operations branch 4 times, most recently from 0f4113a to 40a355e Compare October 27, 2023 10:15
@aytekinar
Copy link
Author

aytekinar commented Oct 27, 2023

Hello all!

First, apologies for responding late --- I have been going through your comments and trying to understand the compilation failure in the CI. I tested the code in pgenv-managed PostgreSQL installations, and I used the configuration flags --with-openssl and --enable-debug (so, --with-llvm was not there, resulting in the lack of bitcode generation, and hence, no compilation issues).

Let me first answer the specific question:

For my own benefit, is the work for AVX/AVX2 similar? From what I read for floating point calculations, they're pretty similar.

Correct, AVX brought the float support and AVX2 built int capabilities on top of AVX. So, for the purposes of our functions mentioned above, AVX requirement is enough. That said, AVX does not bring FMA (fused multiply and add), which is important to bring c = _mmXXX_add_ps(c, _mmXXX_mul_ps(a, b)) to c = _mmXXX_fmadd_ps(a, b, c). If we would like to have FMA support for/on 256-bit registers, as well, then the corresponding target is fma. Note that fma capability is already available in AVX512F (AVX-512 Foundations). Also, AVX512F does bring support for _mm512_abs_ps (absolute value on float variables), which is not available in AVX-128 or AVX-256. This is the reason why I needed to implement the L1 distance by using bit-masking in AVX-128 and AVX-256 versions whereas I was using the _mm512_abs_ps for AVX512F.

One final comment is that when we use FMA, we get different results, which is the expected behavior because of the lack of one rounding operation.

New Changeset

When the --with-llvm configuration option is present (in PostgreSQL installation), we receive compile/link-time error during bitcode generation. I have already created an issue under llvm/llvm-project#70184 and posted both the reproduced tar ball (directly from the pgvector objects) and the minimal, reproducible example. Note that, when we write our resolver/dispatcher function ourselves and use -flto=thin (i.e., thin LTO), we get a segmentation fault.

During my trials, I have observed the following:

  1. If -flto=thin is not present, the bitcode files are generated and work OK (which is what we don't have much control over due to the way PostgreSQL installations are handled), and/or,
  2. If we keep the function names/signatures the same (which is only supported in clang and g++ (the C++ frontend in GCC)), then LLVM LTO does not have issues, either.

The portable way is to have different function names (as was the case in the original PR of mine), but then, we cannot compile for LLVM-enabled PostgreSQL installations (due to bitcode generation and LTO). If, on the other hand, we choose to go for Option 2 above, we cannot support GCC.

As a result, I have opted for this version of the PR. Namely, I am using the target_clones function attribute (cf. GCC and Clang), which enables the compiler to generate different versions of the function, together with a proper resolver function, for the corresponding targets. In the PR, I have chosen to target the following: AVX, FMA and AVX512F. Unfortunately, because MSVC does not support such attributes, I defined it as a no-op macro (so, we're currently discarding MSVC, which does not support function multi-versioning).

You can see how the code is generated for, e.g., l1_distance in Godbolt. Note the lack of -march=native or -mavx512f kind of flags.

Benchmarks

I have repeated the same C and pgbench benchmarks, and I have also done ann-benchmarks (over the fashion-mnist-784-euclidean dataset).

C Benchmark

I have used the following commit when repeating the C benchmarks: aytekinar/simd-playground@cbe8ac27

Below are the results from my laptop:

clang-release (no attribute; non-native build)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_dot_product/1024             107 ns          107 ns      6475605
BM_dot_product/2048             247 ns          247 ns      3209992
BM_dot_product/4096             441 ns          441 ns      1506644
BM_dot_product/8192             870 ns          870 ns       775244
BM_dot_product/16384           1877 ns         1877 ns       382151
BM_cosine_distance/1024         166 ns          166 ns      4158748
BM_cosine_distance/2048         329 ns          329 ns      2123591
BM_cosine_distance/4096         654 ns          654 ns       993028
BM_cosine_distance/8192        1305 ns         1305 ns       527415
BM_cosine_distance/16384       2644 ns         2644 ns       267665
BM_l1_distance/1024             109 ns          109 ns      6268004
BM_l1_distance/2048             219 ns          219 ns      3211458
BM_l1_distance/4096             432 ns          432 ns      1616881
BM_l1_distance/8192             874 ns          874 ns       771417
BM_l1_distance/16384           1823 ns         1823 ns       378044
BM_l2_distance/1024             107 ns          107 ns      6402242
BM_l2_distance/2048             215 ns          215 ns      3228364
BM_l2_distance/4096             430 ns          430 ns      1628083
BM_l2_distance/8192             870 ns          870 ns       791861
BM_l2_distance/16384           1825 ns         1825 ns       381786
BM_l2_norm/1024                98.5 ns         98.5 ns      6997782
BM_l2_norm/2048                 205 ns          205 ns      3391447
BM_l2_norm/4096                 425 ns          425 ns      1639000
BM_l2_norm/8192                 852 ns          852 ns       795043
BM_l2_norm/16384               1718 ns         1718 ns       401015
clang-release (target_clones attribute; non-native build)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_dot_product/1024            30.5 ns         30.5 ns     22957587
BM_dot_product/2048            47.9 ns         47.9 ns     14712766
BM_dot_product/4096            97.0 ns         97.0 ns      7244225
BM_dot_product/8192             445 ns          445 ns      1581668
BM_dot_product/16384            879 ns          879 ns       790080
BM_cosine_distance/1024        54.4 ns         54.4 ns     12739433
BM_cosine_distance/2048        99.6 ns         99.6 ns      6977678
BM_cosine_distance/4096         188 ns          188 ns      3657308
BM_cosine_distance/8192         500 ns          500 ns      1000000
BM_cosine_distance/16384        997 ns          997 ns       697808
BM_l1_distance/1024            35.7 ns         35.7 ns     19479876
BM_l1_distance/2048            70.0 ns         70.0 ns      9948538
BM_l1_distance/4096             138 ns          138 ns      5014898
BM_l1_distance/8192             470 ns          470 ns      1495663
BM_l1_distance/16384            933 ns          933 ns       745000
BM_l2_distance/1024            32.3 ns         32.3 ns     21667922
BM_l2_distance/2048            62.4 ns         62.4 ns     11130121
BM_l2_distance/4096             123 ns          123 ns      5664727
BM_l2_distance/8192             469 ns          469 ns      1491221
BM_l2_distance/16384            939 ns          939 ns       754665
BM_l2_norm/1024                18.0 ns         18.0 ns     38544091
BM_l2_norm/2048                33.9 ns         33.9 ns     20352724
BM_l2_norm/4096                65.6 ns         65.6 ns     10516859
BM_l2_norm/8192                 128 ns          128 ns      5433454
BM_l2_norm/16384                594 ns          594 ns      1173546
gcc-release (no attribute; non-native build)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_dot_product/1024             191 ns          191 ns      3596584
BM_dot_product/2048             412 ns          412 ns      1711020
BM_dot_product/4096             836 ns          836 ns       824775
BM_dot_product/8192            1695 ns         1695 ns       407793
BM_dot_product/16384           3414 ns         3414 ns       201811
BM_cosine_distance/1024         215 ns          215 ns      3272195
BM_cosine_distance/2048         432 ns          432 ns      1619827
BM_cosine_distance/4096         861 ns          861 ns       790842
BM_cosine_distance/8192        1720 ns         1720 ns       396751
BM_cosine_distance/16384       3424 ns         3424 ns       204935
BM_l1_distance/1024             217 ns          217 ns      3307375
BM_l1_distance/2048             419 ns          419 ns      1672571
BM_l1_distance/4096             881 ns          881 ns       803390
BM_l1_distance/8192            1727 ns         1727 ns       404133
BM_l1_distance/16384           3444 ns         3444 ns       195895
BM_l2_distance/1024             199 ns          199 ns      3524444
BM_l2_distance/2048             417 ns          417 ns      1688875
BM_l2_distance/4096             843 ns          843 ns       793325
BM_l2_distance/8192            1715 ns         1715 ns       405079
BM_l2_distance/16384           3456 ns         3456 ns       196678
BM_l2_norm/1024                 176 ns          176 ns      3974035
BM_l2_norm/2048                 392 ns          392 ns      1778498
BM_l2_norm/4096                 823 ns          823 ns       811369
BM_l2_norm/8192                1689 ns         1689 ns       407585
BM_l2_norm/16384               3402 ns         3402 ns       204944
gcc-release (target_clones attribute; non-native build)
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_dot_product/1024            37.5 ns         37.5 ns     18757227
BM_dot_product/2048            78.9 ns         78.9 ns      8737113
BM_dot_product/4096             185 ns          185 ns      3779932
BM_dot_product/8192             479 ns          479 ns      1456847
BM_dot_product/16384            961 ns          961 ns       717565
BM_cosine_distance/1024        59.6 ns         59.6 ns     11353315
BM_cosine_distance/2048         118 ns          118 ns      5900889
BM_cosine_distance/4096         237 ns          237 ns      2968991
BM_cosine_distance/8192         525 ns          525 ns      1321178
BM_cosine_distance/16384       1077 ns         1077 ns       647748
BM_l1_distance/1024            49.3 ns         49.3 ns     13903233
BM_l1_distance/2048             115 ns          115 ns      6031727
BM_l1_distance/4096             256 ns          256 ns      2741119
BM_l1_distance/8192             551 ns          551 ns      1272843
BM_l1_distance/16384           1125 ns         1125 ns       617751
BM_l2_distance/1024            43.6 ns         43.6 ns     15906860
BM_l2_distance/2048            90.0 ns         90.0 ns      7712385
BM_l2_distance/4096             204 ns          204 ns      3414657
BM_l2_distance/8192             494 ns          494 ns      1433472
BM_l2_distance/16384            987 ns          987 ns       692659
BM_l2_norm/1024                27.9 ns         27.9 ns     24960010
BM_l2_norm/2048                68.4 ns         68.4 ns     10236372
BM_l2_norm/4096                 168 ns          168 ns      4070314
BM_l2_norm/8192                 390 ns          390 ns      1791165
BM_l2_norm/16384                831 ns          831 ns       821427

We still get nice/aggressive speed-ups.

pgbench

The new results are as follows for non-native builds:

pgbench (16.0)
transaction type: 01-vector-ops.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 4
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 9346
number of failed transactions: 0 (0.000%)
latency average = 32.117 ms
initial connection time = 6.148 ms
tps = 311.366130 (without initial connection time)

and function multi-versioning:

pgbench (16.0)
transaction type: 01-vector-ops.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 4
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 10125
number of failed transactions: 0 (0.000%)
latency average = 29.645 ms
initial connection time = 5.628 ms
tps = 337.321905 (without initial connection time)

Note that, this time, both the non-native and SIMD results are lower compared to my previous experiments.

I also attach the flame graph here:

perf

If you download the graph to your local and reopen it in your browser (from local file system), you should be able to zoom in and search for avx512f. The resolver function did its job and dispatched the function call to the corresponding implementation. However, this time, we observe only 5.63% time spent in the vector computations. As a result, the speed-up observed in pgbench is reasonable (i.e., around 8%).

ANN Benchmarks

I have applied the following patchset to ann-benchmarks:

patch.diff
diff --git a/ann_benchmarks/algorithms/pgvector/module.py b/ann_benchmarks/algorithms/pgvector/module.py
index b98a6ce..90ccdc3 100644
--- a/ann_benchmarks/algorithms/pgvector/module.py
+++ b/ann_benchmarks/algorithms/pgvector/module.py
@@ -22,8 +22,8 @@ class PGVector(BaseANN):
             raise RuntimeError(f"unknown metric {metric}")
 
     def fit(self, X):
-        subprocess.run("service postgresql start", shell=True, check=True, stdout=sys.stdout, stderr=sys.stderr)
-        conn = psycopg.connect(user="ann", password="ann", dbname="ann", autocommit=True)
+        # subprocess.run("service postgresql start", shell=True, check=True, stdout=sys.stdout, stderr=sys.stderr)
+        conn = psycopg.connect(host="localhost", user="ann", password="ann", dbname="ann", autocommit=True)
         pgvector.psycopg.register_vector(conn)
         cur = conn.cursor()
         cur.execute("DROP TABLE IF EXISTS items")
diff --git a/ann_benchmarks/algorithms/pgvector_simd/Dockerfile b/ann_benchmarks/algorithms/pgvector_simd/Dockerfile
new file mode 100644
index 0000000..7ba7ede
--- /dev/null
+++ b/ann_benchmarks/algorithms/pgvector_simd/Dockerfile
@@ -0,0 +1,25 @@
+FROM ann-benchmarks
+
+# https://github.com/pgvector/pgvector/blob/master/Dockerfile
+
+RUN git clone https://github.com/pgvector/pgvector /tmp/pgvector
+
+RUN DEBIAN_FRONTEND=noninteractive apt-get -y install tzdata
+RUN apt-get update && apt-get install -y --no-install-recommends build-essential postgresql postgresql-server-dev-all
+RUN sh -c 'echo "local all all trust" > /etc/postgresql/14/main/pg_hba.conf'
+RUN cd /tmp/pgvector && \
+	make clean && \
+	make OPTFLAGS="-march=native -mprefer-vector-width=512" && \
+	make install
+
+USER postgres
+RUN service postgresql start && \
+    psql -c "CREATE USER ann WITH ENCRYPTED PASSWORD 'ann'" && \
+    psql -c "CREATE DATABASE ann" && \
+    psql -c "GRANT ALL PRIVILEGES ON DATABASE ann TO ann" && \
+    psql -d ann -c "CREATE EXTENSION vector" && \
+    psql -c "ALTER USER ann SET maintenance_work_mem = '4GB'" && \
+    psql -c "ALTER SYSTEM SET shared_buffers = '4GB'"
+USER root
+
+RUN pip install psycopg[binary] pgvector
diff --git a/ann_benchmarks/algorithms/pgvector_simd/config.yml b/ann_benchmarks/algorithms/pgvector_simd/config.yml
new file mode 100644
index 0000000..6472fec
--- /dev/null
+++ b/ann_benchmarks/algorithms/pgvector_simd/config.yml
@@ -0,0 +1,17 @@
+float:
+  any:
+  - base_args: ['@metric']
+    constructor: PGVector
+    disabled: false
+    docker_tag: ann-benchmarks-pgvector_simd
+    module: ann_benchmarks.algorithms.pgvector_simd
+    name: pgvector_simd
+    run_groups:
+      M-16:
+        arg_groups: [{M: 16, efConstruction: 200}]
+        args: {}
+        query_args: [[10, 20, 40, 80, 120, 200, 400, 800]]
+      M-24:
+        arg_groups: [{M: 24, efConstruction: 200}]
+        args: {}
+        query_args: [[10, 20, 40, 80, 120, 200, 400, 800]]
diff --git a/ann_benchmarks/algorithms/pgvector_simd/module.py b/ann_benchmarks/algorithms/pgvector_simd/module.py
new file mode 100644
index 0000000..90ccdc3
--- /dev/null
+++ b/ann_benchmarks/algorithms/pgvector_simd/module.py
@@ -0,0 +1,63 @@
+import subprocess
+import sys
+
+import pgvector.psycopg
+import psycopg
+
+from ..base.module import BaseANN
+
+
+class PGVector(BaseANN):
+    def __init__(self, metric, method_param):
+        self._metric = metric
+        self._m = method_param['M']
+        self._ef_construction = method_param['efConstruction']
+        self._cur = None
+
+        if metric == "angular":
+            self._query = "SELECT id FROM items ORDER BY embedding <=> %s LIMIT %s"
+        elif metric == "euclidean":
+            self._query = "SELECT id FROM items ORDER BY embedding <-> %s LIMIT %s"
+        else:
+            raise RuntimeError(f"unknown metric {metric}")
+
+    def fit(self, X):
+        # subprocess.run("service postgresql start", shell=True, check=True, stdout=sys.stdout, stderr=sys.stderr)
+        conn = psycopg.connect(host="localhost", user="ann", password="ann", dbname="ann", autocommit=True)
+        pgvector.psycopg.register_vector(conn)
+        cur = conn.cursor()
+        cur.execute("DROP TABLE IF EXISTS items")
+        cur.execute("CREATE TABLE items (id int, embedding vector(%d))" % X.shape[1])
+        cur.execute("ALTER TABLE items ALTER COLUMN embedding SET STORAGE PLAIN")
+        print("copying data...")
+        with cur.copy("COPY items (id, embedding) FROM STDIN") as copy:
+            for i, embedding in enumerate(X):
+                copy.write_row((i, embedding))
+        print("creating index...")
+        if self._metric == "angular":
+            cur.execute(
+                "CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops) WITH (m = %d, ef_construction = %d)" % (self._m, self._ef_construction)
+            )
+        elif self._metric == "euclidean":
+            cur.execute("CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = %d, ef_construction = %d)" % (self._m, self._ef_construction))
+        else:
+            raise RuntimeError(f"unknown metric {self._metric}")
+        print("done!")
+        self._cur = cur
+
+    def set_query_arguments(self, ef_search):
+        self._ef_search = ef_search
+        self._cur.execute("SET hnsw.ef_search = %d" % ef_search)
+
+    def query(self, v, n):
+        self._cur.execute(self._query, (v, n), binary=True, prepare=True)
+        return [id for id, in self._cur.fetchall()]
+
+    def get_memory_usage(self):
+        if self._cur is None:
+            return 0
+        self._cur.execute("SELECT pg_relation_size('items_embedding_idx')")
+        return self._cur.fetchone()[0] / 1024
+
+    def __str__(self):
+        return f"PGVector(m={self._m}, ef_construction={self._ef_construction}, ef_search={self._ef_search})"

and run the benchmarks with python3 run.py --dataset fashion-mnist-784-euclidean --algorithm {pgvector|pgvector_simd} --local. Below is the result:

fashion-mnist-784-euclidean

Summary

Now, we have only 15 lines of addition to the codebase and benefit from compiler-generated/supported function multi-versioning in both GCC and Clang. MSVC users will still compile their code with, e.g., /arch:AVX512. We can also add support for more architectures/targets easily as they come along and get supported by the compiler(s).

What do you think?

Note. If we are going to go down the path of pulling external dependencies, I would also recommend Agner Fog's very own vectorclass library.

@aytekinar aytekinar force-pushed the feature/simd-operations branch 4 times, most recently from c80103b to d8ddf62 Compare October 27, 2023 21:26
@aytekinar
Copy link
Author

I think I will need help for the build / mac one.

The define macros should be fine: https://godbolt.org/z/WbKfPrYoG

What's the output of the following:

$ clang -dM -E -x c /dev/null | grep -iE '(__gnuc__|__clang__|__clang_m..or__)'
#define __GNUC__ 4
#define __clang__ 1
#define __clang_major__ 15
#define __clang_minor__ 0

@ankane
Copy link
Member

ankane commented Oct 30, 2023

@ashvardanian The best way to benchmark is ann-benchmarks.

@aytekinar target_clones is neat, but looks like it may not be trivial to detect when it's supported. With Apple clang 15 on Mac ARM, I'm seeing:

error: function multiversioning is not supported on the current target

Also, there may be issues with musl.

I'm seeing a similar performance difference with ann-benchmarks, which I don't think justifies added complexity.

@aytekinar aytekinar force-pushed the feature/simd-operations branch from d8ddf62 to f84a633 Compare October 31, 2023 11:15
Comment on lines +37 to +43
#if defined _MSC_VER /* MSVC only */
#define __attribute__(x)
#elif defined __APPLE__ /* Apple/OSX only */
#define target_clones(...)
#elif !defined __has_attribute || !__has_attribute(target_clones) || !defined __gnu_linux__ /* target_clones not supported */
#define target_clones(...)
#endif
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These, now, handle the following situations:

  • for MS Visual C compiler, __attribute__ is defined to be no-op,
  • for Apple targets, target_clones is defined to be no-op, and,
  • whenever target_clones is not found as an attribute or __gnu_linux__ is not defined (i.e., for Alpine), target_clones is defined to be no-op.

You can verify the implementation for l2_distance_impl on Godbolt: https://godbolt.org/z/jzMjKbhKe

I have also tested the compilation in alpine:latest images. GCC builds without the SIMD support whereas Clang 16 builds with it. However, when I try to run the compiled code, I receive an error as mentioned in llvm/llvm-project#64631, which is closed by llvm/llvm-project-release-prs#615. In fact, when I test the code in alpine:edge, instead, GCC still builds the code without the SIMD support whereas Clang 17 builds with it, and I am able to see the benefits (speed improvements of up to 70%, again in pure C benchmarks).

@ankane
Copy link
Member

ankane commented Oct 31, 2023

Thanks @aytekinar, looks like CI is now passing.

However, I don't think it makes sense to spend more time on this unless benchmarks show significant improvement in overall performance.

@ankane
Copy link
Member

ankane commented Nov 1, 2023

A few other places to look for performance gains would be:

  1. Index build times
  2. IVFFlat index scans
  3. Table scans

Based on past SIMD optimizations (#180), index build times might benefit the most.

@jkatz
Copy link
Contributor

jkatz commented Nov 8, 2023

From some recent testing, I do think there may be some headroom on HNSW scans with higher hnsw.ef_search values (e.g. hnsw.ef_search > 200 and esp. hnsw.ef_search > 800). I've observed that the dropoff for pgvector is more significant here than other implementations. I do need to drill into it more to see if CPU is the primary bottleneck, but intuitively it'd be a good area to explore.

Additionally, it'd be good to stress test this against some larger vectors (e.g. 1536-dim) where the CPU is used more. I've found that a 10,000,000 1536-dim test data set can be very revealing in terms of where one can tweak performance. (1,000,000 is good too, if you have space/resource constraints).

@ankane
Copy link
Member

ankane commented Mar 28, 2024

Hi all, I'm considering including a version of this in 0.7.0 (and removing -march=native by default).

I did some initial benchmarking with this branch on HNSW build time (single process) with 128 dimensions and found:

  • ~10% improvement compared to OPTFLAGS=""
  • around the same performance as -march=native with target_clones("default", "avx", "fma")
  • worse performance than -march=native with "avx512f" (possibly due to downclocking)
  • not much difference with -fno-math-errno and/or -funroll-loops

If anyone is able to test or has thoughts on the above, please share.

@aytekinar
Copy link
Author

Great news!

I can try it. In the meantime, do you want anything from my end regarding this PR?

The diff looks neat -- I see that you're going to apply runtime dispatching on only L2 norm computations for the time being. Would you be interested in doing the same to L1, cosine and dot, as well? If so, I can modify my PR to include your changeset (i.e., the OPTFLAGS = change and the RUNTIME_DISPATCH macro/definition) and make it ready to be merged.

@tureba
Copy link

tureba commented Mar 28, 2024

How were those targets "default", "avx", "fma" chosen for x86-64? I see some concerns around AVX not implying FMA upthread, but AFAIK, in practice they show up together more often than not.

In fact, nowadays it's common to not care all that much for the specific instruction sets, but to target the coarser microarchitecture levels.

So I'd have gone with something like this instead:

target_clones("default", "arch=x86-64-v2", "arch=x86-64-v3", "arch=x86-64-v4")

In my testing, it generates variations of code with increasing sizes of registers used, as well as different instructions.

Even after that, it's possible that with the AVX512 instruction set explosion, we might have to think about adding an extra variation for AVX512FP16 (specifically for halfvec) and for AVX512BF16 (for bfloat16, if we ever use it).

@tureba
Copy link

tureba commented Mar 28, 2024

I see the macro is:

#if defined(__x86_64__) && defined(__gnu_linux__) && defined(__has_attribute) && __has_attribute(target_clones)

It only applies to GNU, which is very very sane, as target_clones depends on ifunc, which is a GNU-ism feature, that is not available in non-GNU libc, like MacOS's, for one, and musl (which is a libc used widely in Alpine Linux, which is itself a popular lightweight container image option).

I see it's also limiting to x86-64, which is a very safe bet. IIRC, ARM64 and PowerPC64 might also have target_clones working too, but S390X didn't, last time I checked. Support for such a thing depends on the compilers generating the resolver functions, and glibc calling them at load time, so adoption is slow.

I imagine we can worry about adding other architectures in the future, but I do wonder what architectures we'd like to keep an eye out for, since surely not all extensions have to cover the entirety of the postgres platforms.

I think ARM64 will likely be the next ask, given its current popularity. Does anyone have thoughts on other architectures?

@ankane
Copy link
Member

ankane commented Mar 28, 2024

@aytekinar No need to do anything with the PR. Will add the other distance functions, and include you as a co-author when merging (thanks for the idea and all of the work so far).

@tureba Thanks, will try those out. If you have ideas for what to use for target_clones on ARM, please share.

Here's what I'm thinking overall:

Platform Current 0.7.0 Notes
Linux x86-64 -march=native target_clones In branch
Linux aarch64 -march=native target_clones TODO
Mac x86-64 -march=native -march=native Since target_clones not supported
Mac arm64 none none Not needed
Windows none none Open to ideas
PowerPC none none Don't have a way to test

Also, if anyone has ideas for optimizing distance functions in the halfvec and bitvector branches, please share those as well (a new issue would probably be best).

ankane added a commit that referenced this pull request Apr 8, 2024
Co-authored-by: Arda Aytekin <arda.aytekin@microsoft.com>
@ankane
Copy link
Member

ankane commented Apr 8, 2024

Added CPU dispatching for halfvec distance functions in the commit above (likely still needs a few tweaks).

@aytekinar, added you as a co-author since your work in this PR was very helpful for this.

The reason for intrinsics / dispatching in this case was a significant difference in performance on x86-64. With the SIFT 1M dataset (128 dimensions):

Benchmark Before After
HNSW build time 1431 sec 334 sec
HNSW query time (10k) 17 sec 11 sec

Still looking at dispatching for vector functions.

@ankane ankane closed this in 0030849 Apr 15, 2024
@ankane
Copy link
Member

ankane commented Apr 15, 2024

Added CPU dispatching for key vector distance functions in the commit above. A few findings:

  1. arch=x86-64-v* isn't available with GCC < 12
  2. function multiversioning is in beta on ARM (__HAVE_FUNCTION_MULTI_VERSIONING). It's available in LLVM 16 and was recently added to GCC.

Also, some benchmarks with GIST 1M (960-dimensions):

Benchmark OPTFLAGS="" fma dispatching
HNSW index build (8 processes) 342 sec 294 sec
HNSW query time (1k), ef_search = 40 2.3 sec 2.1 sec
HNSW query time (1k), ef_search = 200 10.4 sec 9.8 sec

Happy to consider other functions and targets if there are benchmarks to support it.

Thanks again @aytekinar.

@aytekinar
Copy link
Author

Thank you, @ankane, for integrating all these changes and responding positively to our request. This was needed for us and our customers.

As for @tureba's comment

It only applies to GNU, which is very very sane, as target_clones depends on ifunc, which is a GNU-ism feature, that is not available in non-GNU libc, like MacOS's, for one, and musl (which is a libc used widely in Alpine Linux, which is itself a popular lightweight container image option).

I had tested the changeset in this PR on alpine:edge (cf. #311 (comment)). If I am not mistaken, Clang 16 should be able to handle alpine properly.

apavenis pushed a commit to apavenis/pgdg that referenced this pull request May 3, 2024
ettanany pushed a commit to aiven/pgrpms that referenced this pull request May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants