Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_bytes_arrays_within_dist function added. #10

Merged
merged 10 commits into from Sep 6, 2021
Merged

check_bytes_arrays_within_dist function added. #10

merged 10 commits into from Sep 6, 2021

Conversation

bigcat88
Copy link
Contributor

Also hamming_distance_sse41_byte_max_dist and hamming_distance_loop_byte_max_dist added as underlying functions.
Added 2 tests(for input args checks and calculations check) and 1 benchmark(with result check too).
Version changed from 1.4 to 2.1

Also `hamming_distance_sse41_byte_max_dist` and `hamming_distance_loop_byte_max_dist` added as underlying functions.
Added 2 tests(for input args checks and calculations check) and 1 benchmark(with result check too).
Version changed from 1.4 to 2.1
Copy link
Owner

@mrecachinas mrecachinas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, just one comment in test_check_bytes_arrays_within_dist_bench. Once you address it, I'll merge and cut a release!

)
def test_check_bytes_arrays_within_dist_bench(benchmark, bytes1, bytes2, max_dist, result):
result = benchmark(check_bytes_arrays_within_dist, bytes1, bytes2, max_dist)
assert result == result
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can either omit this assert or rename the passed in result to expected. I'm happy for you to just omit the assert given that you're already testing check_bytes_arrays_within_dist.

@mrecachinas mrecachinas added the enhancement New feature or request label Aug 28, 2021
remove `result` from `test_check_bytes_arrays_within_dist_bench`
@bigcat88
Copy link
Contributor Author

done

@mrecachinas mrecachinas added this to the 2.1 milestone Aug 29, 2021
@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 2, 2021

Tomorrow and probably next day after too I will have free time, and I will rewrite my code. So don't merge please.

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

How can i detect during compilation(or execution) that module was compiled on this machine?(to safely call __builtin_popcount on x86 CPU?)
__builtin_popcount generates the most optimised version with march=native, so on one PC it uses AVX, and second PC hasn't it for example.
Wish to do:
If module was compiled than __builtin for popcount can be used, if not(binary from wheel) then use the fastest algo that CPU supports.
AVX512 > AVX2 > cpu_popcount > sse_popcount > arithmetic popcount.
For non x86: Neon > cpu_popcount > arithmetic popcount.
When user builds hexhamming from source, __builtin_popcount will be the fastest and can be used...

Need this is for binary distribution w/o problems.
Maybe some flag during compilation for binary wheels, that I can check...

@mrecachinas
Copy link
Owner

Perhaps you could use Function Multiversioning.

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

No, that a bit different thing....

Can be there addition arg in extra_compile_args that will be present only if user compiles hexhamming from source?

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

something like:
https://docs.microsoft.com/en-us/cpp/build/reference/d-preprocessor-definitions?view=msvc-160
https://www.rapidtables.com/code/linux/gcc/gcc-d.html

and i will check it(assume it will have name FROM_SRC) with:
#ifdef FROM_SRC or #if defined(FROM_SRC)

or opposite name like BIN_DISTR

@mrecachinas
Copy link
Owner

Just so I understand, are you asking how to

  • automatically determine what's available (and most optimal) when someone does a python setup.py build/install, or to
  • automatically pick the fastest function variant available on your machine at runtime (i.e., you simply pip install hexhamming and it auto-magically figures it out)?

or both?

In the case where you're on a platform where a matching wheel exists, it won't build it from source. Therefore, the function multi-versioning might actually be useful because at runtime it will choose the most optimal available function at runtime (provided the targets are defined properly). Note: this is essentially how Tensorflow and Pytorch work with their SIMD accelerated function versions (they actually likely use dynamic library loading instead or in addition, but the concept is similar).

However, if you're on a platform where a matching wheel doesn't exist, then I believe it uses the tarball to build (i.e., it basically runs python setup.py build). In that case, it should build it according to what's available on your machine (ideally using the most optimal function, provided the macro-checking is setup accordingly).

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

"automatically determine what's available (and most optimal) when someone does a python setup.py build/install"

PyMODINIT_FUNC 
...
    #if defined(FROM_SRC)
       // HERE ONLY when tarball building it(pip with flag --no-binary)
        p_hamming_distance_bytes = hamming_distance_bytes__builtin;
    #else
        #if defined(CPU_X86_64)
            cpu_capabilities = get_cpuid();
            if ((cpu_capabilities & bit_AVX512) == bit_AVX512)
                p_hamming_distance_bytes = hamming_distance_bytes__avx512;
            else if ((cpu_capabilities & bit_AVX2) == bit_AVX2)
                p_hamming_distance_bytes = hamming_distance_bytes__avx2;
            else if ((cpu_capabilities & bit_AVX2) == bit_POPCNT)
                p_hamming_distance_bytes = hamming_distance_bytes__asm; //cpu_popcount
            else if ((cpu_capabilities & bit_AVX2) == bit_SSE41)
                p_hamming_distance_bytes = hamming_distance_bytes__sse41;
            else
                p_hamming_distance_bytes = hamming_distance_bytes__classic;
        #else //non x86 CPU, in progress...
            p_hamming_distance_bytes = hamming_distance_bytes__classic;
        #endif
    #endif

@mrecachinas
Copy link
Owner

You could use CFLAGS, for example:

CFLAGS="-DFROM_SRC" python setup.py build

Off-hand, I don't know whether that would clobber existing flags (you should check). Otherwise, I suppose you could do something in setup.py like

...

extra_args = [...]
if os.environ.get("FROM_SRC"):
    extra_args.append("-DFROM_SRC")

setup(
    ...
    extra_args=extra_args,
)

need to check how it will compile on MVSC.
later will write why i rewrited what was a;ready written..
@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

those what i pushed need some review and testing, i tested it with hands as i could.
Replaced ifdef SSE4.1 to ifdef CPU_X86_64 , cause there is no CPU now days that didn't support SSE4(13 years passed)...
But can revert check for SSE4.1 or implement it dynamically, if you wish...

a few benchs of this(i changed your popcnt128 to sse_popcnt128 with realization on SSE3, it was faster on my two PC(82xx,109xxk Intel CPU's):

avx2: 
elem size=512, elements before match=4.0kk, in MB=1953.125,min=0.2221665220000002, avg=0.24850279833333344
elem size=256, elements before match=4.0kk, in MB=976.5625,min=0.10947152799999982, avg=0.113585916
elem size=128, elements before match=4.0kk, in MB=488.28125,min=0.05709911199999995, avg=0.0619054171
elem size=64, elements before match=4.0kk, in MB=244.140625,min=0.028964258999999992, avg=0.03135880983333331
popcount:
elem size=512, elements before match=4.0kk, in MB=1953.125,min=0.2470361539999999, avg=0.27512743159999997
elem size=256, elements before match=4.0kk, in MB=976.5625,min=0.12591829200000015, avg=0.13021109633333328
elem size=128, elements before match=4.0kk, in MB=488.28125,min=0.06370064799999975, avg=0.06827986276666669
elem size=64, elements before match=4.0kk, in MB=244.140625,min=0.03282513900000006, avg=0.035752803400000005
sse:
elem size=512, elements before match=4.0kk, in MB=1953.125,min=0.2810797640000011, avg=0.3101610192333334
elem size=256, elements before match=4.0kk, in MB=976.5625,min=0.1433219910000001, avg=0.15382660596666664
elem size=128, elements before match=4.0kk, in MB=488.28125,min=0.07159850000000034, avg=0.0757703425666667
elem size=64, elements before match=4.0kk, in MB=244.140625,min=0.036107665000000067, avg=0.04193896580000001

And there is something a bug(dont pass all tests), need some times to found & fix it...

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 3, 2021

That what i pushed is working, but want to try one more thing for hamming_distance_bytes function for sse/avx realization.
After that will post benchmarks compared to original code in master branch(hamming_distance_bytes function).
P.SL i haven't visual studio currently and didnt make compilation test on it...(only on gcc/clang)

…n `int`(this is a high cost operation for CPU) during distance calculation, and store it only at end.
@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 4, 2021

that's all, finished.
there also can be done imorovement for avx calculation(use _mm256_extract_epi64 only at end after cycle) - like in my last commit, but avx already the fastest from all, so maybe later...
i currently have no arm cpu, and from tomorrow will have no free time for a week or two, so maybe later will add arm neon realiztion(need cpu or install an emulator to check what will be written)

this code is ready for binary distribution throw wheel(i think), and is not slower than previous.

@bigcat88
Copy link
Contributor Author

bigcat88 commented Sep 5, 2021

That what i pushed is working, but want to try one more thing for hamming_distance_bytes function for sse/avx realization.
After that will post benchmarks compared to original code in master branch(hamming_distance_bytes function).
P.SL i haven't visual studio currently and didnt make compilation test on it...(only on gcc/clang)

To compile it with studio add to start of "python_hexhamming.h" after line number 2, this strings, after that VS will ok to build it.

#if defined(_MSC_VER)
#include <cstdint>
#include <BaseTsd.h>
typedef SSIZE_T ssize_t;
#endif

@mrecachinas mrecachinas merged commit 384cc1d into mrecachinas:master Sep 6, 2021
@mrecachinas
Copy link
Owner

I pulled this into master to cut a new release with it. For further optimizations, feel free to open a new PR. Thanks for your work @bigcat88!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants