Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to find closest bytestring/string in list #9

Closed
mrecachinas opened this issue Aug 11, 2021 · 12 comments
Closed

Add function to find closest bytestring/string in list #9

mrecachinas opened this issue Aug 11, 2021 · 12 comments
Labels
enhancement New feature or request

Comments

@mrecachinas
Copy link
Owner

mrecachinas commented Aug 11, 2021

In #7, @bigcat88 wrote:

Also imho, will be interesting such function for python:

int check_bytes_arrays_within_dist(
    byte* arrays_to_compare,
    int nElementsInFirstArg,
    byte* what_to_compare,
    int nBytesInElement,
    int max_dist
)

Return index of first array for which distance <(or <=) max_dist, or -1 otherwise.

Like current check_hexstrings_within_dist with ability to compare many arrays(arg1) to one(arg=what_to_compare) to see if for specified distance there is any equality.
Thanks, for your work and library - anyway 👍

@mrecachinas mrecachinas added the enhancement New feature or request label Aug 11, 2021
@bigcat88
Copy link
Contributor

bigcat88 commented Aug 24, 2021

static char check_bytes_arrays_within_dist_docstring[] =
    "Check if any element of byte array are within a specified Hamming Distance\n"
    "and return it's index or -1 otherwise.\n\n"
    "Size in bytes of `array_of_elems` and `elem_to_compare` must be multiple of 16.\n"
    "Size in bytes of array_of_elems must be equal to `elem_size` * `elem_count`. \n\n"
    ":param elem_size: size in bytes of one element of array\n"
    ":type elem_size: int\n"
    ":param elem_count: number of elements in array_of_elems to search within\n"
    ":type elem_count: int\n"
    ":param array_of_elems: array of bytes to search within\n"
    ":type array_of_elems: bytes\n"
    ":param elem_to_compare: will compare to each element in array_of_elems\n"
    ":type elem_to_compare: bytes\n"
    ":param max_dist: maximum allowable Hamming Distance\n"
    ":type max_dist: int\n"
    ":returns: index of first element in array_of_elems for which hamming distance <= `max_dist` or -1. \n"
    ":rtype: int\n"
    ":raises nothing: if input parameters are invalid, internal exception will occur. Use wisely!"

I can write such function. Do you ok with this?

Also, have you any plans to build binary wheels for different OS?(Centos,Debian,Ubuntu)(manylinux1-..-manylinux_x_y) ?
And maybe some plans to support arm cpu's in far feature?(sse2neon project will have it first release soon btw)

@mrecachinas
Copy link
Owner Author

Definitely happy to support as many platforms as possible!

Also very happy to have you add that function! That signature looks good to me.

@bigcat88
Copy link
Contributor

bigcat88 commented Aug 24, 2021

static char check_bytes_arrays_within_dist_docstring[] =
    "Check if any element of byte array are within a specified Hamming Distance\n"
    "and return it's index or -1 otherwise.\n\n"
    "Size in bytes of `array_of_elems` and `elem_to_compare` must be multiple of 16.\n"
    "Size of `elem_to_compare ` must be multiplier of `array_of_elems` size. \n\n"
    ":param array_of_elems: array of bytes to search within\n"
    ":type array_of_elems: bytes\n"
    ":param elem_to_compare: will compare to each element in array_of_elems\n"
    ":type elem_to_compare: bytes\n"
    ":param max_dist: maximum allowable Hamming Distance\n"
    ":type max_dist: int\n"
    ":returns: index of first element in array_of_elems for which hamming distance <= `max_dist` or -1. \n"
    ":rtype: int\n"
    ":raises ValueError or native exception: if input parameters are invalid, internal exception will occur. Use wisely!";

Done it locally on my mac, need only understand how to add tests and benchmarks.
After that will push it to my github and make a pull request...

@mrecachinas
Copy link
Owner Author

At the moment I'm testing it via the Python extension (i.e., I don't have C++ unit tests specifically for the C++ functions).

To add a test, you can simply add a couple test functions to test/test_hexhamming.py. Don't worry about parametrizing the tests at first, something like the following would suffice:

from hexhamming import check_bytes_arrays_within_dist

def test_check_bytes_arrays_within_dist():
    array_of_elems = ["...", "...", ...]
    elem_to_compare = ...
    max_dist = ...
    expected = ...
    assert check_bytes_arrays_within_dist(...) == expected

For benchmarking, you can do

from hexhamming import check_bytes_arrays_within_dist

def test_check_bytes_arrays_within_dist_bench(benchmark):
    array_of_elems = ["...", "...", ...]
    elem_to_compare = ...
    max_dist = ...
    expected = ...
    assert check_bytes_arrays_within_dist(...) == expected

Let me know if you have any questions about this.

@bigcat88
Copy link
Contributor

Is there any problems with my pull request?

@mrecachinas
Copy link
Owner Author

Ah sorry! I didn’t get notified that there was a PR. I’ll review it this morning.

@bigcat88
Copy link
Contributor

bigcat88 commented Aug 28, 2021

I will rewrite maybe code for non SSE, need look deeper at it, it seems not so optimised for my func when it takes array with multiply 16.(i just copied your code)
And add Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS to my func, but need to benchmarks to see how it will feel.
But it will be later, first when I have time(probably on Monday-Wednesday), with next pull request I plan to add building wheels.
Like this, but with more wheels and with python 3.9 support:
https://github.com/carsales/pyheif/blob/38468c5963f7c7e40f482c71d9662f9c7412895f/docker/Dockerfile.manylinux2014#L165

@mrecachinas
Copy link
Owner Author

Sounds good. I'd like to tie the wheel-building and pushing to PyPI with GitHub Actions (probably using this action https://github.com/RalfG/python-wheels-manylinux-build). I can try to add that today.

@bigcat88
Copy link
Contributor

Sounds good. I'd like to tie the wheel-building and pushing to PyPI with GitHub Actions (probably using this action https://github.com/RalfG/python-wheels-manylinux-build). I can try to add that today.

Looks very promising, if I can help with this - just say.

@bigcat88
Copy link
Contributor

bigcat88 commented Sep 5, 2021

Say if there is anything else that I can put for that pull request.
When binary wheels(manylinux2014/manylinux2010) will appear, I can test them on different CPUs(with/without AVX support), but pretty sure there will not be any problems.

@mrecachinas
Copy link
Owner Author

I'll need to spend a little more time on the wheel building for i686 and ARM, but manylinux wheels are now being built in CI as of 4fc4fb8. For some reason, I'm getting a Docker compatibility error with ARM: https://github.com/mrecachinas/hexhamming/runs/3526835994?check_suite_focus=true. Error copied below:

WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64)
and no specific platform was requested
standard_init_linux.go:228: exec user process caused: exec format error

@mrecachinas
Copy link
Owner Author

Closing this since #10 has been pulled in and passed CI. For further optimizations, just go ahead and submit a PR! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants