Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Function to Automatically Transform Input To Use The Smallest dtype #6

Closed
seanlaw opened this issue Feb 27, 2020 · 8 comments
Closed

Comments

@seanlaw
Copy link
Collaborator

seanlaw commented Feb 27, 2020

According to #3

a type that uses k bits will need approximately k more time so try to use the smallest type you can

It would be nice to add a function to handle this automatically behind the scenes

@louisabraham
Copy link
Owner

The optimal solution (in terms of dtype size) would be to use np.unique. However, a simpler solution that just tests on inp.max() - inp.min() will be faster.
Since most users don't have a small number of different values covering a big range, this solution is the best.

@seanlaw
Copy link
Collaborator Author

seanlaw commented Feb 27, 2020

I would agree with this. I think we just need to show some documentation or a good example that demonstrates how much of a waste it is to have an array like:

x = np.random.randint(0, 2, size=100)
x[0] = 1e30

And one should convert this to integers between 0 and 2 (inclusive).

@seanlaw seanlaw reopened this Feb 27, 2020
@seanlaw seanlaw closed this as completed Feb 27, 2020
Repository owner deleted a comment from seanlaw Feb 27, 2020
louisabraham added a commit that referenced this issue Feb 27, 2020
@louisabraham
Copy link
Owner

from time import time
from pydivsufsort import divsufsort


n = 1_000_000

random_string = np.random.randint(255, size=n, dtype=np.uint8)

d = time()
divsufsort(random_string)
print(time() - d)

random_string = random_string.astype(np.uint64)

d = time()
divsufsort(random_string)
print(time() - d)

random_string[0] = -1

d = time()
divsufsort(random_string)
print(time() - d)
0.2969942092895508
0.3129754066467285
1.618661880493164

On such a string, np.unique takes about 120 ms so it's not worth it.

@louisabraham
Copy link
Owner

Sorry, github is currently experiencing problems and I deleted your double comment. It looks like both are gone now.

@seanlaw
Copy link
Collaborator Author

seanlaw commented Feb 27, 2020

On such a string, np.unique takes about 120 ms so it's not worth it.

I can't tell from this comment if you think this is too fast or slow? IIRC, np.unique sorts the array first but, technically, we don't need sorting

@seanlaw seanlaw reopened this Feb 27, 2020
@louisabraham
Copy link
Owner

120 ms is comparable to the cost of divsufsort on a string with the same dtype. Hence the user will pay in most cases a non negligible cost that there is no way to avoid. There is no way to make np.unique faster or to use it without sorting.

The current version includes an optimization that will work in most cases and doesn't cost much.

Why did you reopen the issue? To remember to put that in an example?

@louisabraham
Copy link
Owner

I could add a parameter to launch np.unique but just doing it will take the same amount of code, and I want the interface to be simple.

@seanlaw
Copy link
Collaborator Author

seanlaw commented Feb 27, 2020

Cool. I'm with you and agree that a simple interface is preferred.

Sorry for re-opening. I couldn't comment without re-opening. For issues that aren't closed by an actual commit, I tend to prefer to leave it open for at least 14 days (2 weeks) to allow for discussion and additional thoughts/considerations. After 2 weeks, I usually close and make a comment like "feel free to re-open..."

Not trying to dictate here but this approach might be useful in order to avoid constantly opening and closing issues just to be able to comment. It would reduce the amount of email notifications.

@seanlaw seanlaw closed this as completed Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants