Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qihui/store many vectors #76

Merged
merged 7 commits into from
Oct 21, 2018

Conversation

xieqihui
Copy link
Contributor

@xieqihui xieqihui commented Feb 7, 2018

  1. Implement store_many_vectors() for RedisStorage() using pipeline(). This has significantly increased speed.
  2. Add store_many_vectors() support for Storage() and MemoryStorage() by simply using a for loop of store_vector().

Store a batch of vectors.
Stores vector and JSON-serializable data in bucket with specified key.
"""
for idx, v in enumerate(vs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest

if data is None:
    data = itertools.repeat(data)
for v, k, d in zip(vs, bucket_keys, data):
    self.store_vector(hash_name, k, v, d)

Copy link
Collaborator

@amorgun amorgun Feb 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way you can use this code as default implementation at the base Storage class. I am not sure if this is a good idea though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have revised the storage_memory following your suggestion.
However, I think it may not be good to implement this as default implementation at the base Storage. At least we find that Redis can use a more natural way to achieve it more efficiently. Leaving store_many_vectors() as an abstract method may force future contributors to think about efficient ways when implementing for other storage options.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieqihui I agree. Let's keep it an abstract method.

"""
if data is None:
data = itertools.repeat(data)
for v, k, d in zip(vs, bucket_keys, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better performance in Python 2 use

from future.builtins import zip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this line in commit 8d97980

"""
with self.redis_object.pipeline() as pipeline:
for idx, v in enumerate(vs):
redis_key = self._format_redis_key(hash_name, bucket_keys[idx])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest extracting internal method _add_vector(self, hash_name, bucket_keys, vs, data, redis) and using it as

def store_vector(self, hash_name, bucket_key, v, data):
    self._add_vector(hash_name, bucket_key, v, data, self.redis)

def store_many_vectors(self, hash_name, bucket_keys, vs, data):
    with self.redis_object.pipeline() as pipeline:
        for bucket_key, v in zip(bucket_keys, vs):
            self._add_vector(hash_name, bucket_key, v, data, pipeline)
        pipeline.execute()

It should help with code duplication.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amorgun @xieqihui Are you happy with the state of the pull request? Would merge it in then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay. I haven't added unit tests for this PR. Pls give me one week's time, i will finish it up.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xieqihui I was inactive for a long time here as well, no problem. We all have real jobs right :)

Take your time! Anything between 1 and 4 weeks would be nice, so that we move one a bit.

Thanks for responding so fast!

Cheers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pixelogik Two of the issues I pointed out in comments are still here and I would like to see them fixed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amorgun Thanks for the feedback. It's in progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion. I included this modification in commit 4e001a1

@xieqihui
Copy link
Contributor Author

xieqihui commented Aug 6, 2018

@pixelogik @amorgun I have addressed all your feedback and added test cases. However, the CI checks failed in setting up environment for Python 3.3 tests. It doesn't seem to be related to my code. Could you please help check it?

@pixelogik
Copy link
Owner

@xieqihui I have no time this week because of work. Do you have capacity @amorgun ?

@amorgun
Copy link
Collaborator

amorgun commented Aug 7, 2018

@pixelogik The request looks good to me. I fixed the issue with CI in #80

@pixelogik pixelogik merged commit efb6d13 into pixelogik:master Oct 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants