Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition data more explicitly during indexing? #39

Closed
skystrife opened this issue Mar 31, 2014 · 1 comment
Closed

Partition data more explicitly during indexing? #39

skystrife opened this issue Mar 31, 2014 · 1 comment

Comments

@skystrife
Copy link
Member

Based on our observations for the performance of indexing speed with the reddit dataset, I wonder whether or not we would get better performance if we pre-split the data before passing the documents off to the analyzer in each thread as opposed to the current situation where the threads all potentially compete for the mutex that surrounds the shared queue.

Basically, what I'm imagining is lazily-loading the document's content instead of loading it when it's read from the corpus, and then just creating a huge vector of all of the documents, partitioning it into num_threads parts, and then having each thread tokenize just that segment. Perhaps that can eliminate the contention for the mutex? This is likely to be a bigger concern when documents are super small.

@skystrife
Copy link
Member Author

Going to close this as not relevant anymore. I'm satisfied with the current state of indexing performance in the develop branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant