You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Based on our observations for the performance of indexing speed with the reddit dataset, I wonder whether or not we would get better performance if we pre-split the data before passing the documents off to the analyzer in each thread as opposed to the current situation where the threads all potentially compete for the mutex that surrounds the shared queue.
Basically, what I'm imagining is lazily-loading the document's content instead of loading it when it's read from the corpus, and then just creating a huge vector of all of the documents, partitioning it into num_threads parts, and then having each thread tokenize just that segment. Perhaps that can eliminate the contention for the mutex? This is likely to be a bigger concern when documents are super small.
The text was updated successfully, but these errors were encountered:
Based on our observations for the performance of indexing speed with the reddit dataset, I wonder whether or not we would get better performance if we pre-split the data before passing the documents off to the analyzer in each thread as opposed to the current situation where the threads all potentially compete for the mutex that surrounds the shared queue.
Basically, what I'm imagining is lazily-loading the document's content instead of loading it when it's read from the corpus, and then just creating a huge vector of all of the documents, partitioning it into
num_threads
parts, and then having each thread tokenize just that segment. Perhaps that can eliminate the contention for the mutex? This is likely to be a bigger concern when documents are super small.The text was updated successfully, but these errors were encountered: