-
Notifications
You must be signed in to change notification settings - Fork 37
fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21
fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21
Conversation
…edit in README regarding the default_general_domain.en.csv
Thanks for pointing out the performance issue with reinitializing a matcher every time, benchmarking, and making the PR. It looks like the Travis builds failed due to not finding the matcher anymore as a class-level property. Perhaps initializing the matcher at the instance level could work. I'm not too sure why the multiprocessing does not lead to an increase in performance, as I remember from previously benchmarking that 6 cores lead to about a 3X performance increase. A bit unrelated to your specific issue but the following could potentially lead to a further increase in performance. What I was originally intending with the
Currently, it's looking something of the following: (current implementation)
For a set of similar enough documents, which I'm not sure it applies to your case, it's probably possible to find most candidates with only a fraction of the documents, which could greatly accelerate step 1 of the desired implementation. Anyways, thanks for making the PR. It would be nice if you can deal with Travis not passing (probably setting the matcher to an instance property and editing the |
This snippet: import pandas as pd
from pyate import combo_basic
import tqdm
if __name__ == '__main__':
series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"] # type: pd.Series
for document_str in tqdm.tqdm(series):
top_keywords = combo_basic(document_str).sort_values(ascending=False).head(5).index.tolist() Takes 1 min and 20 seconds now. Previously, it took 18 minutes. This snippet, however, which uses import pandas as pd
from pyate import combo_basic
import tqdm
if __name__ == '__main__':
series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"] # type: pd.Series
top_keywords = combo_basic(series.tolist(), verbose=True).sort_values(ascending=False).head(5).index.tolist() used to take 5 minutes and 20 seconds. Now it takes 5 minutes 6 seconds. So, roughly the same. It is strange but I have not investigated this yet. Also, I am noticing that the callback of the |
That's great, and thanks for benchmarking once again. Let's fix the Travis build to get this PR in and then we can look into taking advantage of multiprocessing. I will benchmark it on my machine too when I get the chance.
That's a good point. I will probably set this to |
|
||
for i, pattern in enumerate(TermExtraction.patterns): | ||
TermExtraction.matcher.add("term{}".format(i), add_to_counter, pattern) | ||
matches = TermExtraction.matcher(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinlu1248
I changed the way the Matcher was initialized because I noticed that the TermExtraction.matcher is always using the English model. But, if a user does something like:
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe(TermExtractionPipeline(func))
then, the Matcher would use the English vocab (as defined in the TermExtraction class) and not the German Vocabulary. Hence, I added the nlp
object as parameter to TermExtractionPipeline
's __init__
and initialized the Matcher as self.matcher = Matcher(nlp.vocab)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @stelmath. Sorry I didn't see this message earlier. I will merge the PR now.
This pull request stems from issue #20
My actions:
python setup.py install
pip install -r requiremenets.txt
Working on Windows 10
So, I made the edits in src/pyate/term_extraction.py and created a test script inside the pyate project folder to do some testing (please read the comments to understand the rationale behind the proposed code edits):
A note: using multiprocessing, there is no real difference between the code before and the code now - but I can't explain why.