fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

stelmath · 2020-11-01T20:09:22Z

This pull request stems from issue #20

My actions:

cloned my fork of pyate
created Conda environment
python setup.py install
pip install -r requiremenets.txt

Working on Windows 10

So, I made the edits in src/pyate/term_extraction.py and created a test script inside the pyate project folder to do some testing (please read the comments to understand the rationale behind the proposed code edits):

import pandas as pd
from pyate import combo_basic
import tqdm

# I am on Windows, so I need to wrap this in a __name__ == '__main__' for the multiprocessing to work properly
if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series

    # in this case, individual strings are passed into combo_basic()
    # notice that for each one of those strings the Matcher patterns are added into a brand new Matcher object
    # instead of using a Matcher object that is defined in the TermExtraction class level
    # also notice that the iterations/second in the tqdm progress bar does not diminish as time passes, since, now,
    # the Matcher object is defined anew per instance of the TermExtraction class. In the previous code implementation,
    # the same Matcher object would be shared among all documents, but more importantly, the pattern rules would be
    # added to that Matcher again and again (the same rules each time). The Matcher does not ignore rules that are
    # already added. That is why it was making is slower and slower as the document loop was progressing
    for document_str in tqdm.tqdm(series):
        top_keywords = combo_basic(document_str).sort_values(ascending=False).head(5).index.tolist()

    # alternatively, you can pass all the strings and leverage a multiprocessing Pool.
    # This works too, with the new proposed code. The previous code had the same Matcher issue. Each Matcher, per
    # process, was being fed the same rules again and again. However, on my machine, the for loop above, runs way faster
    # than the multiprocessing paradigm.
    top_keywords = combo_basic(series.tolist(), verbose=True).sort_values(ascending=False).head(5).index.tolist()

    print(top_keywords)

A note: using multiprocessing, there is no real difference between the code before and the code now - but I can't explain why.

…edit in README regarding the default_general_domain.en.csv

kevinlu1248 · 2020-11-01T22:49:46Z

Thanks for pointing out the performance issue with reinitializing a matcher every time, benchmarking, and making the PR. It looks like the Travis builds failed due to not finding the matcher anymore as a class-level property. Perhaps initializing the matcher at the instance level could work. I'm not too sure why the multiprocessing does not lead to an increase in performance, as I remember from previously benchmarking that 6 cores lead to about a 3X performance increase.

A bit unrelated to your specific issue but the following could potentially lead to a further increase in performance.

What I was originally intending with the technical_counts parameter of combo_basic (and the other algorithms) was to have the algorithm run in the following manner for a large number of documents: (desired implementation)

Find all candidate terms from the documents using spaCy's pattern matcher (this is the bottle-neck)
Count the frequency that each candidate appears in every document individually
Perform calculations specified by combo_basic for every document

Currently, it's looking something of the following: (current implementation)

For every document, find all candidate terms, then count the frequency for these terms
Perform calculations specified by combo_basic

For a set of similar enough documents, which I'm not sure it applies to your case, it's probably possible to find most candidates with only a fraction of the documents, which could greatly accelerate step 1 of the desired implementation.

Anyways, thanks for making the PR. It would be nice if you can deal with Travis not passing (probably setting the matcher to an instance property and editing the term_extraction_pipeline.py) but if not, I will work on it when I get the chance. Also, how much did this fix improve the performance by?

stelmath · 2020-11-02T08:29:13Z

Also, how much did this fix improve the performance by?

This snippet:

import pandas as pd
from pyate import combo_basic
import tqdm

if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series
    for document_str in tqdm.tqdm(series):
        top_keywords = combo_basic(document_str).sort_values(ascending=False).head(5).index.tolist()

Takes 1 min and 20 seconds now. Previously, it took 18 minutes.

This snippet, however, which uses multiprocessing

import pandas as pd
from pyate import combo_basic
import tqdm

if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series
    top_keywords = combo_basic(series.tolist(), verbose=True).sort_values(ascending=False).head(5).index.tolist()

used to take 5 minutes and 20 seconds. Now it takes 5 minutes 6 seconds. So, roughly the same. It is strange but I have not investigated this yet.

Also, I am noticing that the callback of the Pool() uses np.int64 for the Series datatype. Isn't that an overkill? Wouldn't int32 or int16suffice?

kevinlu1248 · 2020-11-02T17:41:52Z

Takes 1 min and 20 seconds now. Previously, it took 18 minutes.

That's great, and thanks for benchmarking once again. Let's fix the Travis build to get this PR in and then we can look into taking advantage of multiprocessing. I will benchmark it on my machine too when I get the chance.

Also, I am noticing that the callback of the Pool() uses np.int64 for the Series datatype. Isn't that an overkill? Wouldn't int32 or int16suffice?

That's a good point. I will probably set this to np.int16 in a separate PR and make it an optional parameter for those with large datasets.

stelmath · 2020-11-03T09:13:57Z

src/pyate/term_extraction_pipeline.py


        for i, pattern in enumerate(TermExtraction.patterns):
-            TermExtraction.matcher.add("term{}".format(i), add_to_counter, pattern)
-        matches = TermExtraction.matcher(doc)


@kevinlu1248
I changed the way the Matcher was initialized because I noticed that the TermExtraction.matcher is always using the English model. But, if a user does something like:

nlp = spacy.load('de_core_news_sm') nlp.add_pipe(TermExtractionPipeline(func))

then, the Matcher would use the English vocab (as defined in the TermExtraction class) and not the German Vocabulary. Hence, I added the nlp object as parameter to TermExtractionPipeline's __init__ and initialized the Matcher as self.matcher = Matcher(nlp.vocab)

Awesome, thanks @stelmath. Sorry I didn't see this message earlier. I will merge the PR now.

fix for issue kevinlu1248#20 regarding performance/speed. Also minor …

bc219aa

…edit in README regarding the default_general_domain.en.csv

restyled-io bot mentioned this pull request Nov 1, 2020

Restyle fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #22

Closed

changed TermExtractionPipeline for unit tests

4a19e99

stelmath commented Nov 3, 2020

View reviewed changes

kevinlu1248 merged commit 5d7f357 into kevinlu1248:master Nov 11, 2020

This was referenced Nov 11, 2020

Use np.int16 for the pandas series #23

Closed

Fix performance issue with multiprocessing #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

stelmath commented Nov 1, 2020

kevinlu1248 commented Nov 1, 2020

stelmath commented Nov 2, 2020

kevinlu1248 commented Nov 2, 2020

stelmath Nov 3, 2020 •

edited

Loading

kevinlu1248 Nov 11, 2020

fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

Conversation

stelmath commented Nov 1, 2020

kevinlu1248 commented Nov 1, 2020

stelmath commented Nov 2, 2020

kevinlu1248 commented Nov 2, 2020

stelmath Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

kevinlu1248 Nov 11, 2020

Choose a reason for hiding this comment

stelmath Nov 3, 2020 •

edited

Loading