Skip to content
This repository has been archived by the owner on Jun 1, 2023. It is now read-only.

fix for issue #20 regarding performance/speed and the use of the spaCy Matcher #21

Merged
merged 2 commits into from
Nov 11, 2020

Conversation

stelmath
Copy link
Contributor

@stelmath stelmath commented Nov 1, 2020

This pull request stems from issue #20

My actions:

  • cloned my fork of pyate
  • created Conda environment
  • python setup.py install
  • pip install -r requiremenets.txt

Working on Windows 10

So, I made the edits in src/pyate/term_extraction.py and created a test script inside the pyate project folder to do some testing (please read the comments to understand the rationale behind the proposed code edits):

import pandas as pd
from pyate import combo_basic
import tqdm

# I am on Windows, so I need to wrap this in a __name__ == '__main__' for the multiprocessing to work properly
if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series

    # in this case, individual strings are passed into combo_basic()
    # notice that for each one of those strings the Matcher patterns are added into a brand new Matcher object
    # instead of using a Matcher object that is defined in the TermExtraction class level
    # also notice that the iterations/second in the tqdm progress bar does not diminish as time passes, since, now,
    # the Matcher object is defined anew per instance of the TermExtraction class. In the previous code implementation,
    # the same Matcher object would be shared among all documents, but more importantly, the pattern rules would be
    # added to that Matcher again and again (the same rules each time). The Matcher does not ignore rules that are
    # already added. That is why it was making is slower and slower as the document loop was progressing
    for document_str in tqdm.tqdm(series):
        top_keywords = combo_basic(document_str).sort_values(ascending=False).head(5).index.tolist()

    # alternatively, you can pass all the strings and leverage a multiprocessing Pool.
    # This works too, with the new proposed code. The previous code had the same Matcher issue. Each Matcher, per
    # process, was being fed the same rules again and again. However, on my machine, the for loop above, runs way faster
    # than the multiprocessing paradigm.
    top_keywords = combo_basic(series.tolist(), verbose=True).sort_values(ascending=False).head(5).index.tolist()

    print(top_keywords)

A note: using multiprocessing, there is no real difference between the code before and the code now - but I can't explain why.

…edit in README regarding the default_general_domain.en.csv
@kevinlu1248
Copy link
Owner

Thanks for pointing out the performance issue with reinitializing a matcher every time, benchmarking, and making the PR. It looks like the Travis builds failed due to not finding the matcher anymore as a class-level property. Perhaps initializing the matcher at the instance level could work. I'm not too sure why the multiprocessing does not lead to an increase in performance, as I remember from previously benchmarking that 6 cores lead to about a 3X performance increase.

A bit unrelated to your specific issue but the following could potentially lead to a further increase in performance.

What I was originally intending with the technical_counts parameter of combo_basic (and the other algorithms) was to have the algorithm run in the following manner for a large number of documents: (desired implementation)

  1. Find all candidate terms from the documents using spaCy's pattern matcher (this is the bottle-neck)
  2. Count the frequency that each candidate appears in every document individually
  3. Perform calculations specified by combo_basic for every document

Currently, it's looking something of the following: (current implementation)

  1. For every document, find all candidate terms, then count the frequency for these terms
  2. Perform calculations specified by combo_basic

For a set of similar enough documents, which I'm not sure it applies to your case, it's probably possible to find most candidates with only a fraction of the documents, which could greatly accelerate step 1 of the desired implementation.

Anyways, thanks for making the PR. It would be nice if you can deal with Travis not passing (probably setting the matcher to an instance property and editing the term_extraction_pipeline.py) but if not, I will work on it when I get the chance. Also, how much did this fix improve the performance by?

@stelmath
Copy link
Contributor Author

stelmath commented Nov 2, 2020

Also, how much did this fix improve the performance by?

This snippet:

import pandas as pd
from pyate import combo_basic
import tqdm

if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series
    for document_str in tqdm.tqdm(series):
        top_keywords = combo_basic(document_str).sort_values(ascending=False).head(5).index.tolist()

Takes 1 min and 20 seconds now. Previously, it took 18 minutes.

This snippet, however, which uses multiprocessing

import pandas as pd
from pyate import combo_basic
import tqdm

if __name__ == '__main__':
    series = pd.read_csv('src/pyate/default_general_domain.en.csv')["SECTION_TEXT"]  # type: pd.Series
    top_keywords = combo_basic(series.tolist(), verbose=True).sort_values(ascending=False).head(5).index.tolist()

used to take 5 minutes and 20 seconds. Now it takes 5 minutes 6 seconds. So, roughly the same. It is strange but I have not investigated this yet.

Also, I am noticing that the callback of the Pool() uses np.int64 for the Series datatype. Isn't that an overkill? Wouldn't int32 or int16suffice?

@kevinlu1248
Copy link
Owner

Takes 1 min and 20 seconds now. Previously, it took 18 minutes.

That's great, and thanks for benchmarking once again. Let's fix the Travis build to get this PR in and then we can look into taking advantage of multiprocessing. I will benchmark it on my machine too when I get the chance.

Also, I am noticing that the callback of the Pool() uses np.int64 for the Series datatype. Isn't that an overkill? Wouldn't int32 or int16suffice?

That's a good point. I will probably set this to np.int16 in a separate PR and make it an optional parameter for those with large datasets.


for i, pattern in enumerate(TermExtraction.patterns):
TermExtraction.matcher.add("term{}".format(i), add_to_counter, pattern)
matches = TermExtraction.matcher(doc)
Copy link
Contributor Author

@stelmath stelmath Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinlu1248
I changed the way the Matcher was initialized because I noticed that the TermExtraction.matcher is always using the English model. But, if a user does something like:

nlp = spacy.load('de_core_news_sm')
nlp.add_pipe(TermExtractionPipeline(func))

then, the Matcher would use the English vocab (as defined in the TermExtraction class) and not the German Vocabulary. Hence, I added the nlp object as parameter to TermExtractionPipeline's __init__ and initialized the Matcher as self.matcher = Matcher(nlp.vocab)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @stelmath. Sorry I didn't see this message earlier. I will merge the PR now.

@kevinlu1248 kevinlu1248 merged commit 5d7f357 into kevinlu1248:master Nov 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants