Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Context to follow FastContext #70

Closed
MichalMalyska opened this issue May 25, 2021 · 1 comment
Closed

Improving Context to follow FastContext #70

MichalMalyska opened this issue May 25, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@MichalMalyska
Copy link
Contributor

https://www.sciencedirect.com/science/article/pii/S1532046418301576
makes a pretty strong claim of their implementation of ConText being 2 orders of magnitude faster (which I doubt is achievable in spacy) but also more accurate. I think it would be worth trying to match the performance:
https://github.com/jianlins/FastContext

@turbosheep
Copy link
Member

I think this is an interesting idea. Performance of medspacy has been a concern for us but I don't think any of our processes have been able to really stress test the performance. The "extreme example" in the intro is the team I work on and we expect our completed systems to be able to process millions of records per day.

We also work closely with the authors of this paper and they have contributed to other components of medspacy. We can easily ask them if they have any thoughts for speeding up the code inside each component.

I do have a few initial thoughts on it, though:

  1. One of our goals with medspacy was to be able to leverage the optimizations and work that others were doing (which was largely done in python for the broader NLP/ML community). Implementing the specialized trie for FastContext would require fully abandoning the spacy matchers and may need to be entirely custom.
  2. Medspacy components are not currently compatible with spacy's built-in multiprocessing due to (we believe) some missing or incorrect serialization methods, which is most likely a much quicker path to significant performance increases than implementing the current context algorithm. Parallelization is not a substitute for fastcontext, as mentioned in the paper, but may be a quicker fix for performance than changing the component significantly.
  3. We created common internal structures for medspacy's context, sectionizer and target matcher in the last major release. If this change was made, it could possibly be done inside this framework and benefit all three components. This would be more involved that simply implementing fastcontext, but is still a possibility in favor of eventually making the change.

@turbosheep turbosheep added the enhancement New feature or request label May 26, 2021
@MichalMalyska MichalMalyska closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants