Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make markovify.Text accept a file-like object or generator to reduce memory footprint when using large files? #54

Closed
Oliver2213 opened this issue Jan 21, 2017 · 3 comments
Milestone

Comments

@Oliver2213
Copy link

I've written a script to recursively find all files with a given extension, generate a chain for each, and (once all files have an associated chain), combine them into one mega-chain and store it.
I'm running this on a vary large directory (~1.4 G), and while coding my script I was aware that holding all of that in ram (as markkovify.Text only accepts strings) would probably be an issue.
I was correct; not 2 seconds after having run it the process was killed.
Is there a way to modify .Text and .NewlineText so they can accept (and properly process, of course), a generator or file-like object to iterate over?
I have no problem implementing this myself and filing a pull request, I'm just unsure how to deal with sentence splitting along chunks.

@danthedaniel
Copy link

danthedaniel commented Feb 24, 2017

I'm just unsure how to deal with sentence splitting along chunks.

I'd recommend using a generator internally for this, where it runs over an iterable (a generator, list, or something else) and only yields a new sentence upon the discovery of a !, ? or ..

That way you're relying on Python for maintaining state for you, rather than maintaining state with local variables.

@ghost
Copy link

ghost commented Feb 27, 2017

Also you may try PyPy for some speedups

@jsvine
Copy link
Owner

jsvine commented Sep 2, 2017

Thanks for suggesting this. It's an improvement I'd been meaning to make. Now available in v0.6.1. Fetch the latest with pip install -U markovify, and see the instructions here: https://github.com/jsvine/markovify#generating-markovifytext-models-from-very-large-corpora

Does that improve performance for you?

@jsvine jsvine closed this as completed Sep 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants