Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary file cleanup #18

Closed
barrust opened this issue Nov 10, 2017 · 7 comments · Fixed by #22
Closed

Temporary file cleanup #18

barrust opened this issue Nov 10, 2017 · 7 comments · Fixed by #22
Assignees

Comments

@barrust
Copy link
Collaborator

barrust commented Nov 10, 2017

When pulling images, goose does not clean up the temporary directory of files. It would be beneficial to properly clean up the temp files when goose closes.

Perhaps, turning goose into a context manager would help.

@barrust barrust self-assigned this Nov 10, 2017
@barrust
Copy link
Collaborator Author

barrust commented Nov 11, 2017

I see two possible solutions to this issue:

  1. Generate a unique sub folder in the desired temp directory for each instance of Goose and on close remove the sub folder. I.e., use a uuid to make it unique enough that if someone had multiple instances running it wouldn't cause an issue on cleanup.
  2. Keep track of all temp files generated (specifically for the images) and then release them at the end

Are there preferences on how best to ensure we are freeing the temporary files?

@lababidi
Copy link
Contributor

These seem like a good idea with the unique id. I'm not sure if you need to keep track of the files if you already have a pointer to the folder I believe that should be good enough.

@barrust
Copy link
Collaborator Author

barrust commented Nov 11, 2017

@lababidi you are correct, with the unique folder, we would not need to keep track of individual files. The down side is that separate instances of Goose would not be able to leverage already downloaded images. I actually started implementing option 2 this morning after thinking about it a bit more.

What I have so far allows for the following syntax:

Standard format

from goose3 import Goose

sites = [site1, site2, site3,...]
goose = Goose()
for site in sites:
    goose.extract(url=site)
goose.close()  # it will attempt to close out when garbage collected also

Context manager

from goose3 import Goose

sites = [site1, site2, site3,...]
with Goose() as goose:
    for site in sites:
       article = goose.extract(url=site)

Multithreaded

Also, one could do the following to parallelize the extraction while still leveraging the images downloaded by other threads - this is rough pseudo code and I have not tested any race conditions:

import threading
from goose import Goose

def worker(sites):
    goose = Goose()  # Could also use a context manager
    for site in sites:
        article.extract(url=site)
    goose.close()

threads = []
all_sites = [site1, site2, site3, ....]
for i in range(5):
    t = threading.Thread(target=worker, args=([all_sites[x:x+seg_length] for x in range(0,len(all_sites), seg_length)]))
    threads.append(t)
    t.start()

Switching back to option 1 will not allow different threads to use the same files. It is kinda a cart before the horse issue, do we care about that use case?

@lababidi
Copy link
Contributor

lababidi commented Nov 11, 2017 via email

@barrust
Copy link
Collaborator Author

barrust commented Nov 11, 2017

I agree. I will submit a PR with the tracking of files solution.

We can figure out multithreading (possibly need something like grequests to accomplish it.

@barrust
Copy link
Collaborator Author

barrust commented Nov 17, 2017

I submitted a different PR that implements option 1. Either one should work well. I added more tests to ensure the option 1 version works as expected.

@barrust
Copy link
Collaborator Author

barrust commented Nov 21, 2017

While working on #21 I found why the relase_resources() function was not working (yes, it is misspelled). With the fix in #22 it is working correctly and this could be closed without either of the previous PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants