Temporary file cleanup #18

barrust · 2017-11-10T20:51:27Z

When pulling images, goose does not clean up the temporary directory of files. It would be beneficial to properly clean up the temp files when goose closes.

Perhaps, turning goose into a context manager would help.

barrust · 2017-11-11T13:52:00Z

I see two possible solutions to this issue:

Generate a unique sub folder in the desired temp directory for each instance of Goose and on close remove the sub folder. I.e., use a uuid to make it unique enough that if someone had multiple instances running it wouldn't cause an issue on cleanup.
Keep track of all temp files generated (specifically for the images) and then release them at the end

Are there preferences on how best to ensure we are freeing the temporary files?

lababidi · 2017-11-11T16:21:03Z

These seem like a good idea with the unique id. I'm not sure if you need to keep track of the files if you already have a pointer to the folder I believe that should be good enough.

barrust · 2017-11-11T17:46:12Z

@lababidi you are correct, with the unique folder, we would not need to keep track of individual files. The down side is that separate instances of Goose would not be able to leverage already downloaded images. I actually started implementing option 2 this morning after thinking about it a bit more.

What I have so far allows for the following syntax:

Standard format

from goose3 import Goose

sites = [site1, site2, site3,...]
goose = Goose()
for site in sites:
    goose.extract(url=site)
goose.close()  # it will attempt to close out when garbage collected also

Context manager

from goose3 import Goose

sites = [site1, site2, site3,...]
with Goose() as goose:
    for site in sites:
       article = goose.extract(url=site)

Multithreaded

Also, one could do the following to parallelize the extraction while still leveraging the images downloaded by other threads - this is rough pseudo code and I have not tested any race conditions:

import threading
from goose import Goose

def worker(sites):
    goose = Goose()  # Could also use a context manager
    for site in sites:
        article.extract(url=site)
    goose.close()

threads = []
all_sites = [site1, site2, site3, ....]
for i in range(5):
    t = threading.Thread(target=worker, args=([all_sites[x:x+seg_length] for x in range(0,len(all_sites), seg_length)]))
    threads.append(t)
    t.start()

Switching back to option 1 will not allow different threads to use the same files. It is kinda a cart before the horse issue, do we care about that use case?

lababidi · 2017-11-11T17:57:23Z

It sounds daunting but personally I'd rather bake multithreading inside goose. Maybe leverage urllib or requests if they have multi built in. Otherwise we can implement it in a basic way. I need to sit down and think about the goose object and see if we can implement multithreading within the context so that the extract method is thread safe. Either way I think this feature is more of a nice to have and not a dire necessity. Users could encapsulate a whole process end to end into a function and multithreading could handle that.

…

On Sat, Nov 11, 2017 at 12:46 PM Tyler Barrus ***@***.***> wrote: @lababidi <https://github.com/lababidi> you are correct, with the unique folder, we would not need to keep track of individual files. The down side is that separate instances of Goose would not be able to leverage already downloaded images. I actually started implementing option 2 this morning after thinking about it a bit more. What I have so far allows for the following syntax: Standard format from goose3 import Goose sites = [site1, site2, site3,...] goose = Goose()for site in sites: goose.extract(url=site) goose.close() # it will attempt to close out when garbage collected also Context manager from goose3 import Goose sites = [site1, site2, site3,...]with Goose() as goose: for site in sites: article = goose.extract(url=site) Multithreaded Also, one could do the following to parallelize the extraction while still leveraging the images downloaded by other threads - this is rough pseudo code and I have not tested any race conditions: import threadingfrom goose import Goose def worker(sites): goose = Goose() # Could also use a context manager for site in sites: article.extract(url=site) goose.close() threads = [] all_sites = [site1, site2, site3, ....]for i in range(5): t = threading.Thread(target=worker, args=([all_sites[x:x+seg_length] for x in range(0,len(all_sites), seg_length)])) threads.append(t) t.start() Switching back to option 1 will not allow different threads to use the same files. It is kinda a cart before the horse issue, do we care about that use case? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAePJ-V-lhKQxa7Z4iEG27KGy-x2Vwxhks5s1d1lgaJpZM4QaF8o> .

barrust · 2017-11-11T19:39:10Z

I agree. I will submit a PR with the tracking of files solution.

We can figure out multithreading (possibly need something like grequests to accomplish it.

barrust · 2017-11-17T02:04:24Z

I submitted a different PR that implements option 1. Either one should work well. I added more tests to ensure the option 1 version works as expected.

barrust · 2017-11-21T15:59:21Z

While working on #21 I found why the relase_resources() function was not working (yes, it is misspelled). With the fix in #22 it is working correctly and this could be closed without either of the previous PRs.

barrust self-assigned this Nov 10, 2017

barrust mentioned this issue Nov 11, 2017

Temp file cleanup #19

Closed

barrust mentioned this issue Nov 21, 2017

Fix: Relative Image URL #22

Merged

lababidi closed this as completed in #22 Nov 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary file cleanup #18

Temporary file cleanup #18

barrust commented Nov 10, 2017

barrust commented Nov 11, 2017

lababidi commented Nov 11, 2017

barrust commented Nov 11, 2017

lababidi commented Nov 11, 2017 via email

barrust commented Nov 11, 2017

barrust commented Nov 17, 2017

barrust commented Nov 21, 2017

Temporary file cleanup #18

Temporary file cleanup #18

Comments

barrust commented Nov 10, 2017

barrust commented Nov 11, 2017

lababidi commented Nov 11, 2017

barrust commented Nov 11, 2017

Standard format

Context manager

Multithreaded

lababidi commented Nov 11, 2017 via email

barrust commented Nov 11, 2017

barrust commented Nov 17, 2017

barrust commented Nov 21, 2017