-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporary file cleanup #18
Comments
I see two possible solutions to this issue:
Are there preferences on how best to ensure we are freeing the temporary files? |
These seem like a good idea with the unique id. I'm not sure if you need to keep track of the files if you already have a pointer to the folder I believe that should be good enough. |
@lababidi you are correct, with the unique folder, we would not need to keep track of individual files. The down side is that separate instances of Goose would not be able to leverage already downloaded images. I actually started implementing option 2 this morning after thinking about it a bit more. What I have so far allows for the following syntax: Standard formatfrom goose3 import Goose
sites = [site1, site2, site3,...]
goose = Goose()
for site in sites:
goose.extract(url=site)
goose.close() # it will attempt to close out when garbage collected also Context managerfrom goose3 import Goose
sites = [site1, site2, site3,...]
with Goose() as goose:
for site in sites:
article = goose.extract(url=site) MultithreadedAlso, one could do the following to parallelize the extraction while still leveraging the images downloaded by other threads - this is rough pseudo code and I have not tested any race conditions: import threading
from goose import Goose
def worker(sites):
goose = Goose() # Could also use a context manager
for site in sites:
article.extract(url=site)
goose.close()
threads = []
all_sites = [site1, site2, site3, ....]
for i in range(5):
t = threading.Thread(target=worker, args=([all_sites[x:x+seg_length] for x in range(0,len(all_sites), seg_length)]))
threads.append(t)
t.start() Switching back to option 1 will not allow different threads to use the same files. It is kinda a cart before the horse issue, do we care about that use case? |
It sounds daunting but personally I'd rather bake multithreading inside
goose. Maybe leverage urllib or requests if they have multi built in.
Otherwise we can implement it in a basic way.
I need to sit down and think about the goose object and see if we can
implement multithreading within the context so that the extract method is
thread safe.
Either way I think this feature is more of a nice to have and not a dire
necessity. Users could encapsulate a whole process end to end into a
function and multithreading could handle that.
…On Sat, Nov 11, 2017 at 12:46 PM Tyler Barrus ***@***.***> wrote:
@lababidi <https://github.com/lababidi> you are correct, with the unique
folder, we would not need to keep track of individual files. The down side
is that separate instances of Goose would not be able to leverage already
downloaded images. I actually started implementing option 2 this morning
after thinking about it a bit more.
What I have so far allows for the following syntax:
Standard format
from goose3 import Goose
sites = [site1, site2, site3,...]
goose = Goose()for site in sites:
goose.extract(url=site)
goose.close() # it will attempt to close out when garbage collected also
Context manager
from goose3 import Goose
sites = [site1, site2, site3,...]with Goose() as goose:
for site in sites:
article = goose.extract(url=site)
Multithreaded
Also, one could do the following to parallelize the extraction while still
leveraging the images downloaded by other threads - this is rough pseudo
code and I have not tested any race conditions:
import threadingfrom goose import Goose
def worker(sites):
goose = Goose() # Could also use a context manager
for site in sites:
article.extract(url=site)
goose.close()
threads = []
all_sites = [site1, site2, site3, ....]for i in range(5):
t = threading.Thread(target=worker, args=([all_sites[x:x+seg_length] for x in range(0,len(all_sites), seg_length)]))
threads.append(t)
t.start()
Switching back to option 1 will not allow different threads to use the
same files. It is kinda a cart before the horse issue, do we care about
that use case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAePJ-V-lhKQxa7Z4iEG27KGy-x2Vwxhks5s1d1lgaJpZM4QaF8o>
.
|
I agree. I will submit a PR with the tracking of files solution. We can figure out multithreading (possibly need something like grequests to accomplish it. |
I submitted a different PR that implements option 1. Either one should work well. I added more tests to ensure the option 1 version works as expected. |
When pulling images, goose does not clean up the temporary directory of files. It would be beneficial to properly clean up the temp files when goose closes.
Perhaps, turning goose into a context manager would help.
The text was updated successfully, but these errors were encountered: