Skip to content

Commit

Permalink
Add Download Resume Feature
Browse files Browse the repository at this point in the history
  • Loading branch information
mxamin committed Jul 11, 2015
1 parent 5c233c8 commit afa350a
Show file tree
Hide file tree
Showing 4 changed files with 136 additions and 47 deletions.
23 changes: 23 additions & 0 deletions README.md
Expand Up @@ -222,6 +222,29 @@ instead. This is especially convenient, as typing usernames (email
addresses) and passwords directly on the command line can get tiresome (even
more if you happened to choose a "strong" password).

In default mode when you interrupt the download process by pressing
CTRL+C, partially downloaded files will be deleted from your disk and
you have to start the download process from the begining. If your
download was interrupted by something other than KeyboardInterrupt
(CTRL+C) like sudden system crash, partially downloaded files will
remain on your disk and the next time you start the process again,
these files will be discraded from download list!, therefor it's your

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

s/therefor/therefore/

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Fixed

job to delete them manually before next start. For this reason we
added an option called `--resume` which continues your downloads from
where they stopped:

coursera-dl -u <user> -p <pass> --resume sdn1-001

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

First of all, sorry for not merging your great pull request earlier. I've been busy and I think that we may try to approach this differently.

What if we always allowed resuming (by default, like youtube-dl), by using the following strategy:

  • we always download the files with the .part suffix.
  • once we are done, we (atomically) rename the file to remove the .part suffix.
  • we assume that a file that doesn't have a .part suffix is already finished.

How does that sound?

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

No problem, It actually gave me more time to test my pull on more courses and make sure It won't break the code :)

On adding .part suffix, GREAT idea, I'll work on that.


This option can also be used with external downloaders:

coursera-dl --wget -u <user> -p <pass> --resume sdn1-001

*Note 1*: Some external downloaders use their own built-in resume feature
which may not be compatibale with others, so use them at your own risk.

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

s/compatibale/compatible/

I feel that this point would become void if we used the .part strategy. Or wouldn't it?

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Yes, thats correct, but there is one flaw. In most external downloaders when you start downloading a course, you can't change your downloader.


*Note 2*: Remember that in resume mode, interrupted files **WON'T** be deleted from
your disk.

**NOTE**: If your password contains punctuation, quotes or other "funny
characters" (e.g., `<`, `>`, `#`, `&`, `|` and so on), then you may have to
escape them from your shell. With bash or other Bourne-shell clones (and
Expand Down
19 changes: 14 additions & 5 deletions coursera/coursera_dl.py
Expand Up @@ -491,7 +491,8 @@ def download_lectures(downloader,
hooks=None,
playlist=False,
intact_fnames=False,
ignored_formats=None):
ignored_formats=None,
resume=False):
"""
Downloads lecture resources described by sections.
Returns True if the class appears completed.
Expand Down Expand Up @@ -533,10 +534,10 @@ def download_lectures(downloader,
lecfn = os.path.join(
sec, format_resource(lecnum + 1, lecname, title, fmt))

if overwrite or not os.path.exists(lecfn):
if overwrite or not os.path.exists(lecfn) or resume:
if not skip_download:
logging.info('Downloading: %s', lecfn)
downloader.download(url, lecfn)
downloader.download(url, lecfn, resume=resume)
else:
open(lecfn, 'w').close() # touch
last_update = time.time()
Expand Down Expand Up @@ -727,6 +728,12 @@ def parse_args(args=None):
help='use axel for downloading,'
' optionally specify axel bin')

parser.add_argument('--resume',
dest='resume',
action='store_true',
default=False,
help='resume incomplete downloads (default: False)')

parser.add_argument('-o',
'--overwrite',
dest='overwrite',
Expand Down Expand Up @@ -944,7 +951,8 @@ def download_class(args, class_name):
args.hooks,
args.playlist,
args.intact_fnames,
ignored_formats)
ignored_formats,
args.resume)

return completed

Expand Down Expand Up @@ -996,7 +1004,8 @@ def download_on_demand_class(args, class_name):
args.hooks,
args.playlist,
args.intact_fnames,
ignored_formats
ignored_formats,
args.resume
)
completed = completed and result

Expand Down
135 changes: 96 additions & 39 deletions coursera/downloaders.py
Expand Up @@ -33,28 +33,30 @@ class Downloader(object):
>>> d.download('http://example.com', 'save/to/this/file')
"""

def _start_download(self, url, filename):
def _start_download(self, url, filename, resume):
"""
Actual method to download the given url to the given file.
This method should be implemented by the subclass.
"""
raise NotImplementedError("Subclasses should implement this")

def download(self, url, filename):
def download(self, url, filename, resume=False):
"""
Download the given url to the given file. When the download
is aborted by the user, the partially downloaded file is also removed.
"""

try:
self._start_download(url, filename)
self._start_download(url, filename, resume)
except KeyboardInterrupt as e:
logging.info(
'Keyboard Interrupt -- Removing partial file: %s', filename)
try:
os.remove(filename)
except OSError:
pass
# keep the file if resume is True
if not resume:
logging.info('Keyboard Interrupt -- Removing partial file: %s',
filename)
try:
os.remove(filename)
except OSError:
pass
raise e


Expand Down Expand Up @@ -94,6 +96,13 @@ def _prepare_cookies(self, command, url):
if cookie_values:
self._add_cookies(command, cookie_values)

def _enable_resume(self, command):
"""
Enable resume feature
"""

raise RuntimeError("Subclass should implement this")

def _add_cookies(self, command, cookie_values):
"""
Add the given cookie values to the command
Expand All @@ -107,9 +116,12 @@ def _create_command(self, url, filename):
"""
raise NotImplementedError("Subclasses should implement this")

def _start_download(self, url, filename):
def _start_download(self, url, filename, resume):
command = self._create_command(url, filename)
self._prepare_cookies(command, url)
if resume:
self._enable_resume(command)

logging.debug('Executing %s: %s', self.bin, command)
try:
subprocess.call(command)
Expand All @@ -126,6 +138,9 @@ class WgetDownloader(ExternalDownloader):

bin = 'wget'

def _enable_resume(self, command):
command.extend(['-c', ])

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Just an append would do it, instead of creating a list and then calling extend, but that's a really minor issue.

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

This is known as copy-paste mistake :)


def _add_cookies(self, command, cookie_values):
command.extend(['--header', "Cookie: " + cookie_values])

Expand All @@ -141,6 +156,9 @@ class CurlDownloader(ExternalDownloader):

bin = 'curl'

def _enable_resume(self, command):
command.extend(['-C', '-'])

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Thanks for reminding me why I don't like curl very much. :)

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Your welcome


def _add_cookies(self, command, cookie_values):
command.extend(['--cookie', cookie_values])

Expand All @@ -156,6 +174,9 @@ class Aria2Downloader(ExternalDownloader):

bin = 'aria2c'

def _enable_resume(self, command):
command.extend(['-c', ])

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Same comment as for wget. :)

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Fixed


def _add_cookies(self, command, cookie_values):
command.extend(['--header', "Cookie: " + cookie_values])

Expand All @@ -173,6 +194,10 @@ class AxelDownloader(ExternalDownloader):

bin = 'axel'

def _enable_resume(self, command):
logging.warn('Resume download not implemented for this '
'downloader!')

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Ah, I didn't know that!

BTW, it seems like axel is going to be "revived", at least in debian: https://bugs.debian.org/794421


def _add_cookies(self, command, cookie_values):
command.extend(['-H', "Cookie: " + cookie_values])

Expand Down Expand Up @@ -278,43 +303,75 @@ class NativeDownloader(Downloader):
def __init__(self, session):
self.session = session

def _start_download(self, url, filename):
logging.info('Downloading %s -> %s', url, filename)
def _start_download(self, url, filename, resume=False):
# resume has no meaning if the file doesn't exists!
resume = resume and os.path.exists(filename)

headers = {}
filesize = None
if resume:
filesize = os.path.getsize(filename)
headers['Range'] = 'bytes={}-'.format(filesize)
logging.info('Resume downloading %s -> %s', url, filename)
else:
logging.info('Downloading %s -> %s', url, filename)

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Excellent work here!

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Appreciate it.


attempts_count = 0
error_msg = ''
while attempts_count < 5:
r = self.session.get(url, stream=True)

if r.status_code is not 200:
logging.warn(
'Probably the file is missing from the AWS repository...'
' waiting.')

if r.reason:
error_msg = r.reason + ' ' + str(r.status_code)
r = self.session.get(url, stream=True, headers=headers)

if r.status_code != 200:
# because in resume state we are downloadig only a

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

s/downloadig/downloading/

# portion of requested file, server may return
# following HTTP codes:
# 206: Partial Content
# 415: Requested Range Not Satisfiable

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Should this comment here be 416?

And if we have a requested range not satisfiable, I think that we may have to download the file from scratch? I'm not really that knowledgeable, but I vaguely remember that that's what wget does (or doesn't it?).

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Yes, 416 is correct.

I guessed the same thing about 416: Request Range Not Satisfiable, but after running the code some files didn't downloaded completely or they started downloading from the beginning.
So I lookup wget code and this is exactly what it does facing 416 HTTP Code .

# which are OK for us.
if resume and r.status_code == 206:
pass
elif resume and r.status_code == 416:
logging.info('%s already downloaded', filename)
r.close()
return True
else:
error_msg = 'HTTP Error ' + str(r.status_code)

wait_interval = 2 ** (attempts_count + 1)
msg = 'Error downloading, will retry in {0} seconds ...'
print(msg.format(wait_interval))
time.sleep(wait_interval)
attempts_count += 1
continue
print(r.status_code)
print(url)
print(filesize)
logging.warn('Probably the file is missing from the AWS '
'repository... waiting.')

if r.reason:
error_msg = r.reason + ' ' + str(r.status_code)
else:
error_msg = 'HTTP Error ' + str(r.status_code)

wait_interval = 2 ** (attempts_count + 1)

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Is this retrying thing really useful for people? I always disliked it, but kept it for the sake of those that may need it...

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

Me neither, but my goal in adding a new feature is actually adding a feature, not to remove an already implemented one. Actually is up to you (maintainer of the project) to remove unneeded parts.

msg = 'Error downloading, will retry in {0} seconds ...'
print(msg.format(wait_interval))
time.sleep(wait_interval)
attempts_count += 1
continue

if resume and r.status_code == 200:
# if the server returns HTTP code 200 while we are in
# resume mode, it means that the server does not support
# partial downloads.
resume = False

content_length = r.headers.get('content-length')
progress = DownloadProgress(content_length)
chunk_sz = 1048576
with open(filename, 'wb') as f:
progress.start()
while True:
data = r.raw.read(chunk_sz, decode_content=True)
if not data:
progress.stop()
break
progress.report(r.raw.tell())
f.write(data)
progress = DownloadProgress(content_length)
progress.start()
f = open(filename, 'ab') if resume else open(filename, 'wb')
while True:
data = r.raw.read(chunk_sz, decode_content=True)
if not data:
progress.stop()
break
progress.report(r.raw.tell())
f.write(data)
f.close()
r.close()

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

Double close?

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

First one for closing the file handler and the second one to close the request handler. Should I change anything?

return True

This comment has been minimized.

Copy link
@rbrito

rbrito Aug 5, 2015

This is a very nice contribution. I think that we should split it into a separate file. I would certainly like to beat it a bit more in other projects, so that we can expose it to more widespread usage.

This comment has been minimized.

Copy link
@mxamin

mxamin Aug 8, 2015

Author Owner

I agree, happy to help.


Expand Down
6 changes: 3 additions & 3 deletions coursera/test/test_downloaders.py
Expand Up @@ -70,7 +70,7 @@ def test_bin_not_found_raises_exception():
d = downloaders.ExternalDownloader(None, bin='no_way_this_exists')
d._prepare_cookies = lambda cmd, cv: None
d._create_command = lambda x, y: ['no_way_this_exists']
assertRaises(OSError, d._start_download, 'url', 'filename')
assertRaises(OSError, d._start_download, 'url', 'filename', False)


def test_bin_is_set():
Expand Down Expand Up @@ -186,7 +186,7 @@ class IObject(object):

class MockSession(object):

def get(self, url, stream=True):
def get(self, url, stream=True, headers={}):
object_ = IObject()
object_.status_code = 400
object_.reason = None
Expand All @@ -197,7 +197,7 @@ def get(self, url, stream=True):

session = MockSession()
d = downloaders.NativeDownloader(session)
assert d._start_download('download_url', 'save_to') is False
assert d._start_download('download_url', 'save_to', False) is False

time.sleep = _sleep

Expand Down

0 comments on commit afa350a

Please sign in to comment.