New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a parser for a list of links (see issue #39): #58
Conversation
- `basic_link_info` method in parse.py to avoid code duplication. - Fetch method to get missing titles. - Fix some HTML syntax errors in the templates. - Simpler version of `html_appended_url` in utils.py using urlsplit. - Update the README accordingly. Remove some repetition and minor style improvements. - Add instructions to install requests (which was already used in utils.py). Add requirements.txt. - Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled. - If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.
@frinkelpi thanks so much for this PR! I'm really glad to see people using this project enough that they are willing to contribute back! There's a bunch for me to look over here. Some of these changes I can approve immediately, but some of them I want to clean up a bit, mind splitting this PR into a few separate ones so that I can review the changes independently? PRs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're very welcome! As the edits are quite interdependent it would be rather difficult to split them meaningfully. To make the review easier nonetheless, here are more detailed comments on the modifications, and you should be able to edit the commits.
return {'output': html_appended_url(link), 'status': 'skipped'} | ||
|
||
CMD = [ | ||
*'wget --timestamping --adjust-extension --no-parent'.split(' '), # Docs: https://www.gnu.org/software/wget/manual/wget.html | ||
*(('--page-requisites', '--convert-links') if FETCH_WGET_REQUISITES else ()), | ||
*(('--page-requisites', '--convert-links') if requisites else ()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the argument passed to the method instead of the default value
@@ -185,6 +201,12 @@ def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT | |||
result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # index.html | |||
end() | |||
output = html_appended_url(link) | |||
if not requisites: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the requisites options is not used, the output of wget is a single file, not a tree. In that case, we create the tree and move the file.
@@ -139,7 +140,9 @@ def write_html_link_index(out_dir, link): | |||
'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'), | |||
'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'), | |||
'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']), | |||
'wget': link['latest'].get('wget') or link['domain'], | |||
'wget': quote_plus(link['latest'].get('wget')) or link['domain'], | |||
'screenshot': link['latest'].get('screenshot') or '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the wget or screenshot or pdf options were disabled, a KeyError was thrown.
# parser not supported on this file | ||
pass | ||
|
||
return links | ||
|
||
def basic_link_info(url, f, title=None, time=datetime.now(), tags=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this method to avoid code duplication in the *_export methods below
'sources': [html_file.name], | ||
} | ||
info['type'] = get_link_type(info) | ||
info = basic_link_info(fixed_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use basic_link_info
to avoid code duplication.
@@ -240,7 +248,7 @@ def merge_links(a, b): | |||
'domain': domain(url), | |||
'base_url': base_url(url), | |||
'tags': longer('tags'), | |||
'title': longest_title if '://' not in longest_title else cleanest_title, | |||
'title': longest_title if longest_title is not None and '://' not in longest_title else cleanest_title, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle undefined titles.
@@ -396,59 +404,60 @@ def cleanup_archive(archive_path, links): | |||
print(' '+ '\n '.join(unmatched)) | |||
|
|||
|
|||
def html_appended_url(link): | |||
def html_appended_url(link, requisites=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially rewrote this method using urllib.parse.urlsplit
.
The path returned is the location of the wgetted file, according to whether the requisites
option is used or not.
'screenshot_link': './archive/{timestamp}/screenshot.png'.format(**link), | ||
'favicon_url': '{timestamp}/favicon.ico'.format(**link), | ||
'files_url': '{timestamp}/index.html'.format(**link), | ||
'archive_url': '{}/{}'.format(link['timestamp'], quote_plus(html_appended_url(link))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HTML links were invalid without being quoted when some special characters were present.
link_info = { | ||
**link, | ||
'date': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'), | ||
'google_favicon_url': 'https://www.google.com/s2/favicons?domain={domain}'.format(**link), | ||
'favicon_url': './archive/{timestamp}/favicon.ico'.format(**link), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix relative links.
'archive_org_url': 'https://web.archive.org/web/{base_url}'.format(**link), | ||
} | ||
|
||
# PDF and images are handled slightly differently | ||
# wget, screenshot, & pdf urls all point to the same file | ||
if link['type'] in ('PDF', 'image'): | ||
url = '{}/{}'.format(link['timestamp'], html_appended_url(link)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid code duplication.
Thanks for the comments, but unfortunately I cant re-open this specific PR because github prevents re-opening closed PRs after a force-push. I definitely want these changes though! You're welcome to re-submit it as-is or as separate PRs. Separate, smaller PRs would help me review and merge faster, but it's not a necessity if you're ok waiting longer. I will work on it when I next have a few spare hours for this project. I usually only get a chance to work on this for a few hours per month, apologies if it's a long wait! |
basic_link_info
method in parse.py to avoid code duplication.Fetch method to get missing titles.
Fix some HTML syntax errors in the templates.
Simpler version of
html_appended_url
in utils.py using urlsplit.Update the README accordingly. Remove some repetition and minor style improvements.
Add instructions to install requests (which was already used in utils.py). Add requirements.txt.
Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled.
If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.