Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parser for a list of links (see issue #39): #58

Closed
wants to merge 1 commit into from

Conversation

frinkelpi
Copy link

  • basic_link_info method in parse.py to avoid code duplication.

  • Fetch method to get missing titles.

  • Fix some HTML syntax errors in the templates.

  • Simpler version of html_appended_url in utils.py using urlsplit.

  • Update the README accordingly. Remove some repetition and minor style improvements.

  • Add instructions to install requests (which was already used in utils.py). Add requirements.txt.

  • Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled.

  • If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.

- `basic_link_info` method in parse.py to avoid code duplication.
- Fetch method to get missing titles.
- Fix some HTML syntax errors in the templates.
- Simpler version of `html_appended_url` in utils.py using urlsplit.
- Update the README accordingly. Remove some repetition and minor style improvements.
- Add instructions to install requests (which was already used in utils.py). Add requirements.txt.

- Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled.

- If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.
@pirate
Copy link
Member

pirate commented Dec 27, 2017

@frinkelpi thanks so much for this PR! I'm really glad to see people using this project enough that they are willing to contribute back!

There's a bunch for me to look over here. Some of these changes I can approve immediately, but some of them I want to clean up a bit, mind splitting this PR into a few separate ones so that I can review the changes independently?

PRs:

  1. README readability/style changes
  2. FETCH_WGET_REQUISITES fix, misc refactoring like basic_link_info, html_appended_url, screenshot/pdf template generating
  3. new parser for list of links

@pirate pirate closed this Dec 27, 2017
@pirate pirate reopened this Dec 27, 2017
@pirate pirate closed this Dec 27, 2017
Copy link
Author

@frinkelpi frinkelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're very welcome! As the edits are quite interdependent it would be rather difficult to split them meaningfully. To make the review easier nonetheless, here are more detailed comments on the modifications, and you should be able to edit the commits.

return {'output': html_appended_url(link), 'status': 'skipped'}

CMD = [
*'wget --timestamping --adjust-extension --no-parent'.split(' '), # Docs: https://www.gnu.org/software/wget/manual/wget.html
*(('--page-requisites', '--convert-links') if FETCH_WGET_REQUISITES else ()),
*(('--page-requisites', '--convert-links') if requisites else ()),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the argument passed to the method instead of the default value

@@ -185,6 +201,12 @@ def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT
result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1) # index.html
end()
output = html_appended_url(link)
if not requisites:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the requisites options is not used, the output of wget is a single file, not a tree. In that case, we create the tree and move the file.

@@ -139,7 +140,9 @@ def write_html_link_index(out_dir, link):
'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'),
'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']),
'wget': link['latest'].get('wget') or link['domain'],
'wget': quote_plus(link['latest'].get('wget')) or link['domain'],
'screenshot': link['latest'].get('screenshot') or '',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the wget or screenshot or pdf options were disabled, a KeyError was thrown.

# parser not supported on this file
pass

return links

def basic_link_info(url, f, title=None, time=datetime.now(), tags=""):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this method to avoid code duplication in the *_export methods below

'sources': [html_file.name],
}
info['type'] = get_link_type(info)
info = basic_link_info(fixed_url,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use basic_link_info to avoid code duplication.

@@ -240,7 +248,7 @@ def merge_links(a, b):
'domain': domain(url),
'base_url': base_url(url),
'tags': longer('tags'),
'title': longest_title if '://' not in longest_title else cleanest_title,
'title': longest_title if longest_title is not None and '://' not in longest_title else cleanest_title,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle undefined titles.

@@ -396,59 +404,60 @@ def cleanup_archive(archive_path, links):
print(' '+ '\n '.join(unmatched))


def html_appended_url(link):
def html_appended_url(link, requisites=False):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially rewrote this method using urllib.parse.urlsplit.

The path returned is the location of the wgetted file, according to whether the requisites option is used or not.

'screenshot_link': './archive/{timestamp}/screenshot.png'.format(**link),
'favicon_url': '{timestamp}/favicon.ico'.format(**link),
'files_url': '{timestamp}/index.html'.format(**link),
'archive_url': '{}/{}'.format(link['timestamp'], quote_plus(html_appended_url(link))),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML links were invalid without being quoted when some special characters were present.

link_info = {
**link,
'date': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
'google_favicon_url': 'https://www.google.com/s2/favicons?domain={domain}'.format(**link),
'favicon_url': './archive/{timestamp}/favicon.ico'.format(**link),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix relative links.

'archive_org_url': 'https://web.archive.org/web/{base_url}'.format(**link),
}

# PDF and images are handled slightly differently
# wget, screenshot, & pdf urls all point to the same file
if link['type'] in ('PDF', 'image'):
url = '{}/{}'.format(link['timestamp'], html_appended_url(link))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid code duplication.

@pirate
Copy link
Member

pirate commented Dec 28, 2017

Thanks for the comments, but unfortunately I cant re-open this specific PR because github prevents re-opening closed PRs after a force-push.

I definitely want these changes though! You're welcome to re-submit it as-is or as separate PRs. Separate, smaller PRs would help me review and merge faster, but it's not a necessity if you're ok waiting longer. I will work on it when I next have a few spare hours for this project. I usually only get a chance to work on this for a few hours per month, apologies if it's a long wait!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants