Add a parser for a list of links (see issue #39): #58

frinkelpi · 2017-12-25T04:23:35Z

basic_link_info method in parse.py to avoid code duplication.
Fetch method to get missing titles.
Fix some HTML syntax errors in the templates.
Simpler version of html_appended_url in utils.py using urlsplit.
Update the README accordingly. Remove some repetition and minor style improvements.
Add instructions to install requests (which was already used in utils.py). Add requirements.txt.
Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled.
If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.

- `basic_link_info` method in parse.py to avoid code duplication. - Fetch method to get missing titles. - Fix some HTML syntax errors in the templates. - Simpler version of `html_appended_url` in utils.py using urlsplit. - Update the README accordingly. Remove some repetition and minor style improvements. - Add instructions to install requests (which was already used in utils.py). Add requirements.txt. - Fix a bug where a KeyError would be thrown when generating the templates if screenshots or PDFs are disabled. - If FETCH_WGET_REQUISITES is disabled, then wget stores the output in a different way, which was not handled and resulted in bugs. This has been corrected.

pirate · 2017-12-27T19:20:14Z

@frinkelpi thanks so much for this PR! I'm really glad to see people using this project enough that they are willing to contribute back!

There's a bunch for me to look over here. Some of these changes I can approve immediately, but some of them I want to clean up a bit, mind splitting this PR into a few separate ones so that I can review the changes independently?

PRs:

README readability/style changes
FETCH_WGET_REQUISITES fix, misc refactoring like basic_link_info, html_appended_url, screenshot/pdf template generating
new parser for list of links

frinkelpi

You're very welcome! As the edits are quite interdependent it would be rather difficult to split them meaningfully. To make the review easier nonetheless, here are more detailed comments on the modifications, and you should be able to edit the commits.

frinkelpi · 2017-12-27T19:43:36Z

archive_methods.py

        return {'output': html_appended_url(link), 'status': 'skipped'}

    CMD = [
        *'wget --timestamping --adjust-extension --no-parent'.split(' '),                # Docs: https://www.gnu.org/software/wget/manual/wget.html
-        *(('--page-requisites', '--convert-links') if FETCH_WGET_REQUISITES else ()),
+        *(('--page-requisites', '--convert-links') if requisites else ()),


Use the argument passed to the method instead of the default value

frinkelpi · 2017-12-27T19:46:37Z

archive_methods.py

@@ -185,6 +201,12 @@ def fetch_wget(link_dir, link, requisites=FETCH_WGET_REQUISITES, timeout=TIMEOUT
        result = run(CMD, stdout=PIPE, stderr=PIPE, cwd=link_dir, timeout=timeout + 1)  # index.html
        end()
        output = html_appended_url(link)
+        if not requisites:


If the requisites options is not used, the output of wget is a single file, not a tree. In that case, we create the tree and move the file.

frinkelpi · 2017-12-27T19:47:22Z

index.py

@@ -139,7 +140,9 @@ def write_html_link_index(out_dir, link):
            'bookmarked': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
            'updated': datetime.fromtimestamp(float(link['updated'])).strftime('%Y-%m-%d %H:%M'),
            'archive_org': link['latest'].get('archive_org') or 'https://web.archive.org/save/{}'.format(link['url']),
-            'wget': link['latest'].get('wget') or link['domain'],
+            'wget': quote_plus(link['latest'].get('wget')) or link['domain'],
+            'screenshot': link['latest'].get('screenshot') or '',


If the wget or screenshot or pdf options were disabled, a KeyError was thrown.

frinkelpi · 2017-12-27T19:48:01Z

parse.py

                # parser not supported on this file
                pass

    return links

+def basic_link_info(url, f, title=None, time=datetime.now(), tags=""):


Added this method to avoid code duplication in the *_export methods below

frinkelpi · 2017-12-27T19:48:33Z

parse.py

-                'sources': [html_file.name],
-            }
-            info['type'] = get_link_type(info)
+            info = basic_link_info(fixed_url,


Use basic_link_info to avoid code duplication.

frinkelpi · 2017-12-27T19:52:35Z

util.py

@@ -240,7 +248,7 @@ def merge_links(a, b):
        'domain': domain(url),
        'base_url': base_url(url),
        'tags': longer('tags'),
-        'title': longest_title if '://' not in longest_title else cleanest_title,
+        'title': longest_title if longest_title is not None and '://' not in longest_title else cleanest_title,


Handle undefined titles.

frinkelpi · 2017-12-27T19:54:41Z

util.py

@@ -396,59 +404,60 @@ def cleanup_archive(archive_path, links):
        print('    '+ '\n    '.join(unmatched))


-def html_appended_url(link):
+def html_appended_url(link, requisites=False):


Partially rewrote this method using urllib.parse.urlsplit.

The path returned is the location of the wgetted file, according to whether the requisites option is used or not.

frinkelpi · 2017-12-27T19:55:19Z

util.py

-        'screenshot_link': './archive/{timestamp}/screenshot.png'.format(**link),
+        'favicon_url': '{timestamp}/favicon.ico'.format(**link),
+        'files_url': '{timestamp}/index.html'.format(**link),
+        'archive_url': '{}/{}'.format(link['timestamp'], quote_plus(html_appended_url(link))),


The HTML links were invalid without being quoted when some special characters were present.

frinkelpi · 2017-12-27T19:55:29Z

util.py

    link_info = {
        **link,
        'date': datetime.fromtimestamp(float(link['timestamp'])).strftime('%Y-%m-%d %H:%M'),
        'google_favicon_url': 'https://www.google.com/s2/favicons?domain={domain}'.format(**link),
-        'favicon_url': './archive/{timestamp}/favicon.ico'.format(**link),


Fix relative links.

frinkelpi · 2017-12-27T19:56:16Z

util.py

        'archive_org_url': 'https://web.archive.org/web/{base_url}'.format(**link),
    }

    # PDF and images are handled slightly differently
    # wget, screenshot, & pdf urls all point to the same file
    if link['type'] in ('PDF', 'image'):
+        url = '{}/{}'.format(link['timestamp'], html_appended_url(link))


Avoid code duplication.

pirate · 2017-12-28T00:59:16Z

Thanks for the comments, but unfortunately I cant re-open this specific PR because github prevents re-opening closed PRs after a force-push.

I definitely want these changes though! You're welcome to re-submit it as-is or as separate PRs. Separate, smaller PRs would help me review and merge faster, but it's not a necessity if you're ok waiting longer. I will work on it when I next have a few spare hours for this project. I usually only get a chance to work on this for a few hours per month, apologies if it's a long wait!

pirate closed this Dec 27, 2017

pirate reopened this Dec 27, 2017

pirate closed this Dec 27, 2017

frinkelpi commented Dec 27, 2017

View reviewed changes

frinkelpi mentioned this pull request Dec 28, 2017

Add a parser for a list of links (see issue #39) #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a parser for a list of links (see issue #39): #58

Add a parser for a list of links (see issue #39): #58

frinkelpi commented Dec 25, 2017

pirate commented Dec 27, 2017 •

edited

frinkelpi left a comment •

edited

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

frinkelpi Dec 27, 2017

pirate commented Dec 28, 2017 •

edited

Add a parser for a list of links (see issue #39): #58

Add a parser for a list of links (see issue #39): #58

Conversation

frinkelpi commented Dec 25, 2017

pirate commented Dec 27, 2017 • edited

frinkelpi left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pirate commented Dec 28, 2017 • edited

pirate commented Dec 27, 2017 •

edited

frinkelpi left a comment •

edited

pirate commented Dec 28, 2017 •

edited