Skip to content

Site Config File Explained

m5n edited this page Dec 1, 2014 · 6 revisions

This page describes the various options available in a site config file. All examples are for Pinterest (the site as of November 2014).

Overall guidance:

actual_username_regex

Regular expression that will be run on the landing page (the page that is loaded after successful login) to determine the user's username. This is needed because on some sites, the site username is different than the login username. Make sure to test the regex thoroughly to make sure it only extracts the username and nothing else. Also make sure to properly escape characters that have a special meaning in regular expressions.

Example:

"actual_username_regex": "P.start.start\\(.*?\"username\":\\s+\"(.+?)\""

click_if_present_on_paths_selector

Selector for the element to click if it is visible on a page that will be saved. This is needed to close in-page popups that may show on page load (e.g. new-user help content). Note the same check will be performed on every page that will be saved, so not on login or landing pages. There is no per-page setting. (See intro atop this Wiki page for the properties that control what pages get saved.)

Example:

"click_if_present_selector": ".Modal.show .Button.primary"

content_to_exclude

A list of absolute URLs, paths (relative to the home page URL) or regular expressions matching pages to exclude. The only exceptions to this are the absolute URLs included in the content_to_save list. Use the macro {{home_page}} to substitute the user's home page URL as determined by home_page, or the macro {{username}} to substitute the user's username as extracted by actual_username_regex.

Example:

"content_to_exclude": [
    "/{{username}}/pins/",
    "/{{username}}/followers/",
    "/{{username}}/following/"
]

content_to_save

A list of absolute URLs, paths (relative to the home page URL) or regular expressions matching pages to save. All absolute paths in this list will be saved regardless of the content_to_exclude setting. The regular expressions denote additional pages to save. Any link (a[href]) found on a page that will be saved is matched against the regex'es in this list and if there's a match and the link is not on the content_to_exclude list, the linked-to page will also be saved. Use the macro {{home_page}} to substitute the user's home page URL as determined by home_page, or the macro {{username}} to substitute the user's username as extracted by actual_username_regex. The user's home page (see home_page) will always be saved and does not need to be specified in this list.

Example:

"content_to_save": [
    "regex:^{{home_page}}",
    "/settings/"
]

content_to_save_only_if_linked_from_other_content

Regular expressions for additional "reference" pages to save. A "reference" page is a page that is linked from another page that will be saved, which is not itself a "reference" page. To make this clear, in the example below, "pin" pages will be saved if they're linked from the main pages, but not if they're linked from other "pin" pages. Any link (a[href]) found on a page that will be saved that is not a "reference" page itself is matched against this list of regex'es and if there's a match and the link is not on the content_to_exclude list, the linked-to page will also be saved. These pages are different from the other saved pages in that lazy_load_on_paths is not applied and links from "reference" pages are not followed. Use the macro {{home_page}} to substitute the user's home page URL as determined by home_page, or the macro {{username}} to substitute the user's username as extracted by actual_username_regex.

Example:

"content_to_save_only_if_linked_from_other_content": [
    "regex:^http://www.pinterest.com/pin/\\d+/$"
]

home_page

URL of the user's home page. Use the macro {{username}} to substitute the user's username as extracted by actual_username_regex. The user's home page will always be saved and does not need to be specified in the content_to_save list.

Example:

"home_page": "http://www.pinterest.com/{{username}}/"

lazy_load_on_paths

Indicates if "lazy load" is in effect on pages that will be saved. This is needed to be able to load all content on a page before saving it. Note the same check will be performed on every page that will be saved, so not on login or landing pages. There is no per-page setting. (See intro atop this Wiki page for the properties that control what pages get saved, but as an exception, "lazy load" will not be applied to pages matching content_to_save_only_if_linked_from_other_content.)

Example:

"lazy_load_on_paths": true

login_error_text_selector

Selector for the visible element that contains the login error message if login fails. Note that unlike login_success_element_selector, this is not just any element that is only visible when login fails, it must contain the error message. If found, this error message is displayed "as is" in the tool's command line interface.

Example:

"login_error_text_selector": ".loginError p"

login_form_password_field_name

Name of the form's password input field, as in <input type="password" name="password">.

Example:

"login_form_password_field_name": "password"

login_form_selector

Selector for the login form element that contains the username and password input fields and the submit button. Make sure the targeted element is visible on page load, which may mean a dedicated login page must exist. See also: login_page.

Example:

"login_form_selector": "form"

login_form_submit_button_selector

Selector for the form's submit button.

Example:

"login_form_submit_button_selector": "button[type='submit']"

login_form_username_field_name

Name of the form's username input field, as in <input type="text" name="username">.

Example:

"login_form_username_field_name": "username_or_email"

login_page

URL of the login page. Most sites have a dedicated login page, but if not, the page at this URL must have visible input fields. If no such page exists, open an issue so that it can be investigated.

Example:

"login_page": "https://www.pinterest.com/login/"

login_success_element_selector

Selector for the element that is visible when login is successful. If the process of logging in redirects to another page, this element must be visible on the target page. The element can be anything, but make sure it's only visible if login is successful.

Example:

"login_success_element_selector": ".RightHeader"