Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
pha
README.md
analyze_classnames.ipynb
document_summary.ipynb
named_entities.ipynb
nn_readable.ipynb
requirements.txt
search_example.ipynb
setup.py

README.md

Python Library

To install:

$ pip install -e python/
# Optional requirements:
$ pip install -r python/requirements.txt

Usage

You'll probably want to get an instance of Archive:

from pha import Archive
archive = Archive.default_location()

Or Archive(path), but normal installation always puts the data into the data/ directory.

The key objects are all implemented in __init__.py: Archive, Activity, and Page.

  • Activity is one visit in the browser. This includes any changes to the location hash. This represents both old activity fetched from browser history (from HistoryItem and VisitItem), as well as new activity (with more complete information available).
  • Page is a fetched page. By default only one version a page will be created for a given URL (though the code/database allows for multiple pages fetched over time). A page is both stored in the database, as well as in a JSON file in data/pages/ (the library tries to be resilient when the two sources don't match).

Note that URLs do include the fragment/hash, so http://example.com/ and http://example.com/#header are treated as different.

Typically you'll call:

  • archive.get_activity(url): get a list of activities for the URL
  • archive.activity(): get a list of ALL activities
  • archive.activity_with_page(): get a list of all activity that also have a fetched page
  • archive.sample_activity_with_page(number, unique_url=True, unique_domain=False): fetch a random sample of pages. Because there tend to be lots of pages from some domains (e.g., gmail.com) this tries to get a sampling of "unique" pages. If you ask for unique_url then it will look at the entire URL, normalize segments of the URL, and treat number and non-number segments differently. So it would include a homepage and an article page, but probably not multiple article pages from the same site. unique_domain gets only one page per domain.
  • archive.get_activity_by_source(activity.id): get every activity that came from the given activity (typically through navigation).

Pages

You might spend most of your time with the Page objects, at least if you are interested in content parsing and interpretation.

A few highlights:

  • page.html: returns a viewable HTML representation of the page.
  • page.lxml: returns the page, having been parsed with lxml.html.
  • page.full_text: tries to get the text of page.
  • page.readable_text: if the page was parseable with Readability then this will contain the text extracted as part of the article view (excluding navigation, etc).
  • page.readable_html: an HTML view of the readable portion of the page.
  • page.display_page(): run in a Jupyter Notebook, this will show the page in an iframe (see also notebooktools).

Helpers

There's several helper modules:

Notebooks

I'm collecting notebooks in this directory as examples, and hopefully they'll grow into simultaneously documentation and interesting data interpretation. It would be cool to have more!

You can’t perform that action at this time.