Skip to content

Latest commit

 

History

History
397 lines (311 loc) · 14.6 KB

MODULES.org

File metadata and controls

397 lines (311 loc) · 14.6 KB

This file is an overview of documented modules (which I’m progressively expanding).

There are many more, see:

  • “What’s inside” for the full list of modules.
  • you can also run hpi modules to list what’s available on your system
  • source code is always the primary source of truth

If you have some issues with the setup, see “Troubleshooting”.

TOC

Intro

See SETUP to find out how to set up your own config.

Some explanations:

  • MY_CONFIG is the path where you are keeping your private configuration (usually ~/.config/my/)
  • Path is a standard Python object to represent paths
  • PathIsh is a helper type to allow using either str, or a Path
  • Paths is another helper type for paths.

    It’s ‘smart’, allows you to be flexible about your config:

    • simple str or a Path
    • /a/path/to/directory/, so the module will consume all files from this directory
    • a list of files/directories (it will be flattened)
    • a glob string, so you can be flexible about the format of your data on disk (e.g. if you want to keep it compressed)
    • empty string (e.g. export_path = ''), this will prevent the module from consuming any data

      This can be useful for modules that merge multiple data sources (for example, my.twitter or my.github)

    Typically, such variable will be passed to get_files to actually extract the list of real files to use. You can see usage examples here.

  • if the field has a default value, you can omit it from your private config altogether

For more thoughts on modules and their structure, see MODULE_DESIGN

all.py

Some modules have lots of different sources for data. For example, my.location (location data) has lots of possible sources – from my.google.takeout.parser, using the gpslogger android app, or through geolocating my.ip addresses. If you only plan on using one the modules, you can just import from the individual module, (e.g. my.google.takeout.parser) or you can disable the others using the core config – See the MODULE_DESIGN docs for more details.

Configs

The config snippets below are meant to be modified accordingly and pasted into your private configuration, e.g $MY_CONFIG/my/config.py.

You don’t have to set up all modules at once, it’s recommended to do it gradually, to get the feel of how HPI works.

For an extensive/complex example, you can check out @seanbreckenridge’s config

Reddit data: saved items/comments/upvotes/etc.

class reddit:
    class rexport:
        '''
        Uses [[https://github.com/karlicoss/rexport][rexport]] output.
        '''

        # path[s]/glob to the exported JSON data
        export_path: Paths

    class pushshift:
        '''
        Uses [[https://github.com/seanbreckenridge/pushshift_comment_export][pushshift]] to get access to old comments
        '''

        # path[s]/glob to the exported JSON data
        export_path: Paths

Parses browser history using browserexport

class browser:
    class export:
        # path[s]/glob to your backed up browser history sqlite files
        export_path: Paths

    class active_browser:
        # paths to sqlite database files which you use actively
        # to read from. For example:
        # from browserexport.browsers.all import Firefox
        # export_path = Firefox.locate_database()
        export_path: Paths

Merged location history from lots of sources.

The main sources here are gpslogger .gpx (XML) files, and google takeout (using my.google.takeout.parser), with a fallback on manually defined home locations.

You might also be able to use my.location.via_ip which uses my.ip.all to provide geolocation data for an IPs (though no IPs are provided from any of the sources here). For an example of usage, see here

class location:
    home = (
         # supports ISO strings
         ('2005-12-04'                                       , (42.697842, 23.325973)), # Bulgaria, Sofia
         # supports date/datetime objects
         (date(year=1980, month=2, day=15)                   , (40.7128  , -74.0060 )), # NY
         (datetime.fromtimestamp(1600000000, tz=timezone.utc), (55.7558  , 37.6173  )), # Moscow, Russia
     )
     # note: order doesn't matter, will be sorted in the data provider

     class gpslogger:
         # path[s]/glob to the exported gpx files
          export_path: Paths

          # default accuracy for gpslogger
          accuracy: float = 50.0

      class via_ip:
          # guess ~15km accuracy for IP addresses
          accuracy: float = 15_000

Uses the my.location module to determine the timezone for a location.

This can be used to ‘localize’ timezones. Most modules here return datetimes in UTC, to prevent confusion whether or not its a local timezone, one from UTC, or one in your timezone.

Depending on the specific data provider and your level of paranoia you might expect different behaviour.. E.g.:

  • if your objects already have tz info, you might not need to call localize() at all
  • it’s safer when either all of your objects are tz aware or all are tz unware, not a mixture
  • you might trust your original timezone, or it might just be UTC, and you want to use something more reasonable
TzPolicy = Literal[
    'keep'   , # if datetime is tz aware, just preserve it
    'convert', # if datetime is tz aware, convert to provider's tz
    'throw'  , # if datetime is tz aware, throw exception
]

This is still a work in progress, plan is to integrate it with hpi query so that you can easily convert/localize timezones for some module/data

class time:
    class tz:
        policy = 'keep'

        class via_location:
            # less precise, but faster
            fast: bool = True

            # sort locations by date
            # in case multiple sources provide them out of order
            sort_locations: bool = True

            # if the accuracy for the location is more than 5km (this
            # isn't an accurate location, so shouldn't use it to determine
            # timezone), don't use
            require_accuracy: float = 5_000

Parses Google Takeout using google_takeout_parser

See google_takeout_parser for more information about how to export and organize your takeouts

If the DISABLE_TAKEOUT_CACHE environment variable is set, this won’t cache individual exports in ~/.cache/google_takeout_parser

The directory set as takeout_path can be unpacked directories, or zip files of the exports, which are temporarily unpacked while creating the cachew cache

class google(user_config):
    # directory which includes unpacked/zipped takeouts
    takeout_path: Paths

    error_policy: ErrorPolicy = 'yield'

    # experimental flag to use core.kompress.ZipPath
    # instead of unpacking to a tmp dir via match_structure
    _use_zippath: bool = False

Hypothes.is highlights and annotations

class hypothesis:
    '''
    Uses [[https://github.com/karlicoss/hypexport][hypexport]] outputs
    '''

    # paths[s]/glob to the exported JSON data
    export_path: Paths

Pocket bookmarks and highlights

class pocket:
    '''
    Uses [[https://github.com/karlicoss/pockexport][pockexport]] outputs
    '''

    # paths[s]/glob to the exported JSON data
    export_path: Paths

Twitter data (tweets and favorites).

Uses Twint data export.

Requirements: pip3 install --user dataset

class twint:
    export_path: Paths # path[s]/glob to the twint Sqlite database

Twitter data (uses official twitter archive export)

class twitter_archive:
    export_path: Paths # path[s]/glob to the twitter archive takeout

Last.fm scrobbles

class lastfm:
    """
    Uses [[https://github.com/karlicoss/lastfm-backup][lastfm-backup]] outputs
    """
    export_path: Paths

Polar articles and highlights

class polar:
    '''
    Polar config is optional, you only need it if you want to specify custom 'polar_dir'
    '''
    polar_dir: PathIsh = Path('~/.polar').expanduser()
    defensive: bool = True # pass False if you want it to fail faster on errors (useful for debugging)

Instapaper bookmarks, highlights and annotations

class instapaper:
    '''
    Uses [[https://github.com/karlicoss/instapexport][instapexport]] outputs.
    '''
    # path[s]/glob to the exported JSON data
    export_path : Paths

Github data (uses official GDPR export)

class github:
    gdpr_dir: PathIsh  # path to unpacked GDPR archive

Github data: events, comments, etc. (API data)

class github:
    '''
    Uses [[https://github.com/karlicoss/ghexport][ghexport]] outputs.
    '''
    # path[s]/glob to the exported JSON data
    export_path: Paths

    # path to a cache directory
    # if omitted, will use /tmp
    cache_dir: Optional[PathIsh] = None

Kobo e-ink reader: annotations and reading stats

class kobo:
    '''
    Uses [[https://github.com/karlicoss/kobuddy#as-a-backup-tool][kobuddy]] outputs.
    '''
    # path[s]/glob to the exported databases
    export_path: Paths