Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.4 (first Django release) #207

Open
wants to merge 167 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@pirate
Copy link
Owner

commented Apr 2, 2019

The v0.4 Release

A bunch of big changes:

  • pip install archivebox is now available
  • beginnings of transition to Django while maintaining a mostly backwards-compatible CLI
  • using argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
  • new subcommands-based CLI for archivebox (see below)

For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

Released in this version:

Install Methods:

Command Line Interface:

Web UI:

  • / Main index
  • /add Page to add new links to the archive
  • /archive/<timestamp>/ Snapshot details page
  • /archive/<timestamp>/<url> live wget archive of page
  • /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
  • /archive/<url> shortcut to view most recent snapshot of given url
  • /archive/<url_hash> shortcut to view most recent snapshot of given url
  • /admin Admin interface to view and edit archive data

Python API:

(Red features are still unfinished and will be released in later versions)

pirate added some commits Mar 26, 2019

pirate added some commits Apr 28, 2019

@pirate pirate referenced this pull request Apr 30, 2019

Open

Architecture: Serverless oneshot archiving #223

4 of 8 tasks complete
@datakid

This comment has been minimized.

Copy link

commented May 5, 2019

Here are some notes I have taken on my attempts to get this happening.

Quick reference:

  • archiving seems to work
  • archivebox command is impressive
  • django admin screen is up and running.

But there are errors too. Here are my notes:

  • Installed python3, pip3
  • created and activated venv (/home/ubuntu/ab)
  • git clone (/home/ubuntu/ArchiveBox)
  • git checkout django
  • pip installed archivebox
  • create new folder (/home/ubuntu/archives)
  • archivebox init

export OUTPUT_DIR=/home/ubuntu/archives

Small change I've made: updated to Django 2.2.1 and have edited setup.py to
reflect that.

[i] Dependency versions:
 √  PYTHON_BINARY          /home/ubuntu/ab/bin/python3                                                  v3.6            valid 
 √  DJANGO_BINARY          /home/ubuntu/ab/lib/python3.6/site-packages/django/bin/django-admin.py       v2.2.1          valid 
 √  CURL_BINARY            /usr/bin/curl                                                                v7.58.0         valid 
 √  WGET_BINARY            /usr/bin/wget                                                                v1.19.4         valid 
 √  GIT_BINARY             /usr/bin/git                                                                 v2.17.1         valid 
 √  YOUTUBEDL_BINARY       /home/ubuntu/ab/bin/youtube-dl                                               v2019.04.30     valid 
 √  CHROME_BINARY          /usr/bin/chromium-browser                                                    v73.0.3683.86   valid 

[i] Code locations:
 √  REPO_DIR               /home/ubuntu/ab/lib/python3.6/site-packages                                  66 files        valid 
 √  PYTHON_DIR             /home/ubuntu/ab/lib/python3.6/site-packages/archivebox                       15 files        valid 
 √  TEMPLATES_DIR          /home/ubuntu/ab/lib/python3.6/site-packages/archivebox/themes/legacy         1 files         valid 

[i] External locations:
 -  CHROME_USER_DATA_DIR                                                                                -               disabled 
 -  COOKIES_FILE                                                                                        -               disabled 

[i] Data locations:
 √  OUTPUT_DIR             /home/ubuntu/archives                                                        10 files        valid 
 √  SOURCES_DIR            /home/ubuntu/archives/sources                                                2 files         valid 
 √  LOGS_DIR               /home/ubuntu/archives/logs                                                   0 files         valid 
 √  ARCHIVE_DIR            /home/ubuntu/archives/archive                                                2 files         valid 
 √  CONFIG_FILE            /home/ubuntu/archives/ArchiveBox.conf                                        436.0 Bytes     valid 
 √  SQL_INDEX              /home/ubuntu/archives/index.sqlite3                                          188.0 KB        valid 
 √  JSON_INDEX             /home/ubuntu/archives/index.json                                             65.7 KB         valid 
 √  HTML_INDEX             /home/ubuntu/archives/index.html                                             163.7 KB        valid 

There are two obvious errors/sticking points.

  1. Still getting that datetime error
(ab) ubuntu@archive-box:~/archives$ archivebox add http://discworld.atuin.net/lpc/
[*] [2019-05-05 02:23:18] Downloading http://discworld.atuin.net/lpc/
    > ./sources/discworld.atuin.net-1557022998.txt                                                                                                      
Traceback (most recent call last):
  File "/home/ubuntu/ab/bin/archivebox", line 10, in <module>
    sys.exit(main())
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/__main__.py", line 10, in main
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/cli/archivebox.py", line 58, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/cli/archivebox_add.py", line 55, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/main.py", line 504, in add
    all_links = load_main_index(out_dir=out_dir)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/index/__init__.py", line 249, in load_main_index
    all_links = list(parse_json_main_index(out_dir))
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/index/json.py", line 52, in parse_json_main_index
    yield Link.from_json(link_json)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/index/schema.py", line 189, in from_json
    info['updated'] = parse_date(info.get('updated'))
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/util.py", line 104, in typechecked_function
    return func(*args, **kwargs)
  File "/home/ubuntu/ab/lib/python3.6/site-packages/archivebox/util.py", line 188, in parse_date
    raise ValueError('Tried to parse invalid date! {}'.format(date))
ValueError: Tried to parse invalid date! 2019-04-28T07:26:43.256025
  1. If I change into the Django directory and run the server manually, then go
    to a browser and point it at that IP address, I get an error.

I've set debug=True, and this is what I'm seeing in the browser. I'm pretty sure
this is a pebcak, but I can't work it out yet.

http://IPADDRESS:8000/

'NoneType' object is not subscriptable
/home/ubuntu/ArchiveBox/archivebox/core/views.py in get, line 23

BUT

http://IPADDRESS:8000/admin give me the admin screen.

OK. I use the django method to create a user

./manage createsuperuser

after that I see I can archivebox createsuperuser <- cool

Once I've logged in, I click Snapshot and I see this error

Request URL: 	http://IPADDRESS:8000/admin/core/snapshot/
Django Version: 	2.2.1
Exception Type: 	OperationalError
Exception Value: 	

no such table: core_snapshot
/home/ubuntu/ab/lib/python3.6/site-packages/django/db/backends/sqlite3/base.py in execute, line 383

I try to ./manage migrate but there are none to apply. Hmm. Not sure what I might do next.

@datakid

This comment has been minimized.

Copy link

commented May 5, 2019

Regards the date parsing issue, is it worth using dateparser?
(docs)

It seems to be pretty efficient at date parsing - and is an active repo.

>>> now="2019-04-28T07:26:43.256025"
>>> dateparser.parse(now)
datetime.datetime(2019, 4, 28, 7, 26, 43, 256025)
@pirate

This comment has been minimized.

Copy link
Owner Author

commented May 6, 2019

These notes are very helpful, thanks @datakid! FYI once you run archivebox init, all Django commands must be run with archivebox manage in that collection directory (not ./manage.py in the code directory). This is needed in order to load the config and database that are defined in the collection dir. I just added a warning to manage.py to make that clearer to devs in the future: ca9c9ef. I'll also double-check the snapshot migrations in case there's a missing migration that could've caused the missing table error you encountered.

Re: dateparsing, I've considered using dateparser previously (I've used it many times in previous projects), the reason I didn't is because all time formats that ArchiveBox needs to parse are known ahead of time, so we should be able to write determinisitc 1:1 parsers and dumpers for each format instead of using a library designed to parse arbitrary date formats. Given my date hand-written date parsing is buggy and causing an issue in your index, I'm considering adding dateparser just so I don't have to keep maintaining potentially buggy parsers that change across multiple ArchiveBox versions.

@datakid

This comment has been minimized.

Copy link

commented May 6, 2019

AHA! Great - thanks for clearing that up.

Re date parser, I know what you mean re small and tight code over potentially bloated python modules. Maybe the best bet is put in dateparser, then make a ticket ("low hanging fruit for new devs") to extract just the bits that matter. Might be worth spelling out the known datatypes you are expecting.

@pirate pirate changed the title v0.4.0 (first Django release) v0.4 (first Django release) May 7, 2019

@pirate

This comment has been minimized.

Copy link
Owner Author

commented May 13, 2019

Quick update, I'm busy with work and an event in NYC this whole week, but will try and get v0.4 launched and up on PyPi sometime next week. Apologies for the delay!

@hank

This comment has been minimized.

Copy link

commented May 13, 2019

Just tried this out and found out the documentation saying Python 3.5 works is wrong. F-strings are 3.6+ so debian 9 users will need to find a 3.6 solution.

@pirate

This comment has been minimized.

Copy link
Owner Author

commented May 13, 2019

Yeah v0.4 is >3.6 only, users on older OSs have to install a more recent Python version (which only takes 3 or 4 shell commands to install) or run an older version. I have no intention of supporting older Python versions below 3.6, the docs you saw are for the current version, and will all be updated once the new version is released.

@hank

This comment has been minimized.

Copy link

commented May 13, 2019

Cool beans, dude. I'll try on Debian Buster, which is python 3.7. Thanks!

@pirate

This comment has been minimized.

Copy link
Owner Author

commented May 20, 2019

Unfortunately, I'm going to have to delay this release a bit longer as it's imperative that the security model is improved before this is released as an easy pip install. I've already reserved the package name on PyPi, I just haven't publicly released the next version on PyPI main yet. If you absolutely cannot wait and aren't using this project for private pages that need to be secure, you can start testing it by doing pip install --index-url https://test.pypi.org/simple/ archivebox or cloning the repo and doing pip install ..

My day job is getting quite busy so it may be slow, but I appreciate all ya'll's patience! Expect several months 😿 before this is resolved and the release is ready, it's quite complex.

  • #237 Strip JS from static archives by default, force people to replay via isolated WARC proxy/GUI if they want interactivity preserved
  • #239 Disable index JS entirely (apart from DataTable) with CSP headers and title bleaching
  • #235 Fix URL parsing to be more robust against non-url-encoded symbols

Any help with PRs that strip JavaScript / add CSP headers would be super helpful! (check out the bleach python library to get started)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.