GitHub - nozberkaryaindonesia/ReadableWebProxy: Rewriting web proxy and archival tool.

Readable-Web Proxy

Reading long-form content on the internet is a shitty experience.
This is a web-proxy that tries to make it better.

This is a rewriting proxy. In other words, it proxies arbitrary web content, while allowing the rewriting of the remote content as driven by a set of rule-files. The goal is to effectively allow the complete customization of any existing web-sites as driven by predefined rules.

Functionally, it's used for extracting just the actual content body of a site and reproducing it in a clean layout. It also modifies all links on the page to point to internal addresses, so following a link points to the proxied version of the file, rather then the original.

While the above was the original scope, the project has mutated heavily. At this point, it has a complete web spider and archives entire websites to local storage. Additionally, multiple versions of each page are kept, with a overall rolling refresh of the entire database at configurable intervals (configurable on a per-domain, or global basis).

Quick installation overview:

Install Postgresql >= 9.5. This is ~~alpha~~, you will (probably) have to build from source. (This is because this project uses the new ON CONFLICT clause)
Build the community extensions for Postgresql.
Create a database for the project.
In the project database, install the pg_trgm and citext extensions from the community extensions modules.
Copy settings.example.py to settings.py.
Setup virtualhost by running build-venv.sh
Activate vhost: source flask/bin/activate
Bootstrap DB: create_db.sh
Run server: python3 run.py
(Optional): Scraper is started by python runScrape.py
(Optional): Scraper periodic scheduler is started by python runScrape.py scheduler

Name		Name	Last commit message	Last commit date
Latest commit History 664 Commits
FetchAgent		FetchAgent
Misc		Misc
RawArchiver		RawArchiver
WebMirror		WebMirror
alembic		alembic
amqpstorm		amqpstorm
app		app
bsonrpc		bsonrpc
common		common
rabbit_pub_cert		rabbit_pub_cert
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
TODO.md		TODO.md
activePlugins.py		activePlugins.py
activeScheduledTasks.py		activeScheduledTasks.py
alembic.ini		alembic.ini
amqp_test.py		amqp_test.py
amqpstorm_test.py		amqpstorm_test.py
build-venv.sh		build-venv.sh
config.py		config.py
diff_test.py		diff_test.py
flags.py		flags.py
logSetup.py		logSetup.py
manage.py		manage.py
requirements.txt		requirements.txt
run.py		run.py
runFetchAgent.py		runFetchAgent.py
runScheduler.py		runScheduler.py
runScrape.py		runScrape.py
runStatus.py		runStatus.py
runTitleTests.py		runTitleTests.py
run_agent.sh		run_agent.sh
script.py.mako		script.py.mako
settings.py.example		settings.py.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readable-Web Proxy

About

Releases

Packages

Languages

License

nozberkaryaindonesia/ReadableWebProxy

Folders and files

Latest commit

History

Repository files navigation

Readable-Web Proxy

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages