wikicrawl

This is a crawler to report on pages in a Confluence wiki.

Create a Python 3.8 virtualenv, then install the requirements:

python -m pip install -r requirements.txt

To run it, create a file called keys.py like this:

USER = 'myemail@company.com'
PASSWORD = 'VouKOgWgS1xBiVMHtsGQD349'
SITE = 'https://openedx.atlassian.net/wiki'

or define environment variables:

CRAWL_USER = 'myemail@company.com'
CRAWL_PASSWORD = 'VouKOgWgS1xBiVMHtsGQD349'
CRAWL_SITE = 'https://openedx.atlassian.net/wiki'

The PASSWORD is an API token you can get from https://id.atlassian.com/manage-profile/security/api-tokens

If you wish to get visited data on your pages, you can add CLOUD_SESSION_COOKIE_TOKEN to keys.py like this:

CLOUD_SESSION_COOKIE_TOKEN = "sdljfslajdflashdflasjdflkajsldfjalsndamvosjdmiweryoweiurasnasdvosdueursasdkhasohdfasuioyfasjfioehsanfsflksajfioe"

You will need to copy the value from a cookie in your browser called cloud.session.token that is scoped to something similar to .atlassian.net. Your actual value will be much longer than this (~900 characters).

Then run:

python crawl.py --all --pages

An html directory will be created here and populated with the report. Open html/index.html to see the list of wiki spaces, each linked to a page about the pages in the space.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.editorconfig		.editorconfig
.gitignore		.gitignore
README.rst		README.rst
crawl.py		crawl.py
get_visits.py		get_visits.py
htmlwriter.py		htmlwriter.py
requirements.txt		requirements.txt
sort.js		sort.js
style.css		style.css
work.py		work.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikicrawl

About

Releases

Packages

Contributors 3

Languages

nedbat/wikicrawl

Folders and files

Latest commit

History

Repository files navigation

wikicrawl

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages