Command line tool to allow for finding all urls from a website and extract data.
This tool is similar to a recursive grep using grep --extended-regexp --recursive
on a directory tree. Maxtract
searching a website for all explicit links (links provided via <a href="...">...</a>
) and pulls information from each
page.
When limiting the search depth via --max-depth
be aware that a depth of 0 will only search the given url and no others.
It may be minimally faster to simply use curl <url> | grep -E <pattern>
to extract information.
This tool can provide output in one of 3 ways as specified in the usage:
Output:
-o, --data-only only print the extracted data, without the source url
-f, --full print the url as a heading before the found data (default)
-j, --json print the data as json
--pretty-json print the data as pretty json
Prints the parent url and the extracted data in a tree-like format
$> maxtract https://sample.com/index.html --email
https://sample.com/child_00.html
├─ test.email@test.org
https://sample.com/child_01.html
├─ test.email@test.edu
https://sample.com/child_10.html
├─ test.email@test.co
https://sample.com/index.html
├─ i_am_an_email@some_school.edu
├─ test.email@test.com
Only prints the extracted data on separate lines
$> maxtract https://sample.com/index.html --email --data-only
test.email@test.org
test.email@test.edu
test.email@test.co
i_am_an_email@some_school.edu
test.email@test.com
Prints the entire map as json maintaining a list of the child urls from each parent. The output can be nicely formatted
(--json-pretty
) or output on one line (--json
).
$> maxtract https://sample.com/index.html --email --json-pretty
{
"https://sample.com/child_00.html": {
"url": "https://sample.com/child_00.html",
"data": [
"test.email@test.org"
],
"children": [
"https://sample.com/child_10.html"
]
},
"https://sample.com/child_01.html": {
"url": "https://sample.com/child_01.html",
"data": [
"test.email@test.edu"
],
"children": []
},
"https://sample.com/child_10.html": {
"url": "https://sample.com/child_10.html",
"data": [
"test.email@test.co"
],
"children": [
"https://sample.com/index.html"
]
},
"https://sample.com/index.html": {
"url": "https://sample.com/index.html",
"data": [
"i_am_an_email@some_school.edu",
"test.email@test.com"
],
"children": [
"https://sample.com/child_00.html",
"https://sample.com/child_01.html"
]
}
}