tq is command line utility that performs an HTML element selection on HTML content passed to the stdin. Using css selectors that everybody knows.
Since input comes from stdin and output is sent to stdout. It can easily be used inside traditional UNIX pipelines to extract content from webpages and html files.
tq provides extra formating options such as json-encoding or newlines squashing, so it can play nicely with everyones favourite command line tooling.
sudo pip install https://github.com/plainas/tq/zipball/stable
WARNING: tq requires python3. On some systems, the pip will install python2 packages. In that case, you will need to use pip3 instead.
Get headlines from hacker news
curl https://news.ycombinator.com/news | tq -tj ".title a"
Get the title of an html document stored in a file
cat mydocument.html | tq -t title
Get all the images from a webpage
curl -s 'http://example.com/' | tq "img" -a src | wget -i -
Notice that tq doesn't provide a way to make http requests or read files. You can use your favorite HTTP client, or provide the html source from any source you want.
For a modern, user friendly http client, check httpie. Or you can just use curl, wget, netcat, etc.
SELECTORA css selector
-a ATTRIBUTE --attr=ATTRIBUTEOutputs only the contents of the html ATTRIBUTE.
-t, --textOutputs only the inner text of the selected elements.
-q, --squashSquash lines.
-s, --squash-spaceSquash spaces.
-j, --json-linesJSON encode each match.
-J, --jsonOutput as json array of strings.
-v, --versionPrints tq version