Findlinks

Findlinks is a script for dumping a list of Wikipedia page names where the article contains any URLs for a given domain.

It answers the question: which pages on Enwiki include a cnn.com URL?

It can operate on 1 wiki, 2+ wikis, or all 800+ wikis.

It's useful for bot operators who need to know which articles to process for a given domain.

It's useful to dump all URLs for a given domain for whatever purpose.

It's useful to build other queries to answer other questions from the replication database.

Running

findlinks - list page names that contain a domain

  -d <domain>   (required) Domain to search for eg. cnn.com
  -s <site>     (required) One or more site codes [space seperated] - see allwikis.txt for the list
                           If "ALL" then process all sites (800+) in allwikis.txt
                           If "<whatever>.txt" then process all site codes listed in the file <whatever>.txt
                           Use of a trailing "_p" in the site code is supported but optional - see Examples below
  -n <ns>       (optional) Namespace(s) to target [space seperated]. Default is "0 6"
                           eg. -n "0 6 10" will check these 3 namespaces 
                           0 = mainspace, 6 = File: and 10 = Template:
  -r <regex>    (optional) Only report URLs that match the given regex
  -k            (optional) Keep raw output file. Useful for viewing the URLs
  -a            (optional) Generate a fresh copy of allwikis.txt - ie. a list of all wiki site codes

  Examples:

    Find all pages on enwiki in namespace 4 & 5 that contain archive.md
      ./findlinks -d archive.md -s enwiki -n '4 5'
    Find all pages on enwiki and eswiki in namespace 0 that contain archive.md
      ./findlinks -d archive.md -s 'enwiki eswiki' -n 0
    Find all pages on the sites listed in mylist.txt in namespace 0 & 6 that contain archive.md
      ./findlinks -d archive.md -s mylist.txt
    Find all pages on enwiki in namespace 0 & 6 that contain a URL with 'archive.today' that has 'http:'
      ./findlinks -d archive.today -s enwiki -r '^http:'

How it works

The script uses ssh to establish a tunnel with the replication server on Toolforge and then runs queries through the tunnel. It can run from any computer it doesn't need to be hosted on Toolforge.

Dependencies

GNU awk 4.1+
MySql client
ssh client
A Wikitech Toolforge account: https://wikitech.wikimedia.org/wiki/Portal:Toolforge

Setup

Clone the repo

  cd ~
  git clone 'https://github.com/greencardamom/Findlinks'

Install a MySQL client if not already:
```
  sudo apt-get install mysql-client
```
findlinks.awk has a hard coded path at the top of the file for the "Home" directory.
You will need a Toolforge account (free registration). Copy your replica.my.cnf file to the Findlinks local directory (it contains your SQL login ID and password)
You will need passwordless ssh access. Run 'ssh-keygen' and copy-paste the content of ~/.ssh/id_rsa.pub to your toolforge account at https://admin.toolforge.org/ under "Add a ssh public key"

Credits

by User:GreenC (en.wikipedia.org)

Iabotwatch uses the BotWikiAwk framework of tools and libraries for building and running bots on Wikipedia

https://github.com/greencardamom/BotWikiAwk

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
allwikis.txt		allwikis.txt
findlinks.awk		findlinks.awk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

allwikis.txt

allwikis.txt

findlinks.awk

findlinks.awk

Repository files navigation

Findlinks

Running

How it works

Dependencies

Setup

Credits

About

Releases

Packages

Languages

License

greencardamom/Findlinks

Folders and files

Latest commit

History

Repository files navigation

Findlinks

Running

How it works

Dependencies

Setup

Credits

About

Resources

License

Stars

Watchers

Forks

Languages