Skip to content

greencardamom/Findlinks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Findlinks

Findlinks is a script for dumping a list of Wikipedia page names where the article contains any URLs for a given domain.

It answers the question: which pages on Enwiki include a cnn.com URL?

It can operate on 1 wiki, 2+ wikis, or all 800+ wikis.

It's useful for bot operators who need to know which articles to process for a given domain.

It's useful to dump all URLs for a given domain for whatever purpose.

It's useful to build other queries to answer other questions from the replication database.

Running

findlinks - list page names that contain a domain

  -d <domain>   (required) Domain to search for eg. cnn.com
  -s <site>     (required) One or more site codes [space seperated] - see allwikis.txt for the list
                           If "ALL" then process all sites (800+) in allwikis.txt
                           If "<whatever>.txt" then process all site codes listed in the file <whatever>.txt
                           Use of a trailing "_p" in the site code is supported but optional - see Examples below
  -n <ns>       (optional) Namespace(s) to target [space seperated]. Default is "0 6"
                           eg. -n "0 6 10" will check these 3 namespaces 
                           0 = mainspace, 6 = File: and 10 = Template:
  -r <regex>    (optional) Only report URLs that match the given regex
  -k            (optional) Keep raw output file. Useful for viewing the URLs
  -a            (optional) Generate a fresh copy of allwikis.txt - ie. a list of all wiki site codes

  Examples:

    Find all pages on enwiki in namespace 4 & 5 that contain archive.md
      ./findlinks -d archive.md -s enwiki -n '4 5'
    Find all pages on enwiki and eswiki in namespace 0 that contain archive.md
      ./findlinks -d archive.md -s 'enwiki eswiki' -n 0
    Find all pages on the sites listed in mylist.txt in namespace 0 & 6 that contain archive.md
      ./findlinks -d archive.md -s mylist.txt
    Find all pages on enwiki in namespace 0 & 6 that contain a URL with 'archive.today' that has 'http:'
      ./findlinks -d archive.today -s enwiki -r '^http:'

How it works

The script uses ssh to establish a tunnel with the replication server on Toolforge and then runs queries through the tunnel. It can run from any computer it doesn't need to be hosted on Toolforge.

Dependencies

Setup

  • Clone the repo

      cd ~
      git clone 'https://github.com/greencardamom/Findlinks'
    
  • Install a MySQL client if not already:

      sudo apt-get install mysql-client
    
  • findlinks.awk has a hard coded path at the top of the file for the "Home" directory.

  • You will need a Toolforge account (free registration). Copy your replica.my.cnf file to the Findlinks local directory (it contains your SQL login ID and password)

  • You will need passwordless ssh access. Run 'ssh-keygen' and copy-paste the content of ~/.ssh/id_rsa.pub to your toolforge account at https://admin.toolforge.org/ under "Add a ssh public key"

Credits

by User:GreenC (en.wikipedia.org)

MIT License Copyright 2024

Iabotwatch uses the BotWikiAwk framework of tools and libraries for building and running bots on Wikipedia

https://github.com/greencardamom/BotWikiAwk

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages