<a data-flickr-embed="true"  href="https://www.flickr.com/photos/pfctdayelise/371603584" title="Sania Mirza"><img src="https://farm1.staticflickr.com/136/371603584_b2127a2671_n.jpg" width="198" height="320" alt="Sania Mirza" align="left" style="padding-right:30px;"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script> 
In 2006 and 2007 I went to the Australian Open to watch the tennis, and source freely-licensed photographs of tennis players for Wikipedia. I took around [100 photos](https://en.wikipedia.org/wiki/User:Pfctdayelise/Bragsheet#Tennis_photos), and happily many of them survive in Wikipedia articles [to](https://en.wikipedia.org/wiki/Tim_Henman) [this](https://en.wikipedia.org/wiki/Anne_Kremer) [day](https://en.wikipedia.org/wiki/Cyril_Saulnier). 

Seeing your images on articles is a pretty feel-good way of [contributing](https://en.wikipedia.org/wiki/Wikipedia:Ten_things_you_may_not_know_about_images_on_Wikipedia) to Wikipedia. So I am thinking about going again this year (the [Open](http://www.ausopen.com/index.html) started today), but the scattershot approach I used in 2007 isn't going to cut it any more. So I need to figure out, which players don't have _any_ photos on their Wikipedia bio?

   * [Who's playing?](#Who%27s-playing?)
   * [Using the Wikipedia API](#Using-the-Wikipedia-API)
   * [Redirects](#Redirects)
   * [Page doesn't exist](#Page-doesn%27t-exist)
   * [Disambiguation pages](#Disambiguation-pages)
   * [hasImage](#hasImage)
   * [Full results](#Full-results)

_Also this photo of Sania Mirza is the third most popular image I have on Flickr - no idea why._

## Who's playing?

The official website has a [list of players](http://www.ausopen.com/en_AU/players/profiles.html). That's pretty quick to manually copy into a text file and delete a few stray lines.

There are 546 players, so I'm going to work with a shorter sample until I get things vaguely working, to speed up development and avoid hitting the API unnecessarily often.

## Using the Wikipedia API

Actually there is no Wikipedia API. But there is a [MediaWiki API](https://www.mediawiki.org/wiki/API:Main_page). It's very powerful, too - all kinds of [bots](http://www.technologyreview.com/view/524751/the-shadowy-world-of-wikipedias-editing-bots/) are powered by it. And there is a good Python client library, called [mwclient](https://github.com/mwclient/mwclient). OK, so I just want something a bit like this \*cracks knuckles\* ...

    # sampleplayers.txt
    
    Baghdatis, Marcos
    Bai, Yan
    Baker, Brian
    Barrere, Gregoire
    Basic, Mirza


In [60]:
import mwclient
site = mwclient.Site('en.wikipedia.org')

PLAYERSFILE = 'sampleplayers.txt'


def getPage(name):
    return site.Pages[name]


def hasImage(page):
    # TODO
    return False


hasimage = []
needsimage = []

with open(PLAYERSFILE) as players:
    for player in players:
        page = getPage(player)
        if hasImage(page):
            hasimage.append(player)
        else:
            needsimage.append(player)
            
print("Has image:", hasimage)
print("Needs image:", needsimage)


Has image: []
Needs image: ['Baghdatis, Marcos\n', 'Bai, Yan\n', 'Baker, Brian\n', 'Barrere, Gregoire\n', 'Basic, Mirza\n']


I'll worry about `hasImage` in a minute. Right now there is a more pressing problem: fixing up the names. I need to get rid of those newlines and make them 'firstname lastname' to match the Wikipedia naming convention.

In [61]:
def normaliseName(name):
    last, first = name.strip().split(', ')
    return ' '.join([first, last])


hasimage = []
needsimage = []

with open(PLAYERSFILE) as players:
    for player in players:
        forwardname = normaliseName(player)
        page = getPage(forwardname)
        if hasImage(page):
            hasimage.append(forwardname)
        else:
            needsimage.append(forwardname)
            
print("Has image:", hasimage)
print("Needs image:", needsimage)

Has image: []
Needs image: ['Marcos Baghdatis', 'Yan Bai', 'Brian Baker', 'Gregoire Barrere', 'Mirza Basic']


OK. Maybe now I should verify the pages look like what I think they do...

In [62]:
with open(PLAYERSFILE) as players:
    for player in players:
        forwardname = normaliseName(player)
        page = getPage(forwardname)
        print(forwardname.upper())
        print(page.text()[:200])

MARCOS BAGHDATIS
{{Use dmy dates|date=June 2013}}
{{Infobox tennis biography
|name= Marcos Baghdatis<br><small>Μάρκος Παγδατής</small>
|image= Marcos Baghdatis Olympics 2012.jpg
|country= {{CYP}}
|residence= [[Limasso
YAN BAI
#redirect [[Bai Yan]]
BRIAN BAKER
'''Brian Baker''' may refer to:

* [[Brian Baker (musician)]] (born 1965), American guitarist for punk bands Minor Threat, Dag Nasty, and Bad Religion, among others
* [[Brian Baker (actor)]] (born 196
GREGOIRE BARRERE

MIRZA BASIC
#REDIRECT [[Mirza Bašić]]
{{R from title without diacritics}}


This reveals a few issues I need to deal with before I start looking at images. The [Marcos Baghdatis](https://en.wikipedia.org/wiki/Marcos_Baghdatis) article seems [legit](https://en.wikipedia.org/wiki/Marcos_Baghdatis?action=edit&veswitched=1). Yan Bai and Mirza Basic are redirects. Brian Baker is a disambiguation page, and Gregoire Barrere maybe doesn't have a page yet. 😢 Can anyone [fix that](https://en.wikipedia.org/wiki/Gr%C3%A9goire_Barr%C3%A8re)?

## Redirects

If I type in "Yan Bai" at Wikipedia, I get whisked off to https://en.wikipedia.org/wiki/Bai_Yan . There is a visual hint that something happened:

<img src="blog/redirect.png" />

Happily, the API knows about redirects and can automatically resolve them for me.

In [63]:
def getPage(name):
    return site.Pages[name].resolve_redirect()


In [64]:
for player in ['Yan Bai', 'Mirza Basic']:
    page = getPage(player)
    print(player.upper())
    print(page.text()[:200])

YAN BAI
{{chinese name|[[Bo (Chinese name)|Bai/Bo (柏)]]}}
{{Infobox tennis biography
|name = Bai Yan
|image = 
|country = {{CHN}}<ref name=ATPProfile>{{cite news|title=ATP.com - Yan Bai Profile|url=http://www
MIRZA BASIC
{{Infobox tennis biography
| name                  = Mirza Bašić
| image                 = <!-- Commented out because image was deleted: [[File:Mirza-Basic.jpg|200px]] -->
| nickname              =
| 


Looks better! In the second case, the correct name is Mirza Bašić. It's embarrassing that the official Australian Open website can't cope with diacritics tbh.

To record the correct name of the page, I can do the following:

In [65]:
page.name

'Mirza Bašić'

## Page doesn't exist

Gregoire Barrere (or rather Grégoire Barrère) doesn't have a page yet. The API also copes with this pretty well:

In [66]:
site.Pages['Gregoire Barrere'].exists

False

So I update my `getPage` function:

In [67]:
def getPage(name):
    page = site.Pages[name].resolve_redirect()
    if not page.exists:
        return
    return page


needspage = []
hasimage = []
needsimage = []

with open(PLAYERSFILE) as players:
    for player in players:
        forwardname = normaliseName(player)
        page = getPage(forwardname)
        if not page:
            needspage.append(forwardname)
            continue

        if hasImage(page):
            hasimage.append(page.name)
        else:
            needsimage.append(page.name)

print("Needs page:", needspage)
print("Has image:", hasimage)
print("Needs image:", needsimage)

Needs page: ['Gregoire Barrere']
Has image: []
Needs image: ['Marcos Baghdatis', 'Bai Yan', 'Brian Baker', 'Mirza Bašić']


## Disambiguation pages

Disambiguation or "dab" pages are what I would call part of the Wikipedia API. They're built on editing community conventions rather than technical capabilities of MediaWiki. But I need to deal with them otherwise the results will be nonsense.

So let's look at the full content of the Brian Baker page and see what there is to play with:

In [68]:
page = site.Pages['Brian Baker']
print(page.text())

'''Brian Baker''' may refer to:

* [[Brian Baker (musician)]] (born 1965), American guitarist for punk bands Minor Threat, Dag Nasty, and Bad Religion, among others
* [[Brian Baker (actor)]] (born 1967), American actor and former Sprint spokesman
* [[Brian Baker (tennis)]] (born 1985), American professional tennis player
* [[Brian Baker (The Wire)]], police officer on the HBO drama ''The Wire''
* [[Brian Baker (diplomat)]] (born 1944), former Canadian diplomat and Ambassador to Denmark
* [[Brian Baker (politician)]], American politician and Missouri State Representative
* [[Brian Baker (producer)]], American engineer and producer for bands including Blue October
* Brian Baker, Australian singer for [[The Makers (Australian band)|The Makers]] and others
* [[Brian Baker (runner)]] (born 1970), American track and field athlete and coach
* [[Brian Edmund Baker]] (1896–1979), World War I flying ace

==See also==
* [[Bryan Baker (disambiguation)]]

{{hndis|Baker, Brian}}

{{DEFAULTSORT:Baker

Hmm ok...kind of not that useful. If I look at the page on Wikipedia, I can see there is a bit more structure that is not evident in the page wikitext:

<img src="blog/brianbakerdisambig.png" />

At the bottom there is a category which seems pretty definitive in terms of identifying a disambiguation page. Categories _are_ part of the MediaWiki API:

In [69]:
page = site.Pages['Brian Baker']
cats = page.categories()
for cat in cats:
    print(cat['title'])

Category:All disambiguation pages
Category:All article disambiguation pages
Category:Human name disambiguation pages


There are some bonus categories, because MediaWiki supports [hidden categories](https://www.mediawiki.org/wiki/Help:Categories#Hidden_categories). This is one of those features that you don't need unless your wiki has millions of pages and a crowd of obsessive sorters. If you have your preferences arranged just-so you can actually get these categories to show up.

<img src="blog/disambighiddencategories.png" />

OK so... to detect a disambiguation page, I can probably just look for one of these categories. In the API there are two ways to do this - check if the page is in the category, or check if category is attached to the page. Sounds much of a muchness, but the [category All disambiguation pages](https://en.wikipedia.org/wiki/Category:All_disambiguation_pages) has over 265,000 members. So I have a hunch let's not do it that way 😉

In [70]:
def isDisambiguation(page):
    cats = page.categories()
    disambigCat = 'Category:All disambiguation pages'
    return disambigCat in [cat['title'] for cat in cats]

page = site.Pages['Brian Baker']
print(isDisambiguation(page))

page = site.Pages['Marcos Baghdatis']
print(isDisambiguation(page))

True
False


(Another task is try and resolve the disambiguation page to the correct page, but I'll tackle that later.)

In [71]:
needspage = []
disambigs = []
hasimage = []
needsimage = []

with open(PLAYERSFILE) as players:
    for player in players:
        forwardname = normaliseName(player)
        page = getPage(forwardname)
        if not page:
            needspage.append(forwardname)
            continue

        if isDisambiguation(page):
            disambigs.append(page.name)
        elif hasImage(page):
            hasimage.append(page.name)
        else:
            needsimage.append(page.name)

print("Needs page:", needspage)
print("Disambig:", disambigs)
print("Has image:", hasimage)
print("Needs image:", needsimage)

Needs page: ['Gregoire Barrere']
Disambig: ['Brian Baker']
Has image: []
Needs image: ['Marcos Baghdatis', 'Bai Yan', 'Mirza Bašić']


## hasImage

Now I have certainty I'm on a player's biography, I can check for images.

I could try and parse the wikitext and see if the tennis player infobox has an image value filled out, but it seems simpler to start with the images API.

In [72]:
page = site.Pages['Marcos Baghdatis']
images = page.images()
for image in images:
    print(image['title'])

File:Flag of Finland.svg
File:Ambox important.svg
File:Baghdatis 2009 Delray 1.jpg
File:Flag of France.svg
File:Flag of Croatia.svg
File:Flag of Argentina.svg
File:Flag of Belgium (civil).svg
File:Flag of Cyprus.svg
File:Commons-logo.svg
File:Flag of Chile.svg
File:Flag of Russia.svg
File:Marcos Baghdatis Serve.JPG
File:Flag of Serbia.svg
File:Flag of Switzerland.svg
File:Flag of the United States.svg
File:Marcos Baghdatis 2004 US Open.JPG
File:Flag of Thailand.svg
File:Flag of the Czech Republic.svg
File:Marcos Baghdatis Olympics 2012.jpg
File:Marcos Baghdatis2007USOPEN.jpg


That's... a lot of flags. It's because editors like to do this kind of thing:

<img src="blog/flags.png" />

So to filter them out, what do they have in common?

What jumps out at me is that they are SVGs. SVGs are not normally used for photographs (which is what we are trying to detect), so that will be a good start.

In [73]:
def isBoring(imagename):
    # Flags, Increase2.svg, Decrease2.svg
    return imagename.endswith('.svg')


def hasImage(page):
    images = page.images()
    imgnames = [image['title'] for image in images]
    interestingImages = [imgname for imgname in imgnames
                         if not isBoring(imgname)]
    return bool(interestingImages)


for player in['Marcos Baghdatis', 'Bai Yan', 'Mirza Bašić']:
    page = getPage(player)
    print(player, hasImage(page))

Marcos Baghdatis True
Bai Yan False
Mirza Bašić False


Now we can put it all together:

In [74]:
needspage = []
disambigs = []
hasimage = []
needsimage = []

with open(PLAYERSFILE) as players:
    for player in players:
        forwardname = normaliseName(player)
        page = getPage(forwardname)
        if not page:
            needspage.append(forwardname)
            continue

        if isDisambiguation(page):
            disambigs.append(page.name)
        elif hasImage(page):
            hasimage.append(page.name)
        else:
            needsimage.append(page.name)
            
print("Has image:", hasimage)
print("Disambig:", disambigs)
print("Needs image:", needsimage)
print("No page:", needspage)

Has image: ['Marcos Baghdatis']
Disambig: ['Brian Baker']
Needs image: ['Bai Yan', 'Mirza Bašić']
No page: ['Gregoire Barrere']


🙌

## Full results

I modified the above a little bit to print things to file rather than store in a list. You can see the final script at [github](https://github.com/pfctdayelise/aomp/blob/master/script.py).

    (aomp)blaugher@scorpion:~/workspace/aomp$ python script.py 
    ...............................................................................................
    ...............................................................................................
    ...............................................................................................
    ...............................................................................................
    ...............................................................................................
    .......................................................................
    Has image: 473
    See disambigs.txt, needspage.txt and needsimage.txt for further work!
    
    (aomp)blaugher@scorpion:~/workspace/aomp$ wc -l *txt
       25 disambigs.txt
       30 needsimage.txt
       18 needspage.txt
      546 players.txt

Overall quite impressive -- only around 5% of the players at the 2016 Australian Open don't have a Wikipedia article, and only another 3% or so need images. (Give or take resolving the name disambiguations and doing a bit more manual verifying that the results seem reasonable.) 

That leaves some todos:
   - Encourage people to [start articles](https://github.com/pfctdayelise/aomp/blob/master/needspage.txt) for the missing players
   - Try to resolve [disambiguation pages](https://github.com/pfctdayelise/aomp/blob/master/disambigs.txt) - look for a link on a line that mentions tennis, or even just tennis in the link name
   - For [missing photos](https://github.com/pfctdayelise/aomp/blob/master/needsimage.txt), check interwiki links (articles on the same topic in other languages) and see if any of them have images.
   - Put together my tennis schedule - seems like I will have a lot of time to enjoy the tennis for itself and not have to worry about photo ops! 🎾🎾🎾🎾🎾

_Written by [Brianna Laugher](http://brianna.laugher.id.au) ([@pfctdayelise](https://twitter.com/pfctdayelise)), code is available on [github](https://github.com/pfctdayelise/aomp)._