## Using BeautifulSoup to Scrape a Paid Online Chess Database for Free

#### Disclaimer: I'm a fan of paying for content that you enjoy, but in this case, I would like to avoid paying a yearly membership for one simple query that I'm trying to do.

So, I have a chess match in a few hours. I'm playing as Black, and I happen to know what kind of opening moves my opponent generally plays (1. d4). I'm hoping to do some preparation for the game by grabbing a list of chess games played by grandmasters that follow similar opening moves. Specifically, chess games are recorded in [PGN notation](https://en.wikipedia.org/wiki/Portable_Game_Notation), so I want to save a bunch of pgns to a text file that I can open with a chess program to allow me to play out the games and study lines in the opening. Unfortunately, [the website I'm trying to query](http://www.chessgames.com) for such a custom database is asking me to sign up as a paid member for this. 

The website allows you to look at individual games for free, and even to search for a list of games based on specific conditions (for example, which opening was played), but it won't allow you to batch-download the games into a custom database without passing a paywall. Therefore, I'm going to write a quick script to pull down the grandmaster games from their website individually and save them to my own database (a .pgn text file, in this case).

[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a python library for pulling data out of HTML and XML files. I've only recently started using it and it seems like a very powerful tool.

In [1]:
import urllib2
from BeautifulSoup import BeautifulSoup

I found the base url below by going to the main chess database site and entering a basic query: I want games following the opening known as the "[Benko Gambit](https://en.wikipedia.org/wiki/Benko_Gambit)" (which I am hoping to play tonight), and the games have to be played after the year 2010. I add the year requirement because chess opening theory is constantly evolving so specific lines in the opening before this year may be outdated (generally meaning they've been refuted by computers).

In [2]:
maxgames = 50 #I won't have time to look through tons of games, 
              #so let's only save 50
def getSoup(pagenum = 1):
    #Split the url into two lines just to make it easier to see
    baseurl = 'http://www.chessgames.com/perl/chess.pl?page=%d&'% pagenum
    baseurl += 'playercomp=either&year=2010&yearcomp=ge&eco=A57-A59' 
    ufile = urllib2.urlopen(baseurl)
    return BeautifulSoup(ufile)

In [3]:
soup = getSoup()

Now I have the HTML from the first page (which contains a table of 25 games, each of which is a hyperlink) in the form of a BeautifulSoup instance. The page also contains a string like "page 1 of 10", meaning I can sift through more than just the first page if I want to extract more than 25 games. Here's how I determine the maximum number of pages I can loop through:

In [4]:
#Let's split up the page's text, look for a specific string
#There happen to be multiple instances on each page, 
#so let's just use sets to identify unique elements.
pagenumstr = list(set([x for x in soup.getText().split(';') if 'page ' in x]))[0]
npages = int(pagenumstr.split(' ')[-1])
print "There are %d maximum pages to sift through." % npages

There are 46 maximum pages to sift through.


I can use some of the basic BeautifulSoup utilities to keep only the hyperlinks, and furthermore, keep only the ones that point to actual chess games (there are other hyperlinks that point to various other places). I will identify the relevant links by requiring they contain a specific string within them.

The function below will extract and build a list of URLs (as strings) that point to individual chess games. Later, I'll loop through these URLs and extract what I want (the PGN) from each of these URLs

In [5]:
#Note I'm using an older version of BeautifulSoup, so I use 
# findAll instead of find_all
#Pull out all the links in the soup (they start with <a href= ...)
def getAbsoluteURLs(mysoup):
    """
    Given a soup object, do some munging and return me the list of URLs that
    point to pages containing PGNs of individual games
    """
    #all_links is a list of type 'BeautifulSoup.Tag'
    all_links = mysoup.findAll('a')
    
    #From all_links, extract the ones that link directly to an actual game
    #and store the unicode that can be used to build a full URL that game
    rel_links = [x.get('href') for x in all_links if '/perl/chessgame?gid=' in x.get('href')]
    #An element in rel_link looks like: /perl/chessgame?gid=1805362
    
    #I notice that when I acutally click on the link, it takes me to 
    #the following url: http://www.chessgames.com/perl/chessgame?gid=1000223. 
    #That's amazingly convenient... 
    #let's build a list of full URLs, each of which 
    #points to an individual game.
    abs_urls = ['http://www.chessgames.com' + str(x) for x in rel_links]
    #An element in abs_urls looks like: 
    # "http://www.chessgames.com/perl/chessgame?gid=1805362"
    
    return abs_urls

I see the HTML of each game page contains the PGN I'm after, always surrounded by a block like

`<textarea style="display: none;" id="pgnText">`

so I'll find that in each of the soups and scrape out the PGN! Additionally, the PGN contains the playing strength of the Black player, and I'd like to make a cut on that to ensure I'm only looking at games from the strongest players.

In [6]:
def scrapePGN(myurl):
    """
    Function that, given a URL to a specific chess game, extracts the PGN and
    returns it as a string.
    """
    mysoup = BeautifulSoup(urllib2.urlopen(myurl))
    return mysoup.find('textarea').getText()

def verifyPGN(pgnstring):
    """
    I want to make sure that the player using the Black pieces 
    (as I will be tonight) is very strong. Chess players have 
    a "rating", and I'll make sure the Black player is rated 
    at least 2200 if I'm going to save this game.
    """
    min_allowable_ELO = 2200
    
    if 'BlackElo' not in pgnstring: return False
    else: 
        elostr = [x for x in pgnstring.split("[") if 'BlackElo' in x][0]
        #Sometimes BlackELO is unknown... skip those games!
        if "?" in elostr: return False
        return elostr.strip("BlackElo")[2:-3] >= min_allowable_ELO

I also notice that the games are reverse-time-ordered... meaning, the most recent games are on the last page. I'd like to view them starting with the most recent, so let's start by grabbing some links on the back page, and loop in reverse.

Below is the main function that does all of the looping, scraping, munging, etc. The final output is a list of PGN strings, which I will write to a file.

In [7]:
def genPGNs():
    """
    Main function: Loops through the table pages in reverse order, 
    generates a list of links from each page, goes to each of those
    links, and extracts the PGN. It continues looping until it has 
    scraped "maxgames" PGNs, then stops and returns a list of the 
    PGNs (again, stored as strings).
    """
    pgns = []
    n_games_scraped = 0
    for ipage in xrange(npages,0,-1):
        soup = getSoup(ipage)
        for url in getAbsoluteURLs(soup):
            pgn = scrapePGN(url)
            if not verifyPGN(pgn): continue
            pgns.append(pgn)
            n_games_scraped += 1
            if n_games_scraped >= maxgames:
                print "Done! Scraped %d games." % n_games_scraped
                return pgns
    print "Done! Didn't reach maximum number of games. Scraped %d games." % n_games_scraped
    return pgns

In [8]:
pgns = genPGNs()

Done! Scraped 50 games.


Now to create an output file! [My chess program](http://scidvspc.sourceforge.net/) fortunately just reads in a plain text file with a bunch of pgns copied and pasted, so making the output file should be a cinch.

In [9]:
outfile = 'benko_games_db.pgn'
f = open(outfile, 'w')
for pgn in pgns:
    f.write(pgn + '\n\n')
f.close()

Let's have a quick look to make sure everything was saved alright in my output file:

In [10]:
!head  -n 25 $outfile

[Event "British Championships"]
[Site "Coventry ENG"]
[Date "2015.08.07"]
[EventDate "2015.07.27"]
[Round "11.36"]
[Result "1/2-1/2"]
[White "Michael Ashworth"]
[Black "John Garnett"]
[ECO "A59"]
[WhiteElo "1912"]
[BlackElo "1984"]
[PlyCount "82"]

1. d4 Nf6 2. c4 c5 3. d5 b5 4. cxb5 a6 5. bxa6 g6 6. Nc3 Bxa6
7. e4 Bxf1 8. Kxf1 d6 9. g3 Bg7 10. Kg2 O-O 11. Nf3 Na6
12. Qe2 Qb6 13. Nd2 Nc7 14. Nc4 Qa6 15. a4 Rfb8 16. Bd2 Nd7
17. Rhe1 Nb6 18. Nxb6 Qxb6 19. Reb1 e6 20. Qd3 exd5 21. Nxd5
Nxd5 22. Qxd5 Qb7 23. Qd3 Qc6 24. Qc2 Re8 25. Re1 d5 26. exd5
Qxd5+ 27. f3 Bd4 28. Rxe8+ Rxe8 29. Re1 Rb8 30. Bc3 Qc4
31. Re2 Bxc3 32. bxc3 Rb3 33. Re4 Qxc3 34. Qxc3 Rxc3 35. Re8+
Kg7 36. a5 Rc2+ 37. Kh3 Ra2 38. Ra8 c4 39. a6 c3 40. a7 c2
41. Rc8 Rxa7 1/2-1/2

[Event "PokerStars IoM Masters"]
[Site "Douglas ENG"]


Hooray! I've successfully created my output PGN file. I'm going to go load it into my chess program and start studying for my game tonight. Wish me luck!