
## Download board game IDs from boardgamegeek.com

### Springboard Capstone 2 project: building a recommendation engine
### John Burt


Procedure:
- Scrape the "Browse" section of BGG to acquire a list of game IDs.
- Filter games by num ratings > 100 threshold.
- Save game name and ID to CSV.

Notes:

- [BGG API (not used here) is an alternative.](https://boardgamegeek.com/wiki/page/BGG_XML_API2)
    - installation: pip install boardgamegeek2


- Code used in this notebook is modified from [Building a boardgamegeek.com Data Set with Scraping and APIs in Python](https://sdsawtelle.github.io/blog/output/boardgamegeek-data-scraping.html)


In [1]:
import requests
from bs4 import BeautifulSoup
import scipy.io
import matplotlib.pyplot as plt
import matplotlib 
import pandas as pd
import numpy as np
import pickle
from time import sleep
import timeit

%matplotlib inline

datadir = './data/'

In [2]:
# example of standard API get method:

# r = requests.get("http://www.boardgamegeek.com/xmlapi2/user?name=Zazz&top=1")
# soup = BeautifulSoup(r.text, "xml")  # Use the xml parser for API responses and the html_parser for scraping
# print(r.status_code)  # 404 not found and the like. Hopefully 200!

### "safe" request function

Sometimes a server will spazz out and flub your request or sometimes your connection will blink out in the middle of reading a response, so it's good to wrap requests.get() in something a little more fault tolerant:

In [3]:
def request(msg, slp=1):
    '''A wrapper to make robust https requests.'''
    status_code = 500  # Want to get a status-code of 200
    while status_code != 200:
        sleep(slp)  # Don't ping the server too often
        try:
            r = requests.get(msg)
            status_code = r.status_code
            if status_code != 200:
                print("Server Error! Response Code %i. Retrying..." % (r.status_code))
        except:
            print("An exception has occurred, probably a momentory loss of connection. Waiting one seconds...")
            sleep(1)
    return r

### Scrape the "Browse" section of BGG to acquire a list of game IDs.

In [5]:
# Initialize a DF to hold all our scraped game info
df_all = pd.DataFrame(columns=["id", "name", "nrate", "pic_url"])
min_nrate = 100 # min ratings threshold
lowest_nrate = 1000000
npage = 1

# Scrape successful pages in the results until we get down to games with < min_nrate ratings each
while lowest_nrate > min_nrate:
    # Get full HTML for a specific page in the full listing of boardgames sorted by 
    r = request("https://boardgamegeek.com/browse/boardgame/page/%i?sort=numvoters&sortdir=desc" % (npage,))
    soup = BeautifulSoup(r.text, "html.parser")   
    
    # Get rows for the table listing all the games on this page
    table = soup.find_all("tr", attrs={"id": "row_"})  # Get list of all the rows (tags) in the list of games on this page
    df = pd.DataFrame(columns=["id", "name", "nrate", "pic_url"], index=range(len(table)))  # DF to hold this pages results
    
    # Loop through each row and pull out the info for that game
    for idx, row in enumerate(table):
        # Row may or may not start with a "boardgame rank" link, if YES then strip it
        links = row.find_all("a")
        if "name" in links[0].attrs.keys():
            del links[0]
        gamelink = links[1]  # Get the relative URL for the specific game
        gameid = int(gamelink["href"].split("/")[2])  # Get the game ID by parsing the relative URL
        gamename = gamelink.contents[0]  # Get the actual name of the game as the link contents
        try:
            imlink = links[0]  # Get the URL for the game thumbnail
            thumbnail = imlink.contents[0]["src"]
        except:
            thumbnail = ''
            print('idx =',idx,' error: no thumbnail')

        ratings_str = row.find_all("td", attrs={"class": "collection_bggrating"})[2].contents[0]
        nratings = int("".join(ratings_str.split()))

        df.iloc[idx, :] = [gameid, gamename, nratings, thumbnail]
        
    # Concatenate the results of this page to the master dataframe
    lowest_nrate = df["nrate"].min()  # The smallest number of ratings of any game on the page
    print("Page %i scraped, minimum number of ratings was %i" % (npage, lowest_nrate))
    df_all = pd.concat([df_all, df], axis=0)
    npage += 1
    sleep(2) # Keep the BGG server happy.
    


[<tr id="row_">
<td align="center" class="collection_rank">
<a name="335"></a>			335			
					</td>
<td class="collection_thumbnail">
<a href="/boardgame/13/catan"><img src="https://cf.geekdo-images.com/micro/img/e0y6BognJpgqdsgn2mXP5AARp98=/fit-in/64x64/pic2419375.jpg"/></a>
</td>
<td class="collection_objectname" id="CEcell_objectname1">
<div id="status_objectname1"></div>
<div id="results_objectname1" onclick="" style="z-index:1000;">
<a href="/boardgame/13/catan">Catan</a>
<span class="smallerfont dull">(1995)</span>
</div>
</td>
<td align="center" class="collection_bggrating">
			7.022		</td>
<td align="center" class="collection_bggrating">
			7.18		</td>
<td align="center" class="collection_bggrating">
			88197		</td>
<td class="collection_shop">
<div class="aad" id="aad_thing_13_textwithprices__"></div>
<div>
<a class="ulprice" href="https://apps.apple.com/us/app/catan-classic/id335029050?uo=4&amp;mt=8&amp;at=10lazE" target="_blank">iOS App: <span class="positive">$4.99</span></a

### Save the ratings data to CSV 

Note that I also save the URL to a thumbnail pic so that I can use it later in the app.

In [29]:
df = df_all.copy()
# Reset the index since we concatenated a bunch of DFs with the same index into one DF
df.reset_index(inplace=True, drop=True)

# add flag to indicate all ratings have been downloaded.
#  - this is for the next step in the process
pagesize = 100
df['nrating_pages'] = (df['nrate']/pagesize).astype(int)

# Write the DF to .csv for future use
df.to_csv(datadir+"bgg_gamelist.csv", index=False, encoding="utf-8")
print(df.shape)
df.head()


(12600, 5)


Unnamed: 0,id,name,nrate,pic_url,nrating_pages
0,13,Catan,87850,https://cf.geekdo-images.com/micro/img/e0y6Bog...,878
1,822,Carcassonne,87558,https://cf.geekdo-images.com/micro/img/z0tTaij...,875
2,30549,Pandemic,86396,https://cf.geekdo-images.com/micro/img/0m3-oqB...,863
3,68448,7 Wonders,71600,https://cf.geekdo-images.com/micro/img/h-Ejv31...,716
4,36218,Dominion,69929,https://cf.geekdo-images.com/micro/img/VYp2s2f...,699
