## Download board game metadata from boardgamegeek.com

### Springboard Capstone 2 project: building a recommendation engine
### John Burt


#### Procedure:
- Load previously downloaded game ID list. All games with > 100 ratings were collected.
- Use [BGG API 2 interface](https://boardgamegeek.com/wiki/page/BGG_XML_API2) to collect game metadata.
- Save game metadata to a CSV file.

Notes:

- [BGG API package (not used here) is an alternative.](https://boardgamegeek.com/wiki/page/BGG_XML_API2)
    - installation: pip install boardgamegeek2


- Code used in this notebook is modified from [Building a boardgamegeek.com Data Set with Scraping and APIs in Python](https://sdsawtelle.github.io/blog/output/boardgamegeek-data-scraping.html)


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import pickle
from time import sleep
import timeit

%matplotlib inline

datadir = './data/'

### Load previously downloaded game ID data

In [2]:
# load previously downloaded game id data.
games = pd.read_csv(datadir+'bgg_gamelist.csv')
print(games.shape)
games.head()

(12600, 5)


Unnamed: 0,id,name,nrate,pic_url,nrating_pages
0,13,Catan,87850,https://cf.geekdo-images.com/micro/img/e0y6Bog...,878
1,822,Carcassonne,87558,https://cf.geekdo-images.com/micro/img/z0tTaij...,875
2,30549,Pandemic,86396,https://cf.geekdo-images.com/micro/img/0m3-oqB...,863
3,68448,7 Wonders,71600,https://cf.geekdo-images.com/micro/img/h-Ejv31...,716
4,36218,Dominion,69929,https://cf.geekdo-images.com/micro/img/VYp2s2f...,699


### "safe" request function

Sometimes a server will spazz out and flub your request or sometimes your connection will blink out in the middle of reading a response, so it's good to wrap requests.get() in something a little more fault tolerant:

In [3]:
def request(msg, slp=1):
    '''A wrapper to make robust https requests.'''
    status_code = 500  # Want to get a status-code of 200
    while status_code != 200:
        sleep(slp)  # Don't ping the server too often
        try:
            r = requests.get(msg)
            status_code = r.status_code
            if status_code != 200:
                print("Server Error! Response Code %i. Retrying..." % (r.status_code))
        except:
            print("An exception has occurred, probably a momentory loss of connection. Waiting one seconds...")
            sleep(1)
    return r

### Collect game information. 

- Read in blocks of 100

In [6]:
minplayers = []
maxplayers = []
minage = []
mean_rating = []
weight = []
categories = []
mechanics = []

blocksize = 100

for i in range(0,games.shape[0],blocksize):
    gids = ','.join([str(id) for id in games['id'].iloc[i:i+blocksize].values])
    r = request("http://www.boardgamegeek.com/xmlapi2/thing?id="+gids+"&stats=1")
    soup = BeautifulSoup(r.text, "xml")
    
    for item in soup('item'):
        
        minplayers.append(int(item("minplayers")[0]['value']))
        maxplayers.append(int(item("maxplayers")[0]['value']))
        minage.append(int(item("minage")[0]['value']))
        mean_rating.append(float(item("average")[0]['value']))
        weight.append(float(item("averageweight")[0]['value']))
        
        cats = [obj['value'] for obj in item.find_all(type='boardgamecategory')]
        categories.append(','.join(cats))
        
        mechs = [obj['value'] for obj in item.find_all(type='boardgamemechanic')]
        mechanics.append(','.join(mechs))
        
    print(int(i/blocksize),end=',')
    
    sleep(2) # Keep the BGG server happy.

print('\n done')


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,
 done


### Load info into a dataframe and save it to CSV

In [8]:
games['minplayers'] = minplayers
games['maxplayers'] = maxplayers
games['minage'] = minage
games['mean_rating'] = mean_rating
games['weight'] = weight
games['categories'] = categories
games['mechanics'] = mechanics

# Write the DF to .csv for future use
games.to_csv(datadir+"bgg_game_info.csv", index=False, encoding="utf-8")
print(games.shape)
games.head()


(12600, 12)


Unnamed: 0,id,name,nrate,pic_url,nrating_pages,minplayers,maxplayers,minage,mean_rating,weight,categories,mechanics
0,13,Catan,87850,https://cf.geekdo-images.com/micro/img/e0y6Bog...,878,3,4,10,7.18061,2.3357,Negotiation,"Dice Rolling,Hexagon Grid,Income,Modular Board..."
1,822,Carcassonne,87558,https://cf.geekdo-images.com/micro/img/z0tTaij...,875,2,5,8,7.42375,1.9219,"City Building,Medieval,Territory Building","Area Majority / Influence,Tile Placement"
2,30549,Pandemic,86396,https://cf.geekdo-images.com/micro/img/0m3-oqB...,863,2,4,8,7.62865,2.4211,Medical,"Action Points,Cooperative Game,Hand Management..."
3,68448,7 Wonders,71600,https://cf.geekdo-images.com/micro/img/h-Ejv31...,716,2,7,10,7.77245,2.3389,"Ancient,Card Game,City Building,Civilization","Card Drafting,Drafting,Hand Management,Set Col..."
4,36218,Dominion,69929,https://cf.geekdo-images.com/micro/img/VYp2s2f...,699,2,4,13,7.63822,2.3616,"Card Game,Medieval","Deck, Bag, and Pool Building,Hand Management,V..."


In [18]:
print('Number of ratings for all games =', format(games['nrate'].sum(), ','))



Number of ratings for all games = 15,230,683
