# BGG Board Game Recommendation Engine

<img src="nb-assets/shelf.jpeg">
<p style="text-align:center"> [Pictured: my (main) game shelf] </p>

Let me describe a common scenario in my life: I find myself at a mall as a result of marital obligation, and I'm fortunate enough to locate the _game store_. Gazing upon the glorious bounty on the shelves, I reach for an article of particular interest and begin reading the description...

45 minutes later, I still haven't bought anything. Why? I have no idea which of these grossly overpriced items will actually be worth the investment (seriously, board games are expensive). So to save my marriage (I lied - we're at the mall because I _wanted_ to look at board games. Sue me) I decided to design a recommendation system so I can stop buying games by accident, and start looking for the games I will actually like.

## Part 1: Data Collection


<img src="nb-assets/catan.jpg">

The first step of this project is data collection. Anyone who is into board games will be familiar with the source I've chosen: [Board Game Geek](https://boardgamegeek.com/). It's a very popular website for documenting, discussing, rating and reviewing all sorts of boardgames and tabletop games. Each game has detailed statistics reflecting user submitted ratings, genre and category information, a "complexity score" metric, and more - lots of potential features to feed an ML model. I'm going to be collecting data from the top 20,000 games as listed on their [rankings page](https://boardgamegeek.com/browse/boardgame/page/100?sort=rank). But first, I need to collect the list of games and links to the individual pages. Here we go!

## Imports

In [398]:
import requests # To make HTTP requests
import re # For pattern matching some of the results
import json # For parsing (certain) results
import time # To implement scraping delay
import pandas as pd # To view the data in DataFrame format
from bs4 import BeautifulSoup # To parse the HTML from request objects

## Targeting Data - CSS Selectors and HTML Tags

<img src="nb-assets/devtools.png" height="400" width="680">

<p style="text-align:center"> "Developer tools are your friend." <br> - Some guy on StackOverflow</p>

This section represents the trial and error approach of finding the right HTML tags and CSS selectors to access the page info we need. I'm using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the HTML from the request objects. 

### Note on robots.txt:

Most websites have a robots.txt file that can be found by adding `/robots.txt` to the base URL. This is where the site's administrators designate their preferred webscraping/webcrawling activity. I checked for BGG, and they list a crawl delay of 5s, so that's what I have adhered to.

In [2]:
# Designate the URL of the target page
url = 'https://boardgamegeek.com/browse/boardgame/page/1?sort=rank'
content_1 = requests.get(url)
# Initialize the BeautifulSoup object
rankings = BeautifulSoup(content_1.content, 'html.parser')

In [55]:
# Name
rankings.select('tr#row_')[0].find('a', class_='primary').string

'Gloomhaven'

In [75]:
# Year
rankings.select('tr#row_')[0].find('span').string[1:5]

'2017'

In [71]:
# Rel filepath (part of the URL that leads to the individual game's page)
rankings.select('tr#row_')[0].find('a', class_='primary').get('href')

'/boardgame/174430/gloomhaven'

In [73]:
# ID
rankings.select('tr#row_')[0].find('a', class_='primary').get('href').split('/')[2]

'174430'

In [62]:
# Average user rating
rankings.select('tr#row_')[0].find_all('td', class_='collection_bggrating')[1].string.strip()
# Number of votes
rankings.select('tr#row_')[0].find_all('td', class_='collection_bggrating')[2].string.strip()

'51469'

---
Here I set up my first function to scrape rankings pages. I set it up to save the data in a csv every 10 pages (every 1000 records). It did not go as planned.

---

## Getting past the login wall

<img src="nb-assets/loginwall.png" height="240" width="480">
<p style="text-align:center">Well that's not right.</p>

After running a first past through the top 200 rankings pages, I noticed something pretty weird - my DataFrames were all the same size! It seems like to view rankings pages past 20 (the first 2000 results), you need to be logged in. I will skip over the hours of Google and StackOverflow and get to the point: to log in to this site with only the requests module, I needed two things: 
 - My login credentials formatted in JSON with correct keyword names
 - The URL of the API endpoint to make a POST request 
 
Fortunately, I found the answer through, you guessed it: Chrome Developer tools. By recording network activity and logging in, I was able to find the required formatting for my login details and the correct URL to send my POST request to. 

In [190]:
# This time, initialize a requests Session object that can store
# the authentication cookie and access the remaining pages to scrape.
s = requests.Session()

headers = {"Accept-Language":"en-US,en;q=0.9",
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36"}
data = {"credentials": {"username": "", "password": ""}}
url = "https://boardgamegeek.com/login/api/v1"

r = s.post(url, json=data, headers=headers)

In [200]:
# Assign variables
count = 1 # Starting page
url = 'https://boardgamegeek.com/browse/boardgame/page/' # Base URL
params = {'sort':'rank'} # Sort method

# Initialize an empty list to store game data
game_data = []

# Create the main scraping loop 
while count <= 200:
    # Make the request
    content = s.get(url+str(count), params=params)
    # Error handling: if response code is not 200 'OK', we move
    # to the next page and print a note to the console
    if content.status_code != 200:
        print(f'Failed on page {count}')
        continue
        
    # Parse the data from our response object
    page = BeautifulSoup(content.content, 'html.parser')
    
    # Begin collecting the targeted data
    for idx, game in enumerate(page.select('tr#row_')):
        
        # Error handling: if any of these selections throw an error, we
        # print a note to the console and move on to the next game
        try:
            id_ = game.find('a', class_='primary').get('href').split('/')[2]
            name = game.find('a', class_='primary').string
            year = game.find('span').string[1:5]
            filepath = game.find('a', class_='primary').get('href')
            rating = game.find_all('td', class_='collection_bggrating')[1].string.strip()
            votecount = game.find_all('td', class_='collection_bggrating')[2].string.strip()
        except:
            print(f'error with game {idx} on page {count}')
            continue
            
        # If everything worked, we update game_data
        data = [id_, name, year, filepath, rating, votecount]
        game_data.append(data)
            
    # Failsafe: save a copy of the data every 10 pages
    if count % 10 == 0:
        pd.DataFrame(game_data).to_csv(f'data{count}.csv', index=False)
    
    # Sleep 5s per the crawl delay on the site's robots.txt
    time.sleep(5)
    
    count += 1

In [None]:
# Only two entries were invalid - not bad! 
# I also ended up with a duplicate, so I'll drop it
df.columns = ['id', 'title', 'year', 'url', 'user_score', 'votes']
df = df.drop_duplicates()

In [223]:
# df.to_csv('data/ranks.csv', index=False)

In [225]:
url = 'https://boardgamegeek.com/boardgame/174430/gloomhaven/'
url_credits = 'https://boardgamegeek.com/boardgame/174430/gloomhaven/credits'

game_page = BeautifulSoup(s.get(url).content, 'html.parser')
credits = BeautifulSoup(s.get(url).content, 'html.parser')

## Scraping the Game Details Page

<img src="nb-assets/page2.png" height="400" width="600">

Looks easy to get, but turns out not so much. Rather than being directly in the HTML, this info is actually populated via JavaScript. A JSON file is served when the page is loaded - this JSON file has the data I'm looking for. 

After some poking around, I ended up scraping the whole set twice - the first time, using an exposed API (URL redacted from this notebook) found using Chrome Developer Tools (duh). The second time I scraped the individual pages to get the leftover data.

And again, per the crawl-delay stated on the site's [robots.txt](https://boardgamegeek.com/robots.txt), I scraped at a delay of 5s (I timed it - it took about .5s to process each request. Hence the 4.5s sleep calls)

In [None]:
#First pass, taking advantage of the API 

details = []
for count, id_ in enumerate(df.id.values[16792:]):
    url = f'[API URL]'
    response = requests.get(url)
    data = response.json()
    # Error handling - prints an error message each time data can't be collected
    try:
        # Max # of players
        players = data['item']['maxplayers']
        # Min # of players (solo yes/no)
        solo = data['item']['minplayers'] == '1'
        # Min-age (categorize: kid, teen, adult)
        age = data['item']['minage']
        # Minimum playtime
        playtime = data['item']['minplaytime']
        # Description
        description = BeautifulSoup(data['item']['description'], 'html.parser').text.strip()
        # Designer(s)
        designers = [designer['name'] for designer in data['item']['links']['boardgamedesigner']]
        # Publisher (main)
        publisher = data['item']['links']['boardgamepublisher'][0]['name']
        # Number of awards
        awards = data['item']['linkcounts']['boardgamehonor']
        # Categories
        categories = [category['name'] for category in data['item']['links']['boardgamecategory']]
        # Game mechanics ## credits
        mechanics = [mechanic['name'] for mechanic in data['item']['links']['boardgamemechanic']]
        # Genres
        genres = [subgenre['name'] for subgenre in data['item']['links']['boardgamesubdomain']]
    except:
        print(f'error on index {count}')
        continue
    
    # Update the list
    details.append([id_, players, solo, age, playtime, description, designers, publisher, awards, categories, mechanics])
    
    # Failsafe - save the data every 100 entries
    if count % 100 == 0:
        pd.DataFrame(details, columns = ['id', 'players', 'solo', 'age', 'playtime', 'description', 'designers', 'publisher', 'awards', 'categories', 'mechanics']).to_csv(f'data/details{count}.csv')
    # Be a good citizen!
    time.sleep(4.5)
    

In [None]:
# Second pass - scraping the pages directly and getting
# remaining data from the JSON files served on each page

# New list to hold details
details_again = []
# Base url
base_url = 'https://boardgamegeek.com'
count = 0
for game in df.url.values:
    # Added error handling for potential network issues
    try:
        url = f'{base_url}{game}'
        response = requests.get(url)
        game_page = BeautifulSoup(response.content, 'html.parser') 
        match = re.search(r'GEEK.geekitemPreload = ({.*})', game_page.find_all('script')[2].string)
        data = json.loads(match.groups()[0])
    except:
        print(f'Error in requests on index {count}')
        time.sleep(4.5)
        continue
        
    # Changed the error handling pattern here: I opted to include NaN entries 
    # when taking in this data after I noticed a very high rate of errors late
    # in the scraping process.
    
    try:
        # Best # players
        best_players = data['item']['polls']['userplayers']['best'][0]['max']
    except:
        best_players = None
    try:
        # complexity score
        weight = data['item']['stats']['avgweight']
    except:
        weight = None
    try:
        # Min # of players 
        min_players = data['item']['minplayers']
    except:
        min_players = None
        # Minimum playtime
    try:
        max_playtime = data['item']['maxplaytime']
    except:
        max_playtime = None
        
    # Update the list
    details_again.append([df.iloc[count].id, best_players, min_players, max_playtime, weight])
    
    # Again, save the data every 100 entries as a failsafe
    if count % 100 == 0:
        current_data = pd.DataFrame(details_again, columns = ['id', 'best_players', 'min_players', 'max_playtime', 'weight'])
        current_data.to_csv(f'data/extra_details{count}.csv', index=False)
    count += 1
    if count > 19998:
        break
    # Be a good citizen!
    time.sleep(4.5)