In [1]:
from boardgamegeek import BGGClient
import pandas as pd
import numpy as np

To collect board game data from boardgamegeek.com (BGG), we will be using the BGG's own API.

In addition to the API, BGG also provides a base csv file for general data for the site. We will use this csv file as a base for any further data collection.

The link to the API and the csv file can be found [here](https://boardgamegeek.com/wiki/page/BGG_XML_API2). Further documentation for the API can also be found [here](https://lcosmin.github.io/boardgamegeek/modules.html#boardgamegeek.api.BGGClient.collection).

In [2]:
bgg_df = pd.read_csv('../data/boardgames_ranks.csv')
bgg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154683 entries, 0 to 154682
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   id                   154683 non-null  int64  
 1   name                 154683 non-null  object 
 2   yearpublished        154683 non-null  int64  
 3   rank                 154683 non-null  int64  
 4   bayesaverage         154683 non-null  float64
 5   average              154683 non-null  float64
 6   usersrated           154683 non-null  int64  
 7   abstracts_rank       1427 non-null    float64
 8   cgs_rank             359 non-null     float64
 9   childrensgames_rank  1055 non-null    float64
 10  familygames_rank     3205 non-null    float64
 11  partygames_rank      891 non-null     float64
 12  strategygames_rank   2930 non-null    float64
 13  thematic_rank        1646 non-null    float64
 14  wargames_rank        4240 non-null    float64
dtypes: float64(10), i

### Dropping games with no rank
We will first drop all rows with no rank. As seen above, only 26 360 games out of the 154 683 games in the dataframe have listed rankings, which is approximately only 17% of all the games.

An explanation for how game rankings are determined can be found in [this video](https://youtu.be/UqFvEzhjSfI?si=L45YhNwB4jwjd9_M).

#### A brief summary of the scoring system:

BGG users are able to give a 1-10 rating score for each game on the site. The site collates and averages these values to get the average score, but also includes dummy votes to calculate the bayesian average, using the formula below:

$$ \frac{\sum \text{user scores} + (\text{number of dummies}*\text{dummy value} )}{\text{number of user ratings} + \text{number of dummies}} $$

The dummy value is taken to be the average score of all games on BGG (<b>5.5</b>), while the number of dummies was empirically found to be approximately <b>1450</b> in the video linked above.

Only games with at least 30 user ratings are eligible for a ranking on BGG ([sourced from the BGG site](https://boardgamegeek.com/wiki/page/ratings)), leading to the situation above where only a fraction of all games on the site are given a ranking. Limiting our data to ranked games ensures that we focus on data with more user interaction, lending to higher objectivity in the data and less bias.



In [3]:
# Filtering out all games with no ranking
bgg_df = bgg_df.loc[bgg_df['rank']!=0,:].reset_index(drop=True)

### Filtering to 'Thematic Games'

We then filter our dataframe to games within the 'thematic' category (i.e. rows with a non-null value for `thematic_rank`).

[Thematic Games](https://boardgamegeek.com/boardgamesubdomain/5496/thematic-games): As described on the website, thematic games consist of games in which the overall game experience is driven primarily by a strong theme, much like a book or a movie. For thematic games, the game mechanics and rules are purposefully designed to simulate and complement the given theme, lending to a more immersive and engaging experience. This can be contrasted with [strategy games](https://boardgamegeek.com/boardgamesubdomain/5497/strategy-games), for which the game mechanics prioritise player agency and decision-making instead. These games potentially compromise on theming and may lean into abstraction (see also [abstract games](https://boardgamegeek.com/boardgamesubdomain/4666/abstract-games)) to provide a more strategic and intellectually-engaging experience.

Since we are filtering out all non-thematic games, we can also drop all other columns for categorical ranks, excluding `thematic_rank`.

We also convert `thematic_rank` from float to integer at this stage as the game rankings need not be in the float format.

In [4]:
thematic_df = bgg_df[bgg_df['thematic_rank'].notnull()].reset_index(drop=True)
thematic_df = thematic_df.drop(columns=['wargames_rank', 'strategygames_rank', 'partygames_rank', 'familygames_rank', 'childrensgames_rank', 'cgs_rank', 'abstracts_rank'])
thematic_df['thematic_rank'] = thematic_df['thematic_rank'].astype(int)
thematic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1646 entries, 0 to 1645
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1646 non-null   int64  
 1   name           1646 non-null   object 
 2   yearpublished  1646 non-null   int64  
 3   rank           1646 non-null   int64  
 4   bayesaverage   1646 non-null   float64
 5   average        1646 non-null   float64
 6   usersrated     1646 non-null   int64  
 7   thematic_rank  1646 non-null   int32  
dtypes: float64(2), int32(1), int64(4), object(1)
memory usage: 96.6+ KB


#### *Scraping by `id` instead of `name`

Initially, the data collection process would check for each game in our dataframe by name. However, this initial data collection resulted in a dataframe with 16 less rows (1630 instead of 1646). Further investigation showed that within the 1646 games in the dataset, there are 16 pairs of games with the same listed name. This could be due to the games being reimplementations or expansions of existing sets, among other potential reasons.

Since these games are not duplicated rows and technically distinct, we will instead collect data based on the games' ID, which differs even for games with the same name.

In [5]:
# Checking for duplicate names within the dataset. Note the 32 entries, 16 pairs of games with the same name that differ in all remaining features.
thematic_df[thematic_df.duplicated(keep=False,subset=['name'])]

Unnamed: 0,id,name,yearpublished,rank,bayesaverage,average,usersrated,thematic_rank
40,283355,Dune,2019,192,7.34565,7.94778,8619,44
42,39463,Cosmic Encounter,2008,196,7.34058,7.52305,32845,63
83,15987,Arkham Horror,2005,427,7.02934,7.23892,39436,145
92,121,Dune,1979,473,6.99062,7.59133,5875,99
124,34119,Tales of the Arabian Nights,2009,608,6.87588,7.16451,12694,160
134,135219,The Battle of Five Armies,2014,657,6.83111,7.86622,2928,96
139,256999,Project: ELITE,2020,671,6.82546,7.98139,2532,79
209,224,History of the World,1991,1073,6.56666,7.11741,4568,254
299,15,Cosmic Encounter,1977,1480,6.38128,6.92605,3968,359
306,22038,Warrior Knights,2006,1524,6.36093,6.86433,4054,352


### Data collection using the BGG API

We will add the following features:
- `rating_num_weights`
- `rating_average_weight`
- `playing_time`
- `min_age`
- `min


Additionally, the game mechanics and game categories will be added as categorical features (prefixed with `mech_` and `cat_` respectively).

Initially the collection process involved retrieving and appending the 'game mechanics' data to our dataset using panda's `.append` function. However the code appears to return an error at approximately 5 minutes, potentially due to the BGG API itself. Additionally, the number of rows of data collected before an error occurs varies between collection attempts.

To accomodate for the interruptions due to errors, the 'game mechanics' data will be appended to a dictionary through a for loop, with a try/except conditional to break the loop in the event of an error. This allows us to pick up and continue from the point of error when collecting the data.

In [6]:
'''Function below was used to collect the additional data using the API and need not be rerun.'''

# # Initialising the dictionary, start_ind(to keep track of the index between collection processes), and the list of game ids to loop through
# id_list = thematic_df['id'].tolist()
# thematic_dict = {}
# start_ind = 0

# # Defining a function to collect all the necessary data based on a a given game id, then appending it to the dictionary
# def bgg_scrape(ind):
#     game_id = thematic_df['id'][ind]
#     thematic_dict[game_id] = {}
#     thematic_dict[game_id]['rating_num_weights'] = bgg.game(game_id=id_list[ind]).rating_num_weights
#     thematic_dict[game_id]['rating_average_weight'] = bgg.game(game_id=id_list[ind]).rating_average_weight
#     thematic_dict[game_id]['playing_time'] = bgg.game(game_id=id_list[ind]).playing_time
#     thematic_dict[game_id]['min_age'] = bgg.game(game_id=id_list[ind]).min_age
#     thematic_dict[game_id]['min_players'] = bgg.game(game_id=id_list[ind]).min_players
#     thematic_dict[game_id]['max_players'] = bgg.game(game_id=id_list[ind]).max_players
#     thematic_dict[game_id]['mechanics'] = bgg.game(game_id=id_list[ind]).mechanics
#     thematic_dict[game_id]['categories'] = bgg.game(game_id=id_list[ind]).categories
#     thematic_dict[game_id]['families'] = bgg.game(game_id=id_list[ind]).families

'Function below was used to collect the additional data using the API and need not be rerun.'

In [7]:
'''The for loop below was used to collect the data and need not be rerun.'''

# # Initialising BGGClient object for data collection using the API
# bgg = BGGClient()

# # The loop keeps track of the index before the error occurs, so that it can continue collection from that point after an error breaks the loop
# for ind in range(start_ind, len(id_list)):
#     try:
#         bgg_scrape(ind)
#         start_ind += 1
#         print(start_ind)
#     except:
#         break   

'The for loop below was used to collect the data and need not be rerun.'

In [8]:
# # Exporting the data to .json for further work
# Note that the dict keys must be converted to string before being able to export as json
# json_dict = {}
# for key, value in thematic_dict.items():
#     json_dict[key.astype(str)] = value

# import json
# with open('../data/thematic_dict.json', 'w') as fp:
#     json.dump(json_dict, fp)

In [9]:
# Importing the collected data in dictionary format from the json file
import json
with open('../data/thematic_dict.json') as json_file:
    json_dict = json.load(json_file)

# Converting the keys from string to int
thematic_dict = {}
for key, value in json_dict.items():
    thematic_dict[int(key)] = value

In [10]:
# Converting the dict to df format 
thematic_scraped_df = pd.DataFrame.from_dict(thematic_dict, orient='index').reset_index().rename(columns={'index':'id'})
thematic_scraped_df.head()

Unnamed: 0,id,rating_num_weights,rating_average_weight,playing_time,min_age,min_players,max_players,mechanics,categories,families
0,161936,1453,2.8314,60,13,2,4,"[Action Points, Cooperative Game, Hand Managem...","[Environmental, Medical]","[Components: Map (Global Scale), Components: M..."
1,174430,2557,3.9112,120,14,1,4,"[Action Queue, Action Retrieval, Campaign / Ba...","[Adventure, Exploration, Fantasy, Fighting, Mi...","[Category: Dungeon Crawler, Components: Map (C..."
2,233078,1156,4.314,480,14,3,6,"[Action Drafting, Area-Impulse, Dice Rolling, ...","[Civilization, Economic, Exploration, Negotiat...","[Components: Hexagonal Tiles, Components: Map ..."
3,115746,1174,4.2155,180,13,2,4,"[Action Drafting, Area Majority / Influence, A...","[Fantasy, Fighting, Miniatures, Novel-based, T...","[Authors: J.R.R. Tolkien, Category: Dized Tuto..."
4,187645,1127,3.7418,240,14,2,4,"[Area Majority / Influence, Area Movement, Are...","[Civil War, Miniatures, Movies / TV / Radio th...","[Category: Two Player Fighting Games, Componen..."


In [11]:
# Comparing number of rows for thematic_df (existing data) and number of entries in thematic_scraped_df (collected data).
print('Initial number of rows = {}\nNumber of rows scraped = {}\nRows missing = {}'.format(len(thematic_df), len(thematic_scraped_df), len(thematic_df) - len(thematic_scraped_df)))

Initial number of rows = 1646
Number of rows scraped = 1646
Rows missing = 0


In [12]:
merged_df = pd.merge(thematic_df, thematic_scraped_df, on='id', how='outer')
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1646 entries, 0 to 1645
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     1646 non-null   int64  
 1   name                   1646 non-null   object 
 2   yearpublished          1646 non-null   int64  
 3   rank                   1646 non-null   int64  
 4   bayesaverage           1646 non-null   float64
 5   average                1646 non-null   float64
 6   usersrated             1646 non-null   int64  
 7   thematic_rank          1646 non-null   int32  
 8   rating_num_weights     1646 non-null   int64  
 9   rating_average_weight  1646 non-null   float64
 10  playing_time           1646 non-null   int64  
 11  min_age                1646 non-null   int64  
 12  min_players            1646 non-null   int64  
 13  max_players            1646 non-null   int64  
 14  mechanics              1646 non-null   object 
 15  cate

In [13]:
test = (merged_df[merged_df['name']=='Gloomhaven']['families']).iloc[-1]

### The `families` feature
[Families](https://boardgamegeek.com/wiki/page/Families) in BGG are groups of games sharing particular aspects. There are currently 5084 families on BGG, [at the point of writing](https://boardgamegeek.com/browse/boardgamefamily/page/1). Including a categorical feature for each family will create a large but extremely sparse dataset, which may lead to overfitting. We can compare this to the relatively lower number of unique entries for both `mechanics` ([192 total](https://boardgamegeek.com/browse/boardgamemechanic)) and `categories` ([84 total](https://boardgamegeek.com/browse/boardgamecategory)).

Instead of including all features from `families`, we will pick a specific feature of interest for our further data analysis: <b>whether or not the board game was crowdfunded</b>.

#### Crowdfunding for board games
[Tabletop games raised 1.5 billion dollars on Kickstarter](https://www.dicebreaker.com/companies/kickstarter/news/tabletop-games-raised-one-and-a-half-billion-dollars-on-kickstarter)


In [14]:
# Data engineering: adding a 'crowdfunded' feature
# Each row in the 'families' feature contains a list of families, we loop through each list to check if any families for that row contains the string 'Crowdfunding:'
merged_df['crowdfunded'] = merged_df['families'].apply(lambda x: int(any('Crowdfunding:' in fam for fam in x)))

# Checking updated df for all rows with crowdfunded games
merged_df[merged_df['crowdfunded']==1]

Unnamed: 0,id,name,yearpublished,rank,bayesaverage,average,usersrated,thematic_rank,rating_num_weights,rating_average_weight,playing_time,min_age,min_players,max_players,mechanics,categories,families,crowdfunded
42,392,Brawl,1999,4460,5.79023,6.36033,1186,845,111,1.3694,15,10,2,7,"[Real-Time, Variable Player Powers]","[Card Game, Fighting, Print & Play, Real-time]","[Crowdfunding: Kickstarter, Theme: Anime / Manga]",1
61,713,Nuclear War,1965,3794,5.86329,6.21635,2944,832,272,1.4449,60,10,2,6,"[Action Queue, Hand Management, Take That]","[Card Game, Humor, Modern Warfare, Negotiation...","[Crowdfunding: Kickstarter, Digital Implementa...",1
173,2324,Last Frontier: The Vesuvius Incident,1993,7280,5.62971,7.01500,200,1036,36,2.9167,120,13,1,2,"[Chit-Pull System, Dice Rolling, Events, Grid ...","[Horror, Science Fiction, Wargame]","[Creatures: Aliens / Extraterrestrials, Crowdf...",1
267,5641,Nightmare,1991,10548,5.56461,5.92491,1061,1517,72,1.7778,60,12,3,6,"[Dice Rolling, Elapsed Real Time Ending, Roll ...","[Adventure, Dice, Horror, Party Game]","[Components: Videocassettes, Crowdfunding: Kic...",1
293,7514,Zombie Plague,2001,6357,5.66620,6.67448,458,1028,64,1.8125,60,10,2,6,"[Action Points, Dice Rolling, Roll / Spin and ...","[Fighting, Horror, Miniatures, Print & Play, S...","[Creatures: Zombies, Crowdfunding: Kickstarter]",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1618,358320,Sleeping Gods: Distant Skies,2023,1824,6.25694,8.59426,726,265,23,2.7391,600,13,1,4,"[Action Points, Cooperative Game, Deck, Bag, a...","[Adventure, Exploration, Fantasy, Mythology]","[Crowdfunding: Gamefound, Game: Sleeping Gods,...",1
1622,360899,Harrow County: The Game of Gothic Conflict,2024,5131,5.73578,8.37220,205,818,18,3.8333,90,12,1,3,"[Chit-Pull System, Cube Tower, Deduction, Hexa...","[Comic Book / Strip, Horror, Mythology]","[Crowdfunding: Kickstarter, Theme: Witches]",1
1623,363204,FLOE,2025,14349,5.53257,7.57097,62,1376,23,2.9565,90,10,1,4,"[Action Points, Area Movement, Contracts, Crit...","[Adventure, Animals, Exploration, Fantasy]","[Admin: Upcoming Releases, Crowdfunding: Kicks...",1
1635,371433,Terrorscape,2023,2685,6.03704,8.39552,617,487,16,2.3125,45,14,2,4,"[Deduction, Dice Rolling, Hand Management, Hid...","[Deduction, Dice, Fighting, Horror, Miniatures]","[Crowdfunding: Kickstarter, Digital Implementa...",1


In [15]:
# Dropping 'families' column
merged_df.drop(columns='families', inplace=True)

### MultiLabelBinarizer

In [16]:
from sklearn.preprocessing import MultiLabelBinarizer

In [34]:
mlb_mech = MultiLabelBinarizer()

In [35]:
test = mlb_mech.fit_transform(merged_df['mechanics'])

In [43]:
test2 = pd.DataFrame(test, columns='mech_'+mlb_mech.classes_).rename(str.lower, axis='columns')
test2.columns = test2.columns.str.replace(' ', '_').str.replace('_/_','/')

In [44]:
test2

Unnamed: 0,mech_acting,mech_action_drafting,mech_action_points,mech_action_queue,mech_action_retrieval,mech_action_timer,mech_action/event,mech_advantage_token,mech_alliances,mech_area_majority/influence,...,mech_turn_order:_time_track,mech_variable_phase_order,mech_variable_player_powers,mech_variable_set-up,mech_victory_points_as_a_resource,mech_voting,mech_worker_placement,mech_worker_placement_with_dice_workers,"mech_worker_placement,_different_worker_types",mech_zone_of_control
0,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1641,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1642,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1643,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1644,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
