### Introduction
In another post, I scraped board games data from [boardgamegeek.com](https://boardgamegeek.com/) using its API, as well as beautifulsoup4. I also did some data processing to put these data into more usable format. 

This post is an attempt to address my original goal. I want to find similar games based on the games that I like. We'll try to do this with a K nearest neighbours algorithm. 

In [54]:
# import the basic packages
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

In [55]:
# get the data
games = pd.read_csv('bggdata/bgg_games_clean.csv', index_col=0)
games_mechanics = pd.read_csv('bggdata/games_mechanics.csv',index_col=0)
games_categories = pd.read_csv('bggdata/games_categories.csv',index_col=0)

In [56]:
# dropping text columns from games
games_drop = games.drop(['Year','title','description'],axis=1)

### Nearest Neighbours

Here we have a naive idea of simply bundling all features together. We will use 8 columns from *games*, 182 columns from *games_mechanics*, 83 columns from *games_categories* for a total of 273 columns. 

Moreover, since the numbers in *games* have quite different order of maginitue, we will scale them using `StandardScaler` to centre the mean to 0, with standard deviation 1. 

In [57]:
scaler = StandardScaler()
scaled_games = scaler.fit_transform(games_drop)
# putting index and column names back
scaled_games = pd.DataFrame(scaled_games,columns = games_drop.columns,
                            index = games_drop.index)
scaled_games.head()

Unnamed: 0_level_0,avg_rating,num_voters,owners,complexity,minplayers,maxplayers,minplaytime,maxplaytime
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
174430,2.560079,11.536048,13.102335,2.19968,-1.481667,-0.108935,-0.010778,0.053977
161936,2.364589,11.528012,12.632545,0.999323,-0.030304,-0.108935,-0.010778,-0.05673
224517,2.413167,5.023684,5.203451,2.275277,-0.030304,-0.108935,-0.010778,0.053977
167791,2.170114,17.926426,16.72603,1.472366,-1.481667,-0.043388,0.120966,0.053977
233078,2.451964,3.52167,3.010162,2.635219,1.421058,0.022158,0.384454,0.718217


In [58]:
# putting all features together
df_all = pd.concat([scaled_games,games_mechanics,games_categories],axis=1)

Now, we are ready to use the nearest neighbours algorithm. Another thing to note is that we are in a high dimensional space. Due to curse of dimensionality, it makes more sense to use *cosine similarity* metric instead of Euclidean distance. 

In [59]:
nbrs_all = NearestNeighbors(n_neighbors=10,metric='cosine',algorithm='brute')
nbrs_all.fit(df_all)
distances, indices = nbrs_all.kneighbors(df_all)

In [60]:
# shape of indices
indices.shape

(20115, 10)

Now, each row of *indices* corresponds to a game entry (say, of a game X), and it contains the indices of corresponding nearest neighbours of X. For example, the first row corresponds to the first game in *games*, meaning *Gloomhaven*. It is important to not be confused between position indices in *indices* and the database ids in our tables. 

In [61]:
# indices of games close to index 0
indices[0]

array([  0,  54,  25,  21,   1,  13, 327,  81,   3, 721], dtype=int64)

In [62]:
# neighbours of Gloomhaven
games.iloc[indices[0]]

Unnamed: 0_level_0,title,Year,description,avg_rating,num_voters,owners,complexity,minplayers,maxplayers,minplaytime,maxplaytime
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
174430,Gloomhaven,2017,Gloomhaven is a game of Euro-inspired tactica...,8.7977,41028,66800,3.8575,1,4,60,120
121921,Robinson Crusoe: Adventures on the Cursed Island,2012,Robinson Crusoe: Adventures on the Cursed Isla...,7.84268,33658,52052,3.7781,1,4,60,120
96848,Mage Knight Board Game,2011,The Mage Knight board game puts you in control...,8.10245,27601,39168,4.3086,1,4,60,240
205637,Arkham Horror: The Card Game,2016,"Something evil stirs in Arkham, and only you c...",8.18056,27905,49710,3.4279,1,2,60,120
161936,Pandemic Legacy: Season 1,2015,Pandemic Legacy is a co-operative campaign gam...,8.61484,41000,64455,2.8397,2,4,60,60
169786,Scythe,2016,It is a time of unrest in 1920s Europa. The as...,8.23946,56761,74108,3.4098,1,5,90,115
15987,Arkham Horror,2005,"&#10; The year is 1926, and it is the h...",7.25816,36647,48324,3.5767,1,8,120,240
40834,Dominion: Intrigue,2009,"In Dominion: Intrigue (as with Dominion), each...",7.72203,29932,43306,2.4228,2,4,30,30
167791,Terraforming Mars,2016,"In the 2400s, mankind begins to terraform the ...",8.43293,63292,84888,3.2408,1,5,120,120
65244,Forbidden Island,2010,Forbidden Island is a visually stunning cooper...,6.79259,41580,71849,1.7406,2,4,30,30


Let us try out another very popular game, say, *Ticket to Ride*, with id 9209.

In [63]:
# find the position index of Ticket to ride
df_all.index.get_loc(9209)

170

In [66]:
# neighbours of Ticket to Ride
games.iloc[indices[170]]

Unnamed: 0_level_0,title,Year,description,avg_rating,num_voters,owners,complexity,minplayers,maxplayers,minplaytime,maxplaytime
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9209,Ticket to Ride,2004,"With elegantly simple gameplay, Ticket to Ride...",7.42598,70853,96261,1.8496,2,5,30,60
14996,Ticket to Ride: Europe,2005,Ticket to Ride: Europe takes you on a new trai...,7.54833,57709,81626,1.9399,2,5,30,60
68448,7 Wonders,2010,You are the leader of one of the 7 great citie...,7.75314,83484,111185,2.3302,2,7,30,30
30549,Pandemic,2008,"In Pandemic, several virulent diseases have br...",7.60782,100934,153172,2.4117,2,4,45,45
822,Carcassonne,2000,Carcassonne is a tile-placement game in which ...,7.41866,100666,147490,1.913,2,5,30,45
36218,Dominion,2008,"&quot;You are a monarch, like your parents bef...",7.62049,77432,101070,2.3582,2,4,30,30
148228,Splendor,2014,Splendor is a game of chip-collecting and card...,7.45242,57826,81510,1.7987,2,4,30,30
13,Catan,1995,"In CATAN (formerly The Settlers of Catan), pla...",7.15563,100403,152507,2.3226,3,4,60,120
230802,Azul,2017,"Introduced by the Moors, azulejos (originally ...",7.83234,52102,77182,1.7696,2,4,30,45
70323,King of Tokyo,2011,"In King of Tokyo, you play mutant monsters, gi...",7.18558,57089,82000,1.4939,2,6,30,30


Let us try yet with another less known game, a personal favourite, *El Grande* id 93.

In [73]:
# position index of El Grande
df_all.index.get_loc(93)

75

In [74]:
# neighbours of Risk
games.iloc[indices[75]]

Unnamed: 0_level_0,title,Year,description,avg_rating,num_voters,owners,complexity,minplayers,maxplayers,minplaytime,maxplaytime
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
93,El Grande,1995,"In this award-winning game, players take on th...",7.75546,23419,22471,3.0531,2,5,60,120
170216,Blood Rage,2015,&quot;Life is Battle; Battle is Glory; Glory i...,7.99685,33425,39940,2.8801,2,4,60,90
2651,Power Grid,2004,Power Grid is the updated release of the Fried...,7.85678,55881,65978,3.2723,2,6,120,120
3076,Puerto Rico,2002,"In Puerto Rico, players assume the roles of co...",7.99212,62738,73011,3.28,3,5,90,150
31260,Agricola,2007,Description from BoardgameNews&#10;&#10;In Agr...,7.94674,63103,75261,3.6397,1,5,30,150
12333,Twilight Struggle,2005,"&quot;Now the trumpet summons us again, not as...",8.29353,40433,55623,3.5847,2,2,120,180
164928,Orléans,2014,During the medieval goings-on around Orl&eacut...,8.08263,21791,26514,3.0525,2,4,90,90
6249,Alhambra,2003,"Granada, 1278. At the foot of the Sierra Neva...",7.03029,28386,34658,2.1046,2,6,45,60
28143,Race for the Galaxy,2007,2018 UPDATE: The second edition of the game is...,7.75971,46051,55731,2.9824,2,4,30,60
40834,Dominion: Intrigue,2009,"In Dominion: Intrigue (as with Dominion), each...",7.72203,29932,43306,2.4228,2,4,30,30


### Discussion
As someone who has played a few board games, this is quite interesting. 

Neighbours of *Gloomhaven* are fairly popular games, and many of them are cooperative games, i.e., games where players play together as a team against the game. They are also generally of higher complexity. 

Neighbours of *Ticket to Ride* are also very popular. But they are of fairly low complexities, making them ideal for new board games players.

Neighbours of *El Grande* seem to be more strategic games. People call them *Euro games*, maybe with *Blood Rage* as an exception (interestingly). 

### What now?
An advantage of KNN is that it is easy to implement. However, the disadvantage is that it has no performance metric, even though we saw some glimpse when looking at specific games. It would be nice to have an approach with a sensible measurement of success. 