In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy import stats
#Similarity Scoring
from sklearn.metrics import jaccard_score

import seaborn as sns
from matplotlib import pyplot as plt


# Import functions from scipy
from scipy.spatial.distance import pdist, squareform

# Making a Content - Based Recomendation System 

Content based recomendation systems provide users with items with similar features to those they already enjoy. These are importaint for showing users new items who have not yet gotten reviews!. 

## Themes 

We will start with finding similar themes!

In [4]:
themes_df = pd.read_csv('data/modern_themes.csv')
#we want our bgg id to be the index
themes_df.set_index('BGGId', inplace=True)
#due to memory issues, use a sample to prove the data, run the real thin in virtual machine
themes_sampled = themes_df.sample(2000, random_state=42)

## Using Jaccard Similarity

Our dataset comes with two pre one-hot-encoded datasets about our games **themes** and **mechanics**. We can use these to find the Jaccard Similarity between games to determine how similar they are. 

Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data.

Jaccard Similarity = (number of observations in both sets) / (number in either set)

If two games are exactly the same, their Jaccard Similarity Index will be 1. Conversely, if they have nothing in common then their similarity will be 0.

In [5]:
# Calculate all pairwise distances
jaccard_distances_themes = pdist(themes_sampled.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array_themes = 1 -  squareform(jaccard_distances_themes)

# Wrap the array in a pandas DataFrame
themes_jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array_themes, index=themes_sampled.index, columns=themes_sampled.index)

In [6]:
themes_jaccard_similarity_df.head()

BGGId,218460,587,84671,144958,24977,260037,10537,178147,110864,170669,...,10968,173337,257766,280162,161926,230650,1105,9610,262114,241
BGGId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
218460,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
587,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
84671,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
144958,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
24977,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


## Let's look at this in practice. 

Giant Uno 

In [9]:
# Find the values for a specific game 
hnefatafl_theme_series = themes_jaccard_similarity_df.loc[218460]

# Sort these values from highest to lowest
theme_ordered_similarities = hnefatafl_theme_series.sort_values(ascending=False)[:10]

# Print the results
print(theme_ordered_similarities)

BGGId
218460    1.0
172542    1.0
2763      1.0
2043      1.0
207898    1.0
4178      1.0
113301    1.0
204142    1.0
58003     1.0
4382      1.0
Name: 218460, dtype: float64


# Mechanics 

In [31]:
mechanics_df = pd.read_csv('data/modern_mechanics.csv')
#we want our bgg id to be the index
mechanics_df.set_index('BGGId', inplace=True)
#due to memory issues, use a sample to prove the data, run the real thin in virtual machine
mechanics_sampled = mechanics_df.sample(2000, random_state=42)

## Jaccard Similarity 

Same as before!

In [1]:
# Calculate all pairwise distances
jaccard_distances_mechs = pdist(mechanics_sampled.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array_mechs = 1 -  squareform(jaccard_distances_mechs)

# Wrap the array in a pandas DataFrame
mechanics_jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array_mechs, index=mechanics_sampled.index, columns=mechanics_sampled.index)

NameError: name 'pdist' is not defined

## Check with a specific example

Werewolf is a popular social game about secret roles, sometimes called 'Mafia'.

The game alternates between night and day phases. At night, the Werewolves secretly choose a Villager to kill. During the day, the Villager who was killed is revealed and is out of the game. The remaining Villagers then vote on the player they suspect is a Werewolf. That player reveals his/her role and is out of the game.

Werewolves win when there are an equal number of Villagers and Werewolves. Villagers win when they have killed all Werewolves.

In [44]:
# Find the values for a specific game 
werewolf_mechanics_series = mechanics_jaccard_similarity_df.loc[925]

# Sort these values from highest to lowest
mechanic_ordered_similarities = werewolf_mechanics_series.sort_values(ascending=False)[:10]

# Print the results
print(mechanic_ordered_similarities)

BGGId
925       1.000000
67148     0.727273
140457    0.636364
144464    0.583333
24068     0.500000
225167    0.500000
166384    0.400000
162660    0.363636
224212    0.363636
127024    0.333333
Name: 925, dtype: float64


Now when we look at our most similar games:

[Ultimate Werewolf](https://boardgamegeek.com/boardgame/67148/ultimate-werewolf-compact-edition) is a modern interpretation of werewolf. Some of the other similar games are also just 'variations of werewolf' 

[Shadow Hunters](https://boardgamegeek.com/boardgame/24068/shadow-hunters) is a survival board game set in a devil-filled forest in which three groups of characters—the Shadows, creatures of the night; the Hunters, humans who try to destroy supernatural creatures; and the Neutrals, civilians caught in the middle of this ancient battle—struggle against each other to survive.

# ...Can we do both at once?

In [49]:
combined_df = mechanics_df.join(themes_df)
combined_df

Unnamed: 0_level_0,Alliances,Area Majority / Influence,Auction/Bidding,Dice Rolling,Hand Management,Simultaneous Action Selection,Trick-taking,Hexagon Grid,Once-Per-Game Abilities,Set Collection,...,Theme_Fashion,Theme_Geocaching,Theme_Ecology,Theme_Chernobyl,Theme_Photography,Theme_French Foreign Legion,Theme_Cruise ships,Theme_Apache Tribes,Theme_Rivers,Theme_Flags identification
BGGId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,1,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342010,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
342207,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
342942,0,0,0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
343905,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
#take a test sample
combined_sampled = combined_df.sample(2000, random_state=42)

In [62]:
# Calculate all pairwise distances
jaccard_distances = pdist(combined_df.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 -  squareform(jaccard_distances)

# Wrap the array in a pandas DataFrame
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index=combined_df.index, columns=combined_df.index)

In [61]:
jaccard_similarity_df.sample(10)

BGGId,3320,240100,241724,142889,209166,7349,172662,38980,244333,256669,...,99437,70149,200456,127024,130556,29109,180205,1326,5957,192296
BGGId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
42207,0.0,0.333333,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.125,0.0,0.058824
175121,0.2,0.25,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,...,0.111111,0.0,0.166667,0.0,0.0,0.0,0.0,0.111111,0.0,0.055556
109105,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,...,0.125,0.0,0.0,0.058824,0.0,0.0,0.0,0.125,0.0,0.058824
154515,0.0,0.5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.142857,0.0,0.0625
232139,0.0,0.0,0.0,0.333333,0.090909,0.444444,0.0,0.0,0.1,0.0,...,0.083333,0.0,0.0,0.047619,0.125,0.083333,0.0,0.083333,0.375,0.1
142124,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.083333,0.0,0.2,...,0.090909,0.0,0.125,0.105263,0.0,0.090909,0.0,0.0,0.0,0.105263
294514,0.0,0.166667,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.090909,0.0,0.0,0.105263,0.0,0.0,0.0,0.0,0.0,0.05
265684,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632
212839,0.0,0.0,0.090909,0.0,0.375,0.0,0.0,0.3,0.0,0.090909,...,0.0,0.153846,0.125,0.0,0.0,0.0,0.111111,0.0,0.0,0.0
104006,0.0,0.0,0.181818,0.0,0.090909,0.0,0.0,0.166667,0.1,0.083333,...,0.0,0.142857,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0


## Check this with Werewolf: 

In [None]:
# Find the values for a specific game 
werewolf_series = jaccard_similarity_df.loc[925]

# Sort these values from highest to lowest
ordered_similarities = werewolf_series.sort_values(ascending=False)[:10]

# Print the results
print(ordered_similarities)

BGGId
925       1.000000
63539     0.769231
67148     0.769231
56885     0.692308
140457    0.692308
168680    0.692308
25821     0.692308
144464    0.642857
166019    0.642857
255293    0.642857
Name: 925, dtype: float64


Using both gives us:
- [Lupus in Tabula](https://boardgamegeek.com/boardgame/63539/lupus-in-tabula)
- [Werewolf Ultimate Compact Edition](https://boardgamegeek.com/boardgame/67148/ultimate-werewolf-compact-edition)
- [Werewolves of Miller's Hollow](https://boardgamegeek.com/boardgame/56885/the-werewolves-of-millers-hollow-the-village)

All of which are social deduction games, specifically themed around werewolves!

# Check this with Hnefatafl:

In [66]:
# Find the values for a specific game 
hnefatafl_series = jaccard_similarity_df.loc[2932]

# Sort these values from highest to lowest
ordered_similarities = hnefatafl_series.sort_values(ascending=False)[:10]

# Print the results
print(ordered_similarities)

BGGId
2932      1.000000
26952     0.500000
1960      0.500000
315       0.500000
10213     0.428571
25727     0.400000
272380    0.400000
1337      0.400000
2211      0.400000
2065      0.400000
Name: 2932, dtype: float64


Here we get: 
- [International Checkers](https://boardgamegeek.com/boardgame/26952/international-checkers)
- [Last Word](https://boardgamegeek.com/boardgame/1960/last-word)
- [Bagh Chal](https://boardgamegeek.com/boardgame/315/bagh-chal)
- [Fox and Geese](https://boardgamegeek.com/boardgame/10213/fox-and-geese)

All of which are tile-based games about capturing pieces. Bagh Chal and Fox and Geese are the most thematically relevant! 
