# Meeple Matchmaker

## The Data 

[BoardGameGeek](https://boardgamegeek.com/) (BGG) is a game database with over 125,600 different tabletop games, including European-style board games, wargames, and card games. In addition to the game database, the site allows users to rate games on a 1–10 scale and publishes a ranked list of board games, as rated by the users. 

The dataset being used for this project is from [kaggle](https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek), sourced from the BGG API. 

As part of this project, I'm using the 
- games
- themes
- mechanics 
- and user ratings

in order to make predictions! 

# Python Imports

In [285]:
import pandas as pd
import numpy as np
from scipy import stats
#Similarity Scoring
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from math import sqrt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px

In [2]:
!pip install surprise

UnboundLocalError: cannot access local variable 'child' where it is not associated with a value

In [2]:
from surprise import SVD, accuracy
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise.model_selection.split import train_test_split
from collections import defaultdict

ModuleNotFoundError: No module named 'surprise'

In [195]:
#define a common color pallete for all graphs
color_pallete = px.colors.qualitative.Pastel
color_pallete_continuous = color_pallete[0:2]

# EDA & Data Preparation 

## Boardgames

In [196]:
boardgames_df = pd.read_csv('data/games.csv')

There are a lot of games! However, some are VERY old. 

While it's very cool to look at how long humans have been making board games (and how someone has mislabled 'Dog-opoly' as having been published in 0 BC) we are looking to show users modern games. To start, we will only look at games published after 1900*.

[Settlers of Catan (1995)] is often credited as popularizing board games in the United States... but is that true? 

**This includes modern publishings of ancient games*

In [197]:
modern_boardgames_df = boardgames_df.loc[boardgames_df['YearPublished']>=1900]

In [198]:
fig = px.histogram(modern_boardgames_df, x= 'YearPublished', color_discrete_sequence = color_pallete)

fig.add_annotation(x=1995, y=252,
            text="Settlers of Catan Released",
            showarrow=True,
            arrowhead=1)

fig.update_layout(
    xaxis_title_text='Year', # xaxis label
    yaxis_title_text='Number of Games Published', # yaxis label
)

fig.show()

That is a pretty conclusive spike! 

As there are so few games published before 1960, we will use that as our cutoff for a 'modern' game. 

In [199]:
modern_boardgames_df = boardgames_df.loc[boardgames_df['YearPublished']>=1960]

There are also a lot of columns **we don't want**, such as BayesAvgRating. Let's drop them. 

In [200]:
boardgames_df = boardgames_df.drop(columns=['Rank:boardgame', 'Rank:strategygames', 'Rank:abstracts', 'Rank:familygames', 'Rank:thematic', 'Rank:cgs', 'Rank:wargames', 'Rank:partygames', 'Rank:childrensgames',
                                            'Cat:Thematic', 'Cat:Strategy', 'Cat:War', 'Cat:Family', 'Cat:CGS', 'Cat:Abstract', 'Cat:Party', 'Cat:Childrens'])
#remove reimplementations
boardgames_df = boardgames_df.loc[boardgames_df['IsReimplementation'] == 0]
boardgames_df = boardgames_df.drop(columns=['NumImplementations', 'IsReimplementation', 'NumAlternates', 'NumExpansions', 'NumComments', 'LanguageEase', 'NumWish'])

BGG also provides an 'adjusted' rating based on the number of ratings a board game has overall. You can read more [here](https://boardgamegeek.com/wiki/page/ratings)

This rating causes games with *few votes* but very high ratings to rank lower than games with *many more votes* but a lower Average Rating. 

It also pushes the scores overall closer to the average rating of all games on the site, around 5.5-6 ("Ok game, some fun or challenge at least, will play sporadically if in the right mood.") using dummy variables : with no way to verify these, I will be using average rating instead. 

In [201]:
modern_boardgames_df = modern_boardgames_df.drop(columns=['BayesAvgRating', 'StdDev'])

In [202]:
boardgames_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19365 entries, 0 to 21924
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   BGGId           19365 non-null  int64  
 1   Name            19365 non-null  object 
 2   Description     19364 non-null  object 
 3   YearPublished   19365 non-null  int64  
 4   GameWeight      19365 non-null  float64
 5   AvgRating       19365 non-null  float64
 6   BayesAvgRating  19365 non-null  float64
 7   StdDev          19365 non-null  float64
 8   MinPlayers      19365 non-null  int64  
 9   MaxPlayers      19365 non-null  int64  
 10  ComAgeRec       14356 non-null  float64
 11  BestPlayers     19365 non-null  int64  
 12  GoodPlayers     19365 non-null  object 
 13  NumOwned        19365 non-null  int64  
 14  NumWant         19365 non-null  int64  
 15  NumWeightVotes  19365 non-null  int64  
 16  MfgPlaytime     19365 non-null  int64  
 17  ComMinPlaytime  19365 non-null  int6

In [203]:
modern_boardgames_df.describe()

Unnamed: 0,BGGId,YearPublished,GameWeight,AvgRating,MinPlayers,MaxPlayers,ComAgeRec,LanguageEase,BestPlayers,NumOwned,...,Rank:partygames,Rank:childrensgames,Cat:Thematic,Cat:Strategy,Cat:War,Cat:Family,Cat:CGS,Cat:Abstract,Cat:Party,Cat:Childrens
count,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0,16044.0,15700.0,21461.0,21461.0,...,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0,21461.0
mean,119377.349052,2007.498719,1.988929,6.438295,2.005918,5.697265,10.051841,217.36664,0.311309,1467.22902,...,21300.718792,21099.255953,0.056847,0.107591,0.163646,0.104562,0.014119,0.047295,0.028936,0.038442
std,104690.383864,12.401029,0.849512,0.922438,0.692263,15.095857,3.258665,235.622276,1.065583,5291.142784,...,3622.467691,4135.19854,0.231556,0.30987,0.369962,0.305995,0.117983,0.212274,0.167631,0.192265
min,1.0,1960.0,0.0,1.04133,0.0,0.0,2.0,1.0,0.0,2.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,12902.0,2002.0,1.3333,5.85417,2.0,4.0,8.0,26.0,0.0,152.0,...,21926.0,21926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,112373.0,2012.0,2.0,6.46591,2.0,4.0,10.0,141.0,0.0,325.0,...,21926.0,21926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,207814.0,2017.0,2.5385,7.06179,2.0,6.0,12.0,352.90625,0.0,908.0,...,21926.0,21926.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,349131.0,2021.0,5.0,9.59524,10.0,999.0,21.0,1757.0,15.0,166497.0,...,21926.0,21926.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


##### Rounding the Ratings and Complexity for Visuals

Having average ratings is great, but not for quickly visualizing how 'good' a game is on a graph. It's great to know one game is reviewed 7.61428 and the other is 7.45601, but when the main point is to compare average rating, they should both be a 7.5. 

In [204]:
# Function to round to nearest whole number or .5
def round_to_whole_or_pfive(x):
    return round(x * 2) / 2

In [205]:
modern_boardgames_df['Rounded_Rating'] = modern_boardgames_df['AvgRating'].map(round_to_whole_or_pfive)
modern_boardgames_df['Rounded_Weight'] = modern_boardgames_df['GameWeight'].map(round_to_whole_or_pfive)

### Examining Popularity 

Board Game Geek Uses a 1-10 Rating system with the following values: 

- 10 - Outstanding. Always want to play and expect this will never change.

- 9 - Excellent game. Always want to play it.

- 8 - Very good game. I like to play. Probably I'll suggest it and will never turn down a game.

- 7 - Good game, usually willing to play.

- 6 - Ok game, some fun or challenge at least, will play sporadically if in the right mood.

- 5 - Average game, slightly boring, take it or leave it.

- 4 - Not so good, it doesn't get me but could be talked into it on occasion.

- 3 - Likely won't play this again although could be convinced. Bad.

- 2 - Extremely annoying game, won't play this ever again.

- 1 - Defies description of a game. You won't catch me dead playing this. Clearly broken.


Only games that have at least 30 User Ratings are eligible to join the site Ranking for top games.


#### Most-Reviewed Games

There is a big range of number of reviews! The most reviewed games are some of the most popular games on the market. 

At the lower end of the spectrum, a game needs 30 user ratings to be Ranked on BoardGameGeek, so it is not suprising that that is our minimum. 

In [206]:
most_reviews = modern_boardgames_df.sort_values(by= 'NumUserRatings',ascending=False)[:25]

In [207]:
names = most_reviews['Name']

fig = px.histogram(most_reviews, x= 'Name', y= 'NumUserRatings', color='Rounded_Rating', color_discrete_sequence = color_pallete,     category_orders = {'Name': names}
)
fig.update_layout(
    xaxis_title_text='Game', # xaxis label
    yaxis_title_text='Number of Reviews', # yaxis label
)

fig.show()

The games that have the most reviews are not always the most popular pick, but all rate above a 7! 

That brings us to the next question: what is the 'normal' number of reviews? 

In [208]:
modern_boardgames_df['NumUserRatings'].describe()

count     21461.000000
mean        859.055216
std        3641.646430
min          30.000000
25%          56.000000
50%         123.000000
75%         397.000000
max      108101.000000
Name: NumUserRatings, dtype: float64

#### Most-Popular Games

Since we want the cream of the crop, let's examine most popular only within the 75th percentile. 

In [209]:
top_percentile = modern_boardgames_df.loc[modern_boardgames_df['NumUserRatings'] >= 397]
most_popular = top_percentile.sort_values(by= 'AvgRating',ascending=False)[:25]

In [210]:
names = most_popular['Name']

fig = px.histogram(most_popular, x= 'Name', y= 'AvgRating', color='Rounded_Rating', color_discrete_sequence = color_pallete, category_orders = {'Name': names})
fig.update_layout(
    xaxis_title_text='Game', # xaxis label
    yaxis_title_text='Average Rating', # yaxis label
)

fig.show()

### Examining Complexity 

Weight is a personal opinion expressing how difficult the game is to play - "Weight" is not actually defined by BGG so different people have different ideas of what it means. The general consensus is that lightweight games are easier to learn, while heavyweight games are more difficult to learn. 

There are only 13 unrated games, so I've made the decision to drop them. 

In [211]:
modern_boardgames_df = modern_boardgames_df.loc[modern_boardgames_df['GameWeight']>0]

In [212]:
fig = px.scatter(modern_boardgames_df, x= 'GameWeight', y = 'AvgRating', color='AvgRating', color_continuous_scale = color_pallete_continuous)

fig.update_layout(
    xaxis_title_text='Average Game Weight', # xaxis label
    yaxis_title_text='Average Rating', # yaxis label
)

fig.show()

There are good games in each game weight, however we can see that the less 'weighty' games have a lower values, while no game above 3 is rated below a 5. 

In [213]:
fig = px.histogram(modern_boardgames_df, x= 'GameWeight', color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Average Game Weight', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

In [214]:
modern_boardgames_df[['Name', 'GameWeight', 'NumWeightVotes']].sample(10, random_state=2)

Unnamed: 0,Name,GameWeight,NumWeightVotes
20544,NICE TRY: The Challenge Party Game,1.0,2
247,Bazaar,2.0352,142
11899,Flames of War: Open Fire!,3.75,4
13352,Agents Secrets,2.0,1
19589,WTF?,1.3333,3
8068,Chaos,1.75,4
7587,Venedig,2.0769,39
18388,Last Stand,1.75,4
21607,Death Valley,1.875,8
12028,Vem aí a Troika,1.9167,12


This can partially be explained by how the majority of games are less than 3 complexity. You can also see that there are wayyyyyyy fewer votes for weight, so it is more common for the weights to be a whole number. 

## Useful Functions

### Game Lookup

In [215]:
def get_game_from_ID(game_id):
    return modern_boardgames_df.loc[modern_boardgames_df['BGGId']== game_id]

    
def get_game_from_Name(game_name):
    return modern_boardgames_df.loc[modern_boardgames_df['Name'] == game_name]

### Keeping a list of our games. 

We've narrowed down our games so we can now make a list of the 'keepers'. This is super helpful for getting rid of excess data, as we want to filter our other files to only include our modern game list. 

In [216]:
boardgame_list = modern_boardgames_df['BGGId']

### Reducing the Size of One-Hot Encoded Matrixes 

(and adding them to our main dataframe as text. )


In [217]:
# Generalized function to categorize games based on majority of either mechanics or themes
def categorize_game_based_on_majority(data_df, categories):
    """
    Categorize games based on the majority of either mechanics or themes.
    
    Parameters:
    - data_df: DataFrame with one-hot encoded columns, rows are games, columns are mechanics or themes.
    - categories: Dictionary where keys are categories (e.g., 'Adventure', 'Dice Rolling'), and values are lists of associated mechanics or themes.
    
    Returns:
    - pd.Series: A pandas Series with the primary category assigned to each game.
    """
    category_assignments = []
    
    for game in data_df.index:
        # Get the list of mechanics or themes for this game (assumed to be a series of 0s and 1s)
        game_items = data_df.loc[game]
        
        # Initialize a dictionary to count how many items belong to each category
        category_count = {category: 0 for category in categories.keys()}
        
        # Count how many items for each category are present for this game
        for category, items in categories.items():
            for item in items:
                if item in data_df.columns:  # Check if item exists in the DataFrame columns
                    # If the item exists and is present in the game (i.e., value is 1), count it
                    if game_items.get(item, 0) == 1:
                        category_count[category] += 1
        
        # Assign the category with the majority of items
        majority_category = max(category_count, key=category_count.get)
        category_assignments.append(majority_category)
    
    return pd.Series(category_assignments, index=data_df.index)

In [218]:
def reduce_ohe_by_grouping(data_df, categories):
    """
    Reduces the size of a one-hot encoded DataFrame by grouping similar things into categories.
    
    Parameters:
    - data_df: DataFrame with one-hot encoded columns where each row represents a game.
    - category_mapping: Dictionary where keys are new category names and values are lists of mechanics/themes to be grouped.
    
    Returns:
    - pd.DataFrame: A DataFrame with grouped categories as columns.
    """
    # Initialize a new DataFrame to store the reduced categories
    reduced_df = data_df[['BGGId']].copy()  # Copy BGGId column to the reduced DataFrame    
    # Iterate through each category in the category_mapping
    for category, mechanics in categories.items():
        # Check if all mechanics in the group exist in the DataFrame columns
        missing_columns = [mechanic for mechanic in mechanics if mechanic not in data_df.columns]
           
        # Check if any mechanic in the group is active (1) for the games
        active_mechanics = [mechanic for mechanic in mechanics if mechanic in data_df.columns]
        
        # Create the new category column by checking if any mechanic is active
        reduced_df[category] = data_df[active_mechanics].any(axis=1).astype(int)
    
    # Return the DataFrame with the reduced categories
    return reduced_df

### Un hot Encode

In [219]:
def un_hot_encode(row, one_hot_df):
    # get the BGGId from the row
    game_bggid = row['BGGId']
    
    # check if the BGGId exists in the one-hot encoded DataFrame index
    if game_bggid in one_hot_df.index:
        # retrieve the row corresponding to the BGGId from the one-hot encoded DataFrame
        thing_row = one_hot_df.loc[game_bggid]
        
        # get the column names (mechanics) where the value is 1 (indicating the game has that mechanic)
        active_thing = thing_row[thing_row == 1].index.tolist()
        
        # return the mechanics as a comma-separated string
        return ', '.join(active_thing)
    else:
        # If BGGId not found in the one-hot encoded DataFrame, return an empty string to handle gracefully :> 
        return ''

## Themes

In [220]:
game_themes_df = pd.read_csv('data/themes.csv')

Only keep games in our game list. 

In [221]:
game_themes_df = game_themes_df[game_themes_df['BGGId'].isin(boardgame_list)]

In [222]:
game_themes_df

Unnamed: 0,BGGId,Adventure,Fantasy,Fighting,Environmental,Medical,Economic,Industry / Manufacturing,Transportation,Science Fiction,...,Theme_Fashion,Theme_Geocaching,Theme_Ecology,Theme_Chernobyl,Theme_Photography,Theme_French Foreign Legion,Theme_Cruise ships,Theme_Apache Tribes,Theme_Rivers,Theme_Flags identification
0,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21918,346703,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21919,346965,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21921,347521,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21922,348955,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Thinning our themes

Since there are 218 columns, it's a little hard to see how many themes each game has. 

In [223]:
themes_series = pd.Series()
for col in game_themes_df.columns:
    if col != 'BGGId':
        themes_series[col] = sum(game_themes_df[col])

In [224]:
themes_series = themes_series.sort_values(ascending=False)
fig = px.histogram(themes_series, x= themes_series.index, y=themes_series, color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Theme', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

Many of these themes have VERY FEW games assosiated with them! 

Looking at the most common themes may help us determine how to reduce the width.

In [225]:
themes_popular = themes_series[:25]

fig = px.histogram(themes_popular, x= themes_popular.index, y=themes_popular, color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Top Themes', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

...Yeah! All of these are pretty important, and can be generalized into groups for all those pesky tiny themes. 

In [226]:
game_themes_df

Unnamed: 0,BGGId,Adventure,Fantasy,Fighting,Environmental,Medical,Economic,Industry / Manufacturing,Transportation,Science Fiction,...,Theme_Fashion,Theme_Geocaching,Theme_Ecology,Theme_Chernobyl,Theme_Photography,Theme_French Foreign Legion,Theme_Cruise ships,Theme_Apache Tribes,Theme_Rivers,Theme_Flags identification
0,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21918,346703,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21919,346965,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21921,347521,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21922,348955,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [227]:
themes = {
    # Adventure
    "Adventure": [
        "Adventure", "Theme_Exploration", "Theme_Treasure", "Theme_Camping", "Theme_Hiking", "Theme_Survival", "Theme_Tropical Islands"
    ],

    # Fantasy
    "Fantasy": [
        "Fantasy", "Theme_Fantasy", "Theme_Vikings", "Theme_Magic", "Theme_Elves", "Theme_Witches", "Theme_Mythology","Theme_Alchemy", "Mythology", "Theme_Gods", "Theme_Myth", "Theme_Mythical Creatures", "Theme_Magic"

    ],

    # Novels
    "Books / Libraries": [
        "Novel-based", "Theme_School / College / University"
    ],

    # Fighting
    "Fighting / Combat": [
        "Fighting", "Theme_Fighting", "Theme_Karate", "Theme_Boxing", "Theme_Combat", "Theme_Samurai", "Theme_Battles"
    ],

    # Environmental
    "Environmental": [
        "Environmental", "Theme_Ecology", "Theme_Nature", "Theme_Trees and Forests", "Theme_Gardening", "Theme_Farming", "Theme_Survival", "Theme_Rivers", "Theme_Hike-Hiking", "Theme_Mushrooms", "Theme_Earthquakes"
    ],

    # Medical
    "Medical / Science": [
        "Medical", "Theme_Healthcare", "Theme_Doctors", "Theme_Nurses", "Theme_Hospitals", "Theme_Mad Science / Mad Scientist", "Theme_Experiment", "Theme_Lab", "Theme_Chemistry", 

    ],

    # Economic
    "Economic": [
        "Economic", "Theme_Money", "Theme_Economy", "Theme_Banking", "Theme_Investment"
    ],

    # Industry / Manufacturing
    "Industry / Manufacturing": [
        "Industry / Manufacturing", "Theme_Industry", "Theme_Manufacturing", "Theme_Construction", "Theme_Oil / Gas / Petroleum", "Theme_Mining", "Theme_Resources", "Theme_Digging", "Theme_Mines",
                "Theme_Construction", "Theme_Building", "Theme_Engineers", "Theme_Architecture"


    ],

    # Transportation
    "Transportation": [
        "Transportation", "Theme_Trains", "Theme_Aviation / Flight", "Theme_Airships / Blimps / Dirigibles", "Theme_Cars", "Theme_Traffic"
    ],

    # Science Fiction
    "Science Fiction": [
        "Science Fiction", "Space Exploration", "Theme_Futuristic", "Theme_Robots", "Theme_Astronomy", "Theme_Cyberpunk", "Theme_Sci-Fi Sports", 
        "Theme_Cyberpunk", "Theme_Urban Future", "Theme_Hackers", "Theme_Artificial Intelligence",
        "Theme_Robots", "Theme_Mech", "Theme_Time Travel", "Theme_Time", "Theme_Future"
    ],

    # Civilization
    "Civilization": [
        "Civilization", "Theme_Civilization", "Theme_Age of Reason", "Theme_Ancient", "Theme_Medieval", "Theme_Renaissance", "Theme_Gladiators", "Theme_Coliseum", "Theme_Arena", "Theme_Battles"

    ],

    # History
    "History": [
        "History", "Theme_History", "Theme_Political"
    ],

    # Movies / TV / Radio Theme
    "Movies / TV / Radio": [
        "Movies / TV / Radio theme", "Theme_Movies", "Theme_TV", "Theme_Film", "Theme_Radio", "Theme_Actors", "Theme_Directors"
    ],

    # Renaissance
    "Art / Photography": [
        "Renaissance", "Theme_Renaissance", "Theme_Art", "Theme_Gardening", "Theme_Painting / Paintings", "Theme_Art", "Theme_Painting / Paintings", 
        "Theme_Photos", "Theme_Photography", "Theme_Sewing / Knitting / Cloth-Making", "Theme_Fashion", "Theme_Jewelry", "Theme_Design", "Theme_Architecture", "Theme_Crafting"
    ],

    # American West
    "American West": [
        "American West", "Theme_Wild West", "Theme_Cowboys", "Theme_Trains", "Theme_Indians"
    ],

    # Animals
    "Animals": [
        "Animals", "Theme_Zoology", "Theme_Wildlife", "Theme_Safari", "Theme_Farming"
    ],

    #Warfare
    "War": [
        "Modern Warfare", "Theme_War", "Theme_Soldiers", "Theme_Battle", "Theme_Tactics", "Theme_World War II", "Theme_World War I", "Theme_American Revolutionary War", "Theme_Vietnam War", "Civil War", "Post-Napoleonic"
    ],

    # Medieval
    "Medieval": [
        "Medieval", "Theme_King Arthur / The Knights of the Round Table / Camelot", "Theme_Medieval Europe", "Theme_Samurai", "Theme_Chivalry", "Pike and Shot"
    ],

    # Nautical
    "Nautical": [
        "Nautical", "Theme_Pirates", "Theme_Ships", "Theme_Seafaring", "Theme_Exploration", "Theme_Submarines"
    ],

    # Horror
    "Horror": [
        "Horror", "Theme_Zombies", "Theme_Ghosts", "Theme_Vampires", "Theme_Spooky Old Houses", "Theme_Murder/Mystery",  "Theme_Zombie Apocalypse", "Theme_Undead", "Theme_Resurrection", "Theme_Dreams / Nightmares"
    ],

    # Farming
    "Farming": [
        "Farming", "Theme_Agriculture", "Theme_Crops", "Theme_Fruit", "Theme_Vegetables"
    ],

    # Religious
    "Religious": [
        "Religious", "Theme_Spiritual", "Theme_Church", "Theme_Paganism"
    ],

    # Travel
    "Travel": [
        "Travel", "Theme_Exploration", "Theme_Tropical Islands", "Theme_Deserts", "Theme_Safaris", "Theme_Adventure"
    ],

    # Murder / Mystery
    "Murder / Mystery": [
        "Murder/Mystery", "Theme_Mystery / Crime", "Theme_Investigation", "Theme_Sleuth", "Theme_Detectives"
    ],

    # Pirates
    "Pirates": [
        "Pirates", "Theme_Pirates", "Theme_Captain", "Theme_Ships", "Theme_Treasure"
    ],

    # Comic Book / Strip
    "Comic Book / Strip": [
        "Comic Book / Strip", "Theme_Comics", "Theme_Superheroes", "Theme_Villains", "Theme_Villainy", "Theme_Evil", "Theme_Supervillains", "Theme_Villains"
    ],

    # Mature / Adult
    "Mature / Adult": [
        "Mature / Adult", "Theme_Erotica", "Theme_Adult", "Theme_Mature", "Strip,Mature"
    ],

    # Video Game Theme
    "Video Games": [
        "Video Game Theme", "Theme_Video Game Theme: Nintendo", "Theme_Video Game Theme: Pokémon", "Theme_Video Game Theme: Super Mario Bros.",
        "Theme_Video Game Theme: Minecraft", "Theme_Video Game Theme: Resident Evil", "Theme_Video Game Theme: Tetris", "Theme_Video Game Theme: SEGA", "Theme_Teaching Programming", "Theme_Video Game Theme: The Oregon Trail",
        "Theme_Video Game Theme: Carmen Sandiego", "Theme_Video Game Theme: Tetris", "Theme_Video Game Theme: Doo"
    ],

    # Spies / Secret Agents
    "Spies / Secret Agents": [
        "Spies/Secret Agents", "Theme_Spies", "Theme_CIA", "Theme_Surveillance"
    ],

    # Arabian
    "Arabian": [
        "Arabian", "Theme_Arabian Nights", "Theme_Deserts", "Theme_Sand", "Theme_Oasis"
    ],

    # Prehistoric
    "Prehistoric": [
        "Prehistoric", "Theme_Dinosaurs", "Theme_Cavemen", "Theme_Stone Age", "Theme_Prehistoric"
    ],

    # Trains
    "Trains": [
        "Trains", "Theme_Railroad", "Theme_Train", "Theme_Steam Engine"
    ],

    # Aviation / Flight
    "Aviation / Flight": [
        "Aviation / Flight", "Theme_Airplanes", "Theme_Pilots", "Theme_Airports", "Theme_Aviation"
    ],

    # Racing
    "Racing": [
        "Racing", "Theme_Racing", "Theme_Race Cars", "Theme_Formula 1", "Theme_Motorcycles"
    ],

    # Humor
    "Humor": [
        "Humor", "Theme_Satire", "Theme_Comedy", "Theme_Funny", "Theme_Laughter"
    ],

    # Sports
    "Sports": [
        "Sports", "Theme_Basketball", "Theme_Football", "Theme_Soccer", "Theme_Tennis", "Theme_Golf", "Theme_FIFA World Cup"
    ],

    # Mafia
    "Mafia": [
        "Mafia", "Theme_Gangsters", "Theme_Mob", "Theme_Organized Crime"
    ],

    # Political
    "Political": [
        "Political", "Theme_Politics", "Theme_Elections", "Theme_Campaigns", "Theme_Party Politics", "Theme_Latin American Political Games", "Theme_Spanish Political Games"
    ],

    # Math
    "Math": [
        "Math", "Theme_Puzzle", "Theme_Numbers", "Theme_Mathematics", "Theme_Geometry"
    ],

    # Trivia
    "Trivia": [
        "Trivia", "Theme_Trivia", "Theme_Questions", "Theme_Puzzle", "Theme_Quiz"
    ],

    # Food / Cooking
    "Food / Cooking": [
        "Theme_Food / Cooking", "Theme_Cooking", "Theme_Restaurants", "Theme_Chef", "Theme_Baking"
    ],

    # Retro
    "Retro": [
        "Theme_Retro", "Theme_Nostalgia", "Theme_Classic", "Theme_80s", "Theme_90s"
    ],


    # Archaeology / Paleontology
    "Archaeology / Paleontology": [
        "Theme_Archaeology / Paleontology", "Theme_Dinosaurs", "Theme_Treasure Hunt"
    ],

    # Circus
    "Circus": [
        "Theme_Circus", "Theme_Performers", "Theme_Ringmaster", "Theme_Clowns"
    ],

    # Weather
    "Weather": [
        "Theme_Weather", "Theme_Storms", "Theme_Tornado", "Theme_Hurricanes", "Theme_Flooding"
    ]
}


Now all we have to do is group. 

In [228]:
# Reduce the matrix by grouping mechanics
themes_reduced = reduce_ohe_by_grouping(game_themes_df, themes)

# Show the reduced DataFrame
themes_reduced.head()

Unnamed: 0,BGGId,Adventure,Fantasy,Books / Libraries,Fighting / Combat,Environmental,Medical / Science,Economic,Industry / Manufacturing,Transportation,...,Sports,Mafia,Political,Math,Trivia,Food / Cooking,Retro,Archaeology / Paleontology,Circus,Weather
0,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [229]:
themes_reduced = themes_reduced[:25]

fig = px.histogram(themes_popular, x= themes_popular.index, y=themes_popular, color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Top Themes', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

45 is a lot less than 218!

Just in case, we will also grab the themes as a string. 

In [232]:
game_themes_df['themes'] = themes_reduced.drop(columns='BGGId').apply(un_hot_encode, axis=1)
#also add them to our board game export 
game_themes_df

TypeError: un_hot_encode() missing 1 required positional argument: 'one_hot_df'

### Primary Themes

In [231]:
# Get the category assignment for each game
theme_series = categorize_game_based_on_majority(game_themes_df, themes)
modern_boardgames_df["Primary_Theme"] = theme_series
modern_boardgames_df[:5]

Unnamed: 0,BGGId,Name,Description,YearPublished,GameWeight,AvgRating,MinPlayers,MaxPlayers,ComAgeRec,LanguageEase,...,Cat:Strategy,Cat:War,Cat:Family,Cat:CGS,Cat:Abstract,Cat:Party,Cat:Childrens,Rounded_Rating,Rounded_Weight,Primary_Theme
0,1,Die Macher,die macher game seven sequential political rac...,1986,4.3206,7.61428,3,5,14.366667,1.395833,...,1,0,0,0,0,0,0,7.5,4.5,Economic
1,2,Dragonmaster,dragonmaster tricktaking card game base old ga...,1981,1.963,6.64537,3,4,,27.0,...,1,0,0,0,0,0,0,6.5,2.0,Fantasy
2,3,Samurai,samurai set medieval japan player compete gain...,1998,2.4859,7.45601,2,4,9.307692,1.0,...,1,0,0,0,0,0,0,7.5,2.5,Medieval
3,4,Tal der Könige,triangular box luxurious large block tal der k...,1992,2.6667,6.60006,2,4,13.0,256.0,...,0,0,0,0,0,0,0,6.5,2.5,Adventure
4,5,Acquire,acquire player strategically invest business t...,1964,2.5031,7.33861,2,6,11.410256,21.152941,...,1,0,0,0,0,0,0,7.5,2.5,Economic


In [233]:
# Apply the un_hot_encode function to each row in modern_boardgames_df
modern_boardgames_df["All_Themes"] = modern_boardgames_df.apply(lambda row: un_hot_encode(row, themes_reduced), axis=1)
modern_boardgames_df["All_Themes"]

0                            Fantasy
1        Fighting / Combat, Medieval
2                                   
3                           Economic
4             Civilization, Nautical
                    ...             
21918                               
21919                               
21921                               
21922                               
21923                               
Name: All_Themes, Length: 20971, dtype: object

## Mechanics 

In [234]:
game_mechanics_df = pd.read_csv('data/mechanics.csv')

Only keep games in our game list. 

In [235]:
game_mechanics_df = game_mechanics_df[game_mechanics_df['BGGId'].isin(boardgame_list)]

In [236]:
game_mechanics_df.head()

Unnamed: 0,BGGId,Alliances,Area Majority / Influence,Auction/Bidding,Dice Rolling,Hand Management,Simultaneous Action Selection,Trick-taking,Hexagon Grid,Once-Per-Game Abilities,...,Contracts,Passed Action Token,King of the Hill,Action Retrieval,Force Commitment,Rondel,Automatic Resource Growth,Legacy Game,Dexterity,Physical
0,1,1,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,0,1,0,0,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,4,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Thinning the Mechanics

158 mechanics is also a lot of mechanics. Let's reduce.

In [237]:
mechanics_series = pd.Series()
for col in game_mechanics_df.columns:
    if col != 'BGGId':
        mechanics_series[col] = sum(game_mechanics_df[col])

In [238]:
mechanics_series = mechanics_series.sort_values(ascending=False)
fig = px.histogram(mechanics_series, x= mechanics_series.index, y=mechanics_series, color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Mechanic Featured', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

With so many mechanics, it's difficult to see popularity. Lets look at the top picks: 

In [239]:
mechanics_popular = mechanics_series[:35]
fig = px.histogram(mechanics_popular, x= mechanics_popular.index, y=mechanics_popular, color_discrete_sequence = color_pallete)

fig.update_layout(
    xaxis_title_text='Mechanic Featured', # xaxis label
    yaxis_title_text='Number of Games', # yaxis label
)

fig.show()

Now lets reduce this into something more manageable. 

In [240]:
mechanics = {
    "Dice Rolling / Randomness": [
        "Dice Rolling",
        "Roll / Spin and Move",
        "Re-rolling and Locking",
        "Random Production",
        "Different Dice Movement",
        "Critical Hits and Failures",
        "Die Icon Resolution"
    ],
    
    "Movement Mechanics": [
        "Movement Points",
        "Programmed Movement",
        "Grid Movement",
        "Measurement Movement",
        "Area Movement",
        "Track Movement",
        "Point to Point Movement",
        "Three Dimensional Movement",
        "Movement Template",
        "Ladder Climbing",
        "Pattern Movement",
        "Chaining",
        "Relative Movement",
        "Static Capture",
        "Impulse Movement",
        "Resource to Move",
        "Area-Impulse",
        "Movement Points",
        "Zone of Control",
        "Measurement Movement",
        "Push Your Luck",
        "Tug of War",
        "Secret Unit Deployment",
        "Multiple Maps",
        "Network and Route Building",
        "Race"
    ],
    "Set Collection": [
        "Set Collection",
        "Connections"
    ],
    
    "Social Interaction": [
        "Cooperative Game",
        "Negotiation",
        "Trading",
        "Player Elimination",
        "Role Playing",
        "Voting",
        "Bribery",
        "Bidding and Bluffing",
        "Take That",
        "Team-Based Game",
        "I Cut, You Choose",
        "Hidden Roles",
        "Traitor Game",
        "Prisoner's Dilemma",
        "Communication Limits",
        "Catch the Leader",
        "Roles with Asymmetric Information",
        "Social Game",
        "Player Judge",
        "Social Interaction", 
        "Storytelling",
        "Alliances",
        "Bias"

    ],
    
    "Deck/Card Mechanics": [
        "Deck Construction",
        "Card Play Conflict Resolution",
        "Drafting",
        "Tableau Building",
        "Deck, Bag, and Pool Building",
        "Move Through Deck",
        "Action Queue",
        "Command Cards",
        "Action/Event",
        "Card Play Conflict Resolution",
        "Pick-up and Deliver",
        "Campaign / Battle Card Driven",
        "Deck Construction",
        "Deck/Bag/Pools"
    ],
    
    "Board & Grid Mechanics": [
        "Hexagon Grid",
        "Square Grid",
        "Modular Board",
        "Map Addition",
        "Map Deformation",
        "Multiple Maps",
        "Map Reduction",
        "Tile Placement",
        "Layering",
        "Grid Coverage",
        "Pieces as Map",
        "Static Capture",
        "Connections",
        "Minimap Resolution",
        "Crayon Rail System",
        "Multiple Maps"
    ],
    
    "Economics & Resource Management": [
        "Investment",
        "Market",
        "Loans",
        "Stock Holding",
        "Victory Points as a Resource",
        "Commodity Speculation",
        "Income",
        "Enclosure",
        "Resource to Move",
        "Increase Value of Unchosen Resources",
        "Automatic Resource Growth",
        "Delayed Purchase",
        "Ratio / Combat Results Table",
        "Resource Management",
        "Economic Resource",
        "Tech Trees / Tech Tracks",
        "Predictive Bid",
        "Bribery",
        "Ownership"
    ],
    
    "Physical / Dexterity": [
        "Dexterity",
        "Physical Removal",
        "Mancala",
        "Physical",
        "Cube Tower",
        "Layering",
        "Real-Time"
    ],
    
    "Puzzle Solving / Memory": [
        "Pattern Building",
        "Pattern Recognition",
        "Memory",
        "Induction",
        "Matching",
        "Deduction",
        "Targeted Clues",
        "Critical Hits and Failures",
        "Bingo"
    ],
    
    "Action Mechanics": [
        "Action Points",
        "Action Drafting",
        "Simultaneous Action Selection",
        "Action Timer",
        "Action Retrieval",
        "Action/Event",
        "Action Queue",
        "I Cut, You Choose",
        "Stat Check Resolution",
        "Advantage Token",
        "Time Track",
        "Orders Counters",
        "Order Counters",
        "Force Commitment",
        "Rondel",
        "Speed Matching",
        "Once-Per-Game Abilities",
        "Simulation",
        "Area-Impulse"
    ],
    "Legacy Games": [
        "Scenario / Mission / Campaign Game",
        "Legacy Game"
    ],
    "Drawing": [
        "Paper-and-Pencil",
        "Crayon Rail System",
        "Line Drawing"
    ],
    "Special End Of Game Conditions" : [
        "Sudden Death Ending",
        "Highest-Lowest Scoring",
        "Score-and-Reset Game",
        "End Game Bonuses",
        "Finale Ending",
        "Single Loser Game",
        "King of the Hill"
    ],
    "Solo" : [
        "Solo / Solitaire Game"
    ],
    "Secrets / Deception": [
        "Hidden Roles", 
        "Hidden Victory Points", 
        "Hidden Movement", 
        "Secret Unit Deployment", 
        "Traitor Game",
        "Deduction"
    ]
}

In [241]:
# Reduce the matrix by grouping mechanics
mechanics_reduced_df = reduce_ohe_by_grouping(game_mechanics_df, mechanics)

# Show the reduced DataFrame
mechanics_reduced_df.head()

Unnamed: 0,BGGId,Dice Rolling / Randomness,Movement Mechanics,Set Collection,Social Interaction,Deck/Card Mechanics,Board & Grid Mechanics,Economics & Resource Management,Physical / Dexterity,Puzzle Solving / Memory,Action Mechanics,Legacy Games,Drawing,Special End Of Game Conditions,Solo,Secrets / Deception
0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,3,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0
3,4,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0
4,5,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0


### Primary & Secondary Mechanics 

We want to keep our primary mechanic for future classification problems, but we will store the rest as features!

In [242]:
# Get the category assignment for each game
category_series = categorize_game_based_on_majority(game_mechanics_df, mechanics)
modern_boardgames_df["Primary_Mechanic"] = category_series
modern_boardgames_df[:5]

Unnamed: 0,BGGId,Name,Description,YearPublished,GameWeight,AvgRating,MinPlayers,MaxPlayers,ComAgeRec,LanguageEase,...,Cat:Family,Cat:CGS,Cat:Abstract,Cat:Party,Cat:Childrens,Rounded_Rating,Rounded_Weight,Primary_Theme,All_Themes,Primary_Mechanic
0,1,Die Macher,die macher game seven sequential political rac...,1986,4.3206,7.61428,3,5,14.366667,1.395833,...,0,0,0,0,0,7.5,4.5,Economic,Fantasy,Social Interaction
1,2,Dragonmaster,dragonmaster tricktaking card game base old ga...,1981,1.963,6.64537,3,4,,27.0,...,0,0,0,0,0,6.5,2.0,Fantasy,"Fighting / Combat, Medieval",Dice Rolling / Randomness
2,3,Samurai,samurai set medieval japan player compete gain...,1998,2.4859,7.45601,2,4,9.307692,1.0,...,0,0,0,0,0,7.5,2.5,Medieval,,Board & Grid Mechanics
3,4,Tal der Könige,triangular box luxurious large block tal der k...,1992,2.6667,6.60006,2,4,13.0,256.0,...,0,0,0,0,0,6.5,2.5,Adventure,Economic,Set Collection
4,5,Acquire,acquire player strategically invest business t...,1964,2.5031,7.33861,2,6,11.410256,21.152941,...,0,0,0,0,0,7.5,2.5,Economic,"Civilization, Nautical",Economics & Resource Management


In [243]:
# Apply the un_hot_encode function to each row in modern_boardgames_df
modern_boardgames_df["All_Mechanics"] = modern_boardgames_df.apply(lambda row: un_hot_encode(row, mechanics_reduced_df), axis=1)
modern_boardgames_df["All_Mechanics"]

0                                                         
1        Set Collection, Board & Grid Mechanics, Action...
2                         Set Collection, Action Mechanics
3        Board & Grid Mechanics, Economics & Resource M...
4                                Dice Rolling / Randomness
                               ...                        
21918                                                     
21919                                                     
21921                                                     
21922                                                     
21923                                                     
Name: All_Mechanics, Length: 20971, dtype: object

# Examining Correlations between game features 

# Users 

In [244]:
users_df = pd.read_csv('data/user_ratings.csv')

In [245]:
users_df

Unnamed: 0,BGGId,Rating,Username
0,213788,8.0,Tonydorrf
1,213788,8.0,tachyon14k
2,213788,8.0,Ungotter
3,213788,8.0,brainlocki3
4,213788,8.0,PPMP
...,...,...,...
18942210,165521,3.0,rseater
18942211,165521,3.0,Bluefox86
18942212,165521,3.0,serginator
18942213,193488,1.0,CaptainCattan


We also only want to include the board games ratings we have in our database. (Sorry to all the players of [The Royal Game of Ur](https://en.wikipedia.org/wiki/Royal_Game_of_Ur)) 

In [246]:
users_filtered_df = users_df[users_df['BGGId'].isin(modern_boardgames_df['BGGId'])]
users_filtered_df.sample(10)

Unnamed: 0,BGGId,Rating,Username
7029609,175117,6.5,Karadin42
7314515,62871,7.0,DGSonicO15
12918404,314040,8.0,StuRutten
3720216,74,4.0,stroutqb22
12628033,90137,10.0,Oigelb
17745044,870,10.0,mtnman69
14954830,247367,9.0,fmirza23
12002853,147020,10.0,sincerneret
14936300,205317,8.7,Synnelus
5936975,125618,7.0,Tabaraloro


In [247]:
users_filtered_df['BGGId'].isin(modern_boardgames_df['BGGId']).value_counts()

BGGId
True    18463833
Name: count, dtype: int64

In [248]:
modern_boardgames_df.loc[modern_boardgames_df['BGGId']==161882]

Unnamed: 0,BGGId,Name,Description,YearPublished,GameWeight,AvgRating,MinPlayers,MaxPlayers,ComAgeRec,LanguageEase,...,Cat:CGS,Cat:Abstract,Cat:Party,Cat:Childrens,Rounded_Rating,Rounded_Weight,Primary_Theme,All_Themes,Primary_Mechanic,All_Mechanics
13693,161882,Irish Gauge,irish gauge mdash title winsome games essen ...,2014,2.3667,7.33808,3,5,11.5,1.333333,...,0,0,0,0,7.5,2.5,Economic,,Economics & Resource Management,


We have a lot of users, so lets only pick users with over 10 reviews. 

In [249]:
# filter for users with more than x ratings
min_user_ratings = 10

rating_counter = users_filtered_df['Username'].value_counts()
user_mask = users_filtered_df['Username'].isin(rating_counter[rating_counter < min_user_ratings].index)

users_filtered_df.drop(index=users_filtered_df[user_mask].index, inplace=True)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [257]:
users_filtered_df

Unnamed: 0,BGGId,Rating,Username
0,213788,8.0,Tonydorrf
1,213788,8.0,tachyon14k
2,213788,8.0,Ungotter
3,213788,8.0,brainlocki3
4,213788,8.0,PPMP
...,...,...,...
18942177,165522,4.0,vadgil
18942178,165522,3.0,rseater
18942179,165522,1.0,Bernytheking
18942213,193488,1.0,CaptainCattan


### Finding Favorites

In [258]:
def get_favorite_mechanic (userID, rating_threshold=8):
    """ 
    Given a user ID / Username, this function returns the most frequent mechanic 
    of the games that the user has rated above a certain rating threshold.
    """
     # Filter UserRatings for the given user and the games with a rating above the threshold
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID]
    rated_games = user_ratings[user_ratings['Rating'] >= rating_threshold]

    # Merge rated games with the Games DataFrame to get primary mechanics 
    rated_games_with_details = pd.merge(rated_games, modern_boardgames_df[['BGGId', 'Primary_Mechanic']], on='BGGId')

    favorite_mechanic = rated_games_with_details['Primary_Mechanic'].mode()[0] if not rated_games_with_details.empty else None
    return favorite_mechanic

In [259]:
def get_favorite_theme (userID, rating_threshold=8):
    """ 
    Given a user ID / Username, this function returns the most frequent mechanic 
    of the games that the user has rated above a certain rating threshold.
    """
     # Filter UserRatings for the given user and the games with a rating above the threshold
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID]
    rated_games = user_ratings[user_ratings['Rating'] >= rating_threshold]

    # Merge rated games with the Games DataFrame to get primary themes 
    rated_games_with_details = pd.merge(rated_games, modern_boardgames_df[['BGGId', 'Primary_Theme']], on='BGGId')

    favorite_theme = rated_games_with_details['Primary_Theme'].mode()[0] if not rated_games_with_details.empty else None
    return favorite_theme

In [260]:
def get_favorite_game (userID):
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID].sort_values(by='Rating', ascending=False)
    top_pick = user_ratings['BGGId'][:1].values[0]
    game = get_game_from_ID(top_pick)
    return game

def get_favorite_game_name (userID):
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID].sort_values(by='Rating', ascending=False)
    top_pick = user_ratings['BGGId'][:1].values[0]
    name = get_game_from_ID(top_pick)
    return name['Name']

In [261]:
def get_favorite_games (userID, num_games=10):
    """ 
    Given a user ID / Username, this function returns the ids of the top n rated games
    """
     # Filter UserRatings for the given user and the games with a rating above the threshold
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID].sort_values(by='Rating', ascending=False)
    top_picks = user_ratings['BGGId'][:num_games].values

    return top_picks


In [262]:
def get_all_reviews (userID):
    """ 
    Given a user ID / Username, this function returns the highest rated game
    """
     # Filter UserRatings for the given user and the games with a rating above the threshold
    user_ratings = users_filtered_df[users_filtered_df['Username'] == userID]

    return user_ratings

In [263]:
user_groups = users_filtered_df.groupby('Username').mean()
user_groups = user_groups.drop(columns='BGGId')
user_groups = user_groups.reset_index()

user_groups

Unnamed: 0,Username,Rating
0,mycroft,7.521429
1,-=Yod@=-,7.286424
2,-Johnny-,5.313043
3,-Loren-,7.400000
4,-LucaS-,7.717391
...,...,...
222281,zzzoren,5.872928
222282,zzzuzu,7.236842
222283,zzzvone,6.855000
222284,zzzxxxyyy,7.814286


### Examining Trends 

# Exports

For expiriments like NLP, K-NN, ETC

In [62]:
modern_boardgames_df.to_csv("data/modern_games.csv")

In [63]:
users_filtered_df.to_csv("data/users_encoded.csv")

In [64]:
game_themes_df.to_csv("data/modern_themes.csv", index=False)

In [65]:
game_mechanics_df.to_csv("data/modern_mechanics.csv",  index=False)

# Baseline Model - Collaborative 

[Surprise](https://surpriselib.com/) is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. It provides a variety of ready-to-use prediction algorithms, as well as tools to evaluate their performance. 

In [61]:
# preprocessing the data
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(users_df[['Username','BGGId','Rating']], reader)
train, test = train_test_split(data, test_size=.2, random_state=42)

NameError: name 'Reader' is not defined

## SVD as our baseline 
The SVD algorithm []

In [62]:
# baseline model
algo = SVD(random_state = 42)
algo.fit(train)
predictions = algo.test(test)

NameError: name 'SVD' is not defined

## Getting Top Picks 

In [None]:
def get_top_picks(predictions, num_recs = 10):
    """Return the top-N recommendation for each user from a set of predictions.

    predictions(list of Prediction objects): The list of predictions, as returned by suprise's algorithm.
    num_recs (int): The number of recommendation to output for each user. Default is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size num_recs.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:num_recs]

    return top_n

top_n = get_top_picks(predictions, num_recs=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

NameError: name 'predictions' is not defined

## RMSE

In [None]:
# evaluate the rmse result of the prediction and ground thuth
accuracy.rmse(predictions)

 On average, the predicted ratings differ from the true ratings by approximately 4.05 points. This suggests that our baseline is model's predictions are relatively inaccurate, with a large gap between the predicted and actual ratings for the majority of the items.

## Pros and Cons
Pros: 
- Easy to set up
- Easy to evaluate / understand 

Cons: 
- Entirely focused on "Users who liked X, also liked Y"  
- Struggles with new users who may have no ratings, and new items with no reviews.  
- Does not encorporate any data about the items themselves. 

This means that someone who loves train games like Ticket To Ride and Whistle Stop may be reccomended games that are popular and don't use trains like Pandemic, just because more people have rated Ticket To Ride and Pandemic together, and thus they may miss out on niche train games they might love. 

# Content-Based Reccomendations 

Content based recommendation systems provide users with items with similar features to those they already enjoy. These are important for showing users new items who have not yet gotten reviews!

### Using Jaccard Similarity

Our dataset comes with two pre one-hot-encoded datasets about our games **themes** and **mechanics**. We can use these to find the Jaccard Similarity between games to determine how similar they are. 

Developed by Paul Jaccard, the index ranges from 0 to 1. The closer to 1, the more similar the two sets of data.
```
Jaccard Similarity = (number of observations in both sets) / (number in either set)
```
If two games are exactly the same, their Jaccard Similarity Index will be 1. Conversely, if they have nothing in common then their similarity will be 0

In [264]:
# Calculate Jaccard similarity between games
def jaccard_similarity(set1, set2):
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union != 0 else 0

#### Setting up our features 

We have a lot of features we can use to compare! For this, I'll be focusing on mechanics and themes. 

In [265]:
#preprocess the mechanics and themes into one combined set per game
modern_boardgames_df['Mechanics_Set'] = modern_boardgames_df['All_Mechanics'].apply(lambda x: set(x.split(', ')))
modern_boardgames_df['Themes_Set'] = modern_boardgames_df['All_Themes'].apply(lambda x: set(x.split(', ')))

#### Making the Matrix

Jaccard similarity is SLOW (O(n2) where n is the number of games), so we will precompute where possible.

We can also speed up our function by calculating for each [i,j] once, as [j,i] will have the same value! 

In [304]:
def compute_jaccard_similarity_matrix_optimized(data):
    #n for size!
    n = len(data)
    similarity_matrix = np.zeros((n, n))
    
    # Precompute the combined feature sets (mechanics + themes) for all games
    combined_sets = [
        data.iloc[i]['Mechanics_Set'].union(data.iloc[i]['Themes_Set']) 
        for i in range(n)
    ]
    
    # calculate the pairwise Jaccard similarity between all game pairs
    for i in range(n):
        for j in range(i + 1, n):  # calculate only for half of the matrix
            # Compute the Jaccard similarity
            similarity = jaccard_similarity(combined_sets[i], combined_sets[j])
            similarity_matrix[i, j] = similarity
            similarity_matrix[j, i] = similarity  # symmetric matrix - don't need to calculate twice!
    
    return similarity_matrix

In [305]:
jaccard_matrix = compute_jaccard_similarity_matrix_optimized(modern_boardgames_df)

### Creating Content-Based recomendations with Jaccard Similarity

In order to select games for our users, we will examine the features of their highest rated games. 

To improve this, we can also pre-filter by primary mechanic.

In [306]:
def get_content_based_recommendations_jaccard(game_id, jaccard_sim_matrix, n=5):
    ''' 
    Function to get content-based recommendations based on Jaccard similarity for the same primary mechanic
    '''
    # Ensure data is zero-indexed for correct mapping
    data = modern_boardgames_df.reset_index(drop=True)
    
    # Get index of the game in the DataFrame
    idx = data[data['BGGId'] == game_id].index[0]
    
    # Get the Primary Mechanic for the game
    primary_mechanic = data.iloc[idx]['Primary_Mechanic']
    
    # Filter games that have the same Primary Mechanic
    same_mechanic_indices = data[data['Primary_Mechanic'] == primary_mechanic].index.tolist()
    #... and the same theme
    
    # Get the pairwise similarity scores for the game, but only for games with the same Primary Mechanic
    sim_scores = [(i, jaccard_sim_matrix[idx, i]) for i in same_mechanic_indices if i != idx]
    
    # Sort the games based on the similarity scores (in descending order, high to low)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top n+10 most similar games
    sim_scores = sim_scores[:n]
    
    # Get the game indices
    game_indices = [i[0] for i in sim_scores]
    
    # Return the game IDs of the most similar games
    return data.iloc[game_indices]['BGGId'].tolist()

In [307]:
n = get_content_based_recommendations_jaccard(925, jaccard_matrix)

print(get_game_from_ID(925)['Name'])

for i in n:
    print(get_game_from_ID(i)['Name'])

742    Werewolf
Name: Name, dtype: object
32    Arkham Horror
Name: Name, dtype: object
111    Quo Vadis?
Name: Name, dtype: object
171    Kremlin
Name: Name, dtype: object
244    Lords of the Sierra Madre (Second Edition)
Name: Name, dtype: object
263    Koalition
Name: Name, dtype: object


### Using Cosine Distance

Cosine similarity measures the cosine of the angle between two vectors. It is commonly used for text data, where the vectors represent the term frequencies (TF) or TF-IDF values of words or terms in documents.
It focuses on the direction of the vectors and ignores their magnitudes. To use this, we merge our themes and mechanics together and process. 

The value ranges from -1 (completely dissimilar) to 1 (completely similar), with 0 indicating no similarity

In [279]:
# Merge the dataframes on 'BGGId'
combined = pd.merge(game_themes_df, game_mechanics_df, on='BGGId', how='inner')

In [308]:
def compute_cosine_similarity_matrix(data):
   # Compute the pairwise Cosine similarity for the one-hot encoded data
    similarity_matrix = cosine_similarity(data)
    return similarity_matrix

In [310]:
cosine_matrix = compute_cosine_similarity_matrix(combined)

#### Getting Recommendations

In [311]:
def get_content_based_recommendations_cosine(game_id, n=5):
    """
    Provide content-based recommendations based on cosine similarity.

    Parameters:
    - game_id: The BGGId of the game for which we are recommending similar games.
    - n: Number of top recommendations to return.

    Returns:
    - list of BGGIds: List of recommended game BGGIds.
    """
    
    # Check if the game_id exists in the combined dataframe
    if game_id not in combined['BGGId'].values:
        raise ValueError(f"Game ID {game_id} not found in the dataset.")
    
    # Extract the feature vector for the input game_id
    game_vector = combined[combined['BGGId'] == game_id].drop('BGGId', axis=1)
    
    # Calculate the cosine similarity between the game vector and all other games
    similarity_matrix = cosine_similarity(game_vector, combined.drop('BGGId', axis=1))
    
    # Create a DataFrame for similarities
    similarity_df = pd.DataFrame(similarity_matrix.T, columns=['similarity'], index=combined['BGGId'])
    
    # Sort the similarity scores in descending order and exclude the game itself
    recommendations = similarity_df.drop(index=game_id).sort_values(by='similarity', ascending=False)
    
    # Return the top N most similar games
    return recommendations.head(n).index.tolist()

In [312]:
n = get_content_based_recommendations_cosine(925)

print(get_game_from_ID(925)['Name'])

for i in n:
    print(get_game_from_ID(i)['Name'])

742    Werewolf
Name: Name, dtype: object
8656    Ultimate Werewolf: Ultimate Edition
Name: Name, dtype: object
13070    Ultimate Werewolf: Deluxe Edition
Name: Name, dtype: object
9617    Lupus in Tabula
Name: Name, dtype: object
9795    Ultimate Werewolf: Compact Edition
Name: Name, dtype: object
13069    Ultimate Werewolf
Name: Name, dtype: object


## Pros and Cons:

Pros:
- Users are given more diverse recomendations that may be new or less well known 
- We could choose to pick themes or mechanics for an even more specific reccomendation 

Cons:
- Space Intensive (Distances stored In Memory)
- Does not account for average review / User could be reccomended a substandard product

## Cosine or Jaccard? 

In order to determine which content-based method to choose, we need to generate predictive scores for users. 

This will allow us to examine if our proposed score for a user is accurate (ie, do they accully like what is being recommended to them?)

In [396]:
def generate_predictions(game_id, username, similarity_matrix, data, users_data, n=5):
    try:
        # Find the index of the game in the data DataFrame
        idx = data[data['BGGId'] == game_id].index[0]
    except IndexError:
        # print(f"Error: Game with ID {game_id} not found in data.")
        return None
    
    # Ensure the index is within the bounds of the similarity matrix
    if idx >= similarity_matrix.shape[0]:
        # print(f"Error: Index {idx} is out of bounds for the similarity matrix.")
        return None

    # Get the pairwise similarity scores for the game
    sim_scores = [(i, similarity_matrix[idx, i]) for i in range(len(data)) if i != idx]
    
    # Sort the games based on the similarity scores (in descending order)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top n most similar games
    top_n_games = [i[0] for i in sim_scores[:n]]
    
    # print(f"Top {n} similar games for game {game_id}: {top_n_games}")
    
    # Get the ratings for these games by the given user
    user_ratings = users_data[users_data['Username'] == username]
    
    # print(f"User {username} has rated the following games: {user_ratings[['BGGId', 'Rating']]}")
    
    # Check if 'Rating' and 'BGGId' exist in users_data
    if 'Rating' not in user_ratings.columns or 'BGGId' not in user_ratings.columns:
        # print(f"Error: Missing 'Rating' or 'BGGId' column in user ratings for user {username}.")
        return None
    
    # Get the ratings for the top n games that this user has rated
    rated_games = user_ratings[user_ratings['BGGId'].isin(data.iloc[top_n_games]['BGGId'])]
    
    # print(f"User {username} has rated the following top {n} similar games: {rated_games[['BGGId', 'Rating']]}")
    
    if len(rated_games) > 0:
        # If the user has rated any of the top n games, calculate the predicted rating as the mean of those ratings
        predicted_rating = np.mean(rated_games['Rating'])
        # print(f"Predicted Rating for User {username}, Game {game_id}: {predicted_rating}")
    else:
        # If the user hasn't rated any of the top n games, fallback to the mean rating of the top n games
        top_n_ratings = [data.iloc[i]['AvgRating'] for i in top_n_games if pd.notnull(data.iloc[i]['AvgRating'])]
        
        # print(f"Ratings for top {n} similar games: {top_n_ratings}")
        
        if len(top_n_ratings) > 0:
            # If there are valid ratings for the top n games, use the mean of those as the fallback prediction
            predicted_rating = np.mean(top_n_ratings)
            # print(f"Fallback Predicted Rating (mean of top n similar games) for User {username}, Game {game_id}: {predicted_rating}")
        else:
            # If none of the top n games have ratings, fallback to the global average rating
            global_avg_rating = np.mean([rating for rating in data['AvgRating'] if pd.notnull(rating)])
            # print(f"Fallback Predicted Rating (global avg) for User {username}, Game {game_id}: {global_avg_rating}")
            predicted_rating = global_avg_rating
    
    return predicted_rating

## Calculate RMSE

In [397]:
def compute_rmse(predictions, actual_ratings):
    mse = mean_squared_error(actual_ratings, predictions)
    rmse = sqrt(mse)
    return rmse

def evaluate_rmse_sampled(users_df, data, similarity_matrix, similarity_type='Cosine', n=5, sample_size=200):
    """
    Evaluate RMSE for a random sample of users. This function computes the RMSE for predictions generated
    using either Jaccard or Cosine similarity matrices.
    """
    # Randomly sample users for evaluation
    sampled_users = users_df.sample(n=sample_size, random_state=42)
    
    actual_ratings = []
    predicted_ratings = []
    
    # Loop through the sampled users and predict the rating for each game they rated
    for _, row in sampled_users.iterrows():
        game_id = row['BGGId']
        username = row['Username']
        actual_rating = row['Rating']
        
        # Check if 'Rating' and 'BGGId' exist in the sampled user data
        if 'Rating' not in row or 'BGGId' not in row:
            # print(f"Missing 'Rating' or 'BGGId' for user {username} (game {game_id}). Skipping.")
            continue
        
        # print(f"Evaluating: User {username}, Game {game_id}, Actual Rating: {actual_rating}")
        
        # Generate predictions using the specified similarity matrix
        predicted_rating = generate_predictions(game_id, username, similarity_matrix, data, users_df, n)
        
        # If a prediction was successfully made, append it to the lists
        if predicted_rating is not None:
            actual_ratings.append(actual_rating)
            predicted_ratings.append(predicted_rating)
    
    # Check if we have ratings to compute RMSE
    if len(actual_ratings) == 0:
        # print("No valid ratings to compute RMSE.")
        return None
    
    # Compute RMSE
    rmse = compute_rmse(predicted_ratings, actual_ratings)
    return rmse

# Evaluate

In [None]:
cosine_rmse = evaluate_rmse_sampled(users_filtered_df, modern_boardgames_df, cosine_matrix, similarity_type='Cosine', n=3, sample_size=2000)

In [None]:
print(f'Cosine Similarity RMSE (Sampled): {cosine_rmse}')

Cosine Similarity RMSE (Sampled): 1.6861576556826539


In [None]:
jaccard_rmse = evaluate_rmse_sampled(users_filtered_df, modern_boardgames_df, jaccard_matrix, similarity_type='Jaccard', n=1, sample_size=2000)

In [None]:
print(f'Jaccard Similarity RMSE (Sampled): {jaccard_rmse}')

Jaccard Similarity RMSE (Sampled): 2.2969081340086284


Conclusion 

# Hybrid System 

So: we have one system that can provide solid reccomendations based on other users, and one system that can provide solid reccomendations based on the items. 

To solve our problems, we can combine these methods for a hybrid approach!  

In [None]:
def hybrid_predict(user_bggid, game_id, themes_df, mechanics_df, model_cf, top_n=5):
    """
    Hybrid prediction by averaging content-based and collaborative filtering predictions.
    """
    # Content-based prediction (game recommendations)
    content_based_recommendations = predict_content_based(themes_df, mechanics_df, user_bggid, top_n)
    
    # Collaborative filtering prediction (SVD-based)
    # Get the collaborative filtering prediction for the specific user-item pair
    cf_prediction = model_cf.predict(user_bggid, game_id).est
    
    # Now, we combine the two recommendations (simple average in this case)
    # If the game_id is in the top N recommended games, then average the predictions
    if game_id in content_based_recommendations:
        content_based_score = 1  # Arbitrary score for presence in the top N
    else:
        content_based_score = 0  # Game not recommended
    
    hybrid_score = (cf_prediction + content_based_score) / 2  # Average of CF and Content-based
    
    return hybrid_score


In [None]:
def evaluate_hybrid_rmse(testset, themes_df, mechanics_df, model_cf, top_n=5):
    """
    Evaluate the hybrid recommender system using RMSE.
    """
    predictions = []
    true_ratings = []
    
    for user_bggid, game_id, actual_rating in testset:
        # Get the hybrid prediction for this user-item pair
        predicted_rating = hybrid_predict(user_bggid, game_id, themes_df, mechanics_df, model_cf, top_n)
        
        # Collect the true ratings and predicted ratings for RMSE calculation
        true_ratings.append(actual_rating)
        predictions.append(predicted_rating)
    
    # Calculate RMSE
    rmse = np.sqrt(np.mean((np.array(predictions) - np.array(true_ratings))**2))
    return rmse


# Practical Examples

Meet: _____ and ______. 



# Further Steps

 - Using ________ to examine more features 
 - 