<a href="https://colab.research.google.com/github/onertartan/recommender-systems-board-games/blob/main/explanatory_4_content_based_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CONTENT BASED RECOMMENDATION

Original Dataset is taken from <a>https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews</a>

Download and unzip file **games_detailed_info.zip**

In [None]:
# File link: https://drive.google.com/file/d/1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk/view?usp=drive_link
!gdown 1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk&confirm=t

Import packages  

In [None]:
import numpy as np
import pandas as pd
from functools import partial
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from zipfile import ZipFile

In [None]:
with ZipFile("games_detailed_info.zip") as zipFile:
    zipFile.extractall()

## 1- EXAMINE DATA

Check df_details head

In [None]:
df_games_detailed= pd.read_csv("games_detailed_info.csv",index_col = 2,low_memory=False) # use game id as index
df_games_detailed.head(2)

Unnamed: 0_level_0,Unnamed: 0,type,thumbnail,image,primary,alternate,description,yearpublished,minplayers,maxplayers,...,War Game Rank,Customizable Rank,Children's Game Rank,RPG Item Rank,Accessory Rank,Video Game Rank,Amiga Rank,Commodore 64 Rank,Arcade Rank,Atari ST Rank
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,boardgame,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,Pandemic,"['EPIZOotic', 'Pandemia', 'Pandemia 10 Anivers...","In Pandemic, several virulent diseases have br...",2008,2,4,...,,,,,,,,,,
822,1,boardgame,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,Carcassonne,"['Carcassonne Jubilee Edition', 'Carcassonne: ...",Carcassonne is a tile-placement game in which ...,2000,2,5,...,,,,,,,,,,


Let's create a dictionary dataframe mapping ids to game names.

In [None]:
df_id2game = df_games_detailed[[ "primary"]].copy()
df_id2game.head()

Unnamed: 0_level_0,primary
id,Unnamed: 1_level_1
30549,Pandemic
822,Carcassonne
13,Catan
68448,7 Wonders
36218,Dominion


Select the columns **boardgamecategory** and **boardgamemechanic** as  content columns.<br>

In [None]:
df_content= df_games_detailed[["Board Game Rank","boardgamecategory","boardgamemechanic"]].copy()

Check missing values for each column.

In [None]:
df_content.isna().sum()

Board Game Rank         0
boardgamecategory     283
boardgamemechanic    1590
dtype: int64

Number of rows with missing values.

In [None]:
df_content.isna().any(axis=1).sum()

1821

## 2 DATA CLEANING

Drop the games with missing values

In [None]:
df_content.dropna(axis=0,inplace=True)

Check missing values again.

In [None]:
df_content.isna().sum().any()

False

## 3- DATA PREPROCESSING

Elements in the "*boardgamecategory*" and "*boardgamemechanic*" are strings in list format. <br>They can be converted to list.

In [None]:
from ast import literal_eval
df_content["boardgamecategory"]  = df_content["boardgamecategory"].apply(lambda x: literal_eval(str(x)))
df_content["boardgamemechanic"] = df_content["boardgamemechanic"].apply(lambda x: literal_eval(str(x)))

We will also need game ranks to sort the games with equal distances.

In [None]:
df_content["boardgamecategory"].head(2)

id
30549                                        [Medical]
822      [City Building, Medieval, Territory Building]
Name: boardgamecategory, dtype: object

In [None]:
df_content["boardgamemechanic"].head(2)

id
30549    [Action Points, Cooperative Game, Hand Managem...
822      [Area Majority / Influence, Map Addition, Tile...
Name: boardgamemechanic, dtype: object

Get unique categories as a list.

In [None]:
category_cols = sorted(set(sum(df_content["boardgamecategory"].tolist(),[])))
print(category_cols)
print("Number of categories:",len(category_cols))

['Abstract Strategy', 'Action / Dexterity', 'Adventure', 'Age of Reason', 'American Civil War', 'American Indian Wars', 'American Revolutionary War', 'American West', 'Ancient', 'Animals', 'Arabian', 'Aviation / Flight', 'Bluffing', 'Book', 'Card Game', "Children's Game", 'City Building', 'Civil War', 'Civilization', 'Collectible Components', 'Comic Book / Strip', 'Deduction', 'Dice', 'Economic', 'Educational', 'Electronic', 'Environmental', 'Expansion for Base-game', 'Exploration', 'Fan Expansion', 'Fantasy', 'Farming', 'Fighting', 'Game System', 'Horror', 'Humor', 'Industry / Manufacturing', 'Korean War', 'Mafia', 'Math', 'Mature / Adult', 'Maze', 'Medical', 'Medieval', 'Memory', 'Miniatures', 'Modern Warfare', 'Movies / TV / Radio theme', 'Murder/Mystery', 'Music', 'Mythology', 'Napoleonic', 'Nautical', 'Negotiation', 'Novel-based', 'Number', 'Party Game', 'Pike and Shot', 'Pirates', 'Political', 'Post-Napoleonic', 'Prehistoric', 'Print & Play', 'Puzzle', 'Racing', 'Real-time', 'Rel

Get unique mechanics as a list.

In [None]:
mechanics_cols = sorted(set(sum(df_content["boardgamemechanic"].tolist(),[])))
print(mechanics_cols)
print("Number of mechanics types:",len(mechanics_cols))

['Acting', 'Action Drafting', 'Action Points', 'Action Queue', 'Action Retrieval', 'Action Timer', 'Action/Event', 'Advantage Token', 'Alliances', 'Area Majority / Influence', 'Area Movement', 'Area-Impulse', 'Auction/Bidding', 'Auction: Dexterity', 'Auction: Dutch', 'Auction: Dutch Priority', 'Auction: English', 'Auction: Fixed Placement', 'Auction: Once Around', 'Auction: Sealed Bid', 'Auction: Turn Order Until Pass', 'Automatic Resource Growth', 'Betting and Bluffing', 'Bias', 'Bingo', 'Bribery', 'Campaign / Battle Card Driven', 'Card Drafting', 'Card Play Conflict Resolution', 'Catch the Leader', 'Chaining', 'Chit-Pull System', 'Closed Economy Auction', 'Command Cards', 'Commodity Speculation', 'Communication Limits', 'Connections', 'Constrained Bidding', 'Contracts', 'Cooperative Game', 'Crayon Rail System', 'Critical Hits and Failures', 'Cube Tower', 'Deck Construction', 'Deck, Bag, and Pool Building', 'Deduction', 'Delayed Purchase', 'Dice Rolling', 'Die Icon Resolution', 'Diffe

### Create **df_category** one-hot encoded dataframe as empty (initialized with zeros).

In [None]:
df_category = pd.concat([df_content[["boardgamecategory"]], pd.DataFrame(columns= category_cols)])
df_category.fillna(0,inplace=True)
df_category.head(2)

Unnamed: 0,boardgamecategory,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
30549,[Medical],0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[City Building, Medieval, Territory Building]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each game; at the categories that the game contains, replace cells initialized as 0s with 1s.

In [None]:
for id in df_category.index:
    for cat in df_category.loc[id,"boardgamecategory"]:
        df_category.loc[id,cat]=1
df_category.head(2)

Unnamed: 0,boardgamecategory,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
30549,[Medical],0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[City Building, Medieval, Territory Building]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Create **df_mechanics** one-hot encoded dataframe as empty (initialized with zeros).

In [None]:
df_mechanics = pd.concat([df_content[["boardgamemechanic"]],pd.DataFrame(columns= mechanics_cols)])
df_mechanics.fillna(0,inplace=True)
df_mechanics.head(2)

Unnamed: 0,boardgamemechanic,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,"[Action Points, Cooperative Game, Hand Managem...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[Area Majority / Influence, Map Addition, Tile...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
for id in df_mechanics.index:
    for mechanics in df_mechanics.loc[id,"boardgamemechanic"]:
        df_mechanics.loc[id,mechanics]=1
df_mechanics.head(2)

Unnamed: 0,boardgamemechanic,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,"[Action Points, Cooperative Game, Hand Managem...",0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
822,"[Area Majority / Influence, Map Addition, Tile...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can drop the original columns.

In [None]:
df_category.drop("boardgamecategory", inplace=True, axis=1)
df_mechanics.drop("boardgamemechanic", inplace=True, axis=1)

Check shapes

In [None]:
df_category.shape

(19810, 84)

In [None]:
df_mechanics.shape

(19810, 182)

Now we can create final **content dataframe** which has **Board Game Rank**, **one-hot category columns** and **one-hot mechanics columns**.

In [None]:
df_content = pd.concat([df_content["Board Game Rank"],df_category,df_mechanics],axis=1)
df_content.shape

(19810, 267)

In [None]:
df_content.head(2)

Unnamed: 0,Board Game Rank,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,106,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
822,191,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19810 entries, 30549 to 165946
Columns: 267 entries, Board Game Rank to Zone of Control
dtypes: int64(266), object(1)
memory usage: 40.5+ MB


Check Board_Game_Rank column

There are few games not ranked.

In [None]:
sum(df_content["Board Game Rank"]=="Not Ranked")

3

 We can replace them with a number greater than the number of games.

In [None]:
df_content["Board Game Rank"]=df_content["Board Game Rank"].replace({"Not Ranked":"22000"})
df_content.rename(columns={"Board Game Rank":"Board_Game_Rank"},inplace=True)
df_content.head(2)

Unnamed: 0,Board_Game_Rank,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,106,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
822,191,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can apply type conversions to optimize file memory. Then save the resulting df for deployment.

In [None]:
df_content.iloc[:,0]  = df_content.iloc[:,0].astype("uint32")#df_content["Board Game Rank"]=df_content["Board Game Rank"].astype("uint32")
df_content.iloc[:,1:] = df_content.iloc[:,1:].astype("int8")
df_content.info()

In [None]:
df_content.index =df_content.index.astype("int32")

In [None]:
# Save for the deployment
df_content.to_csv("df_content.csv")