<a href="https://colab.research.google.com/github/onertartan/recommender-systems-board-games/blob/main/explanatory_4_content_based_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CONTENT BASED RECOMMENDATION

Original Dataset is taken from <a>https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews</a>

Download and unzip file **games_detailed_info.zip**

In [1]:
# File link: https://drive.google.com/file/d/1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk/view?usp=drive_link
!gdown 1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk&confirm=t

Downloading...
From: https://drive.google.com/uc?id=1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk
To: /content/games_detailed_info.zip
100% 19.5M/19.5M [00:00<00:00, 43.5MB/s]


Import packages  

In [2]:
import numpy as np
import pandas as pd
from functools import partial
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from zipfile import ZipFile

In [3]:
with ZipFile("games_detailed_info.zip") as zipFile:
    zipFile.extractall()

## 1- EXAMINE DATA

Check df_details head

In [4]:
df_games_detailed= pd.read_csv("games_detailed_info.csv",index_col = 2,low_memory=False) # use game id as index
df_games_detailed.head(2)

Unnamed: 0_level_0,Unnamed: 0,type,thumbnail,image,primary,alternate,description,yearpublished,minplayers,maxplayers,...,War Game Rank,Customizable Rank,Children's Game Rank,RPG Item Rank,Accessory Rank,Video Game Rank,Amiga Rank,Commodore 64 Rank,Arcade Rank,Atari ST Rank
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,boardgame,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,Pandemic,"['EPIZOotic', 'Pandemia', 'Pandemia 10 Anivers...","In Pandemic, several virulent diseases have br...",2008,2,4,...,,,,,,,,,,
822,1,boardgame,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,Carcassonne,"['Carcassonne Jubilee Edition', 'Carcassonne: ...",Carcassonne is a tile-placement game in which ...,2000,2,5,...,,,,,,,,,,


Let's check columns and the shape.

In [5]:
print(df_games_detailed.columns.tolist())

['Unnamed: 0', 'type', 'thumbnail', 'image', 'primary', 'alternate', 'description', 'yearpublished', 'minplayers', 'maxplayers', 'suggested_num_players', 'suggested_playerage', 'suggested_language_dependence', 'playingtime', 'minplaytime', 'maxplaytime', 'minage', 'boardgamecategory', 'boardgamemechanic', 'boardgamefamily', 'boardgameexpansion', 'boardgameimplementation', 'boardgamedesigner', 'boardgameartist', 'boardgamepublisher', 'usersrated', 'average', 'bayesaverage', 'Board Game Rank', 'Strategy Game Rank', 'Family Game Rank', 'stddev', 'median', 'owned', 'trading', 'wanting', 'wishing', 'numcomments', 'numweights', 'averageweight', 'boardgameintegration', 'boardgamecompilation', 'Party Game Rank', 'Abstract Game Rank', 'Thematic Rank', 'War Game Rank', 'Customizable Rank', "Children's Game Rank", 'RPG Item Rank', 'Accessory Rank', 'Video Game Rank', 'Amiga Rank', 'Commodore 64 Rank', 'Arcade Rank', 'Atari ST Rank']


In [6]:
df_games_detailed.shape

(21631, 55)

Let's create a dictionary dataframe mapping ids to game names. We will use this df to access game names using game ids.

In [7]:
df_id2game = df_games_detailed[[ "primary"]].copy()
df_id2game.head(2)

Unnamed: 0_level_0,primary
id,Unnamed: 1_level_1
30549,Pandemic
822,Carcassonne


We select the columns **boardgamecategory**  **boardgamemechanic** and **boardgamefamily** as  content columns.<br>
We will also need game ranks to sort the games with equal distances.

In [8]:
df_content= df_games_detailed[["Board Game Rank","boardgamecategory","boardgamemechanic","boardgamefamily"]].copy()
df_content.rename(columns={"Board Game Rank":"Board_Game_Rank"},inplace = True)
df_content.head(3)

Unnamed: 0_level_0,Board_Game_Rank,boardgamecategory,boardgamemechanic,boardgamefamily
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30549,106,['Medical'],"['Action Points', 'Cooperative Game', 'Hand Ma...","['Components: Map (Global Scale)', 'Components..."
822,191,"['City Building', 'Medieval', 'Territory Build...","['Area Majority / Influence', 'Map Addition', ...","['Cities: Carcassonne (France)', 'Components: ..."
13,429,"['Economic', 'Negotiation']","['Dice Rolling', 'Hexagon Grid', 'Income', 'Mo...","['Animals: Sheep', 'Components: Hexagonal Tile..."


We consider that as  we go from boardgamecategory to boardgamemechanic and boardgamefamily columns, we delve into more details.<br>
In other words boardgamecategory provides more broad groups.

Let's check missing values for each column.

In [9]:
df_content.isna().sum()

Board_Game_Rank         0
boardgamecategory     283
boardgamemechanic    1590
boardgamefamily      3761
dtype: int64

We can see that as missing values increase as some games provide less details on boardgamemechanic and boardgamefamily.<br>
(missing values increase in this order: boardgamecategory->boardgamemechanic -> boardgamefamily  ).<br>  

## 2 DATA CLEANING

We consider the existence of boardgamecategory values as the minimum requirement to provide recommendation. <br>**Therefore we drop rows with missing values in boardgamecategory.**

In [10]:
df_content.dropna(subset=["boardgamecategory"],inplace=True)

We drop the rows with missing values in "boardgamecategory", because we consider these features as fundamental features. <br>
Noew let's check missing values again.

In [11]:
df_content.isna().sum()

Board_Game_Rank         0
boardgamecategory       0
boardgamemechanic    1538
boardgamefamily      3676
dtype: int64

We will fill the missing values in columns **boardgamemechanic** and **boardgamefamily** with empty lists.<br>
(These empty lists are going to be represented as zeros in one-hot encoding).<br>
Briefly, we don't throw games with missing values in **boardgamemechanic** and **boardgamefamily** columns. Instead, we will rely on **boardgamecategory** for similarity calculation.

Check the new shape.

In [12]:
df_content.shape

(21348, 4)

We can fill na values in **boardgamemechanic** and **boardgamefamily** with empty lists and check missing values again.

In [13]:
df_content["boardgamemechanic"][df_content["boardgamemechanic"].isna()] = df_content["boardgamemechanic"][df_content["boardgamemechanic"].isna()].apply(lambda x:[""])
df_content["boardgamefamily"][df_content["boardgamefamily"].isna()] = df_content["boardgamefamily"][df_content["boardgamefamily"].isna()].apply(lambda x:[""])
df_content.isna().sum()

Board_Game_Rank      0
boardgamecategory    0
boardgamemechanic    0
boardgamefamily      0
dtype: int64

In Data Preprocessing we will fill missing values(one-hot encoded attribute columns).<br>
Euclidian similarity can mislead to similarity between two games with all zeros.<br>
Since we will use Jaccard similarity, two games with zero attributes will not be considered similar.

## 3- DATA PREPROCESSING

### 3.1 Extract categorical content columns

We will extract categorical content columns which are embedded in lists in rows of boardgamecategory,	boardgamemechanic and	boardgamefamily.  <br>
Then, in the next step we will represent them as one-hot columns.

In [14]:
df_content.head(2)

Unnamed: 0_level_0,Board_Game_Rank,boardgamecategory,boardgamemechanic,boardgamefamily
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30549,106,['Medical'],"['Action Points', 'Cooperative Game', 'Hand Ma...","['Components: Map (Global Scale)', 'Components..."
822,191,"['City Building', 'Medieval', 'Territory Build...","['Area Majority / Influence', 'Map Addition', ...","['Cities: Carcassonne (France)', 'Components: ..."


Each column has categorical attributes given in list format but in str datatype. <br>
We have to interpret these strings as list. We can do this using literal_eval.<br>
For example, <code>literal_eval("[1,2,3]")</code> will yield a list <code> [1,2,3]</code> .<br>

In [16]:
from ast import literal_eval
df_content["boardgamecategory"]  = df_content["boardgamecategory"].apply(lambda x: literal_eval(str(x)))
df_content["boardgamemechanic"] = df_content["boardgamemechanic"].apply(lambda x: literal_eval(str(x)))
df_content["boardgamefamily"] = df_content["boardgamefamily"].apply(lambda x: literal_eval(str(x)))

Once we conterted each cell content to list type, we can extract unique values to be used as one-hot encoding columns.

#### 3.1.1 Get unique values of boardgamecategory as a list.

In [17]:
category_cols = sorted(set(sum(df_content["boardgamecategory"].tolist(),[])))
print(category_cols)
print("Number of categories:",len(category_cols))

['Abstract Strategy', 'Action / Dexterity', 'Adventure', 'Age of Reason', 'American Civil War', 'American Indian Wars', 'American Revolutionary War', 'American West', 'Ancient', 'Animals', 'Arabian', 'Aviation / Flight', 'Bluffing', 'Book', 'Card Game', "Children's Game", 'City Building', 'Civil War', 'Civilization', 'Collectible Components', 'Comic Book / Strip', 'Deduction', 'Dice', 'Economic', 'Educational', 'Electronic', 'Environmental', 'Expansion for Base-game', 'Exploration', 'Fan Expansion', 'Fantasy', 'Farming', 'Fighting', 'Game System', 'Horror', 'Humor', 'Industry / Manufacturing', 'Korean War', 'Mafia', 'Math', 'Mature / Adult', 'Maze', 'Medical', 'Medieval', 'Memory', 'Miniatures', 'Modern Warfare', 'Movies / TV / Radio theme', 'Murder/Mystery', 'Music', 'Mythology', 'Napoleonic', 'Nautical', 'Negotiation', 'Novel-based', 'Number', 'Party Game', 'Pike and Shot', 'Pirates', 'Political', 'Post-Napoleonic', 'Prehistoric', 'Print & Play', 'Puzzle', 'Racing', 'Real-time', 'Rel

#### 3.1.2 Get unique values of boardgamemechanic as a list.

In [18]:
mechanic_cols = sorted(set(sum(df_content["boardgamemechanic"].tolist(),[])))
print(mechanic_cols)

['', 'Acting', 'Action Drafting', 'Action Points', 'Action Queue', 'Action Retrieval', 'Action Timer', 'Action/Event', 'Advantage Token', 'Alliances', 'Area Majority / Influence', 'Area Movement', 'Area-Impulse', 'Auction/Bidding', 'Auction: Dexterity', 'Auction: Dutch', 'Auction: Dutch Priority', 'Auction: English', 'Auction: Fixed Placement', 'Auction: Once Around', 'Auction: Sealed Bid', 'Auction: Turn Order Until Pass', 'Automatic Resource Growth', 'Betting and Bluffing', 'Bias', 'Bingo', 'Bribery', 'Campaign / Battle Card Driven', 'Card Drafting', 'Card Play Conflict Resolution', 'Catch the Leader', 'Chaining', 'Chit-Pull System', 'Closed Economy Auction', 'Command Cards', 'Commodity Speculation', 'Communication Limits', 'Connections', 'Constrained Bidding', 'Contracts', 'Cooperative Game', 'Crayon Rail System', 'Critical Hits and Failures', 'Cube Tower', 'Deck Construction', 'Deck, Bag, and Pool Building', 'Deduction', 'Delayed Purchase', 'Dice Rolling', 'Die Icon Resolution', 'D

In [19]:
# Remove empty string
mechanic_cols = mechanic_cols[1:]
print("Number of unique boardgamemechanic values:",len(mechanic_cols))

Number of unique boardgamemechanic values: 182


#### 3.1.3 Get unique values of boardgamefamily as a list.
Note that differently from the previous two steps, we have to exclude some values(attributes) in boardgamefamily which **we didn't include in family_cols(like "Admin", "Game", "Trivia")**.

In [20]:
family_cols = sorted( {x for row in df_content["boardgamefamily"]  for x in row
               if not (x.startswith("Admin") or x.startswith("Game") or x.startswith("Crowdfunding") or x.startswith("Digital Implementations")
                   or x.startswith("Digital Implementations")  or x.startswith("Trivia")  ) } )
print(family_cols)

['', 'Ancient: Babylon', 'Ancient: Carthage', 'Ancient: Corinth', 'Ancient: Egypt', 'Ancient: Greece', 'Ancient: Indus Valley', 'Ancient: Jericho', 'Ancient: Magna Graecia', 'Ancient: Mesopotamia', 'Ancient: Pompeii', 'Ancient: Rome', 'Ancient: Sparta', 'Animals: Alligators / Crocodiles', 'Animals: Ants', 'Animals: Apes / Monkeys', 'Animals: Badgers', 'Animals: Bats', 'Animals: Bears', 'Animals: Beavers', 'Animals: Bees', 'Animals: Birds', 'Animals: Butterflies', 'Animals: Camels', 'Animals: Cats', 'Animals: Cattle / Cows', 'Animals: Chameleons', 'Animals: Chickens', 'Animals: Cockroaches', 'Animals: Coral / Jellyfish / Anemones', 'Animals: Coyotes', 'Animals: Crabs', 'Animals: Crows / Ravens / Magpies', 'Animals: Deer / Antelope', 'Animals: Dinosaurs', 'Animals: Dogs', 'Animals: Dolphins', 'Animals: Donkeys', 'Animals: Ducks', 'Animals: Eagles', 'Animals: Elephants', 'Animals: Emus', 'Animals: Fish / Fishes', 'Animals: Fleas', 'Animals: Flies', 'Animals: Foxes', 'Animals: Frogs / Toad

In [21]:
# Remove the empty string
family_cols = family_cols[1:]
print("Number of unique boardgamefamily values:",len(family_cols))

Number of unique boardgamefamily values: 2643


## 3.2 One-hot encode boardgamecategory, boardgamemechanics and boardgamefamily columns.
We will create three one-hot encoded dataframes;
* df_category
* df_mechanic
* df_family.<br>
Then we will merge them.
### 3.2.1 Create **df_category** one-hot encoded dataframe

In this step we spread  elements of the lists in boardgamecategory column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [22]:
def initialize_ohe(df_content, unpacked_column, ohe_columns):
    df_ohe = pd.concat([df_content[[unpacked_column]], pd.DataFrame(columns= ohe_columns)])
    df_ohe.fillna(0,inplace=True)
    df_ohe.iloc[:,1:] = df_ohe.iloc[:,1:].astype("int8")
    return df_ohe

In [23]:
df_category =  pd.DataFrame(0,index=df_content.index,columns= category_cols,dtype="int8")
df_category.head(2)

Unnamed: 0_level_0,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,Animals,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each game; at the categories that the game contains, replace cells initialized as 0s with 1s.

In [24]:
def complete_ohe(df_content,df_ohe, unpacked_column):
    for id in df_content.index:
        columns_for_ohe =  df_content.loc[id,unpacked_column]
        if columns_for_ohe != [""]:
            # we have excluded some columns like Admin,Game in df_family
            # therefore we have to ignore these columns
            columns_for_ohe= list(set(columns_for_ohe) & set(df_ohe.columns))
            df_ohe.loc[id, columns_for_ohe] = 1
    return df_ohe

In [25]:
df_category = complete_ohe(df_content, df_ohe=df_category, unpacked_column = "boardgamecategory")
df_category.head(2)

Unnamed: 0_level_0,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,Animals,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
print(df_category.columns.tolist())

['Abstract Strategy', 'Action / Dexterity', 'Adventure', 'Age of Reason', 'American Civil War', 'American Indian Wars', 'American Revolutionary War', 'American West', 'Ancient', 'Animals', 'Arabian', 'Aviation / Flight', 'Bluffing', 'Book', 'Card Game', "Children's Game", 'City Building', 'Civil War', 'Civilization', 'Collectible Components', 'Comic Book / Strip', 'Deduction', 'Dice', 'Economic', 'Educational', 'Electronic', 'Environmental', 'Expansion for Base-game', 'Exploration', 'Fan Expansion', 'Fantasy', 'Farming', 'Fighting', 'Game System', 'Horror', 'Humor', 'Industry / Manufacturing', 'Korean War', 'Mafia', 'Math', 'Mature / Adult', 'Maze', 'Medical', 'Medieval', 'Memory', 'Miniatures', 'Modern Warfare', 'Movies / TV / Radio theme', 'Murder/Mystery', 'Music', 'Mythology', 'Napoleonic', 'Nautical', 'Negotiation', 'Novel-based', 'Number', 'Party Game', 'Pike and Shot', 'Pirates', 'Political', 'Post-Napoleonic', 'Prehistoric', 'Print & Play', 'Puzzle', 'Racing', 'Real-time', 'Rel

### 3.2.2 Create **df_mechanic** one-hot encoded dataframe
In this step we spread  elements of the lists in boardgamemechanic column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [27]:
df_mechanic =  pd.DataFrame(0,index=df_content.index,columns= mechanic_cols, dtype="int8")
df_mechanic.head(2)

Unnamed: 0_level_0,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,Area Majority / Influence,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each game; at the categories that the game contains, replace cells initialized as 0s with 1s.

In [28]:
df_mechanic = complete_ohe(df_content, df_ohe= df_mechanic, unpacked_column = "boardgamemechanic")
df_mechanic.head(2)

Unnamed: 0_level_0,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,Area Majority / Influence,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### 3.2.3 Create **df_family** one-hot encoded dataframe
In this step we spread  elements of the lists in boardgamefamily column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [29]:
df_family =  pd.DataFrame(0,index=df_content.index,columns= family_cols, dtype="int8")
df_family.head(2)

Unnamed: 0_level_0,Ancient: Babylon,Ancient: Carthage,Ancient: Corinth,Ancient: Egypt,Ancient: Greece,Ancient: Indus Valley,Ancient: Jericho,Ancient: Magna Graecia,Ancient: Mesopotamia,Ancient: Pompeii,...,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
df_family = complete_ohe(df_content, df_ohe=df_family, unpacked_column = "boardgamefamily")
pd.options.display.max_columns = 100
df_family.head(2)

Unnamed: 0_level_0,Ancient: Babylon,Ancient: Carthage,Ancient: Corinth,Ancient: Egypt,Ancient: Greece,Ancient: Indus Valley,Ancient: Jericho,Ancient: Magna Graecia,Ancient: Mesopotamia,Ancient: Pompeii,Ancient: Rome,Ancient: Sparta,Animals: Alligators / Crocodiles,Animals: Ants,Animals: Apes / Monkeys,Animals: Badgers,Animals: Bats,Animals: Bears,Animals: Beavers,Animals: Bees,Animals: Birds,Animals: Butterflies,Animals: Camels,Animals: Cats,Animals: Cattle / Cows,Animals: Chameleons,Animals: Chickens,Animals: Cockroaches,Animals: Coral / Jellyfish / Anemones,Animals: Coyotes,Animals: Crabs,Animals: Crows / Ravens / Magpies,Animals: Deer / Antelope,Animals: Dinosaurs,Animals: Dogs,Animals: Dolphins,Animals: Donkeys,Animals: Ducks,Animals: Eagles,Animals: Elephants,Animals: Emus,Animals: Fish / Fishes,Animals: Fleas,Animals: Flies,Animals: Foxes,Animals: Frogs / Toads,Animals: Geese,Animals: Giraffes,Animals: Goats,Animals: Gophers,...,Traditional Games: Knucklebones / Jacks,Traditional Games: Mahjong,Traditional Games: Mancala,Traditional Games: Morris,Traditional Games: Pachisi / Ludo,Traditional Games: Petteia,Traditional Games: Shogi,Traditional Games: Shut the Box,Traditional Games: Snakes and Ladders,Traditional Games: Spoons,Traditional Games: Sudoku,Traditional Games: Tafl,Traditional Games: Tiddlywinks,Traditional Games: Yut Nori,Versions & Editions: Adult Versions of Family-Friendly Games,Versions & Editions: Big Box Versions of Individual Games,Versions & Editions: Board Game Versions of Role-Playing Games,Versions & Editions: Card Versions of Non-Card Games,Versions & Editions: Dice Versions of Non-Dice Games,Versions & Editions: Disney Theme Park Editions,Versions & Editions: Electronic Versions of Non-Electronic Games,Versions & Editions: Junior Versions of Grown-Up Games,Versions & Editions: Legacy Versions of Non-Legacy Games,Versions & Editions: Roll- or Flip-and-Write Versions of Non-Writing Games,Versions & Editions: Travel Versions of Non-Travel Games,Versions & Editions: Two-Player Versions of More-Player Games,Video Game Theme: Angry Birds,Video Game Theme: Carmen Sandiego,Video Game Theme: Doom,Video Game Theme: Dragon Quest,Video Game Theme: Final Fantasy,Video Game Theme: Fruit Ninja,Video Game Theme: Honfoglaló,Video Game Theme: Kingdom Hearts,Video Game Theme: Minecraft,Video Game Theme: Nintendo,Video Game Theme: Pac-Man,Video Game Theme: Pokémon,Video Game Theme: Project Shrine Maiden,Video Game Theme: Resident Evil,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
30549,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
822,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Let's check the shape of df_family

In [31]:
df_family.shape

(21348, 2643)

We can drop columns which have values in only one game.

In [32]:
(df_family.sum()==1).sum()

448

In [33]:
columns_to_drop = df_family.columns[df_family.sum()==1]
print(columns_to_drop.tolist())

['Ancient: Magna Graecia', 'Animals: Sloths', 'Authors: Astrid Lindgren', 'Authors: Ellery Queen', 'Authors: Karl May', 'Authors: Michael Moorcock', 'Books: Alfons Åberg', 'Books: Angelina Ballerina', 'Books: Charlie and Lola', 'Books: Charlie and the Chocolate Factory', 'Books: Der Dativ ist dem Genitiv sein Tod', 'Books: Der Kleine König', 'Books: Felix', 'Books: Goosebumps', 'Books: Hopalong Cassidy', 'Books: Inkworld', 'Books: Jim Button', 'Books: Le Petit Poucet', 'Books: Lilly the Witch', 'Books: Little Bear', 'Books: Little Raven Socks', 'Books: My Secret Unicorn', 'Books: The Berenstain Bears', "Brands: Campbell's", 'Brands: Harley-Davidson', 'Brands: John Deere', "Brands: McDonald's", 'Category: Drinking Games', 'Celebrities: Don Adams', 'Celebrities: Lucille Ball', 'Celebrities: The Wiggles', 'Characters:  The Addams Family', 'Characters: Bozo the Clown', 'Characters: Buck Rogers', 'Characters: Caillou', "Characters: Capt'n Sharky", "Characters: Käpt'n Blaubär", 'Characters: 

Actually these columns are very specific and belong to one game.<br>
We could merge them as a group,but we already have such broad-term columns in df_category.<br>
For example, df_category contain columns such as "Ancient","Book", "Dice" and "Video Game Theme".<br>
Therefore we remove these columns.

In [34]:
df_family.drop(columns=columns_to_drop,inplace=True)
df_family.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21348 entries, 30549 to 165946
Columns: 2195 entries, Ancient: Babylon to Word Games: Spelling / Letters
dtypes: int8(2195)
memory usage: 45.4 MB


## 3.4 Save three ohe dataframes;df_category,df_mechanic, and df_family in dictionary **df_content_dict**

In the previous part we initialized dataframes with zeros, and then assigned ones to corresponding columns in the original column.<br>
Due to the assignment of ones we could not use sparse datatype, but once we have completed assigning ones, we can convert dataframes
to sparse datatype.

In [35]:
df_category = df_category.astype(pd.SparseDtype("bool"))

In [36]:
df_mechanic = df_mechanic.astype(pd.SparseDtype("bool"))
df_mechanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21348 entries, 30549 to 165946
Columns: 182 entries, Acting to Zone of Control
dtypes: Sparse[bool, False](182)
memory usage: 987.5 KB


In [37]:
df_family = df_family.astype(pd.SparseDtype("bool"))
df_family.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21348 entries, 30549 to 165946
Columns: 2195 entries, Ancient: Babylon to Word Games: Spelling / Letters
dtypes: Sparse[bool, False](2195)
memory usage: 871.4 KB


## 3.4 A Final Adjustment: Assign Values to Not Ranked Values in Board_Game_Rank Column

* We will use game rankings in the case of equality in similarities(distances).
* In such cases we will suggest the games with higher ranks first.

* If we observe the *Board_Game_Rank* column we notive that there are few games not ranked.

In [38]:
sum(df_content["Board_Game_Rank"]=="Not Ranked")

5

 We can replace them with a number greater than the number of games. Let's simply say a invalid rank; 100000.

In [39]:
df_content["Board_Game_Rank"]=df_content["Board_Game_Rank"].replace({"Not Ranked":"100000"})
df_content.head(2)

Unnamed: 0_level_0,Board_Game_Rank,boardgamecategory,boardgamemechanic,boardgamefamily
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30549,106,[Medical],"[Action Points, Cooperative Game, Hand Managem...","[Components: Map (Global Scale), Components: M..."
822,191,"[City Building, Medieval, Territory Building]","[Area Majority / Influence, Map Addition, Tile...","[Cities: Carcassonne (France), Components: Mee..."


In [40]:
df_content["Board_Game_Rank"].dtype

dtype('O')

Since we have replaced "Not Ranked" with 100000, now we can convert the data type of this column from Object to integer.

In [41]:
df_content["Board_Game_Rank"]= df_content["Board_Game_Rank"].astype("uint32")

# 4- Save the dataframes in a dictionary.

Now we can save the dataframes below in a dictionary **df_content_dict** which has
* **Board Game Rank**
* **one-hot boardgamecategory**
* **one-hot boardgamemechanic**
* **one-hot boardgamefamily**
columns

We save the dataframes seperately, because we will calculate distances for each of them seperately.

In [43]:
df_content_dict ={"rank":df_content[["Board_Game_Rank"]],"category":df_category,"mechanic":df_mechanic,"family":df_family}

In [44]:
import pickle
# Save for the deployment
with open('df_content_dict.pkl', 'wb') as file:
    pickle.dump(df_content_dict, file)