<a href="https://colab.research.google.com/github/onertartan/recommender-systems-board-games/blob/main/explanatory_4_content_based_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CONTENT BASED RECOMMENDATION

Original Dataset is taken from <a>https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews</a>

Download and unzip file **games_detailed_info.zip**

In [None]:
# File link: https://drive.google.com/file/d/1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk/view?usp=drive_link
!gdown 1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk&confirm=t

Downloading...
From: https://drive.google.com/uc?id=1v9p1P7Sauw285MEEU9swpdtmQnLPJpqk
To: /content/games_detailed_info.zip
  0% 0.00/19.5M [00:00<?, ?B/s]100% 19.5M/19.5M [00:00<00:00, 235MB/s]


Import packages  

In [None]:
import numpy as np
import pandas as pd
from functools import partial
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from zipfile import ZipFile

In [None]:
with ZipFile("games_detailed_info.zip") as zipFile:
    zipFile.extractall()

## 1- EXAMINE DATA

Check df_details head

In [None]:
df_games_detailed= pd.read_csv("games_detailed_info.csv",index_col = 2,low_memory=False) # use game id as index
df_games_detailed.head(2)

Unnamed: 0_level_0,Unnamed: 0,type,thumbnail,image,primary,alternate,description,yearpublished,minplayers,maxplayers,...,War Game Rank,Customizable Rank,Children's Game Rank,RPG Item Rank,Accessory Rank,Video Game Rank,Amiga Rank,Commodore 64 Rank,Arcade Rank,Atari ST Rank
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,boardgame,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,Pandemic,"['EPIZOotic', 'Pandemia', 'Pandemia 10 Anivers...","In Pandemic, several virulent diseases have br...",2008,2,4,...,,,,,,,,,,
822,1,boardgame,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,Carcassonne,"['Carcassonne Jubilee Edition', 'Carcassonne: ...",Carcassonne is a tile-placement game in which ...,2000,2,5,...,,,,,,,,,,


Let's check columns and the shape.

In [None]:
df_games_detailed.columns

Index(['Unnamed: 0', 'type', 'thumbnail', 'image', 'primary', 'alternate',
       'description', 'yearpublished', 'minplayers', 'maxplayers',
       'suggested_num_players', 'suggested_playerage',
       'suggested_language_dependence', 'playingtime', 'minplaytime',
       'maxplaytime', 'minage', 'boardgamecategory', 'boardgamemechanic',
       'boardgamefamily', 'boardgameexpansion', 'boardgameimplementation',
       'boardgamedesigner', 'boardgameartist', 'boardgamepublisher',
       'usersrated', 'average', 'bayesaverage', 'Board Game Rank',
       'Strategy Game Rank', 'Family Game Rank', 'stddev', 'median', 'owned',
       'trading', 'wanting', 'wishing', 'numcomments', 'numweights',
       'averageweight', 'boardgameintegration', 'boardgamecompilation',
       'Party Game Rank', 'Abstract Game Rank', 'Thematic Rank',
       'War Game Rank', 'Customizable Rank', 'Children's Game Rank',
       'RPG Item Rank', 'Accessory Rank', 'Video Game Rank', 'Amiga Rank',
       'Commodore 64

In [None]:
df_games_detailed.shape

(21631, 55)

Let's create a dictionary dataframe mapping ids to game names. We will use this df to access game names using game ids.

In [None]:
df_id2game = df_games_detailed[[ "primary"]].copy()
df_id2game.head()

Unnamed: 0_level_0,primary
id,Unnamed: 1_level_1
30549,Pandemic
822,Carcassonne
13,Catan
68448,7 Wonders
36218,Dominion


We select the columns **boardgamecategory**  **boardgamemechanic** and **boardgamefamily** as  content columns.<br>
We will also need game ranks to sort the games with equal distances.

In [None]:
df_content= df_games_detailed[["Board Game Rank","boardgamecategory","boardgamemechanic","boardgamefamily"]].copy()
df_content.rename(columns={"Board Game Rank":"Board_Game_Rank"},inplace = True)
df_content.head(3)

Unnamed: 0_level_0,Board_Game_Rank,boardgamecategory,boardgamemechanic,boardgamefamily
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30549,106,['Medical'],"['Action Points', 'Cooperative Game', 'Hand Ma...","['Components: Map (Global Scale)', 'Components..."
822,191,"['City Building', 'Medieval', 'Territory Build...","['Area Majority / Influence', 'Map Addition', ...","['Cities: Carcassonne (France)', 'Components: ..."
13,429,"['Economic', 'Negotiation']","['Dice Rolling', 'Hexagon Grid', 'Income', 'Mo...","['Animals: Sheep', 'Components: Hexagonal Tile..."


We consider that as  we go from boardgamecategory to boardgamemechanic and boardgamefamily columns, we delve into more details.<br>
In other words boardgamecategory provides more broad groups.

Let's check missing values for each column.

In [None]:
df_content.isna().sum()

Board_Game_Rank         0
boardgamecategory     283
boardgamemechanic    1590
boardgamefamily      3761
dtype: int64

We can see that as missing values increase as some games provide less details on boardgamemechanic and boardgamefamily.<br>
(missing values increase in this order: boardgamecategory->boardgamemechanic -> boardgamefamily  ).<br>  

## 2 DATA CLEANING

We consider the existence of boardgamecategory values as the minimum requirement to provide recommendation. <br>**Therefore we drop rows with missing values in boardgamecategory.**

In [None]:
df_content.dropna(subset=["boardgamecategory"],inplace=True)

We drop the rows with missing values in "boardgamecategory" which is the main . Check missing values again.

In [None]:
df_content.isna().sum()

Board_Game_Rank         0
boardgamecategory       0
boardgamemechanic    1538
boardgamefamily      3676
dtype: int64

We fill the missing values in columns **boardgamemechanic** and **boardgamefamily** with zeros.

Check the new shape.

In [None]:
df_content.shape

(21348, 4)

We can fill na values in **boardgamemechanic** and **boardgamefamily** with empty lists and check missing values again.

In [None]:
df_content["boardgamemechanic"][df_content["boardgamemechanic"].isna()] = df_content["boardgamemechanic"][df_content["boardgamemechanic"].isna()].apply(lambda x:[""])
df_content["boardgamefamily"][df_content["boardgamefamily"].isna()] = df_content["boardgamefamily"][df_content["boardgamefamily"].isna()].apply(lambda x:[""])
df_content.isna().sum()

Board_Game_Rank      0
boardgamecategory    0
boardgamemechanic    0
boardgamefamily      0
dtype: int64

## 3- DATA PREPROCESSING

### 3.1 Extract categorical content columns

We will extract categorical content columns which are embedded in lists in rows of boardgamecategory,	boardgamemechanic and	boardgamefamily.  <br>
Then, in the next step we will represent them as one-hot columns.

In [None]:
df_content.head(2)

Unnamed: 0_level_0,Board_Game_Rank,boardgamecategory,boardgamemechanic,boardgamefamily
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30549,106,['Medical'],"['Action Points', 'Cooperative Game', 'Hand Ma...","['Components: Map (Global Scale)', 'Components..."
822,191,"['City Building', 'Medieval', 'Territory Build...","['Area Majority / Influence', 'Map Addition', ...","['Cities: Carcassonne (France)', 'Components: ..."


Each column has categorical attributes given in list format but in str datatype. <br>
We have to interpret these strings as list. We can do this using literal_eval.<br>
For example, <code>literal_eval("[1,2,3]")</code> will yield a list <code> [1,2,3]</code> .<br>

In [None]:
from ast import literal_eval
df_content["boardgamecategory"]  = df_content["boardgamecategory"].apply(lambda x: literal_eval(str(x)))
df_content["boardgamemechanic"] = df_content["boardgamemechanic"].apply(lambda x: literal_eval(str(x)))
df_content["boardgamefamily"] = df_content["boardgamefamily"].apply(lambda x: literal_eval(str(x)))

Once we conterted each cell content to list type, we can extract unique values to be used as one-hot encoding columns.

#### 3.1.1 Get unique values of boardgamecategory as a list.

In [None]:
category_cols = sorted(set(sum(df_content["boardgamecategory"].tolist(),[])))
print(category_cols)
print("Number of categories:",len(category_cols))

['Abstract Strategy', 'Action / Dexterity', 'Adventure', 'Age of Reason', 'American Civil War', 'American Indian Wars', 'American Revolutionary War', 'American West', 'Ancient', 'Animals', 'Arabian', 'Aviation / Flight', 'Bluffing', 'Book', 'Card Game', "Children's Game", 'City Building', 'Civil War', 'Civilization', 'Collectible Components', 'Comic Book / Strip', 'Deduction', 'Dice', 'Economic', 'Educational', 'Electronic', 'Environmental', 'Expansion for Base-game', 'Exploration', 'Fan Expansion', 'Fantasy', 'Farming', 'Fighting', 'Game System', 'Horror', 'Humor', 'Industry / Manufacturing', 'Korean War', 'Mafia', 'Math', 'Mature / Adult', 'Maze', 'Medical', 'Medieval', 'Memory', 'Miniatures', 'Modern Warfare', 'Movies / TV / Radio theme', 'Murder/Mystery', 'Music', 'Mythology', 'Napoleonic', 'Nautical', 'Negotiation', 'Novel-based', 'Number', 'Party Game', 'Pike and Shot', 'Pirates', 'Political', 'Post-Napoleonic', 'Prehistoric', 'Print & Play', 'Puzzle', 'Racing', 'Real-time', 'Rel

#### 3.1.2 Get unique values of boardgamemechanic as a list.

In [None]:
mechanic_cols = sorted(set(sum(df_content["boardgamemechanic"].tolist(),[])))
print(mechanic_cols)

['', 'Acting', 'Action Drafting', 'Action Points', 'Action Queue', 'Action Retrieval', 'Action Timer', 'Action/Event', 'Advantage Token', 'Alliances', 'Area Majority / Influence', 'Area Movement', 'Area-Impulse', 'Auction/Bidding', 'Auction: Dexterity', 'Auction: Dutch', 'Auction: Dutch Priority', 'Auction: English', 'Auction: Fixed Placement', 'Auction: Once Around', 'Auction: Sealed Bid', 'Auction: Turn Order Until Pass', 'Automatic Resource Growth', 'Betting and Bluffing', 'Bias', 'Bingo', 'Bribery', 'Campaign / Battle Card Driven', 'Card Drafting', 'Card Play Conflict Resolution', 'Catch the Leader', 'Chaining', 'Chit-Pull System', 'Closed Economy Auction', 'Command Cards', 'Commodity Speculation', 'Communication Limits', 'Connections', 'Constrained Bidding', 'Contracts', 'Cooperative Game', 'Crayon Rail System', 'Critical Hits and Failures', 'Cube Tower', 'Deck Construction', 'Deck, Bag, and Pool Building', 'Deduction', 'Delayed Purchase', 'Dice Rolling', 'Die Icon Resolution', 'D

In [None]:
# Remove empty string
mechanic_cols = mechanic_cols[1:]
print("Number of unique boardgamemechanic values:",len(mechanic_cols))

Number of unique boardgamemechanic values: 182


#### 3.1.3 Get unique values of boardgamefamily as a list.

In [None]:
family_cols = sorted( {x for row in df_content["boardgamefamily"]  for x in row
               if not (x.startswith("Admin") or x.startswith("Game") or x.startswith("Crowdfunding") or x.startswith("Digital Implementations")
                   or x.startswith("Digital Implementations")  or x.startswith("Trivia")  ) } )
print(family_cols)

['', 'Ancient: Babylon', 'Ancient: Carthage', 'Ancient: Corinth', 'Ancient: Egypt', 'Ancient: Greece', 'Ancient: Indus Valley', 'Ancient: Jericho', 'Ancient: Magna Graecia', 'Ancient: Mesopotamia', 'Ancient: Pompeii', 'Ancient: Rome', 'Ancient: Sparta', 'Animals: Alligators / Crocodiles', 'Animals: Ants', 'Animals: Apes / Monkeys', 'Animals: Badgers', 'Animals: Bats', 'Animals: Bears', 'Animals: Beavers', 'Animals: Bees', 'Animals: Birds', 'Animals: Butterflies', 'Animals: Camels', 'Animals: Cats', 'Animals: Cattle / Cows', 'Animals: Chameleons', 'Animals: Chickens', 'Animals: Cockroaches', 'Animals: Coral / Jellyfish / Anemones', 'Animals: Coyotes', 'Animals: Crabs', 'Animals: Crows / Ravens / Magpies', 'Animals: Deer / Antelope', 'Animals: Dinosaurs', 'Animals: Dogs', 'Animals: Dolphins', 'Animals: Donkeys', 'Animals: Ducks', 'Animals: Eagles', 'Animals: Elephants', 'Animals: Emus', 'Animals: Fish / Fishes', 'Animals: Fleas', 'Animals: Flies', 'Animals: Foxes', 'Animals: Frogs / Toad

In [None]:
# Ignore the empty string
family_cols = family_cols[1:]
print("Number of unique boardgamefamily values:",len(family_cols))

Number of unique boardgamefamily values: 2643


## 3.2 One-hot encode boardgamecategory, boardgamemechanics and boardgamefamily columns.
We will create three one-hot encoded dataframes;df_category,df_mechanics and df_family. Then we will merge them.
### 3.2.1 Create **df_category** one-hot encoded dataframe

In this step we spread  elements of the lists in boardgamecategory column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [None]:
df_category = pd.concat([df_content[["boardgamecategory"]], pd.DataFrame(columns= category_cols)])
df_category.fillna(0,inplace=True) # Drop the old boardgamecategory column
df_category.iloc[:,1:] = df_category.iloc[:,1:].astype("int8")
df_category.head(2)

  df_category.iloc[:,1:] = df_category.iloc[:,1:].astype("int8")


Unnamed: 0,boardgamecategory,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
30549,[Medical],0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[City Building, Medieval, Territory Building]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each game; at the categories that the game contains, replace cells initialized as 0s with 1s.

In [None]:
for id in df_category.index:
    columns_for_ohe =  df_category.loc[id,"boardgamecategory"]
    df_category.loc[id, columns_for_ohe]=1
df_category.head(2)

Unnamed: 0,boardgamecategory,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Transportation,Travel,Trivia,Video Game Theme,Vietnam War,Wargame,Word Game,World War I,World War II,Zombies
30549,[Medical],0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[City Building, Medieval, Territory Building]",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3.2.2 Create **df_mechanic** one-hot encoded dataframe
In this step we spread  elements of the lists in boardgamemechanic column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [None]:
df_mechanic = pd.concat([df_content[["boardgamemechanic"]],pd.DataFrame(columns= mechanic_cols)])
df_mechanic.fillna(0,inplace=True)
df_mechanic.iloc[:,1:] = df_mechanic.iloc[:,1:].astype("int8")
df_mechanic.head(2)

  df_mechanic.iloc[:,1:] = df_mechanic.iloc[:,1:].astype("int8")


Unnamed: 0,boardgamemechanic,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,"[Action Points, Cooperative Game, Hand Managem...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[Area Majority / Influence, Map Addition, Tile...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


For each game; at the categories that the game contains, replace cells initialized as 0s with 1s.

In [None]:
for id in df_mechanic.index:
    columns_for_ohe =  df_mechanic.loc[id,"boardgamemechanic"]
    if columns_for_ohe != [""]:
        df_mechanic.loc[id, columns_for_ohe] = 1
df_mechanic.head(2)

Unnamed: 0,boardgamemechanic,Acting,Action Drafting,Action Points,Action Queue,Action Retrieval,Action Timer,Action/Event,Advantage Token,Alliances,...,Turn Order: Stat-Based,Variable Phase Order,Variable Player Powers,Variable Set-up,Victory Points as a Resource,Voting,Worker Placement,Worker Placement with Dice Workers,"Worker Placement, Different Worker Types",Zone of Control
30549,"[Action Points, Cooperative Game, Hand Managem...",0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
822,"[Area Majority / Influence, Map Addition, Tile...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
dataset=df_mechanic.iloc[:,1:].astype("int")
dataset=pd.get_dummies(dataset, columns=dataset.columns)
dataset =dataset.astype("float")

Unnamed: 0,Acting_0,Acting_1,Action Drafting_0,Action Drafting_1,Action Points_0,Action Points_1,Action Queue_0,Action Queue_1,Action Retrieval_0,Action Retrieval_1,...,Voting_0,Voting_1,Worker Placement_0,Worker Placement_1,Worker Placement with Dice Workers_0,Worker Placement with Dice Workers_1,"Worker Placement, Different Worker Types_0","Worker Placement, Different Worker Types_1",Zone of Control_0,Zone of Control_1
30549,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
822,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
13,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
68448,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
36218,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296892,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
217378,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
18063,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
10052,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


### 3.2.3 Create df_family one-hot encoded dataframe
In this step we spread  elements of the lists in boardgamefamily column to new one-hot categorical columns.<br>
We initialize values in these columns as zeros.

In [None]:
df_family = pd.concat([df_content[["boardgamefamily"]], pd.DataFrame(columns= family_cols)])
df_family.fillna(0,inplace=True) # Drop the old boardgamefamily column
df_family.iloc[:,1:] = df_family.iloc[:,1:].astype("int8")
df_family.head(2)

Unnamed: 0,boardgamefamily,Ancient: Babylon,Ancient: Carthage,Ancient: Corinth,Ancient: Egypt,Ancient: Greece,Ancient: Indus Valley,Ancient: Jericho,Ancient: Magna Graecia,Ancient: Mesopotamia,...,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
30549,"[Components: Map (Global Scale), Components: M...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[Cities: Carcassonne (France), Components: Mee...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Differently from the previous two steps, we have to exclude some values(attributes) in boardgamefamily which we didn't include in family_cols(like "Admin", "Game", "Trivia").

In [None]:
for id in df_family.index:
    columns_for_ohe =  df_family.loc[id,"boardgamefamily"]
    if columns_for_ohe != [""]:
        columns_for_ohe = [col  for col in columns_for_ohe if col in family_cols]
        df_family.loc[id, columns_for_ohe] = 1
df_family.head(2)

Unnamed: 0,boardgamefamily,Ancient: Babylon,Ancient: Carthage,Ancient: Corinth,Ancient: Egypt,Ancient: Greece,Ancient: Indus Valley,Ancient: Jericho,Ancient: Magna Graecia,Ancient: Mesopotamia,...,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
30549,"[Components: Map (Global Scale), Components: M...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,"[Cities: Carcassonne (France), Components: Mee...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Drop columns

## 3.3 Feature Engineering

## 3.4 Merge three ohe dataframes;df_category,df_mechanic, and df_family as df_content

Now we can drop the original columns.

In [None]:
df_category.drop("boardgamecategory", inplace=True, axis=1)
df_mechanic.drop("boardgamemechanic", inplace=True, axis=1)
df_family.drop("boardgamefamily", inplace=True, axis=1)

Check shapes

In [None]:
df_category.shape

(21348, 84)

In [None]:
df_mechanic.shape

(21348, 182)

In [None]:
df_family.shape

(21348, 2643)

Now we can create final **content dataframe** which has
* **Board Game Rank**
* **one-hot boardgamecategory**
* **one-hot boardgamemechanic**
* **one-hot boardgamefamily**
columns

In [None]:
df_content = pd.concat([df_content["Board_Game_Rank"],df_category,df_mechanic,df_family],axis=1)
df_content.shape

(21348, 2910)

In [None]:
df_content.head(2)

Unnamed: 0,Board Game Rank,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
30549,106,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,191,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21348 entries, 30549 to 165946
Columns: 2910 entries, Board_Game_Rank to Word Games: Spelling / Letters
dtypes: int8(2909), object(1)
memory usage: 60.1+ MB


## 3.3 Assign Values to Not Ranked Values in Board_Game_Rank

There are few games not ranked.

In [None]:
sum(df_content["Board_Game_Rank"]=="Not Ranked")

5

 We can replace them with a number greater than the number of games.

In [None]:
df_content["Board_Game_Rank"]=df_content["Board_Game_Rank"].replace({"Not Ranked":"22000"})
df_content.head(2)

Unnamed: 0,Board_Game_Rank,Abstract Strategy,Action / Dexterity,Adventure,Age of Reason,American Civil War,American Indian Wars,American Revolutionary War,American West,Ancient,...,Video Game Theme: SEGA,Video Game Theme: Sonic the Hedgehog,Video Game Theme: Super Mario Bros.,Video Game Theme: Tetris,Video Game Theme: The Oregon Trail,Webcomics: Dork Tower,Webcomics: Penny Arcade,Word Games: First Letter Given,Word Games: Guess the Word,Word Games: Spelling / Letters
30549,106,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
822,191,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can apply type conversions to optimize file memory. Then save the resulting df for deployment.

In [None]:
df_content.iloc[:,0]  = df_content.iloc[:,0].astype("uint32")#df_content["Board_Game_Rank"]=df_content["Board_Game_Rank"].astype("uint32")
df_content.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21348 entries, 30549 to 165946
Columns: 2910 entries, Board_Game_Rank to Word Games: Spelling / Letters
dtypes: int8(2909), uint32(1)
memory usage: 60.0 MB


  df_content.iloc[:,0]  = df_content.iloc[:,0].astype("uint32")#df_content["Board_Game_Rank"]=df_content["Board_Game_Rank"].astype("uint32")


In [None]:
# Save for the deployment
df_content.to_csv("df_content.csv")