# Board Games Rank Prediction

# Project description
Aim of this project is to predict what will be the rank of a board game with given attributes.

# Libraries and dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Dataset
Original dataset comes from [Kaggle](https://www.kaggle.com/datasets/joebeachcapital/board-games).
Dataset used in the project consists of two merged files: 
- File *details* containing basic information about board games available on the site,
- File *ratings* containing games' ratings

Data has been gathered on Feb 2024.

In [2]:
df_details = pd.read_csv("Data/details.csv", 
                         index_col="id"
                        )
df_ratings = pd.read_csv("Data/ratings.csv", 
                        index_col="id",
                         usecols=["id", "year", "rank", "average", "bayes_average", "users_rated"]
                        )
df = pd.merge(df_details, df_ratings, how="left", on="id")

In [3]:
df.head()

Unnamed: 0_level_0,num,primary,description,yearpublished,minplayers,maxplayers,playingtime,minplaytime,maxplaytime,minage,...,boardgamepublisher,owned,trading,wanting,wishing,year,rank,average,bayes_average,users_rated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30549,0,Pandemic,"In Pandemic, several virulent diseases have br...",2008,2,4,45,45,45,8,...,"['Z-Man Games', 'Albi', 'Asmodee', 'Asmodee It...",168364,2508,625,9344,2008,106,7.59,7.487,108975
822,1,Carcassonne,Carcassonne is a tile-placement game in which ...,2000,2,5,45,30,45,7,...,"['Hans im Glück', '999 Games', 'Albi', 'Bard C...",161299,1716,582,7383,2000,190,7.42,7.309,108738
13,2,Catan,"In CATAN (formerly The Settlers of Catan), pla...",1995,3,4,120,60,120,10,...,"['KOSMOS', '999 Games', 'Albi', 'Asmodee', 'As...",167733,2018,485,5890,1995,429,7.14,6.97,108024
68448,3,7 Wonders,You are the leader of one of the 7 great citie...,2010,2,7,30,30,30,10,...,"['Repos Production', 'ADC Blackfire Entertainm...",120466,1567,1010,12105,2010,73,7.74,7.634,89982
36218,4,Dominion,"&quot;You are a monarch, like your parents bef...",2008,2,4,30,30,30,13,...,"['Rio Grande Games', '999 Games', 'Albi', 'Bar...",106956,2009,655,8621,2008,104,7.61,7.499,81561


In [4]:
df.shape

(21631, 27)

In [5]:
df.describe()

Unnamed: 0,num,yearpublished,minplayers,maxplayers,playingtime,minplaytime,maxplaytime,minage,owned,trading,wanting,wishing,year,rank,average,bayes_average,users_rated
count,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0,21631.0
mean,10815.0,1986.09491,2.007027,5.709491,90.509177,63.647774,90.509177,9.611391,1487.924553,43.585965,42.030373,233.655587,1988.10129,10879.522352,6.417249,5.683664,874.548518
std,6244.476172,210.042496,0.688957,15.102385,534.826511,447.213702,534.826511,3.640562,5395.077773,102.410851,117.940355,800.657809,190.115056,6311.917913,0.929345,0.366096,3695.946026
min,0.0,-3500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.04,0.0,30.0
25%,5407.5,2001.0,2.0,4.0,25.0,20.0,25.0,8.0,150.0,5.0,3.0,14.0,2001.0,5408.5,5.83,5.51,57.0
50%,10815.0,2011.0,2.0,4.0,45.0,30.0,45.0,10.0,322.0,13.0,9.0,39.0,2011.0,10839.0,6.45,5.546,124.0
75%,16222.5,2017.0,2.0,6.0,90.0,60.0,90.0,12.0,903.5,38.0,29.0,131.0,2017.0,16356.5,7.04,5.678,397.0
max,21630.0,2023.0,10.0,999.0,60000.0,60000.0,60000.0,25.0,168364.0,2508.0,2011.0,19325.0,3500.0,21831.0,9.57,8.511,108975.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21631 entries, 30549 to 165946
Data columns (total 27 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   num                      21631 non-null  int64  
 1   primary                  21631 non-null  object 
 2   description              21630 non-null  object 
 3   yearpublished            21631 non-null  int64  
 4   minplayers               21631 non-null  int64  
 5   maxplayers               21631 non-null  int64  
 6   playingtime              21631 non-null  int64  
 7   minplaytime              21631 non-null  int64  
 8   maxplaytime              21631 non-null  int64  
 9   minage                   21631 non-null  int64  
 10  boardgamecategory        21348 non-null  object 
 11  boardgamemechanic        20041 non-null  object 
 12  boardgamefamily          17870 non-null  object 
 13  boardgameexpansion       5506 non-null   object 
 14  boardgameimplemen

In [7]:
df.isnull().sum()

num                            0
primary                        0
description                    1
yearpublished                  0
minplayers                     0
maxplayers                     0
playingtime                    0
minplaytime                    0
maxplaytime                    0
minage                         0
boardgamecategory            283
boardgamemechanic           1590
boardgamefamily             3761
boardgameexpansion         16125
boardgameimplementation    16769
boardgamedesigner            596
boardgameartist             5907
boardgamepublisher             1
owned                          0
trading                        0
wanting                        0
wishing                        0
year                           0
rank                           0
average                        0
bayes_average                  0
users_rated                    0
dtype: int64

In [8]:
df.duplicated().sum()

0

Dataset consists **mainly** of numeric columns with non-null values. There are some strings and lists objects as well, which may be null. There are no duplicated rows.

# Exploratory Data analysis (EDA)

## Data cleaning and Preprocessing

In [None]:
# Convert pd.NA to empty string
df['game_category'].fillna('', inplace=True)

### Drop unnecessary columns

Remove any columns that are not relevant to your analysis. Use .drop() in pandas.

In [9]:
df.columns

Index(['num', 'primary', 'description', 'yearpublished', 'minplayers',
       'maxplayers', 'playingtime', 'minplaytime', 'maxplaytime', 'minage',
       'boardgamecategory', 'boardgamemechanic', 'boardgamefamily',
       'boardgameexpansion', 'boardgameimplementation', 'boardgamedesigner',
       'boardgameartist', 'boardgamepublisher', 'owned', 'trading', 'wanting',
       'wishing', 'year', 'rank', 'average', 'bayes_average', 'users_rated'],
      dtype='object')

In [10]:
df = df.drop(["num", "description", "boardgameexpansion", "boardgameimplementation",
             "boardgameartist", "boardgamepublisher", "year"], axis=1)

### Rename columns

In [11]:
cols_dict = {"primary": "name",
            "yearpublished": "year_published",
            "minplayers": "min_players",
            "maxplayers": "max_players",
            "playingtime": "playing_time",
            "minplaytime": "min_playtime",
            "maxplaytime": "max_playtime",
            "minage": "min_age",
            "boardgamecategory": "game_category",
            "boardgamemechanic": "game_mechanic",
            "boardgamefamily": "game_family",
            "boardgamedesigner": "game_designer"
            }
df = df.rename(cols_dict, axis=1)
df.columns

Index(['name', 'year_published', 'min_players', 'max_players', 'playing_time',
       'min_playtime', 'max_playtime', 'min_age', 'game_category',
       'game_mechanic', 'game_family', 'game_designer', 'owned', 'trading',
       'wanting', 'wishing', 'rank', 'average', 'bayes_average',
       'users_rated'],
      dtype='object')

### Reorder columns

In [20]:
# List of columns to reorder
reordered_columns = ["rank", "name", "average", "bayes_average", "owned", "wanting", "wishing", "users_rated"]

# List of columns that are not being reordered
remaining_columns = [col for col in df.columns if col not in reordered_columns]

# Concatenate reordered columns with the remaining columns
df_reordered = pd.concat([df[reordered_columns], df[remaining_columns]], axis=1)

In [97]:
df = df_reordered.sort_values("rank")
df.head()

Unnamed: 0_level_0,rank,name,average,bayes_average,owned,wanting,wishing,users_rated,year_published,min_players,max_players,playing_time,min_playtime,max_playtime,min_age,game_category,game_mechanic,game_family,game_designer,trading
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
174430,1,Gloomhaven,8.74,8.511,77758,1346,17658,47827,2017,1,4,120,60,120,14,"['Adventure', 'Exploration', 'Fantasy', 'Fight...","['Action Queue', 'Action Retrieval', 'Campaign...","['Category: Dungeon Crawler', 'Components: Min...",['Isaac Childres'],648
161936,2,Pandemic Legacy: Season 1,8.59,8.442,70830,831,11729,45041,2015,2,4,60,60,60,13,"['Environmental', 'Medical']","['Action Points', 'Cooperative Game', 'Hand Ma...","['Components: Map (Global Scale)', 'Components...","['Rob Daviau', 'Matt Leacock']",327
224517,3,Brass: Birmingham,8.66,8.418,38126,1522,11846,25484,2018,2,4,120,60,120,14,"['Economic', 'Industry / Manufacturing', 'Post...","['Hand Management', 'Income', 'Loans', 'Market...","['Cities: Birmingham (England)', 'Country: Eng...","['Gavan Brown', 'Matt Tolman', 'Martin Wallace']",128
167791,4,Terraforming Mars,8.42,8.274,101872,2011,19227,74216,2016,1,5,120,120,120,12,"['Economic', 'Environmental', 'Industry / Manu...","['Drafting', 'End Game Bonuses', 'Hand Managem...","['Components: Map (Global Scale)', 'Components...",['Jacob Fryxelius'],538
233078,5,Twilight Imperium: Fourth Edition,8.68,8.262,20542,986,8984,16025,2017,3,6,480,240,480,14,"['Civilization', 'Economic', 'Exploration', 'N...","['Action Drafting', 'Area Majority / Influence...","['Components: Hexagonal Tiles', 'Components: M...","['Dane Beltrami', 'Corey Konieczka', 'Christia...",120


In [43]:
# Create dummy variables for 'game_category'
dummy_categories = pd.get_dummies(df['game_category'].apply(lambda x: pd.Series(x)).stack()).groupby(level=0).sum()

# Rename the dummy columns with the prefix "cat_"
dummy_categories.columns = ['cat_' + col.lower().replace(' ', '_') for col in dummy_categories.columns]

# Replace NaN values with 0
dummy_categories.fillna(0, inplace=True)

# Concatenate the dummy variables with the original DataFrame
df_dummy_categories = pd.concat([df, dummy_categories], axis=1)

In [49]:
# Create dummy variables for 'game_category'
dummy_mechanic = pd.get_dummies(df['game_mechanic'].apply(lambda x: pd.Series(x)).stack()).groupby(level=0).sum()

# Rename the dummy columns with the prefix "cat_"
dummy_mechanic.columns = ['cat_' + col.lower().replace(' ', '_') for col in dummy_mechanic.columns]

# Replace NaN values with 0
dummy_mechanic.fillna(0, inplace=True)

# Concatenate the dummy variables with the original DataFrame
df_dummy_mechanic = pd.concat([df, dummy_mechanic], axis=1)
df_dummy_mechanic

Unnamed: 0_level_0,rank,name,average,bayes_average,owned,wanting,wishing,users_rated,year_published,min_players,...,"cat_['variable_player_powers',_'variable_set-up']","cat_['variable_player_powers',_'voting']","cat_['variable_player_powers',_'worker_placement']",cat_['variable_player_powers'],cat_['victory_points_as_a_resource'],cat_['voting'],"cat_['worker_placement',_'worker_placement_with_dice_workers']","cat_['worker_placement',_'worker_placement,_different_worker_types']",cat_['worker_placement'],"cat_['worker_placement,_different_worker_types']"
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,318,Die Macher,7.61,7.100,7532,506,2055,5362,1986,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4036,Dragonmaster,6.64,5.782,1289,72,191,562,1981,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,231,Samurai,7.45,7.239,15634,805,3456,15203,1998,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5382,Tal der Könige,6.60,5.679,640,55,123,341,1992,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,292,Acquire,7.34,7.141,23795,557,2696,18719,1964,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349129,19045,Snowhere,4.97,5.492,79,7,37,35,2021,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
349131,12471,Splitter,6.46,5.533,130,11,50,65,2021,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
349805,11624,Bismarck Solitaire,7.42,5.539,199,10,51,38,2021,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
350736,3924,Voyages,7.80,5.793,1157,14,164,294,2021,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [79]:
import ast
df['game_category'].fillna('[]', inplace=True)

# Convert string representations of lists to actual lists using ast.literal_eval
df['game_category'] = df['game_category'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else [])

# Split the categories into separate columns
df_categories = pd.DataFrame(df['game_category'].tolist())

# Add prefix "cat_" to each column
df_categories.columns = ['cat_' + col.lower().replace(' ', '_') for col in df_categories.columns]

# Create dummy variables for each category
dummy_categories = pd.get_dummies(df_categories.apply(lambda x: x.str.lower().str.replace(' ', '_')))
dummy_categories
# Check if df_categories is empty
if not df_categories.empty:
    # Concatenate the dummy variables with df_categories
    df1 = pd.concat([df, dummy_categories], axis=1)
else:
    # Assign dummy_categories to df
    df1 = pd.concat([df, dummy_categories], axis=1)

df1


AttributeError: 'int' object has no attribute 'lower'

In [57]:
dummy_categories

Unnamed: 0,0_abstract_strategy,0_action_/_dexterity,0_adventure,0_age_of_reason,0_american_civil_war,0_american_indian_wars,0_american_revolutionary_war,0_american_west,0_ancient,0_animals,...,11_renaissance,11_sports,11_wargame,11_world_war_ii,12_real-time,12_science_fiction,12_territory_building,12_world_war_ii,13_space_exploration,13_territory_building
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21626,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21627,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21628,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
21629,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
df['game_category'].tolist()

["['Adventure', 'Exploration', 'Fantasy', 'Fighting', 'Miniatures']",
 "['Environmental', 'Medical']",
 "['Economic', 'Industry / Manufacturing', 'Post-Napoleonic', 'Transportation']",
 "['Economic', 'Environmental', 'Industry / Manufacturing', 'Science Fiction', 'Space Exploration', 'Territory Building']",
 "['Civilization', 'Economic', 'Exploration', 'Negotiation', 'Political', 'Science Fiction', 'Space Exploration', 'Wargame']",
 "['Adventure', 'Exploration', 'Fantasy', 'Fighting', 'Miniatures']",
 "['Economic', 'Science Fiction', 'Space Exploration', 'Territory Building']",
 "['Civil War', 'Fighting', 'Miniatures', 'Movies / TV / Radio theme', 'Science Fiction', 'Wargame']",
 "['Card Game', 'Civilization', 'Economic']",
 "['Adventure', 'Fantasy', 'Fighting', 'Miniatures', 'Novel-based', 'Territory Building', 'Wargame']",
 "['Age of Reason', 'Environmental', 'Fantasy', 'Fighting', 'Mythology', 'Renaissance', 'Territory Building']",
 "['American West', 'Animals', 'Economic']",
 "['Mo

In [51]:
# Split the categories into separate columns
df_categories = pd.DataFrame(df['game_category'].tolist(), columns=['cat_{}'.format(i+1) for i in range(len(df['game_category'].iloc[0]))])
df_categories

ValueError: Shape of passed values is (21631, 1), indices imply (21631, 65)

In [50]:
# Create dummy variables for each category
dummy_categories = pd.get_dummies(df_categories.apply(lambda x: x.str.lower().str.replace(' ', '_')))
dummy_categories

ValueError: Shape of passed values is (21631, 1), indices imply (21631, 65)

In [105]:
df = df_reordered.sort_values("rank")

df['game_category'] = df['game_category'].fillna('')

# Remove square brackets and empty quotes from the string representation
df['game_category'] = df['game_category'].str.replace(r"\[|\]|\'", "")

# Split the string into a list
df['game_category'] = df['game_category'].str.split(", ")

# Split the categories into separate columns
df_categories = pd.DataFrame(df['game_category'].tolist())

# Rename columns, replacing NaN values with an empty string
df_categories.columns = ['cat_' + str(col).lower().replace(' ', '_') if pd.notna(col) else '' for col in df_categories.columns]

# Create dummy variables for each category
dummy_categories = pd.get_dummies(df_categories.apply(lambda x: x.str.lower().str.replace(' ', '_')))



  


In [106]:
dummy_categories

Unnamed: 0,cat_0_,"cat_0_""childrens_game""",cat_0_abstract_strategy,cat_0_action_/_dexterity,cat_0_adventure,cat_0_age_of_reason,cat_0_american_civil_war,cat_0_american_indian_wars,cat_0_american_revolutionary_war,cat_0_american_west,...,cat_11_renaissance,cat_11_sports,cat_11_wargame,cat_11_world_war_ii,cat_12_real-time,cat_12_science_fiction,cat_12_territory_building,cat_12_world_war_ii,cat_13_space_exploration,cat_13_territory_building
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21626,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21627,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21628,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
21629,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
df_dummy_categories[df_dummy_categories["cat_['video_game_theme']"]==1].sort_values("rank")

Unnamed: 0_level_0,rank,name,average,bayes_average,owned,wanting,wishing,users_rated,year_published,min_players,...,cat_['video_game_theme'],"cat_['vietnam_war',_'wargame']","cat_['wargame',_'world_war_i']","cat_['wargame',_'world_war_ii',_'zombies']","cat_['wargame',_'world_war_ii']","cat_['wargame',_'zombies']",cat_['wargame'],cat_['word_game'],cat_['world_war_ii'],cat_['zombies']
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
293889,1299,Fallout Shelter: The Board Game,7.3,6.376,3948,198,1288,1734,2020,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
269603,1688,Minecraft: Builders & Biomes,7.06,6.226,4109,65,459,1613,2019,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
254088,3767,Gravity Superstar,6.76,5.811,782,29,162,574,2018,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278120,3899,God of War: The Card Game,6.75,5.795,1723,31,281,625,2019,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
279741,8139,Devil May Cry: The Bloody Palace,7.98,5.585,316,12,79,112,2021,1,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
285992,9441,Pac-Man: The Board Game,6.2,5.562,675,7,38,197,2019,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
283713,13915,Final Fantasy XIV: Gold Saucer Cactpot Party,6.68,5.524,98,3,15,40,2019,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276264,15602,Dominate Grail War: Fate/Stay night on Board Game,6.94,5.514,49,4,15,40,2019,3,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
223762,16612,Atari's Missile Command,5.85,5.509,274,2,37,61,2018,3,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
223763,16845,Atari's Centipede,5.79,5.507,583,9,53,143,2017,2,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Feature engineering

In [23]:
# Explode the list in the 'Designer' column into separate rows
df_exploded = df.explode('game_category')

# Apply one-hot encoding using get_dummies()
game_category_dummies = pd.get_dummies(df_exploded['game_category'], prefix='cat')

# Concatenate the original DataFrame with the one-hot encoded DataFrame
df_encoded = pd.concat([df, game_category_dummies], axis=1)

# # Drop the original 'Designer' column as it's no longer needed
# df_encoded.drop('Designer', axis=1, inplace=True)

df_encoded


Unnamed: 0_level_0,rank,name,average,bayes_average,owned,wanting,wishing,users_rated,year_published,min_players,...,cat_['Video Game Theme'],"cat_['Vietnam War', 'Wargame']","cat_['Wargame', 'World War I']","cat_['Wargame', 'World War II', 'Zombies']","cat_['Wargame', 'World War II']","cat_['Wargame', 'Zombies']",cat_['Wargame'],cat_['Word Game'],cat_['World War II'],cat_['Zombies']
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
174430,1,Gloomhaven,8.74,8.511,77758,1346,17658,47827,2017,1,...,0,0,0,0,0,0,0,0,0,0
161936,2,Pandemic Legacy: Season 1,8.59,8.442,70830,831,11729,45041,2015,2,...,0,0,0,0,0,0,0,0,0,0
224517,3,Brass: Birmingham,8.66,8.418,38126,1522,11846,25484,2018,2,...,0,0,0,0,0,0,0,0,0,0
167791,4,Terraforming Mars,8.42,8.274,101872,2011,19227,74216,2016,1,...,0,0,0,0,0,0,0,0,0,0
233078,5,Twilight Imperium: Fourth Edition,8.68,8.262,20542,986,8984,16025,2017,3,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7316,21827,Bingo,2.88,3.971,1681,2,27,2282,1530,2,...,0,0,0,0,0,0,0,0,0,0
5048,21828,Candy Land,3.18,3.801,6247,4,67,4238,1949,2,...,0,0,0,0,0,0,0,0,0,0
5432,21829,Chutes and Ladders,2.87,3.614,4818,4,61,4035,-200,2,...,0,0,0,0,0,0,0,0,0,0
11901,21830,Tic-Tac-Toe,2.70,3.575,1447,9,29,3436,-1300,2,...,0,0,0,0,0,0,0,0,0,0


In [13]:
pd.get_dummies(df, columns=['waterfront', 'view', 'condition', 'grade'])

### Encode categorical variables

If you have categorical variables like ‘gender’ or ‘color’, encode them as integers to prepare for modeling. You can use label encoding, one-hot encoding, or target encoding.

### Normalize/standardize numeric variables

If you have numeric features on different scales, normalize or standardize them to the same scale. This helps prevent features with larger ranges from dominating.

### Handle outliers

You may choose to cap outliers at a certain value, winsorize them, or remove them, depending on your analysis goals.

Those are some of the main techniques for preprocessing your data as part of exploratory data analysis. The goal is to transform your raw data into a form that’s easier to analyze and model.

## Data Visualization

Now we can visualize the data to gain insights:

    .hist() for histograms
    .plot() for line plots, scatter plots, etc.
    .value_counts() to see counts of categorical variables
    .describe() for summary statistics

### Histograms

Use .hist() in pandas to get a visual representation of the distribution of a numeric variable. This can reveal outliers, skewness, and other patterns.

### Box plots

Use .boxplot() in pandas to visualize the distribution through quartiles, extremes, and outliers for a numeric variable.

### Scatter plots

Use .plot(kind=’scatter’) to visualize the relationship between two numeric variables. This can reveal correlations, clusters, and outliers.

### Bar plots

Use .plot(kind=’bar’) to compare categorical variables or the counts of categorical variables. This gives a quick visual summary.

### Line plots

Use .plot(kind=’line’) to visualize trends over time for time series data.

### Pair plots

Use seaborn pairplot() to visualize the relationships between all variables in a dataset.

### Correlation heatmaps

Use a seaborn heatmap() to visualize the correlation between all numeric variables.

### Pie charts

Use .plot(kind=’pie’) to visualize the proportional breakdown of a categorical variable.

### Word clouds

Generate a word cloud to visualize the most common words in a text column.

### Descriptive statistics

Use .describe() to get summary stats like count, mean, standard deviation, minimum, and maximum for numeric columns.

These techniques help you gain a quick understanding of your data and reveal patterns, outliers, and relationships that you can then investigate further. Data visualization is a critical part of the exploratory data analysis process.

# Model selection