# Combine Description with Board Game Dataset

In this notebook, we're going to process description data and integrate it into our board game dataset. 

Our approach follows the procedures outlined in Week 1 exercise.

To begin, we convert all text to lowercase, and then eliminate punctuation and stop words. Finally, we apply stemming to reduce the remaining words to their base forms.

In [9]:
import pandas as pd
import numpy as np

# Merge board-game.csv with description.csv
df_description = pd.read_csv("data/description.csv")
df = pd.read_csv("data/board-game.csv")
df = pd.concat([df, df_description], axis=1)
df.head(10)

Unnamed: 0,Name,Year,min_players,max_players,min_playtime,max_playtime,min_age,category,mechanic,userrated,avg_rate,rank,owned,trading,wanting,wishing,num_comments,num_weights,avg_weight,Description
0,Brian Boru: High King of Ireland,2021,3,5,60,90,14,"['Card Game', 'Medieval', 'Territory Building']","['Area Majority / Influence', 'Closed Drafting...",3003,7.55005,774,4200,76,359,1561,522,93,2.4516,"In Brian Boru: High King of Ireland, you striv..."
1,Jump Drive,2017,2,4,10,30,13,"['Card Game', 'Science Fiction', 'Space Explor...","['Hand Management', 'Simultaneous Action Selec...",4510,6.99382,1123,6348,229,215,1233,1097,106,2.0189,"With the invention of Jump Drive, the race for..."
2,DVONN,2001,2,2,30,30,9,['Abstract Strategy'],"['Grid Movement', 'Map Reduction']",4960,7.45702,591,6917,130,291,1318,1481,386,2.6632,DVONN is played on an elongated hexagonal boar...
3,Point Salad,2019,2,6,15,30,8,['Card Game'],"['Open Drafting', 'Set Collection']",17480,7.17658,462,29303,300,304,2505,2431,353,1.153,Point Salad is a fast and fun card drafting ga...
4,Linko,2014,2,5,20,20,10,['Card Game'],"['Hand Management', 'Move Through Deck', 'Open...",4926,6.99363,1072,7998,167,116,662,1020,240,1.3917,"In Linko! (a.k.a. Abluxxen), you take turns pl..."
5,Irish Gauge,2014,3,5,60,60,12,"['Economic', 'Trains']","['Auction/Bidding', 'Hexagon Grid', 'Income', ...",3162,7.25133,1018,4984,108,235,1092,679,77,2.3506,Irish Gauge &mdash; one of three titles in Win...
6,Balloon Cup,2003,2,2,30,30,8,"['Aviation / Flight', 'Card Game']","['Hand Management', 'Set Collection', 'Take Th...",6004,6.67301,1453,6329,144,329,988,2085,654,1.4587,"In Balloon Cup, the players compete in several..."
7,Call of Cthulhu: The Card Game,2008,2,2,30,30,13,"['Card Game', 'Collectible Components', 'Fanta...","['Deck, Bag, and Pool Building', 'Hand Managem...",3016,6.89349,1648,5400,283,107,546,794,218,2.8945,&quot;The oldest and strongest emotion of mank...
8,Schotten Totten,1999,2,2,20,20,8,['Card Game'],"['Card Play Conflict Resolution', 'Hand Manage...",10788,7.35095,400,18275,208,274,1996,1995,450,1.7044,"In Schotten Totten, nine boundary stones lie b..."
9,Navegador,2010,2,5,60,90,12,"['Economic', 'Exploration', 'Nautical', 'Renai...","['Advantage Token', 'Area Movement', 'Market',...",9056,7.53886,289,8887,180,602,2116,1805,573,3.0855,This game is inspired by the Portuguese Age of...


In [10]:
# Unescaping & Downcasing
import html

df['Description'] = df['Description'].apply(lambda x: html.unescape(x)).str.lower()

In [11]:
# Remove punctuation
import re
import string

regex = re.compile('[%s]' % re.escape(string.punctuation))
df['Description'] = df['Description'].apply(lambda x: regex.sub('', x))

In [14]:
# Remove stop-words
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

df['Description'] = df['Description'].apply(remove_stopwords)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\byx10\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [16]:
# Stemming
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
df['Description'] = df['Description'].apply(lambda x: " ".join(stemmer.stem(w) for w in x.split()))

In [20]:
df.to_csv("data/board-game-with-desc.csv", index=False)