# 01 · Feature Engineering

Tento notebook načte `games.parquet` (stažený ve scraperu) a vyrobí:
- **TF-IDF vektory** z kategorií a mechanik
- **Bayesovskou popularitu** (shrunken mean podle počtu hodnocení)
- Normalizované číselné featury

Výsledek se uloží jako `data/games_features.parquet` a použije v dalších krocích.

In [1]:
import os, numpy as np, pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import joblib

DATA_DIR = "data"
games_path = os.path.join(DATA_DIR, "games.parquet")
assert os.path.exists(games_path), "Nejdřív spusť 00_bgg_scraper.ipynb a vytvoř games.parquet"

games = pd.read_parquet(games_path)
print("Načteno her:", len(games))
games.head(2)

Načteno her: 5


Unnamed: 0,bgg_id,name,min_players,max_players,playing_time,year,categories,mechanics,description,rating,rating_count,weight
0,173346,7 Wonders Duel,2,2,30,2015,ancient card game city building civilization e...,end game bonuses income melding and splaying m...,In many ways 7 Wonders Duel resembles its pare...,8.08275,104654,2.2267
1,9209,Ticket to Ride,2,5,60,2004,trains,connections contracts end game bonuses hand ma...,"With elegantly simple gameplay, Ticket to Ride...",7.38782,95778,1.8216


## TF-IDF z kategorií a mechanik

In [2]:
games["tags"] = (games["categories"].fillna("") + " " + games["mechanics"].fillna("")).str.lower()

vec = TfidfVectorizer(token_pattern=r"[a-zA-Z\-]+")
X_tags = vec.fit_transform(games["tags"])
print("Rozměry matice:", X_tags.shape)

# uložíme vectorizer pro pozdější použití
joblib.dump(vec, os.path.join(DATA_DIR, "tfidf_vectorizer.pkl"))

Rozměry matice: (5, 101)


['data\\tfidf_vectorizer.pkl']

## Bayes popularita (shrunken mean)

In [3]:
R = games["rating"].fillna(0)
v = games["rating_count"].fillna(0)
C = R.replace(0, np.nan).mean()
m = np.percentile(v[v>0], 60) if (v>0).any() else 0

bayes = ((v/(v+m))*R + (m/(v+m))*C).fillna(0)
scaler = MinMaxScaler()
games["pop_norm"] = scaler.fit_transform(bayes.values.reshape(-1,1)).ravel()

games[["name","rating","rating_count","pop_norm"]].head(10)

Unnamed: 0,name,rating,rating_count,pop_norm
0,7 Wonders Duel,8.08275,104654,0.68676
1,Ticket to Ride,7.38782,95778,0.0
2,Pandemic,7.52026,132190,0.075527
3,7 Wonders,7.66761,110220,0.259467
4,Gloomhaven,8.55433,65736,1.0


## Uložení datasetu pro další notebooky

In [4]:
out_path = os.path.join(DATA_DIR, "games_features.parquet")
games.to_parquet(out_path, index=False)
print("Uloženo:", out_path)
games.head(3)

Uloženo: data\games_features.parquet


Unnamed: 0,bgg_id,name,min_players,max_players,playing_time,year,categories,mechanics,description,rating,rating_count,weight,tags,pop_norm
0,173346,7 Wonders Duel,2,2,30,2015,ancient card game city building civilization e...,end game bonuses income melding and splaying m...,In many ways 7 Wonders Duel resembles its pare...,8.08275,104654,2.2267,ancient card game city building civilization e...,0.68676
1,9209,Ticket to Ride,2,5,60,2004,trains,connections contracts end game bonuses hand ma...,"With elegantly simple gameplay, Ticket to Ride...",7.38782,95778,1.8216,trains connections contracts end game bonuses ...,0.0
2,30549,Pandemic,2,4,45,2008,medical travel,action points chaining contracts cooperative g...,"In Pandemic, several virulent diseases have br...",7.52026,132190,2.3956,medical travel action points chaining contract...,0.075527
