# Chess Opening Recommender : Feature Engineering

We will extract style features from both the user's games and the elite games.  
Each row in the dataframe should have:  
- 'moves' (list of UCI strings or iterable)  
- 'result' (string: '1-0', '0-1', '1/2-1/2')

For each game, we compute:  
- `ply_count`: total number of plies (half-moves)  
- `avg_trades`: number of captures  
- `first_queen_ply`: ply index of first queen move (or ply_count+1 if never moved)  
- `castled_early`: true if castled by ply 20  
- `checks`: number of checks delivered  
- `result_score`: numeric game result (1.0=win, 0.5=draw, 0.0=loss)  
- `sacrifice_count`: number of material sacrifices  
- `queen_traded_early`: true if queen traded by ply 40  
- `dev_ply_avg`: average ply of minor piece development  
- `endgame_reached`: true if game reached endgame (by ply count or piece count)

We will then return a new dataframe with these features added to the original columns.

We will then, we summarize these per-game features into a single style vector for each player.

We will do this for the user and for all elite players. Inn the next notebook, we will cluster the elite style vectors and compare the user to their closest style neighbors.

In [9]:
from collections import defaultdict
from typing import Dict, List
from pathlib import Path

import chess
import numpy as np
import pandas as pd
from tqdm import tqdm

In [10]:
DATA_DIR = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/data")
ELITE_PGN = DATA_DIR / "lichess_elite_2025-05.pgn"

## Extract Style Features for the User

Compute per‑game features for the target user:

In [5]:
PIECE_VALUES: Dict[int, int] = {
    chess.PAWN: 1,
    chess.KNIGHT: 3,
    chess.BISHOP: 3,
    chess.ROOK: 5,
    chess.QUEEN: 9,
    chess.KING: 0,
}

In [6]:
def _material_value(board: chess.Board) -> int:
    """Return simple material balance (white – black) using PIECE_VALUES."""
    total = 0
    for square, piece in board.piece_map().items():
        sign = 1 if piece.color == chess.WHITE else -1
        total += sign * PIECE_VALUES[piece.piece_type]
    return total

In [25]:
def extract_style_features(games_df: pd.DataFrame) -> pd.DataFrame:
    score_map = {"1-0": 1.0, "0-1": 0.0, "1/2-1/2": 0.5, "½-½": 0.5}
    recs = []

    for _, row in tqdm(games_df.iterrows(), total=len(games_df),
                    desc="Extracting style features"):

        moves = row.get("moves", [])
        if isinstance(moves, np.ndarray):
            moves = moves.tolist()
        elif moves is None:
            moves = []

        board = chess.Board()
        trades = checks = sacrifice_count = 0
        first_q = castled_ply = queen_trade_ply = None
        dev_first_plys = []
        prev_material = _material_value(board)

        for ply, uci in enumerate(moves, start=1):
            try:
                move = chess.Move.from_uci(uci)
            except ValueError:
                # malformed UCI (rare) – skip
                continue

            # **NEW** – skip moves that aren’t legal in the current position
            if not board.is_legal(move):
                continue

            # capture count
            if board.is_capture(move):
                trades += 1

            piece = board.piece_at(move.from_square)
            board.push(move)

            # queen movement
            if first_q is None and piece and piece.piece_type == chess.QUEEN:
                first_q = ply

            # castling detection
            if castled_ply is None and board.castling_rights == 0:
                castled_ply = ply

            # checks
            if board.is_check():
                checks += 1

            # queen trade detection
            if queen_trade_ply is None and (
                board.pieces(chess.QUEEN, chess.WHITE) == 0
                and board.pieces(chess.QUEEN, chess.BLACK) == 0
            ):
                queen_trade_ply = ply

        ply_count = len(moves)
        endgame_reached = ply_count >= 80 or len(board.piece_map()) <= 10

        recs.append(
            {
                **row.to_dict(),
                "ply_count": ply_count,
                "avg_trades": trades,
                "first_queen_ply": first_q or ply_count + 1,
                "castled_early": bool(castled_ply and castled_ply <= 20),
                "checks": checks,
                "result_score": score_map.get(row.get("result", ""), 0.0),
                "queen_traded_early": bool(queen_trade_ply and queen_trade_ply <= 40),
                "endgame_reached": endgame_reached
            }
        )

    return pd.DataFrame(recs)

Example Usage

In [26]:
PATH_TO_USER_GAMES = DATA_DIR / "Chessanonymous1_games.parquet"
user_games_df = pd.read_parquet(PATH_TO_USER_GAMES)

user_features_df = extract_style_features(user_games_df)
user_features_df.to_parquet(DATA_DIR/"Chessanonymous1_features.parquet")
display(user_features_df.head())

Extracting style features: 100%|██████████| 8499/8499 [00:04<00:00, 1849.51it/s]


Unnamed: 0,white,black,result,eco,opening,utc_date,utc_time,time_control,moves,ply_count,avg_trades,first_queen_ply,castled_early,checks,result_score,queen_traded_early,endgame_reached
0,Chessanonymous1,Miki000,1-0,A45,Trompowsky Attack: Classical Defense,2025.08.04,12:49:00,180+0,"[d2d4, g8f6, c1g5, e7e6, e2e3, f8e7, f1d3, e8g...",31,5,13,False,1,1.0,False,False
1,Miki000,Chessanonymous1,1-0,D15,Slav Defense: Chebanenko Variation,2025.08.04,12:44:17,180+0,"[d2d4, d7d5, c2c4, c7c6, b1c3, g8f6, g1f3, a7a...",75,17,33,False,6,1.0,False,False
2,Chessanonymous1,AnnePiecy,1-0,A80,Dutch Defense: Krejcik Gambit,2025.08.04,12:28:10,180+0,"[d2d4, f7f5, g2g4, d7d5, g4g5, b8c6, c1f4, e7e...",43,9,12,False,0,1.0,False,False
3,Chessanonymous1,micfel,1-0,D00,Queen's Pawn Game: Levitsky Attack,2025.08.04,00:13:13,180+0,"[d2d4, d7d5, c1g5, c7c5, c2c3, b8c6, e2e3, d8b...",61,15,8,False,2,1.0,True,False
4,micfel,Chessanonymous1,1-0,B12,Caro-Kann Defense: Modern Variation,2025.08.04,00:10:07,180+0,"[e2e4, c7c6, d2d4, d7d5, b1d2, a7a6, c2c3, e7e...",57,15,14,False,6,1.0,False,False


### Summarize User Style

In [27]:
def summarize_player_features(features_df: pd.DataFrame) -> pd.Series:
    summary = {
        "avg_moves": features_df["ply_count"].mean(),
        "pct_long_games": (features_df["ply_count"] > 80).mean(),
        "avg_trades": features_df["avg_trades"].mean(),
        "avg_queen_move": features_df["first_queen_ply"].mean(),
        "pct_castled_early": features_df["castled_early"].mean(),
        "avg_checks": features_df["checks"].mean(),
        "win_rate": features_df["result_score"].mean(),
        "pct_wins": (features_df["result_score"] == 1.0).mean(),
        "pct_draws": (features_df["result_score"] == 0.5).mean(),
        "pct_losses": (features_df["result_score"] == 0.0).mean(),
        "queen_trade_freq": features_df["queen_traded_early"].mean(),
        "pct_games_endgame": features_df["endgame_reached"].mean(),
        "opening_variety": features_df["eco"].nunique() / len(features_df),
    }

    return pd.Series(summary, dtype="float32")

In [28]:
user_style_vector = summarize_player_features(user_features_df)
user_style_vector.to_frame(name="value").to_csv(DATA_DIR/"Chessanonymous1_style_vector.csv")
user_style_vector

avg_moves            78.543945
pct_long_games        0.418049
avg_trades           16.153666
avg_queen_move       19.997293
pct_castled_early     0.039416
avg_checks            5.149312
win_rate              0.535828
pct_wins              0.514531
pct_draws             0.042593
pct_losses            0.442876
queen_trade_freq      0.083304
pct_games_endgame     0.438522
opening_variety       0.012001
dtype: float32

## Extract Style features per user in elite df

In [29]:
elite_df = pd.read_parquet(DATA_DIR / "lichess_elite_2025-05.parquet")
elite_features_df = extract_style_features(elite_df)
elite_features_df.to_parquet(DATA_DIR/"elite_features.parquet")
elite_features_df.head()

Extracting style features: 100%|██████████| 310142/310142 [03:15<00:00, 1582.93it/s]


Unnamed: 0,white,black,result,eco,opening,utc_date,utc_time,time_control,moves,ply_count,avg_trades,first_queen_ply,castled_early,checks,result_score,queen_traded_early,endgame_reached
0,eNErGyOFbEiNGbOT,Nikitosik-ai,1/2-1/2,A00,Clemenz Opening,2025.05.01,00:00:15,180+0,"[h2h3, e7e5, e2e4, g8f6, b1c3, f8b4, a2a3, b4a...",98,19,44,True,6,0.5,False,True
1,Chessanonymous1,Ariel_mlr,1-0,A45,Trompowsky Attack,2025.05.01,00:00:54,180+0,"[d2d4, g8f6, c1g5, d7d5, g5f6, e7f6, e2e3, f8d...",81,16,17,False,5,1.0,False,True
2,Kyreds_pet,OlympusCz,1-0,B90,"Sicilian Defense: Najdorf Variation, English A...",2025.05.01,00:00:45,180+0,"[e2e4, c7c5, g1f3, d7d6, d2d4, c5d4, f3d4, g8f...",191,25,17,True,14,1.0,False,True
3,rtahmass,Mettigel,0-1,C72,"Ruy Lopez: Morphy Defense, Modern Steinitz Def...",2025.05.01,00:01:09,180+0,"[e2e4, e7e5, g1f3, b8c6, f1b5, a7a6, b5a4, d7d...",30,3,20,False,2,0.0,False,False
4,CruelKen,tomlesspit,1/2-1/2,D38,"Queen's Gambit Declined: Ragozin Defense, Alek...",2025.05.01,00:01:12,180+2,"[g1f3, d7d5, d2d4, g8f6, c2c4, e7e6, b1c3, f8b...",54,15,9,False,6,0.5,False,False


In [30]:
def build_elite_style_vectors(elite_games_df: pd.DataFrame) -> pd.DataFrame:

    required_cols = {"white", "black", "ply_count"}
    if not required_cols.issubset(elite_games_df.columns):
        raise ValueError("elite_games_df missing required columns")

    # treat each game twice: once from White perspective, once from Black
    white_df = elite_games_df.copy()
    white_df["player"] = white_df["white"]
    black_df = elite_games_df.copy()
    black_df["player"] = black_df["black"]

    all_games = pd.concat([white_df, black_df], ignore_index=True)

    style_vectors = all_games.groupby("player", sort=False).apply(
        summarize_player_features
    )
    style_vectors.index.name = None
    return style_vectors.reset_index().rename(columns={"index": "player"})

In [31]:
elite_style_vectors = build_elite_style_vectors(elite_features_df)
print(f"Computed {len(elite_style_vectors)} player style vectors")
display(elite_style_vectors.head())
elite_style_vectors.to_csv(DATA_DIR/"elite_style_vectors.csv", index=False)

Computed 16232 player style vectors


  style_vectors = all_games.groupby("player", sort=False).apply(


Unnamed: 0,player,avg_moves,pct_long_games,avg_trades,avg_queen_move,pct_castled_early,avg_checks,win_rate,pct_wins,pct_draws,pct_losses,queen_trade_freq,pct_games_endgame,opening_variety
0,eNErGyOFbEiNGbOT,136.996445,0.803318,21.172985,20.030806,0.304502,11.930095,0.498223,0.027251,0.941943,0.030806,0.232227,0.813981,0.120853
1,Chessanonymous1,82.669174,0.453634,17.383459,18.478697,0.035088,5.670426,0.536967,0.512531,0.048872,0.438596,0.080201,0.469925,0.035088
2,Kyreds_pet,134.287704,0.805104,20.967518,18.464037,0.457077,12.329467,0.515081,0.062645,0.904872,0.032483,0.310905,0.819026,0.12065
3,rtahmass,88.449211,0.546597,19.250261,18.4911,0.33089,6.361257,0.516754,0.46178,0.109948,0.428272,0.212565,0.568586,0.114136
4,CruelKen,135.0047,0.719875,20.596245,21.690142,0.660407,10.286385,0.531299,0.092332,0.877934,0.029734,0.15493,0.733959,0.106416


## Next Steps 

With these style vectors, we can:

1. **Cluster** the elite players by their feature vectors (e.g. K‑Means) to discover style archetypes.
2. **Compute distances** between the user’s vector and each elite player’s vector to find stylistic neighbors.