# Chess Opening Recommender — Data Collection

The main idea:

- Fetch raw PGNs (the actual move-by-move game records) from Lichess for a specific user.
- Load a sample of elite-level PGNs for comparison and reference.

 I’ll have two Pandas DataFrames:

- user_games_df: all the games for the target user, ready for feature extraction.
- elite_df: a sample of elite games, which I’ll use to build style vectors and do clustering.

This notebook is meant to be run interactively as I iterate on the pipeline, but the functions here will also be used in the automated backend later.

In [9]:
import os
import io
import requests
from pathlib import Path
from typing import Optional
import chess.pgn
import pandas as pd
from tqdm import tqdm

In [None]:
LICHESS_TOKEN = ""
HEADERS = {"Authorization": f"Bearer {LICHESS_TOKEN}"}
DATA_DIR = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/data")

## Fetch User Games from Lichess

We will define `fetch_user_games(username, save_to)`:

This function does the following: 
1. Builds the request to `https://lichess.org/api/games/user/{username}`, asking for complete move list, openings used, an
2. Handles the HTTP response, raising an error if something goes wrong.
3. Returns the raw PGN text and optionally saves it to disk for later reuse.

In [16]:
def fetch_user_games(username: str, save_to: Optional[Path] = None) -> str:
    "Fetch all games for a given Lichess user in PGN format."
    url = f"https://lichess.org/api/games/user/{username}"
    params = {
        "moves": "true",
        "evals": "true",
        "opening": "true",
        "clocks": "false",
        "format": "pgn",
    }
    resp = requests.get(url, headers=HEADERS, params=params, timeout=60)
    resp.raise_for_status()
    text = resp.text
    if save_to:
        save_to.parent.mkdir(parents=True, exist_ok=True)
        save_to.write_text(text, encoding="utf-8")
    return text

### Example of use for fetch_user_games

In [17]:
USERNAME  = "Chessanonymous1"
SAVE_PATH = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/data") / f"{USERNAME}_games.pgn"

pgn_text  = fetch_user_games(USERNAME, save_to=SAVE_PATH)
print(f"Fetched {len(pgn_text)//1024:.1f} KB of PGN → {SAVE_PATH}")

Fetched 10145.0 KB of PGN → /Users/nicholasvega/Downloads/chess-opening-recommender/data/Chessanonymous1_games.pgn


The users games has been succesfuly saved to the data folder in the root repositorty. 

### Example of output from fetch_user_games

[Event "rated blitz game"]

[Site "https://lichess.org/P6YKILRd"]

[Date "2025.08.04"]

[White "Chessanonymous1"]

[Black "Miki000"]

[Result "1-0"]

[GameId "P6YKILRd"]

[UTCDate "2025.08.04"]

[UTCTime "12:49:00"]

[WhiteElo "2419"]

[BlackElo "2383"]

[WhiteRatingDiff "+6"]

[BlackRatingDiff "-5"]

[Variant "Standard"]

[TimeControl "180+0"]

[ECO "A45"]

[Opening "Trompowsky Attack: Classical Defense"]

[Termination "Normal"]

1. d4 Nf6 2. Bg5 e6 3. e3 Be7 4. Bd3 O-O 5. Nd2 c5 6. c3 b6 7. Qf3 d5 8. Ne2 h6 9. h4 Bb7 10. Bxf6 Bxf6 11. g4 g6 12. g5 hxg5 13. hxg5 Bxg5 14. Qh3 Kg7 15. Qh7+ Kf6 16. f4 1-0

## Fetch User Profile

We will define a function that gets the metadata about a player from Lichess

`get_user_profile(username)`:

In [None]:
def get_user_profile(username: str, save_to: Optional[Path] = None) -> dict:
    url = f"https://lichess.org/api/user/{username}"
    r = requests.get(url, headers=HEADERS, timeout=10)
    r.raise_for_status()
    if save_to:
        save_to.parent.mkdir(parents=True, exist_ok=True)
        save_to.write_text(r.text, encoding="utf-8")
    return r.json()

### Example of use for get_user_profile

In [19]:
USERNAME  = "Chessanonymous1"
SAVE_PATH = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/data") / f"{USERNAME}_profile.json"

profile = get_user_profile(USERNAME)
print(f"Fetched profile for {USERNAME} → {SAVE_PATH}")

Fetched profile for Chessanonymous1 → /Users/nicholasvega/Downloads/chess-opening-recommender/data/Chessanonymous1_profile.json


## Parse PGN game to dataframe

In [None]:
def pgn_to_games_df(pgn_text: str) -> pd.DataFrame:
    records = []
    stream = io.StringIO(pgn_text)
    while True:
        game = chess.pgn.read_game(stream)
        if game is None:
            break
        hdr = game.headers
        moves, evals = [], []
        node = game
        while node.variations:
            nxt = node.variation(0)
            moves.append(nxt.move.uci())
            if getattr(nxt, "eval", None) is not None:
                evals.append(nxt.eval)
            node = nxt
        records.append(
            {
                "white": hdr.get("White"),
                "black": hdr.get("Black"),
                "result": hdr.get("Result"),
                "eco": hdr.get("ECO"),
                "opening": hdr.get("Opening", ""),
                "utc_date": hdr.get("UTCDate"),
                "utc_time": hdr.get("UTCTime"),
                "time_control": hdr.get("TimeControl"),
                "moves": moves,
                "evals": evals,
            }
        )
    return pd.DataFrame(records)

## Parse user PGN games to dataframe

In [21]:
def parse_user_pgn(pgn_text: str) -> pd.DataFrame:
    df = pgn_to_games_df(pgn_text)
    return df

Example usage

In [22]:
user_games_df = parse_user_pgn(pgn_text)
print(f"User games parsed: {len(user_games_df)}")
display(user_games_df.head())

#save df to parquet 
user_games_df.to_parquet(DATA_DIR / "Chessanonymous1_games.parquet")

User games parsed: 8499


Unnamed: 0,white,black,result,eco,opening,utc_date,utc_time,time_control,moves
0,Chessanonymous1,Miki000,1-0,A45,Trompowsky Attack: Classical Defense,2025.08.04,12:49:00,180+0,"[d2d4, g8f6, c1g5, e7e6, e2e3, f8e7, f1d3, e8g..."
1,Miki000,Chessanonymous1,1-0,D15,Slav Defense: Chebanenko Variation,2025.08.04,12:44:17,180+0,"[d2d4, d7d5, c2c4, c7c6, b1c3, g8f6, g1f3, a7a..."
2,Chessanonymous1,AnnePiecy,1-0,A80,Dutch Defense: Krejcik Gambit,2025.08.04,12:28:10,180+0,"[d2d4, f7f5, g2g4, d7d5, g4g5, b8c6, c1f4, e7e..."
3,Chessanonymous1,micfel,1-0,D00,Queen's Pawn Game: Levitsky Attack,2025.08.04,00:13:13,180+0,"[d2d4, d7d5, c1g5, c7c5, c2c3, b8c6, e2e3, d8b..."
4,micfel,Chessanonymous1,1-0,B12,Caro-Kann Defense: Modern Variation,2025.08.04,00:10:07,180+0,"[e2e4, c7c6, d2d4, d7d5, b1d2, a7a6, c2c3, e7e..."


## Parse Elite PGN games to dataframe

In [27]:
def parse_elite_pgn(pgn_path: Path) -> pd.DataFrame:
    records = []
    with pgn_path.open(encoding="utf-8", errors="ignore") as fh:
        while True:
            game = chess.pgn.read_game(fh)
            if game is None:
                break
            hdr = game.headers
            moves, evals = [], []
            node = game
            while node.variations:
                nxt = node.variation(0)
                moves.append(nxt.move.uci())
                if getattr(nxt, "eval", None) is not None:
                    evals.append(nxt.eval)
                node = nxt
            records.append({
                "white": hdr.get("White"),
                "black": hdr.get("Black"),
                "result": hdr.get("Result"),
                "eco": hdr.get("ECO"),
                "opening": hdr.get("Opening", ""),
                "utc_date": hdr.get("UTCDate"),
                "utc_time": hdr.get("UTCTime"),
                "time_control": hdr.get("TimeControl"),
                "moves": moves
            })
    return pd.DataFrame(records)

Example usage 

In [28]:
ELITE_PGN = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/data/lichess_elite_2025-05.pgn")

import time
start = time.time()
elite_df = parse_elite_pgn(ELITE_PGN)
display(elite_df.head())

# save elite games to parquet
elite_df.to_parquet(DATA_DIR / "lichess_elite_2025-05.parquet")

Unnamed: 0,white,black,result,eco,opening,utc_date,utc_time,time_control,moves
0,eNErGyOFbEiNGbOT,Nikitosik-ai,1/2-1/2,A00,Clemenz Opening,2025.05.01,00:00:15,180+0,"[h2h3, e7e5, e2e4, g8f6, b1c3, f8b4, a2a3, b4a..."
1,Chessanonymous1,Ariel_mlr,1-0,A45,Trompowsky Attack,2025.05.01,00:00:54,180+0,"[d2d4, g8f6, c1g5, d7d5, g5f6, e7f6, e2e3, f8d..."
2,Kyreds_pet,OlympusCz,1-0,B90,"Sicilian Defense: Najdorf Variation, English A...",2025.05.01,00:00:45,180+0,"[e2e4, c7c5, g1f3, d7d6, d2d4, c5d4, f3d4, g8f..."
3,rtahmass,Mettigel,0-1,C72,"Ruy Lopez: Morphy Defense, Modern Steinitz Def...",2025.05.01,00:01:09,180+0,"[e2e4, e7e5, g1f3, b8c6, f1b5, a7a6, b5a4, d7d..."
4,CruelKen,tomlesspit,1/2-1/2,D38,"Queen's Gambit Declined: Ragozin Defense, Alek...",2025.05.01,00:01:12,180+2,"[g1f3, d7d5, d2d4, g8f6, c2c4, e7e6, b1c3, f8b..."
