# Chess Opening Recommender — Data Collection

**Overview:**  
We will fetch raw PGN from Lichess for one user and load a sample of the elite PGN file. 

**Goal:**  
We will produce two Pandas DataFrames: `user_games_df` and `elite_df` that will be needed for later use.

**Purpose:**  
We will need to validate that PGN fetching & parsing works before we can proceed with building features. 


In [1]:
import os
import io
import requests
from pathlib import Path
from typing import Optional
import chess.pgn
import pandas as pd
from tqdm import tqdm

In [2]:
LICHESS_TOKEN = "lip_876jLfUbVLzQQE4wOmPJ"
HEADERS = {"Authorization": f"Bearer {LICHESS_TOKEN}"}
DATA_DIR = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/src/data")

## Fetch a users games from Lichess

### 1.1 Fetch User Games from Lichess

Define `fetch_user_games(username, max_games, save_to)`:

1. Builds the request to `https://lichess.org/api/games/user/{username}`, asking for up to `max_games` games, full move lists, engine evaluations, and ECO tags in PGN format.
2. Handles the HTTP response, raising an error if something goes wrong.
3. Returns the raw PGN text and optionally saves it to disk for later reuse.

In [3]:
def fetch_user_games(username: str,
                    max_games: int = 300,
                    save_to: Path | None = None) -> str:
    """
    Fetch up to `max_games` recent games for a Lichess user as PGN (with evals & ECO).
    Optionally saves the PGN to disk if `save_to` is provided.
    Returns the PGN text as a string.
    """
    url = f"https://lichess.org/api/games/user/{username}"
    params = {
        "max": max_games,
        "moves": "true",
        "evals": "true",
        "opening": "true",
        "clocks": "false",
        "format": "pgn",
    }
    try:
        response = requests.get(url, headers=HEADERS, params=params, stream=True, timeout=30)
        response.raise_for_status()
        pgn_text = response.text
        if save_to:
            save_to.write_text(pgn_text, encoding="utf-8")
        return pgn_text
    except requests.RequestException as e:
        print(f"Error fetching games for {username}: {e}")
        return ""

### 1.1a Useage example of fetch_user_games

In [4]:
USERNAME  = "Chessanonymous1"
SAVE_PATH = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/src/data") / f"{USERNAME}_games.pgn"

pgn_text  = fetch_user_games(USERNAME, max_games=300, save_to=SAVE_PATH)
print(f"Fetched {len(pgn_text)//1024:.1f} KB of PGN → {SAVE_PATH}")

Error fetching games for Chessanonymous1: 401 Client Error: Unauthorized for url: https://lichess.org/api/games/user/Chessanonymous1?max=300&moves=true&evals=true&opening=true&clocks=false&format=pgn
Fetched 0.0 KB of PGN → /Users/nicholasvega/Downloads/chess-opening-recommender/src/data/Chessanonymous1_games.pgn


### 1.1b example output

[Event "rated blitz game"]
[Site "https://lichess.org/TdpWM9JA"]
[Date "2025.07.22"]
[White "Chessanonymous1"]
[Black "yasinka2016"]
[Result "0-1"]
[GameId "TdpWM9JA"]
[UTCDate "2025.07.22"]
[UTCTime "03:05:34"]
[WhiteElo "2455"]
[BlackElo "2423"]
[WhiteRatingDiff "-6"]
[BlackRatingDiff "+6"]
[BlackTitle "FM"]
[Variant "Standard"]
[TimeControl "180+0"]
[ECO "D00"]
[Opening "Queen's Pawn Game: Levitsky Attack"]
[Termination "Time forfeit"]

A quick profile snapshot (ratings per time‑control, total games, creation date) gives contextual metadata:

* Confirms you fetched the correct account.  
* Lets you log the user’s current strength, which can be handy when interpreting engine accuracy metrics later.  

In [5]:
def get_user_profile(username: str) -> dict:
    url = f"https://lichess.org/api/user/{username}"
    r = requests.get(url, headers=HEADERS, timeout=10)
    r.raise_for_status()
    return r.json()


## 1.2 PGN game to dataframe

In [6]:
def pgn_to_games_df(pgn_text: str) -> pd.DataFrame:
    records = []
    stream = io.StringIO(pgn_text)
    while True:
        game = chess.pgn.read_game(stream)
        if game is None:
            break
        hdr = game.headers
        moves, evals = [], []
        node = game
        while node.variations:
            nxt = node.variation(0)
            moves.append(nxt.move.uci())
            if hasattr(nxt, "eval") and nxt.eval is not None:
                evals.append(nxt.eval)
            node = nxt
        records.append({
            "white":        hdr.get("White"),
            "black":        hdr.get("Black"),
            "result":       hdr.get("Result"),
            "eco":          hdr.get("ECO"),
            "opening":      hdr.get("Opening"),
            "utc_date":     hdr.get("UTCDate"),
            "utc_time":     hdr.get("UTCTime"),
            "time_control": hdr.get("TimeControl"),
            "moves":        moves
        })
    return pd.DataFrame(records)


### 1.2a User PGN game to dataframe

In [7]:
def parse_user_pgn(pgn_text: str) -> pd.DataFrame:
    df = pgn_to_games_df(pgn_text)
    return df

Example usage

In [8]:
user_games_df = parse_user_pgn(pgn_text)
print(f"User games parsed: {len(user_games_df)}")
display(user_games_df.head())

#save df to parquet 
user_games_df.to_parquet(DATA_DIR / "Chessanonymous1_games.parquet")

User games parsed: 0


OSError: Cannot save file into a non-existent directory: '/Users/nicholasvega/Downloads/chess-opening-recommender/src/data'

### 1.2b Parse elite pgn games to dataframe

In [None]:
def parse_elite_pgn_fast(pgn_path: Path, n_games: int = 500) -> pd.DataFrame:
    records = []
    with pgn_path.open(encoding="utf-8", errors="ignore") as fh:
        for _ in tqdm(range(n_games), desc="Parsing elite PGN"):
            game = chess.pgn.read_game(fh)
            if game is None:
                break

            hdr = game.headers
            moves, evals = [], []
            node = game
            while node.variations:
                nxt = node.variation(0)
                moves.append(nxt.move.uci())
                if hasattr(nxt, "eval") and nxt.eval is not None:
                    evals.append(nxt.eval)
                node = nxt

            records.append({
                "white":        hdr.get("White"),
                "black":        hdr.get("Black"),
                "result":       hdr.get("Result"),
                "eco":          hdr.get("ECO"),
                "opening":      hdr.get("Opening", ""),
                "utc_date":     hdr.get("UTCDate"),
                "utc_time":     hdr.get("UTCTime"),
                "time_control": hdr.get("TimeControl"),
                "moves":        moves
            })

    return pd.DataFrame(records)

Example usage 

In [None]:
ELITE_PGN = Path("/Users/nicholasvega/Downloads/chess-opening-recommender/src/data/lichess_elite_2025-05.pgn")

import time
start = time.time()
elite_df = parse_elite_pgn_fast(ELITE_PGN, n_games=500)
print(f"Fast parse took {time.time() - start:.1f}s — parsed {len(elite_df)} games")
display(elite_df.head())

# save elite games to CSV
elite_df.to_parquet(DATA_DIR / "lichess_elite_2025-05.parquet")

Parsing elite PGN: 100%|██████████| 500/500 [00:00<00:00, 1118.13it/s]

Fast parse took 0.5s — parsed 500 games





Unnamed: 0,white,black,result,eco,opening,utc_date,utc_time,time_control,moves
0,eNErGyOFbEiNGbOT,Nikitosik-ai,1/2-1/2,A00,Clemenz Opening,2025.05.01,00:00:15,180+0,"[h2h3, e7e5, e2e4, g8f6, b1c3, f8b4, a2a3, b4a..."
1,Chessanonymous1,Ariel_mlr,1-0,A45,Trompowsky Attack,2025.05.01,00:00:54,180+0,"[d2d4, g8f6, c1g5, d7d5, g5f6, e7f6, e2e3, f8d..."
2,Kyreds_pet,OlympusCz,1-0,B90,"Sicilian Defense: Najdorf Variation, English A...",2025.05.01,00:00:45,180+0,"[e2e4, c7c5, g1f3, d7d6, d2d4, c5d4, f3d4, g8f..."
3,rtahmass,Mettigel,0-1,C72,"Ruy Lopez: Morphy Defense, Modern Steinitz Def...",2025.05.01,00:01:09,180+0,"[e2e4, e7e5, g1f3, b8c6, f1b5, a7a6, b5a4, d7d..."
4,CruelKen,tomlesspit,1/2-1/2,D38,"Queen's Gambit Declined: Ragozin Defense, Alek...",2025.05.01,00:01:12,180+2,"[g1f3, d7d5, d2d4, g8f6, c2c4, e7e6, b1c3, f8b..."


## 1.3 Fetch user profile 

In [None]:
def get_user_profile(username: str) -> dict:
    url = f"https://lichess.org/api/user/{username}"
    r = requests.get(url, headers=HEADERS, timeout=10)
    r.raise_for_status()
    return r.json()

Example

In [None]:
import json

In [None]:
profile = get_user_profile(USERNAME)

(profile_path := DATA_DIR / f"{USERNAME}_profile.json").write_text(
    json.dumps(profile, indent=2), encoding="utf-8"
)

ratings = { fmt: v["rating"] 
            for fmt, v in profile["perfs"].items() 
            if "rating" in v }
print("Current ratings:", ratings)

play_counts = { fmt: v["games"] 
                for fmt, v in profile["perfs"].items() 
                if "games" in v }
print("Total games played:", play_counts)

Current ratings: {'ultraBullet': 1607, 'bullet': 2686, 'blitz': 2449, 'rapid': 2447, 'classical': 1500, 'correspondence': 1500, 'chess960': 2141, 'puzzle': 2519}
Total games played: {'ultraBullet': 106, 'bullet': 291, 'blitz': 7446, 'rapid': 127, 'classical': 0, 'correspondence': 0, 'chess960': 406, 'puzzle': 522}
