Game Analysis
===

Analysis of chess games from the [Lichess Elite Database](https://database.nikonoel.fr/), which is a subset of the [Lichess Database](https://database.lichess.org).

Used [python-chess](https://python-chess.readthedocs.io/en/latest/index.html) to extract some stuff.

This notebook executed via uvx: `uvx --with pandas --with matplotlib jupyter lab`

See: https://docs.astral.sh/uv/concepts/tools/#including-additional-dependencies

In [1]:
import json
import pandas as pd

In [2]:
games = []
with open("game_tags.jsonl") as infile:
    for line in infile:
        games.append(json.loads(line))
len(games)

3819130

In [3]:
columns = [
    "Event",
    "Date",
    "White",
    "Black",
    "Result",
    "BlackTitle",
    "WhiteElo",
    "BlackElo",
    "TimeControl",
    "UTCDate",
    "UTCTime",
    "Termination",
    "WhiteTitle"
]
gdf = pd.DataFrame(games, columns=columns)
gdf.shape

(3819130, 13)

In [4]:
del games

In [5]:
gdf.sample(n=3)

Unnamed: 0,Event,Date,White,Black,Result,BlackTitle,WhiteElo,BlackElo,TimeControl,UTCDate,UTCTime,Termination,WhiteTitle
652109,Rated Blitz game,2020.02.28,hairniko,WalterIalternativo,1/2-1/2,,2440,2400,300+0,2020.02.28,20:35:26,Normal,
189596,Rated Blitz game,2020.03.15,Den78,AMIRREZALH,0-1,,2459,2448,180+0,2020.03.15,11:59:39,Normal,FM
1078171,Rated Blitz game,2020.01.30,Chyngys,buchari11,1-0,,2357,2463,180+0,2020.01.30,06:15:02,Normal,


In [6]:
gdf["Event"].value_counts(dropna=False)

Event
Rated Blitz game        3774133
Rated Rapid game          24152
Rated Classical game      20845
Name: count, dtype: int64

In [7]:
gdf["Result"].value_counts(dropna=False)

Result
1-0        1835842
0-1        1666151
1/2-1/2     315977
*             1160
Name: count, dtype: int64

In [13]:
gdf["Termination"].value_counts(dropna=False)

Termination
Normal              2896673
Time forfeit         921226
Abandoned              1160
Rules infraction         71
Name: count, dtype: int64

In [8]:
gdf["TimeControl"].value_counts(dropna=False).head(n=20)

TimeControl
180+0     3312509
180+2      296190
300+0      122822
600+0       19361
180+1       18252
480+0        7375
120+3        4704
300+5        4692
120+2        3845
300+3        3143
240+0        2492
180+3        1971
300+8        1485
60+4         1232
300+2        1168
900+15       1159
600+5        1158
60+3         1001
45+4          966
240+2         870
Name: count, dtype: int64

In [14]:
#sdf = gdf.sample(n=200000)
sdf = gdf[gdf["Termination"].isin({"Normal", "Time forfeit"})]
len(sdf)

3817899

In [15]:
pd.crosstab(sdf["Event"], sdf["Result"], normalize="index")

Result,0-1,1-0,1/2-1/2
Event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rated Blitz game,0.436492,0.481149,0.082359
Rated Classical game,0.434236,0.451613,0.114151
Rated Rapid game,0.423405,0.457867,0.118728


In [18]:
# TODO I'm curious about the relationship between ELO and drawing games
# can continue with additional modeling: https://chatgpt.com/share/686ade5e-8f1c-8009-8e18-76b597132f00
# I'm curious to try either the pyGAM or the LOWESS approach
# lowess: https://www.statsmodels.org/dev/examples/notebooks/generated/lowess.html
# so: https://stackoverflow.com/questions/42637579/how-to-compute-and-plot-a-lowess-curve-in-python