# Collecting chess games from chess.com API using the chessdotcom wrapper

We are going to use games from players of different nationalities to create a more diverse dataset. This will provide us with the opportunity to discover whether players from different countries tend to use different openings, etc.

For this dataset, I will use the top 20 strongest countries in chess, as listed by FIDE (https://ratings.fide.com/topfed.phtml).

Firstly, let's discuss a major limitation of the API, which was applicable when I worked on it: it is very rigidly structured. You are not able to filter players by rating, which will likely result in an imbalanced dataset. One could attempt to fetch players until finding those of a certain rating, but this comes with the risk of being rate-limited by the API. Since there are millions of players on the platform and when requesting players from a country, the API response is in alphabetical order and only includes the first 10000 results (no more, no less). This produces a small bias because, depending on the number of players you are requesting, your whole dataset might consist of players with usernames such as "aaaaaaaaa," "aaaaaaaaaaaaaaaaa," "aaaaaaaaaaaab," which could be correlated with spam accounts or lower-rated players.

That said, to prevent the code from becoming convoluted, we will first only read the username and their rating from the top 20 countries. Then, I will separate these players by their rating, and finally fetch the games. I suggest saving to a file between these three steps. While this is not necessary, especially for the first two steps, it might be good to avoid making millions of requests in a row (the API might rate-limit you). Additionally, depending on the amount of data, it might not fit into memory.

In this notebook, I will be using much smaller values than the ones I actually used in my original dataset for the sake of simplicity and speed. My dataset will be publicly available (refer to the `data_exploration` notebook).

Chess.com API: https://www.chess.com/club/chess-com-developer-community

Python chessdotcom wrapper: https://chesscom.readthedocs.io/en/latest/ 

### Getting players by country

I will use the ISO 3166-1 alpha-2 codes for countries (e.g., US for the United States, RU for Russia) as specified in the API's documentation. Additionally, 10 players from each country will be retrieved. Subsequently, I will store the data in a list of 3-valued lists: [`Country`, `Username`, `Rating`]. Only after that, I'll transform this list into a dataframe (as it is more efficient than iteratively adding to a new dataframe, as explained [here](https://stackoverflow.com/a/56746204/3585576)).

In [1]:
import asyncio
from io import StringIO
import requests
from typing import Optional

import chess.pgn
import pandas as pd

from chessdotcom.aio import Client, get_player_game_archives, get_player_profile, get_player_stats, get_country_players



Client.request_config["headers"]["User-Agent"] = (
   "Python data collection application. "
   "Contact me at <your@email.com>"
)
request_headers = {'User-Agent': 'Python data collection application. Contact me at <your@email.com>'}

In [3]:
countries = ['US', 'RU', 'CN', 'IN', 'UA', 'AZ', 'FR', 'AM', 'DE', 'ES', 'PL', 'NL', 'HU', 'NO', 'IL', 'GB', 'CZ', 'HR', 'UZ', 'AR']
max_players_per_country = 10
country_player_rating_list = []


async def gather_country_players(country):
    responses = await get_country_players(country)
    return responses

async def gather_player_rating(player_username):
    responses = await get_player_stats(player_username)
    return responses

print('Fetching players from country ', end = '')
for country in countries:
    print(country, '', end = '')
    a = await gather_country_players(country)
    for i, player_username in enumerate(a.json['players']):
        if i == max_players_per_country:
            break
        else:
            try:
                player_stats = await gather_player_rating(player_username)
                rating = player_stats.json['stats']['chess_blitz']['last']['rating']
                country_player_rating_list.append([country, player_username, rating])
            except Exception as e:
                continue

Fetching players from country US RU CN IN UA AZ FR AM DE ES PL NL HU NO IL GB CZ HR UZ AR 

Creating a dataframe with the data collected in the dictionary with the columns `username`, `rating` and `country`.

In [6]:
country_player_rating_df = pd.DataFrame(country_player_rating_list, columns=["country", "username", "rating"])
country_player_rating_df

Unnamed: 0,country,username,rating
0,US,--untouchable--,1836
1,US,-blackburne-,610
2,US,-drake,897
3,US,-jg-,970
4,US,-kennethstriker,400
...,...,...,...
133,AR,009l2,287
134,AR,00nit00,929
135,AR,011gaston,201
136,AR,01seky,1029


### Fetching games

Since this minimal example involves only a small number of players, I will simultaneously separate them by rating and fetch games. To keep it simple, I'll retrieve only 5 blitz games from each player and store the data in a new dataframe. Additionally, the `chess` library will be used to assist in storing chess games in PGN format. 

Let's create a method called `get_games` to fetch games given a username, which will return a dataframe with their data along with the 5 blitz games.:

In [33]:
async def gather_player_game_archives(username: str):
    responses = await get_player_game_archives(username)
    return responses

async def gather_player_profile(username: str):
    responses = await get_player_profile(username)
    return responses

async def get_games(username: str, max_games: int) -> Optional[pd.DataFrame]:
    try:
        games = await gather_player_game_archives(username)
    except Exception as e:
        return
    n_games = 0
    all_games = []
    for monthly_games in games.json['archives']:
        try:
            monthly_games = requests.get(monthly_games, headers=request_headers)
            monthly_games = monthly_games.json()
        except Exception as e:
            return
        try:
            for game in monthly_games['games']:
                n_games += 1
                try:
                    game_pgn = chess.pgn.read_game(StringIO(game['pgn']))
                except Exception as e:
                    continue
                headers = game_pgn.headers
                headers['moves'] = str(game_pgn.mainline_moves())
                try:
                    white_country_request = await gather_player_profile(headers.get("White"))
                    white_country_request = white_country_request.json['player']['country']
                    white_code = requests.get(white_country_request, headers=request_headers).json()['code']
                except Exception as e:
                    white_code = None
                try:
                    black_country_request = await gather_player_profile(headers.get("Black"))
                    black_country_request = black_country_request.json['player']['country']
                    black_code = requests.get(black_country_request, headers=request_headers).json()['code']
                except Exception as e:
                    black_code = None
                try:
                    all_games.append({'white': headers.get("White"), 'white_elo': headers.get("WhiteElo"),
                                      'white_country': white_code, 'black': headers.get("Black"),
                                      'black_elo': headers.get("BlackElo"), 'black_country': black_code,
                                      'result': headers.get("Result"), 'uuid': game['uuid'], 'fen': game['fen'],
                                      'URL': game['url'], 'ECO': headers.get("ECO"), 'ECO_URL': headers.get("ECOUrl"),
                                      'UTC_date': headers.get("UTCDate"), 'end_date': headers.get("EndDate"),
                                      'start_time': headers.get("StartTime"), 'end_time': headers.get("EndTime"),
                                      'time_control': headers.get("TimeControl"), 'tcn': game['tcn'],
                                      'moves': headers.get("moves")})
                except Exception as e:
                    continue
                if n_games >= max_games:
                    df = pd.DataFrame(all_games, columns=['white', 'white_elo', 'white_country', 'black', 'black_elo',
                                                          'black_country', 'result', 'uuid', 'fen', 'URL', 'ECO',
                                                          'ECO_URL', 'UTC_date', 'end_date', 'start_time', 'end_time',
                                                          'time_control', 'tcn', 'moves'])
                    return df
        except Exception as e:
            return

    df = pd.DataFrame(all_games, columns=['white', 'white_elo', 'white_country', 'black', 'black_elo', 'black_country',
                                          'result', 'uuid', 'fen', 'URL', 'ECO', 'ECO_URL', 'UTC_date', 'end_date',
                                          'start_time', 'end_time', 'time_control', 'tcn', 'moves'])
    return df

Now it's possible to iterate over the player's usernames and retrieve 5 blitz games, storing all the data in one dataframe.

In [41]:
max_games = 5
usernames = country_player_rating_df['username'].tolist()
games_from_users = []
print('Fetching games from player ', end = '')
for user in usernames:
    print(user, '', end = '')
    games_df = await get_games(user, max_games)
    games_from_users.append(games_df)

chess_data = pd.concat(games_from_users).reset_index(drop=True)

Fetching games from player --untouchable-- -blackburne- -drake -jg- -kennethstriker -osmium- -pantherunner- -redak- -roman- -texast- -aqe- 000000000000000000000900r 0000001ugh 0000searcher 0007hacker 000icq 000mamba 00long 01lz10 0411chess 06081952 071122jjjjjjjj 0726koreyoshi -random -whitetiger- 0-0krishnaveni0-0 0-dhairya-0 000000e00000 000007batman 00000000000000000000000hd 000000013 001alehandro001 006hd 007edward007 0001thomas 007kbv 007mehdi 01cavid 075akif --esteban-- 0-krasus 0000000e 0000000ok 00000m08 0000alolo 000dorian 000ilian000 000kskqiq 001aka 001hayk 001xachatur 00cross00 01garri 0harik 0-nummerinschach 00000100 0000jk 000aiko 0000000000jr 000777gg 000patrick000 000sarius000 0010001jose 00234alex110 0064figgdiinsknia 007egregorien 0002kuba 000sup 001em 001robert 007anik 00gaala00 00junior00 -eugene- -japie- -michael 000donpablo000 00111david11100 006yannick 007amin 002hollow 00benji 00luckyluke 00szabolcs 012120 00-7 0001010111 000lucky000 007inge 00pop00 00qwerty 01j

In [42]:
chess_data

Unnamed: 0,white,white_elo,white_country,black,black_elo,black_country,result,uuid,fen,URL,ECO,ECO_URL,UTC_date,end_date,start_time,end_time,time_control,tcn,moves
0,--untouchable--,1417,US,tamilmaran,1378,IN,1-0,bc75f314-8156-11dd-8000-000000010001,r4bnr/2n1q1p1/4p1kp/p3PpNP/5P2/P7/1BP1Q1P1/1R1...,https://www.chess.com/game/live/213199842,B07,https://www.chess.com/openings/Pirc-Defense-Ha...,2011.12.01,2011.12.01,01:35:43,01:36:55,60,mCZRnD0Sgv1TfAYIArWOlBXHiqIAriOGbsHzegzsiAsjcj...,1. e4 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
1,--untouchable--,1270,US,thugz111226,1400,PH,0-1,cf88c36e-8156-11dd-8000-000000010001,8/2B4p/1R3p1n/4p2k/4P3/5N2/4K1P1/8 w - -,https://www.chess.com/game/live/213200363,C30,https://www.chess.com/openings/Kings-Gambit-2....,2011.12.01,2011.12.01,01:37:10,01:39:32,60,mC0KnDZRgv70fA1Teg6SAS0SbsYQlBS0DKRKBJQIsHWOJR...,1. e4 { [%clk 0:01:00] } 1... e5 { [%clk 0:01:...
2,adrian07161978,1181,US,--untouchable--,1130,US,1-0,f05118da-8156-11dd-8000-000000010001,2kr3r/pQpqn1p1/4b3/PP6/3P1p2/2P3b1/6B1/RNB2RK1...,https://www.chess.com/game/live/213201241,A00,https://www.chess.com/openings/Kings-Fianchett...,2011.12.01,2011.12.01,01:40:07,01:41:28,60,owZJfo0Kgv5Qeg!Tks9Rjz6Siy7ZzHQ0lBKCvl86nvCvov...,1. g3 { [%clk 0:01:00] } 1... d5 { [%clk 0:01:...
3,ringgo27,952,MY,--untouchable--,1184,US,0-1,06f76f80-8157-11dd-8000-000000010001,6k1/p3pp1p/3p2p1/4b3/6P1/P4R2/5PKP/2r5 w - -,https://www.chess.com/game/live/213201840,A00,https://www.chess.com/openings/Mieses-Opening-...,2011.12.01,2011.12.01,01:42:27,01:44:28,60,ltYImuZRbs5Qsm2Uks92dr!Tab6Srq8!jrXHqjTZgvHzoE...,1. d3 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...
4,polo_the_lolipop,1126,FR,--untouchable--,1249,US,0-1,38703b78-8157-11dd-8000-000000010001,4kb1r/pp2pppp/3p1n2/8/5B2/3BP3/1qr2PPP/1K4NR w...,https://www.chess.com/game/live/213203180,A43,https://www.chess.com/openings/Old-Benoni-Defense,2011.12.01,2011.12.01,01:46:05,01:48:09,60,lBYIcDIBksBsbs5QsHZRdJ6SJt!Tec46tqQBcbBHqz7Pzr...,1. d4 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
681,dante4247k_inactive,480,AR,01xzRezNoTengoElBarraBaja,556,AR,1-0,2373a01c-c73f-11ec-b84c-78ac4409ff3c,1Q2k3/R7/8/4N3/3R1p2/P4P1p/2P2P1P/6K1 b - -,https://www.chess.com/game/live/44922565581,C00,https://www.chess.com/openings/French-Defense-...,2022.04.28,2022.04.28,22:04:26,22:17:47,3600,mC0Sgv7FbsFVvK1TKEVDdvDvov87fA5QegZJCJSJAJ7Zlt...,1. e4 { [%clk 1:00:00] } 1... e6 { [%clk 1:00:...
682,01xzRezNoTengoElBarraBaja,346,AR,dante4247k_inactive,489,AR,0-1,1135c209-c741-11ec-b84c-78ac4409ff3c,rnbqk2r/pppp1ppp/8/P3p3/4n3/3P4/1PP1PbPP/R1BQK...,https://www.chess.com/game/live/44923212311,A00,https://www.chess.com/openings/Ware-Opening-1....,2022.04.28,2022.04.28,22:18:10,22:19:22,3600,iy0KyG!Tbs9IsHTCHBIBltBn,1. a4 { [%clk 1:00:00] } 1... e5 { [%clk 1:00:...
683,dante4247k_inactive,494,AR,01xzRezNoTengoElBarraBaja,271,AR,1-0,5797077a-c741-11ec-b84c-78ac4409ff3c,8/8/8/5P2/1P6/3Q4/P1k1NPPP/R1BR2K1 b - -,https://www.chess.com/game/live/44923236211,B00,https://www.chess.com/openings/Kings-Pawn-Open...,2022.04.28,2022.04.28,22:19:58,22:27:31,3600,mC1LCLZJgv5OvKOIfHYQHQXQKQ8ZQ7Z7eg0KbsKCsJCulu...,1. e4 { [%clk 1:00:00] } 1... f5 { [%clk 1:00:...
684,01xzRezNoTengoElBarraBaja,325,AR,lucakfc,100,AR,1-0,2df1eb0c-d9f0-11ec-9e4a-78ac4409ff3c,r4rk1/2p2ppp/p2p3n/1pP1p1B1/3n4/1PN4P/P1P1PP1P...,https://www.chess.com/game/live/46977622173,A00,https://www.chess.com/openings/Sodium-Attack-1...,2022.05.22,2022.05.22,16:57:07,17:01:21,600,bq0Klt!VqH5QHs9ItB8!BIWOsyQBjrXHysZRgx6xox7McM,1. Na3 { [%clk 0:10:00] } 1... e5 { [%clk 0:10...


This is just a sample of the whole dataset that is going to be used in subsequent steps (Data exploration, etc).