# Initial Filtering of the Data

This is part of the initial data analysis to select what from the data we would like to focus on. We ultimately decided to go with "Rated Bulled Model" and for games in which the white player's elo score was a minimum of 2000. We collected 100,000 games of this type.



In [1]:
import os
import chess.pgn # might need to pip install 'chess'
import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, count, col, regexp_extract, regexp_replace
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
import zstandard as zstd
import io




In [2]:
file_path = '/scratch/zrc3hc/lichess_db_standard_rated_2024-08.pgn'
# png = 'portable game notation' which is standard formate for chess games.

In [3]:
# Checking if file can be accessed 

if os.path.exists(file_path):
    print("File exists and can be accessed.")
else:
    print("File does not exist or cannot be accessed.")

File exists and can be accessed.


In [4]:
with open(file_path, 'r', encoding='utf-8') as file:
    for _ in range(20):
        line = file.readline()
        if line:
            print(line.strip())
        else:
            break

[Event "Rated Bullet game"]
[Site "https://lichess.org/nQ1xYNSF"]
[Date "2024.08.01"]
[Round "-"]
[White "kingskreamer"]
[Black "mysteryvabs"]
[Result "1-0"]
[UTCDate "2024.08.01"]
[UTCTime "00:00:09"]
[WhiteElo "2148"]
[BlackElo "2155"]
[WhiteRatingDiff "+6"]
[BlackRatingDiff "-6"]
[ECO "B10"]
[Opening "Caro-Kann Defense: Accelerated Panov Attack"]
[TimeControl "60+0"]
[Termination "Time forfeit"]

1. e4 { [%clk 0:01:00] } 1... c6 { [%clk 0:01:00] } 2. c4 { [%clk 0:00:59] } 2... d5 { [%clk 0:01:00] } 3. cxd5 { [%clk 0:00:59] } 3... cxd5 { [%clk 0:01:00] } 4. exd5 { [%clk 0:00:58] } 4... Qxd5 { [%clk 0:00:59] } 5. Nc3 { [%clk 0:00:58] } 5... Qd8 { [%clk 0:00:59] } 6. Bc4 { [%clk 0:00:58] } 6... Nf6 { [%clk 0:00:59] } 7. Qb3 { [%clk 0:00:57] } 7... e6 { [%clk 0:00:58] } 8. Nf3 { [%clk 0:00:57] } 8... Nc6 { [%clk 0:00:57] } 9. Bb5 { [%clk 0:00:55] } 9... Bd7 { [%clk 0:00:57] } 10. O-O { [%clk 0:00:54] } 10... Rc8 { [%clk 0:00:56] } 11. Re1 { [%clk 0:00:52] } 11... a6 { [%clk 0:00:56] } 1

### PGN Headers

- **[Event "Rated Bullet game"]** - The event is a rated bullet chess game.
- **[Site "https://lichess.org/nQ1xYNSF"]** - The game URL.
- **[Date "2024.08.01"]** - The date when the game occurred.
- **[Round "-"]** - Likely indicates if repeated games were played.
- **[White "kingskreamer"]** - The white player.
- **[Black "mysteryvabs"]** - The black player.
- **[Result "1-0"]** - The result of the game (White won).
- **[UTCDate "2024.08.01"]** - The date of the game in UTC.
- **[UTCTime "00:00:09"]** - The time of the game in UTC.
- **[WhiteElo "2148"]** - White player's Elo rating before the game.
- **[BlackElo "2155"]** - Black player's Elo rating before the game.
- **[WhiteRatingDiff "+6"]** - Changes in White's Elo after the game (+6).
- **[BlackRatingDiff "-6"]** - Changes in Black's Elo after the game (-6).
- **[ECO "B10"]** - 'B10' refers to a specific opening (Caro-Kann Defense).
- **[Opening "Caro-Kann Defense: Accelerated Panov Attack"]** - The specific opening played.
- **[TimeControl "60+0"]** - The time control was 60 seconds per player with 0 seconds increment.
- **[Termination "Time forfeit"]** - The game ended because one player ran out of time.


### 2. The Moves
This part records the actual moves played in the game. Chess moves are written using standard algebraic notation, and the clocks of both players are tracked after each move (using the `[%clk]` annotation).

For example:

- **1. e4 { [%clk 0:01:00] }**: White plays pawn to e4, and their clock is at 1 minute.
- **1... c6 { [%clk 0:01:00] }**: Black responds with c6 (a pawn move), and their clock remains at 1 minute.
- **2. c4 { [%clk 0:00:59] }**: White moves pawn to c4, with their clock now at 59 seconds.
- **3. cxd5 { [%clk 0:00:59] }**: White captures the pawn on d5, clock at 59 seconds.
- **5. Nc3 { [%clk 0:00:58] }**: White develops their knight to c3, clock at 58 seconds.

The moves continue with annotations showing how much time each player has left after their moves.


In [5]:
def pgn_to_json(pgn_file, limit=10):
    games_list = []
    game_count = 0  # Counter to keep track of how many games have been parsed
    
    with open(pgn_file, 'r', encoding='utf-8') as file:
        while True:
            game = chess.pgn.read_game(file)
            if game is None:
                break
            
            game_info = {
                "Event": game.headers.get("Event", ""),
                "Date": game.headers.get("Date", ""),
                "Result": game.headers.get("Result", ""),
                "WhiteElo": game.headers.get("WhiteElo", ""),
                "BlackElo": game.headers.get("BlackElo", ""),
                "Moves": [move.uci() for move in game.mainline_moves()]
            }
            
            games_list.append(game_info)
            game_count += 1
            
            if game_count % 1000 == 0:
                print(f"{game_count} games have been converted...")
            
            if game_count >= limit:
                break
                
    return games_list

In [6]:
games_json = pgn_to_json(file_path, limit = 1000)


1000 games have been converted...


In [7]:
# Showing first account

print(json.dumps(games_json[0], indent=4))


{
    "Event": "Rated Bullet game",
    "Date": "2024.08.01",
    "Result": "1-0",
    "WhiteElo": "2148",
    "BlackElo": "2155",
    "Moves": [
        "e2e4",
        "c7c6",
        "c2c4",
        "d7d5",
        "c4d5",
        "c6d5",
        "e4d5",
        "d8d5",
        "b1c3",
        "d5d8",
        "f1c4",
        "g8f6",
        "d1b3",
        "e7e6",
        "g1f3",
        "b8c6",
        "c4b5",
        "c8d7",
        "e1g1",
        "a8c8",
        "f1e1",
        "a7a6",
        "b5a4",
        "b7b5",
        "c3b5",
        "a6b5",
        "a4b5",
        "f8e7",
        "f3e5",
        "e8g8",
        "e5d7",
        "d8d7",
        "b3f3",
        "d7c7",
        "b5e2",
        "c6e5",
        "f3g3",
        "e5g6",
        "f2f4",
        "c7f4",
        "g3f4",
        "g6f4",
        "d2d3",
        "f4e2",
        "e1e2",
        "c8c7",
        "c1e3",
        "f8c8",
        "a1f1",
        "f6d5",
        "e3d4",
        "c7c1",
        "g1f2",
      

In [8]:
spark = SparkSession.builder \
    .appName("Game Evaluation") \
    .getOrCreate()



Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/20 07:48:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
schema = StructType([
    StructField("Event", StringType(), True),
    StructField("Site", StringType(), True),
    StructField("Date", StringType(), True),
    StructField("Result", StringType(), True),
    StructField("WhiteElo", StringType(), True),  
    StructField("BlackElo", StringType(), True),  
    StructField("Moves", ArrayType(StringType()), True)  
])


In [10]:
df = spark.createDataFrame(games_json, schema=schema)


In [11]:
df.show(5, truncate=10) 

[Stage 0:>                                                          (0 + 1) / 1]

+----------+----+----------+------+--------+--------+----------+
|     Event|Site|      Date|Result|WhiteElo|BlackElo|     Moves|
+----------+----+----------+------+--------+--------+----------+
|Rated B...|null|2024.08.01|   1-0|    2148|    2155|[e2e4, ...|
|Rated B...|null|2024.08.01|   1-0|    1103|    1106|[e2e4, ...|
|Rated B...|null|2024.08.01|   0-1|     674|     629|[e2e4, ...|
|Rated B...|null|2024.08.01|   0-1|    2459|    2556|[e2e4, ...|
|Rated B...|null|2024.08.01|   0-1|    1527|    1500|[e2e4, ...|
+----------+----+----------+------+--------+--------+----------+
only showing top 5 rows



                                                                                

In [12]:
event_distribution = df.groupBy("Event").count()


In [13]:
event_distribution.show(truncate=False)


[Stage 1:>                                                          (0 + 1) / 1]

+---------------------------------------------------------------+-----+
|Event                                                          |count|
+---------------------------------------------------------------+-----+
|Classical swiss https://lichess.org/swiss/bRGB9CDW             |6    |
|Rated Blitz tournament https://lichess.org/tournament/x2P6n5ZA |25   |
|Rated Blitz tournament https://lichess.org/tournament/MIDvhT7D |20   |
|Rated Bullet tournament https://lichess.org/tournament/CReo1mLT|1    |
|Rated Classical game                                           |5    |
|Rated Blitz tournament https://lichess.org/tournament/9ItBGKQi |3    |
|Rated Blitz tournament https://lichess.org/tournament/3YFuR01B |34   |
|Rated Rapid tournament https://lichess.org/tournament/lPeiflQY |2    |
|Rated Bullet tournament https://lichess.org/tournament/FjIuM2tM|3    |
|Blitz swiss https://lichess.org/swiss/YqQwqdKB                 |4    |
|Rated Blitz game                                               

                                                                                

Some of the events are specified by tournaments, but I want to treat each related tournament as the same. Below is cleaning up the hyperlinks

In [14]:

df1 = df.withColumn("Event", regexp_replace("Event", r' http.*$', ''))

event_distribution = df1.groupBy("Event").count()

event_distribution.show(truncate=False)

+-----------------------+-----+
|Event                  |count|
+-----------------------+-----+
|Rated Bullet tournament|22   |
|Rated Classical game   |5    |
|Rapid swiss            |3    |
|Rated Blitz tournament |93   |
|Rated Rapid tournament |6    |
|Rated Blitz game       |421  |
|Classical swiss        |6    |
|Rated UltraBullet game |3    |
|Rated Rapid game       |115  |
|Blitz swiss            |10   |
|Rated Bullet game      |316  |
+-----------------------+-----+



In [15]:
# Getting png file that only contains Rated Blitz Game


In [20]:
# filtering through the file to only include Rated Blitz Games
# and only games where WhiteElo > 2000
# and only games where there is a result (either 1-0 or 0-1)

def pgn_to_json_filtered(pgn_file, event_filter="Rated Blitz game", min_white_elo=2000, limit=10):
    games_list = []
    total_games_parsed = 0  
    filtered_games_count = 0 
    
    with open(pgn_file, 'r', encoding='utf-8') as file:
        while True:
            game = chess.pgn.read_game(file)
            if game is None:
                break
            
            total_games_parsed += 1
            
            # Extract relevant headers
            event = game.headers.get("Event", "")
            white_elo = game.headers.get("WhiteElo", "")
            result = game.headers.get("Result", "")
            
            # only include rated blitz games
            
            if event != event_filter:
                continue  
            
            # only include games where white elo is a minimum value
            
            try:
                white_elo = int(white_elo) # first making sure the elo value is an integer
            except ValueError:
                continue  
            
            if white_elo <= min_white_elo:
                continue  
            
            # making sure to include no draw games
            if result not in ["1-0", "0-1"]:
                continue  
            
            # Construct game dictionary
            game_info = {
                "Event": event,
                "Site": game.headers.get("Site", ""),
                "Date": game.headers.get("Date", ""),
                "Result": result,
                "WhiteElo": white_elo,
                "BlackElo": game.headers.get("BlackElo", ""),
                "Moves": [move.uci() for move in game.mainline_moves()] 
            }
            
            games_list.append(game_info)
            filtered_games_count += 1
            
            # Print progress for every 1000 games filtered
            if filtered_games_count % 1000 == 0:
                print(f"{filtered_games_count} games filtered...")
            
            # Stop when the limit is reached
            if filtered_games_count >= limit:
                break

    print(f"Total games parsed: {total_games_parsed}")
    print(f"Filtered games count: {filtered_games_count}")
    return games_list


In [23]:
filtered_games_json = pgn_to_json_filtered(file_path, event_filter="Rated Blitz game", limit=100000)
print(json.dumps(filtered_games_json[0], indent=4))


1000 games filtered...
2000 games filtered...
3000 games filtered...
4000 games filtered...
5000 games filtered...
6000 games filtered...
7000 games filtered...
8000 games filtered...
9000 games filtered...
10000 games filtered...
11000 games filtered...
12000 games filtered...
13000 games filtered...
14000 games filtered...
15000 games filtered...
16000 games filtered...
17000 games filtered...
18000 games filtered...
19000 games filtered...
20000 games filtered...
21000 games filtered...
22000 games filtered...
23000 games filtered...
24000 games filtered...
25000 games filtered...
26000 games filtered...
27000 games filtered...
28000 games filtered...
29000 games filtered...
30000 games filtered...
31000 games filtered...
32000 games filtered...
33000 games filtered...
34000 games filtered...
35000 games filtered...
36000 games filtered...
37000 games filtered...
38000 games filtered...
39000 games filtered...
40000 games filtered...
41000 games filtered...
42000 games filtered...
4

In [24]:
df_blitz = spark.createDataFrame(filtered_games_json, schema=schema)
df_blitz.show(10, truncate=20) 


24/11/20 09:56:55 WARN TaskSetManager: Stage 8 contains a task of very large size (58787 KiB). The maximum recommended task size is 1000 KiB.
[Stage 8:>                                                          (0 + 1) / 1]

+----------------+--------------------+----------+------+--------+--------+--------------------+
|           Event|                Site|      Date|Result|WhiteElo|BlackElo|               Moves|
+----------------+--------------------+----------+------+--------+--------+--------------------+
|Rated Blitz game|https://lichess.o...|2024.08.01|   1-0|    2055|    2069|[d2d4, e7e6, e2e4...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   0-1|    2124|    2252|[e2e4, c7c5, b1c3...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   0-1|    2010|    1972|[d2d4, d7d5, c2c4...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   0-1|    2156|    2147|[e2e4, e7e5, g1f3...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   1-0|    2028|    2049|[d2d4, d7d5, g1f3...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   0-1|    2072|    2091|[d2d4, d7d5, c2c4...|
|Rated Blitz game|https://lichess.o...|2024.08.01|   0-1|    2006|    1966|[e2e4, d7d5, e4d5...|
|Rated Blitz game|https://lich

24/11/20 09:56:59 WARN PythonRunner: Detected deadlock while completing task 0.0 in stage 8 (TID 6): Attempting to kill Python Worker
                                                                                

In [25]:
df_blitz_count = df_blitz.count()
print(df_blitz_count)

24/11/20 09:59:49 WARN TaskSetManager: Stage 9 contains a task of very large size (58787 KiB). The maximum recommended task size is 1000 KiB.
[Stage 9:>                                                          (0 + 1) / 1]

100000


                                                                                

In [26]:
df_blitz.write.json("/scratch/zrc3hc/filtered_games_total")

24/11/20 10:00:00 WARN TaskSetManager: Stage 12 contains a task of very large size (58787 KiB). The maximum recommended task size is 1000 KiB.
                                                                                