# Dataset Analysis: "60,000+ Chess Game Dataset (Chess.com)"

## Author

Analysis by Ivan Lamparelli

Written in October 2021

## Dataset Description

The dataset gathers information regarding chess games on the [Chess.com](https://www.chess.com/) platform. 

The dataset has been retrieved from [Kaggle](https://www.kaggle.com/adityajha1504/chesscom-user-games-60000-games). It's composed of a 160MB .csv file, which contains 14 columns and 66.879 rows.

Each row is related to a single game, and contains data like:
- username and ELO score of both players
- which player has won
- type of victory/draw (i.e. victory by checkmate or time)
- time format
- game format
- game FEN and PGN

## Analysis goals

The initial questions I've planned to respond to when starting this analysis are the following:
- How is ELO distributed among players?
- What are the most common opening moves? 
- How do openings change at different ELO ranges?
- How many turns is an average game?
- How does the time format influence the number of turns and the piece advantage between players?
- Do high ELO players end the game with a lower or higher piece advantage on average?
- What are the most common result types overall (i.e. win by checkmate, win by timeout, draw by stalemate, etc)? Are these influenced by ELO and time format?

Further questions may emerge during the analysis depending on what I discover.

## Project setup

### Github Repository

The project has been tracked with Git and is hosted on GitHub. The repository can be explored on [my project's GitHub page](https://github.com/lamparelli/chess-data-analysis).

### Project libraries

In [491]:
import pandas as pd
import os

### Dataset import

In [492]:
# I've used the OS path separator to ensure the code would work both on Windows and Unix environments
games = pd.read_csv("data" + os.path.sep + "club_games_data.csv")

## Data Exploration

### Dataset structure

In [493]:
print("Dataset columns:")
print(list(games.columns))
print("Number of columns: ", games.columns.size)
print("Number of rows: ", len(games))

Dataset columns:
['white_username', 'black_username', 'white_id', 'black_id', 'white_rating', 'black_rating', 'white_result', 'black_result', 'time_class', 'time_control', 'rules', 'rated', 'fen', 'pgn']
Number of columns:  14
Number of rows:  66879


### Sample data

In [494]:
# To see all the table columns, execute the code or read the html version of the document; The pdf version will truncate the table 
games.head()

Unnamed: 0,white_username,black_username,white_id,black_id,white_rating,black_rating,white_result,black_result,time_class,time_control,rules,rated,fen,pgn
0,-Amos-,miniman2804,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/miniman2804,1708,1608,win,checkmated,daily,1/259200,chess,True,r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R b K f3 1 22,"[Event ""Enjoyable games 2 - Round 1""]\n[Site ""Chess.com""]\n[Date ""2013.01.30""]\n[Round ""-""]\n[White ""-Amos-""]\n[Black ""miniman2804""]\n[Result ""1-0""]\n[Tournament ""https://www.chess.com/tournament/enjoyable-games-2""]\n[CurrentPosition ""r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R b K f3 1 22""]\n[Timezone ""UTC""]\n[ECO ""E22""]\n[ECOUrl ""https://www.chess.com/openings/Nimzo-Indian-Defense-Spielmann-Variation""]\n[UTCDate ""2013.01.30""]\n[UTCTime ""16:35:14""]\n[WhiteElo ""1708""]\n[BlackElo ""1608""]\n[TimeControl ""1/259200""]\n[Termination ""-Amos- won by checkmate""]\n[StartTime ""16:35:14""]\n[EndDate ""2013.02.01""]\n[EndTime ""18:14:48""]\n[Link ""https://www.chess.com/game/daily/64629816""]\n\n1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qb3 Bxc3+ 5. Qxc3 O-O 6. Bg5 c5 7. dxc5 Nc6 8. Nf3 Qa5 9. Bxf6 gxf6 10. Qxa5 Nxa5 11. e3 Rd8 12. Rd1 Kg7 13. Be2 b6 14. Rd4 bxc5 15. Rg4+ Kh6 16. Bd3 f5 17. Rh4+ Kg6 18. g4 Ba6 19. gxf5+ exf5 20. Ne5+ Kf6 21. Rh6+ Kxe5 22. f4# 1-0\n"
1,-Amos-,koltcho69,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/koltcho69,1726,1577,win,resigned,daily,1/172800,chess,True,8/5Q1k/4n1pp/8/7P/2N2b2/PP3P2/5K2 b - - 1 33,"[Event ""Rapid Rats - Board 5""]\n[Site ""Chess.com""]\n[Date ""2013.01.19""]\n[Round ""-""]\n[White ""-Amos-""]\n[Black ""koltcho69""]\n[Result ""1-0""]\n[Match ""https://www.chess.com/club/matches/219602""]\n[CurrentPosition ""8/5Q1k/4n1pp/8/7P/2N2b2/PP3P2/5K2 b - - 1 33""]\n[Timezone ""UTC""]\n[ECO ""C53""]\n[ECOUrl ""https://www.chess.com/openings/Giuoco-Piano-Game-Main-Line""]\n[UTCDate ""2013.01.19""]\n[UTCTime ""14:29:25""]\n[WhiteElo ""1726""]\n[BlackElo ""1577""]\n[TimeControl ""1/172800""]\n[Termination ""-Amos- won by resignation""]\n[StartTime ""14:29:25""]\n[EndDate ""2013.02.01""]\n[EndTime ""18:22:03""]\n[Link ""https://www.chess.com/game/daily/64070770""]\n\n1. e4 e5 2. Nf3 Nc6 3. Bc4 Bc5 4. c3 a6 5. d4 exd4 6. cxd4 Be7 7. Qb3 Na5 8. Qc2 Nxc4 9. Qxc4 d6 10. Nc3 c6 11. O-O h6 12. Re1 Nf6 13. d5 c5 14. e5 dxe5 15. Nxe5 O-O 16. Ng6 Re8 17. Rxe7 Rxe7 18. Nxe7+ Qxe7 19. Bf4 b5 20. d6 Qd7 21. Qxc5 Bb7 22. Qc7 Qxc7 23. dxc7 Nd5 24. Rd1 Nxf4 25. Rd8+ Kh7 26. Rxa8 Bxa8 27. c8=Q Bxg2 28. Qxa6 Bf3 29. Qxb5 Nh3+ 30. Kf1 g6 31. Qd7 Ng5 32. h4 Ne6 33. Qxf7+ 1-0\n"
2,-Amos-,enhmandah,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/enhmandah,1727,842,win,resigned,daily,1/172800,chess,True,rn1q1b1r/kb2p1pp/2p5/p1Q5/N1BP2n1/4PN2/1P3PPP/R1B1K2R b KQ - 5 15,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""Chess.com""]\n[Date ""2013.02.01""]\n[Round ""-""]\n[White ""-Amos-""]\n[Black ""enhmandah""]\n[Result ""1-0""]\n[Tournament ""https://www.chess.com/tournament/just-another-clash""]\n[CurrentPosition ""rn1q1b1r/kb2p1pp/2p5/p1Q5/N1BP2n1/4PN2/1P3PPP/R1B1K2R b KQ - 5 15""]\n[Timezone ""UTC""]\n[ECO ""D00""]\n[ECOUrl ""https://www.chess.com/openings/Queens-Pawn-Opening-1...d5-2.e3""]\n[UTCDate ""2013.02.01""]\n[UTCTime ""11:24:19""]\n[WhiteElo ""1727""]\n[BlackElo ""842""]\n[TimeControl ""1/172800""]\n[Termination ""-Amos- won by resignation""]\n[StartTime ""11:24:19""]\n[EndDate ""2013.02.02""]\n[EndTime ""17:58:11""]\n[Link ""https://www.chess.com/game/daily/64714474""]\n\n1. d4 d5 2. e3 c6 3. c4 dxc4 4. Bxc4 b5 5. Bb3 a5 6. Qf3 Bb7 7. Bxf7+ Kd7 8. Qf5+ Kc7 9. Nf3 Nh6 10. Qe5+ Kb6 11. a4 bxa4 12. Nc3 Ng4 13. Nxa4+ Ka6 14. Bc4+ Ka7 15. Qc5+ 1-0\n"
3,enhmandah,-Amos-,https://api.chess.com/pub/player/enhmandah,https://api.chess.com/pub/player/-amos-,819,1727,checkmated,win,daily,1/172800,chess,True,r3kb1r/pp3ppp/3p1n2/2pKp3/P3P3/1P6/4qP1P/QNB5 w kq - 3 17,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""Chess.com""]\n[Date ""2013.02.01""]\n[Round ""-""]\n[White ""enhmandah""]\n[Black ""-Amos-""]\n[Result ""0-1""]\n[Tournament ""https://www.chess.com/tournament/just-another-clash""]\n[CurrentPosition ""r3kb1r/pp3ppp/3p1n2/2pKp3/P3P3/1P6/4qP1P/QNB5 w kq - 3 17""]\n[Timezone ""UTC""]\n[ECO ""B20""]\n[ECOUrl ""https://www.chess.com/openings/Sicilian-Defense-Snyder-Variation""]\n[UTCDate ""2013.02.01""]\n[UTCTime ""11:24:17""]\n[WhiteElo ""819""]\n[BlackElo ""1727""]\n[TimeControl ""1/172800""]\n[Termination ""-Amos- won by checkmate""]\n[StartTime ""11:24:17""]\n[EndDate ""2013.02.02""]\n[EndTime ""18:29:41""]\n[Link ""https://www.chess.com/game/daily/64714344""]\n\n1. e4 c5 2. b3 Nc6 3. a4 d6 4. Bb5 Bd7 5. Qf3 Nd4 6. Qf4 e5 7. Qg3 Nxc2+ 8. Kd1 Nxa1 9. Qc3 Bxb5 10. Qxa1 Qg5 11. d3 Qxg2 12. Ne2 Qxh1+ 13. Kd2 Bxd3 14. Kxd3 Qd1+ 15. Kc4 Qxe2+ 16. Kd5 Nf6# 0-1\n"
4,-Amos-,Shalllow-Blue,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/shalllow-blue,1729,1116,win,resigned,daily,1/172800,chess,True,r3b2r/pp6/2pPpR1k/4n3/2P3Q1/3B4/PP4PP/R5K1 b - - 2 23,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""Chess.com""]\n[Date ""2013.02.01""]\n[Round ""-""]\n[White ""-Amos-""]\n[Black ""Shalllow-Blue""]\n[Result ""1-0""]\n[Tournament ""https://www.chess.com/tournament/just-another-clash""]\n[CurrentPosition ""r3b2r/pp6/2pPpR1k/4n3/2P3Q1/3B4/PP4PP/R5K1 b - - 2 23""]\n[Timezone ""UTC""]\n[ECO ""A40""]\n[ECOUrl ""https://www.chess.com/openings/Queens-Pawn-Opening-Horwitz-Defense-2.c4""]\n[UTCDate ""2013.02.01""]\n[UTCTime ""11:24:19""]\n[WhiteElo ""1729""]\n[BlackElo ""1116""]\n[TimeControl ""1/172800""]\n[Termination ""-Amos- won by resignation""]\n[StartTime ""11:24:19""]\n[EndDate ""2013.02.02""]\n[EndTime ""21:12:59""]\n[Link ""https://www.chess.com/game/daily/64714476""]\n\n1. d4 e6 2. c4 Qf6 3. Nf3 d6 4. Bg5 Qg6 5. Nc3 c6 6. e3 f6 7. Bd3 f5 8. e4 f4 9. e5 Qh5 10. Ne2 Be7 11. Nxf4 Bxg5 12. Nxh5 h6 13. exd6 Nd7 14. Nxg7+ Kf7 15. Ne5+ Kxg7 16. Qh5 Nxe5 17. dxe5 Bd7 18. O-O Be8 19. Qg4 Ne7 20. f4 Ng6 21. fxg5 Nxe5 22. gxh6+ Kxh6 23. Rf6+ 1-0\n"


The nature of the content in most columns is pretty clear:
- white_username and black_username: players usernames
- white_id and black_id: players unique IDs
- white_rating and black_rating: players ELO scores
- white_result and black_result: contain information regarding the game result
- time_class and time_control: game time format
- rated: whether a game is rated or not
- FEN: FEN notation of the game (indicates the last position of the game)

Two columns require further exploration:
- rules: Each row in the sample data contains "chess" for this column. What does "chess" mean in this context?
- PGN: Why doesn't the sample data show PGN notations?

### Data format of the rules column

In [495]:
# Possible values for the rules column
print(games["rules"].unique())

['chess' 'chess960' 'threecheck' 'crazyhouse' 'kingofthehill']


The rules column indicates whether the game followed traditional chess rules or chess variants

### Data format of the PGN column

In [496]:
# Sample data in the PGN column
print(games.loc[0, "pgn"])

[Event "Enjoyable games 2 - Round 1"]
[Site "Chess.com"]
[Date "2013.01.30"]
[Round "-"]
[White "-Amos-"]
[Black "miniman2804"]
[Result "1-0"]
[Tournament "https://www.chess.com/tournament/enjoyable-games-2"]
[CurrentPosition "r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R b K f3 1 22"]
[Timezone "UTC"]
[ECO "E22"]
[ECOUrl "https://www.chess.com/openings/Nimzo-Indian-Defense-Spielmann-Variation"]
[UTCDate "2013.01.30"]
[UTCTime "16:35:14"]
[WhiteElo "1708"]
[BlackElo "1608"]
[TimeControl "1/259200"]
[Termination "-Amos- won by checkmate"]
[StartTime "16:35:14"]
[EndDate "2013.02.01"]
[EndTime "18:14:48"]
[Link "https://www.chess.com/game/daily/64629816"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qb3 Bxc3+ 5. Qxc3 O-O 6. Bg5 c5 7. dxc5 Nc6 8. Nf3 Qa5 9. Bxf6 gxf6 10. Qxa5 Nxa5 11. e3 Rd8 12. Rd1 Kg7 13. Be2 b6 14. Rd4 bxc5 15. Rg4+ Kh6 16. Bd3 f5 17. Rh4+ Kg6 18. g4 Ba6 19. gxf5+ exf5 20. Ne5+ Kf6 21. Rh6+ Kxe5 22. f4# 1-0



The PGN column contains nested data. During the data manipulation phase, we'll extract these data.

### Missing values

#### Number of NA values

We can check the number of NA values per column by using `isna().sum()`.

Then we can count the total number of NA values in the dataset by summing the NA values in each group.

In [497]:
print("NA values: ", games.isna().sum().sum())

NA values:  0


There are no NA values in the dataset.

#### Number of empty strings

To check for missing values expressed as empty strings, first we must manipulate the dataset by trimmming initial and ending spaces in string values.

We can achieve this by the `strip()` function to each column that contains strings.

In [498]:
games = games.applymap(lambda colData: colData.strip() if isinstance(colData, str) else colData)

We can then transform empty strings into NA values.

In [499]:
games = games.replace("", pd.NA)

Now we check again for NA values.

In [500]:
print("Empty strings: ", games.isna().sum().sum())

Empty strings:  0


There are no empty strings in the dataset.

## Data Manipulation

After an initial exploration of the data structure, I've set an initial list of manipulations I want to perform on the raw data to prepare for the analysis:
- Determine if we can drop the player id columns
- Extract the winning player and game result type from the white_result and black_result columns
- Extract piece advantage, point advantage, number of turns at end of game from the FEN column
- Extract date and PGN notation from the PGN column

### Players manipulation

#### Checking username uniqueness

We want to determine whether we can remove the white_id and black_id columns. In order to do so, we must check whether there are players with different IDs that have the same username.

A single player (identified by id) could be present both in the white_id and black_id columns, if he has played sometimes as black and other times as white. So, the first step is identifying the unique players by id, then we can check for duplicate usernames.

In [501]:
# Extracts id and username of every white/black player
whitePlayers = games[["white_id", "white_username"]]
blackPlayers = games[["black_id", "black_username"]]

# Renames the columns with equal names in order to join them
whitePlayers = whitePlayers.rename(columns={"white_id": "id", "white_username": "username"})
blackPlayers = blackPlayers.rename(columns={"black_id": "id", "black_username": "username"})

# Joins the id and username of every player in a single dataframe (to then identify unique players)
players = pd.concat([whitePlayers, blackPlayers])

# Identifies unique players
players.drop_duplicates(subset=["id"], inplace=True)

# Prints the number of duplicated usernames
numOfPlayerByUsername = players["username"].value_counts()
print("Duplicated usernames: ", numOfPlayerByUsername[numOfPlayerByUsername > 1].size)

Duplicated usernames:  0


#### Removing redundant data

After having determined that usernames are valid unique identifiers, we can drop the winner_id and loser_id columns.

In [502]:
games.drop(columns=["white_id", "black_id"], inplace=True)
print(list(games.columns))

['white_username', 'black_username', 'white_rating', 'black_rating', 'white_result', 'black_result', 'time_class', 'time_control', 'rules', 'rated', 'fen', 'pgn']


### Game results manipulation

Currently, the types of game result are divided between two columns (white_result and black_result):
- for drawn games: both white_result and black_result contain the type of draw
- for won games: if white won, white_result contains the word `win` and black_result contains the type of victory; viceversa if black won

This structure is unfit for exploring the different game result types. We'll proceed to extract two new columns:
- winner: white, black, neither (in case of draw)
- result type: the win/draw cause (i.e. victory by checkmate, victory by timeout, draw by stalemate, etc.)

#### Extracting winning player

In [503]:
# Defined constants to avoid using hardwired strings in code
winnerInCaseOfDraw = "neither"
winResult = "win"
whitePlayer = "white"
blackPlayer = "black"

# Sets a unique column with the information on who won
games["winner"] = winnerInCaseOfDraw
games.loc[games["white_result"] == winResult, "winner"] = whitePlayer
games.loc[games["black_result"] == winResult, "winner"] = blackPlayer

print("Sample mapping of the winner column:")
print(games[["white_result", "black_result", "winner"]].head())

Sample mapping of the winner column:
  white_result black_result winner
0          win   checkmated  white
1          win     resigned  white
2          win     resigned  white
3   checkmated          win  black
4          win     resigned  white


#### Extracting game result type

In [504]:
# Defines how to read the result type from each game
def getResultType(game):
    return game["black_result"] if game["winner"] == whitePlayer else game["white_result"]

# Sets the result type in a new column
games["result_type"] = games.apply(lambda game: getResultType(game), axis=1)

print("Sample mapping of the new columns:")
print(games[["white_result", "black_result", "winner", "result_type"]].head())

Sample mapping of the new columns:
  white_result black_result winner result_type
0          win   checkmated  white  checkmated
1          win     resigned  white    resigned
2          win     resigned  white    resigned
3   checkmated          win  black  checkmated
4          win     resigned  white    resigned


#### Dropping old columns

In [477]:
games.drop(columns=["white_result", "black_result"], inplace=True)
print("Columns: ", list(games.columns))

Columns:  ['white_username', 'black_username', 'white_rating', 'black_rating', 'time_class', 'time_control', 'rules', 'rated', 'fen', 'pgn', 'winner', 'result_type']


### FEN data manipulation

#### Hints about FEN notation

A FEN string contains 6 pieces information (separated by spaces) regarding a particular state of the board, in the format:

`board_setup next_player castling_options en_passant_options number_of_moves_since_last_capture number_of_turns`

Example: 

`r2qk2r/1pp2ppp/p3pn2/1b6/2NP4/8/PP1BQPPP/R4RK1 b kq - 1 15`

The main interest of the analysis regarding this data is in the piece and points advantage (readable from the board setup) and the number of turns. We'll proceed to extract these pieces of information first. 

#### Hints about board setup in FEN notation

The board setup is represented as a series of information on the board rows:
- each row is separated by a slash
- each row is represented as a series of letters and numbers (a letter is a piece, a number is a set of adiacent empty spaces)
- K is King, Q is Queen, R is rook, N is knight, B is bishop, P is pawn
- white pieces are in uppercase, black pieces are in lowercase

Example:

`r2qk2r/1pp2ppp/p3pn2/1b6/2NP4/8/PP1BQPPP/R4RK1`

#### Hints about piece and points advantage

Each player start with 16 pieces. Each piece has a different score, which is used to evaluate which player has an advantage in material:
- king: no score (since it can't be captured, it can only be checkmated)
- queen: 9 points
- rook: 5 points
- bishop: 3 points
- knight: 3 points
- pawn: 1 point

Sometimes the two players have the same total score in terms of points, but one of them has a higher number of pieces (pieces with a smaller score). This can mean occupying more phisical squares on the board, which, according to chess litetature, can be an important factor during the endgame (the ending phase of the game).

Part of the analysis goals is understanding how differences in terms of available pieces and point advantage reflect on winrates.

#### How to extract info from FEN notation

Each piece of information is separated by a space. We could simply extract different pieces by using the `split()` function, like this:

In [505]:
sampleFenString = "r2qk2r/1pp2ppp/p3pn2/1b6/2NP4/8/PP1BQPPP/R4RK1 b kq - 1 15"
sampleFenString.split()

['r2qk2r/1pp2ppp/p3pn2/1b6/2NP4/8/PP1BQPPP/R4RK1', 'b', 'kq', '-', '1', '15']

But this wouldn't give us control on data validity. By using regular expressions to extract the data, we can ensure that there's no invalid data.

In [479]:
# Each piece of info is enclosed in a capture group (parenthesis) so that we can then extract them

# A single row can contain 1 to 8 values, letters or number
# A letter indicates a pieces
# A number indicates the adiacent empty spaces to the right (there can be max 8 spaces, in which case it's an empty row) 
rowBoardRegex = "[rnbqkpRNBQKP1-8]{1,8}"

# There are 8 rows. Each row (except the last one) has a slash to the right
boardRegex = "(" + ((rowBoardRegex + "/") * 7) + rowBoardRegex + ")"

# Either white or black
nextPlayerRegex = "(w|b)"

# Shows availability of kingside/queenside castling for white and black; If there are no available options, there's a dash 
castlingRegex = "([KQkq]{1,4}|-)"

# Shows square coordinates where the next player can move en passant (or a dash if there is no en passant move)
enPassantRegex = "([a-h][[1-8]|-)"

# Number
movesSinceLastCaptureRegex = "([0-9]+)"

# Number
numberOfTurnsRegex = "([0-9]+)"

# Joins the different controls to create the regex for the FEN strings
fenRegex = " ".join([boardRegex, nextPlayerRegex, castlingRegex, enPassantRegex, movesSinceLastCaptureRegex, numberOfTurnsRegex])

# Sets symbols ^ and $ to specify start and end of string
fenRegex = "^" + fenRegex + "$"

#### Examining FEN data validity

In [480]:
# Checks for every game if the FEN data is valid
games["fen_check"] = games["fen"].str.extract(fenRegex)[0]
print("Number of games with invalid FEN data: ", games["fen_check"].isna().sum())

Number of games with invalid FEN data:  64934


We can see that a lot of games have invalid FEN data. Let's explore the issue by checking a sample of invalid data.

In [507]:
# Sample of invalid data
games[games["fen_check"].isna()]["fen"].head(3)

651               8/5kbR/3r1p2/P1pPp1p1/r5P1/8/8/3R1K2 w D - 4 34
664        1kn2r1n/4p1pp/2B5/1pP2p2/7q/2Q3N1/3K2P1/3R4 w f - 0 27
671    rkbbnrqn/pppppppp/8/8/8/3P4/PPP1PPPP/RKBBNRQN b FQfq - 0 1
Name: fen, dtype: object

As explained in the hints on FEN, a complete FEN string has 6 pieces of data separated by spaces. We can see that, in the sample of invalid data, there are only 4 pieces of information. 

Part of the FEN strings are missing the last 2 pieces of information (number of turns since last capture, number of turns).

This means that:
- the FEN column can't be used reliably to read the number of turns (we'll have to read it from the PGN data)
- the regex needs to be slightly altered to consider optionality of the last 2 parameters

In [508]:
# Fixed regex with last 2 parameters set as optional
fenRegex = "^" + boardRegex + " " + nextPlayerRegex + " " + castlingRegex + " " + enPassantRegex + \
    "( " + movesSinceLastCaptureRegex + " " + numberOfTurnsRegex + "){0,1}"

# Checks for every game if the FEN data is valid
games["fen_check"] = games["fen"].str.extract(fenRegex)[0]
print("Number of games with invalid FEN data: ", games["fen_check"].isna().sum())

Number of games with invalid FEN data:  28


In [509]:
# Sample of invalid data
pd.set_option('display.max_colwidth', None)
games[games["fen_check"].isna()][["fen", "fen_check"]].head(3)

Unnamed: 0,fen,fen_check
651,8/5kbR/3r1p2/P1pPp1p1/r5P1/8/8/3R1K2 w D - 4 34,
664,1kn2r1n/4p1pp/2B5/1pP2p2/7q/2Q3N1/3K2P1/3R4 w f - 0 27,
671,rkbbnrqn/pppppppp/8/8/8/3P4/PPP1PPPP/RKBBNRQN b FQfq - 0 1,


Another error that some of the FEN strings present, as we can see in the sample, is having invalid characters like `D` and `F` in the castling parameter.

Given that the most important piece of information we want to extract is the board setup (to calculate the piece and points advantage), let's change approach and check whether all the FEN strings contain valid board setups.

#### Extracting board setup

In [510]:
games["fen_board_setup"] = games["fen"].str.extract(boardRegex)[0]

print("Number of games with invalid board information: ", games["fen_board_setup"].isna().sum())

Number of games with invalid board information:  0


In [511]:
print("Sample board setups:")
print(games["fen_board_setup"].head(3))

Sample board setups:
0            r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R
1                        8/5Q1k/4n1pp/8/7P/2N2b2/PP3P2/5K2
2    rn1q1b1r/kb2p1pp/2p5/p1Q5/N1BP2n1/4PN2/1P3PPP/R1B1K2R
Name: fen_board_setup, dtype: object


#### Extracting number of pieces

To calculate the piece and points advantage, first we need to count how many pieces each player has remaining.

In [484]:
# Counts white pieces
games["white_pieces_kings"] = games["fen_board_setup"].str.count("K")
games["white_pieces_queens"] = games["fen_board_setup"].str.count("Q")
games["white_pieces_rooks"] = games["fen_board_setup"].str.count("R")
games["white_pieces_knights"] = games["fen_board_setup"].str.count("N")
games["white_pieces_bishops"] = games["fen_board_setup"].str.count("B")
games["white_pieces_pawns"] = games["fen_board_setup"].str.count("P")

# Counts black pieces
games["black_pieces_kings"] = games["fen_board_setup"].str.count("k")
games["black_pieces_queens"] = games["fen_board_setup"].str.count("q")
games["black_pieces_rooks"] = games["fen_board_setup"].str.count("r")
games["black_pieces_knights"] = games["fen_board_setup"].str.count("n")
games["black_pieces_bishops"] = games["fen_board_setup"].str.count("b")
games["black_pieces_pawns"] = games["fen_board_setup"].str.count("p")

# Counts totals
games["white_pieces"] = games.loc[:, "white_pieces_kings":"white_pieces_pawns"].sum(axis=1)
games["black_pieces"] = games.loc[:, "black_pieces_kings":"black_pieces_pawns"].sum(axis=1)

# Sample data
sampleGame = games.iloc[0, :]
print("Sample comparison between FEN board and number of pieces:")
print(sampleGame["fen_board_setup"])
print("White pieces: ", sampleGame["white_pieces"])
print("White knights: ", sampleGame["white_pieces_knights"])
print("Black pieces: ", sampleGame["black_pieces"])
print("Black knights: ", sampleGame["black_pieces_knights"])

Sample comparison between FEN board and number of pieces:
r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R
White pieces:  10
White knights:  0
Black pieces:  11
Black knights:  1


#### Extracting piece advantage

Now we'll calculate the difference in number of pieces between the winner and the loser. 
If the number is negative, it means that the winner has won with less pieces remaining than the loser.

In [485]:
# Can be used to calculate points or pieces advantage by passing the desired column name
def getAdvantage(game, columnName):
    return (game["white_" + columnName] - game["black_" + columnName]) \
        * (1 if game["winner"] == whitePlayer else -1) # If White has won, we calculate normally; otherwise, we flip the result

games["piece_advantage"] = games.apply(lambda game: getAdvantage(game, "pieces"), axis=1)

# Sample data
print("Sample piece advantage data:")
print(games[["winner", "white_pieces", "black_pieces", "piece_advantage"]].head(3))

Sample piece advantage data:
  winner  white_pieces  black_pieces  piece_advantage
0  white            10            11               -1
1  white             7             5                2
2  white            14            13                1


#### Extracting points advantage

Now we'll calculate the difference in pieces points between the winner and the loser.
If the number is negative, it means that the winner has won with a weaker set of pieces than the loser.

In [486]:
def getPoints(game, player):
    return game[player + "_pieces_queens"] * 9 + game[player + "_pieces_rooks"] * 5 + \
    game[player + "_pieces_knights"] * 3 + game[player + "_pieces_bishops"] * 3 + game[player + "_pieces_pawns"] * 1

games["white_points"] = games.apply(lambda game: getPoints(game, whitePlayer), axis=1)
games["black_points"] = games.apply(lambda game: getPoints(game, blackPlayer), axis=1)
games["points_advantage"] = games.apply(lambda game: getAdvantage(game, "points"), axis=1)

# Sample data
sampleGame = games.iloc[0, :]
print ("Sample point calculation: ")
print("White points: ", sampleGame["white_points"])
print("White queens: ", sampleGame["white_pieces_queens"], " * 9 points each")
print("White rooks: ", sampleGame["white_pieces_rooks"], " * 5 points each")
print("White knights: ", sampleGame["white_pieces_knights"], " * 3 points each")
print("White bishops: ", sampleGame["white_pieces_bishops"], " * 3 points each")
print("White pawns: ", sampleGame["white_pieces_pawns"], " * 1 point each")

print("\nSample points advantage data:")
print(games[["winner", "white_points", "black_points", "points_advantage"]].head(3))

Sample point calculation: 
White points:  19
White queens:  0  * 9 points each
White rooks:  2  * 5 points each
White knights:  0  * 3 points each
White bishops:  1  * 3 points each
White pawns:  6  * 1 point each

Sample points advantage data:
  winner  white_points  black_points  points_advantage
0  white            19            22                -3
1  white            16             8                 8
2  white            37            36                 1


#### Extracting number of turns

In [487]:
nextPlayerRegex = "(w|b)"
castlingRegex = "([KQkq]{1,4}|-)"
enPassantRegex = "([a-h][[1-8]|-)"
movesSinceLastCaptureRegex = "([0-9]+)"
numberOfTurnsRegex = "([0-9]+)$"

fenRegex = boardRegex + "w|b [0-9]+$"

### PGN column manipulation

As seen in the initial data exploration, the PGN column contains nested data.

In [488]:
# Sample data
print(games.loc[0, "pgn"])

[Event "Enjoyable games 2 - Round 1"]
[Site "Chess.com"]
[Date "2013.01.30"]
[Round "-"]
[White "-Amos-"]
[Black "miniman2804"]
[Result "1-0"]
[Tournament "https://www.chess.com/tournament/enjoyable-games-2"]
[CurrentPosition "r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R b K f3 1 22"]
[Timezone "UTC"]
[ECO "E22"]
[ECOUrl "https://www.chess.com/openings/Nimzo-Indian-Defense-Spielmann-Variation"]
[UTCDate "2013.01.30"]
[UTCTime "16:35:14"]
[WhiteElo "1708"]
[BlackElo "1608"]
[TimeControl "1/259200"]
[Termination "-Amos- won by checkmate"]
[StartTime "16:35:14"]
[EndDate "2013.02.01"]
[EndTime "18:14:48"]
[Link "https://www.chess.com/game/daily/64629816"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qb3 Bxc3+ 5. Qxc3 O-O 6. Bg5 c5 7. dxc5 Nc6 8. Nf3 Qa5 9. Bxf6 gxf6 10. Qxa5 Nxa5 11. e3 Rd8 12. Rd1 Kg7 13. Be2 b6 14. Rd4 bxc5 15. Rg4+ Kh6 16. Bd3 f5 17. Rh4+ Kg6 18. g4 Ba6 19. gxf5+ exf5 20. Ne5+ Kf6 21. Rh6+ Kxe5 22. f4# 1-0


Some of the data in it (i.e. white/black ELO, termination cause, etc.) are redundant with other columns in the dataset. Other pieces of data are less interesting from an analysis standpoint (i.e. timezone, url link to game information).

The two pieces of data we're interested in extracting are:
- game date (to perform analysis on dates)
- game PGN (to perform analysis on moves)

First thing we'll do is renaming the PGN column into a less ambiguous name (so that we can then use the `pgn` name for the column that will contain only pgn data).
We won't remove the current PGN column since we may be interested in extracting additional data from it, depending on the result of the analysis.

In [489]:
games.rename(columns={"pgn": "game_information"}, inplace=True)

#### Extracting game date from PGN column

We'll use a regular expression to extract the game date from the game_information column, looking for the values contained within the [Date] tag.

In [490]:
# A date in the game_information column is composed as year.month.day
dateStringRegex = "[0-9]{4}\.[0-9]{2}\.[0-9]{2}"
# The date is contained in a tag in the structure [Date "my_date"]
# We only want the my_date part, so we put it in a capture group (in parenthesis)
dateFieldRegex = "\[Date \"(" + dateStringRegex + ")\"\]"

games["game_date"] = games["game_information"].str.extract(dateFieldRegex)[0]

# Sample data
print("Number of games with invalid date information: ", games["game_date"].isna().sum())
print("Sample dates:")
print(games["game_date"].head(3))

Number of games with invalid date information:  0
Sample dates:
0    2013.01.30
1    2013.01.19
2    2013.02.01
Name: game_date, dtype: object


#### Extracting PGN moves data

A valid PGN string is composed as a series of numbers (representing turns), with white and black moves for each turn.

Example: `1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 [...]`