# Dataset Analysis: "60,000+ Chess Game Dataset (Chess.com)"

## Author

Analysis by Ivan Lamparelli

Written in October 2021

## Dataset Description

The dataset gathers information regarding chess games on the [Chess.com](https://www.chess.com/) platform. 

The dataset has been retrieved from [Kaggle](https://www.kaggle.com/adityajha1504/chesscom-user-games-60000-games). It's composed of a 160MB .csv file, which contains 14 columns and 66.879 rows.

Each row is related to a single game, and contains data like:
- username and ELO score of both players
- which player has won
- type of victory/draw (ie. victory by checkmate or time)
- time format
- game format
- game FEN and PGN

## Analysis goals

The initial questions I've planned to respond to when starting this analysis are the following:
- How is ELO distributed among players?
- What are the most common opening moves? 
- How do openings change at different ELO ranges?
- How many turns is an average game?
- How does the time format influence the number of turns and the piece advantage between players?
- Do high ELO players end the game with a lower or higher piece advantage on average?
- What are the most common win/draw conditions overall? Are these influencede by ELO and time format?

Further questions may emerge during the analysis depending on what I discover.

## Project setup

### Project libraries

In [110]:
import pandas as pd
import os

### Dataset import

In [111]:
# I've used the OS path separator to ensure the code would work both on Windows and Unix environments
games = pd.read_csv("data" + os.path.sep + "club_games_data.csv")

## Dataset structure

### Column list

In [112]:
print(list(games.columns))

['white_username', 'black_username', 'white_id', 'black_id', 'white_rating', 'black_rating', 'white_result', 'black_result', 'time_class', 'time_control', 'rules', 'rated', 'fen', 'pgn']


### Sample data

In [113]:
games.head()

Unnamed: 0,white_username,black_username,white_id,black_id,white_rating,black_rating,white_result,black_result,time_class,time_control,rules,rated,fen,pgn
0,-Amos-,miniman2804,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/miniman2804,1708,1608,win,checkmated,daily,1/259200,chess,True,r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R ...,"[Event ""Enjoyable games 2 - Round 1""]\n[Site ""..."
1,-Amos-,koltcho69,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/koltcho69,1726,1577,win,resigned,daily,1/172800,chess,True,8/5Q1k/4n1pp/8/7P/2N2b2/PP3P2/5K2 b - - 1 33,"[Event ""Rapid Rats - Board 5""]\n[Site ""Chess.c..."
2,-Amos-,enhmandah,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/enhmandah,1727,842,win,resigned,daily,1/172800,chess,True,rn1q1b1r/kb2p1pp/2p5/p1Q5/N1BP2n1/4PN2/1P3PPP/...,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""..."
3,enhmandah,-Amos-,https://api.chess.com/pub/player/enhmandah,https://api.chess.com/pub/player/-amos-,819,1727,checkmated,win,daily,1/172800,chess,True,r3kb1r/pp3ppp/3p1n2/2pKp3/P3P3/1P6/4qP1P/QNB5 ...,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""..."
4,-Amos-,Shalllow-Blue,https://api.chess.com/pub/player/-amos-,https://api.chess.com/pub/player/shalllow-blue,1729,1116,win,resigned,daily,1/172800,chess,True,r3b2r/pp6/2pPpR1k/4n3/2P3Q1/3B4/PP4PP/R5K1 b -...,"[Event ""CHESS BOARD CLASH - Round 1""]\n[Site ""..."


The nature of the content in most columns is pretty clear:
- white_username and black_username: players usernames
- white_id and black_id: players unique IDs
- white_rating and black_rating: players ELO scores
- white_result and black_result: contain information regarding the game result
- time_class and time_control: game time format
- rated: whether a game is rated or not
- FEN: FEN notation of the game (indicates the last position of the game)

Two columns require further exploration:
- rules: Each row in the sample data contains "chess" for this column. What does "chess" mean in this context?
- PGN: Why doesn't the sample data show PGN notations?

### Data format of the rules column

In [114]:
# Possible values for the rules column
print(games["rules"].unique())

['chess' 'chess960' 'threecheck' 'crazyhouse' 'kingofthehill']


The rules column indicates whether the game followed traditional chess rules or chess variants

### Data format of the PGN column

In [115]:
# Sample data in the PGN column
print(games.loc[0, "pgn"])

[Event "Enjoyable games 2 - Round 1"]
[Site "Chess.com"]
[Date "2013.01.30"]
[Round "-"]
[White "-Amos-"]
[Black "miniman2804"]
[Result "1-0"]
[Tournament "https://www.chess.com/tournament/enjoyable-games-2"]
[CurrentPosition "r2r4/p2p1p1p/b6R/n1p1kp2/2P2P2/3BP3/PP5P/4K2R b K f3 1 22"]
[Timezone "UTC"]
[ECO "E22"]
[ECOUrl "https://www.chess.com/openings/Nimzo-Indian-Defense-Spielmann-Variation"]
[UTCDate "2013.01.30"]
[UTCTime "16:35:14"]
[WhiteElo "1708"]
[BlackElo "1608"]
[TimeControl "1/259200"]
[Termination "-Amos- won by checkmate"]
[StartTime "16:35:14"]
[EndDate "2013.02.01"]
[EndTime "18:14:48"]
[Link "https://www.chess.com/game/daily/64629816"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qb3 Bxc3+ 5. Qxc3 O-O 6. Bg5 c5 7. dxc5 Nc6 8. Nf3 Qa5 9. Bxf6 gxf6 10. Qxa5 Nxa5 11. e3 Rd8 12. Rd1 Kg7 13. Be2 b6 14. Rd4 bxc5 15. Rg4+ Kh6 16. Bd3 f5 17. Rh4+ Kg6 18. g4 Ba6 19. gxf5+ exf5 20. Ne5+ Kf6 21. Rh6+ Kxe5 22. f4# 1-0



The PGN column contains nested data. During the data manipulation phase, we'll extract these data.

## Missing values

### Number of NA values

We can check the number of NA values per column by using `isna().sum()`.

Then we can count the total number of NA values in the dataset by summing the NA values in each group.

In [116]:
print("NA values: ", games.isna().sum().sum())

NA values:  0


There are no NA values in the dataset.

### Number of empty strings

To check for missing values expressed as empty strings, first we must manipulate the dataset by trimmming initial and ending spaces in string values.

We can achieve this by the strip() function to each column that contains strings.

In [117]:
games = games.applymap(lambda colData: colData.strip() if isinstance(colData, str) else colData)

We can then transform empty strings into NA values.

In [118]:
games = games.replace("", pd.NA)

Now we check again for NA values.

In [119]:
print("Empty strings: ", games.isna().sum().sum())

Empty strings:  0


There are no empty strings in the dataset.

## Players

### Checking username uniqueness

We want to determine whether we can remove the winner_id and loser_id columns.

In order to do so, we must check whether there are players with different IDs that have the same username.

In [120]:
# Extracts id-username pairs for white and black players
whitePlayers = games[["white_username", "white_id"]].rename(columns={"white_username": "username", "white_id": "id"})
blackPlayers = games[["black_username", "black_id"]].rename(columns={"black_username": "username", "black_id": "id"})

# Gathers the id-username data of every game in a common dataframe
players = pd.concat([whitePlayers, blackPlayers])

# Gathers the unique id-username pairs for each player
players = players.drop_duplicates()

# Prints the number of duplicated usernames
numOfPlayerByUsername = players["username"].value_counts()
print("Duplicated usernames: ", numOfPlayerByUsername[numOfPlayerByUsername > 1].size)

Duplicated usernames:  0


### Removing redundant data

After having determined that usernames are valid unique identifiers, we can drop the winner_id and loser_id columns.

In [121]:
games = games.drop(columns=["white_id", "black_id"])
print(list(games.columns))

['white_username', 'black_username', 'white_rating', 'black_rating', 'white_result', 'black_result', 'time_class', 'time_control', 'rules', 'rated', 'fen', 'pgn']
