# Loading, Cleaning, EDA, Feature Engineering

## Introduction

### Problem Statement

With the development of technology, detecting chess cheaters by the hardware that they carry has become harder. It is estimated to cost chess tournaments 10x more to be able to detect the hardware cheaters are using than the cost of the cheating hardware itself. Because of this, the development of statistical methods to predict when a player is cheating at chess (i.e. using a chess engine to know what the best move in the position is) is paramount to preserving the integrity of chess competitions.


Predicting if someone is cheating is, at its root, a classification problem. The scope of this project is to develop methods which are able to predict if white (i.e. the player of the white pieces) or black (i.e. the player of the black pieces) is cheating. The ideal scenario would be to have a dataset with games where humans are playing against each other, labelled with if white, black or neither were cheating. Unfortunately, such datasets are not available to the public - chess cheaters are reluctant to identify themselves publicly and chess platforms with sophisticated cheater-detection methods do not share their own datasets to minimise the risk of cheaters learning how they are being detected.


To circumvent this issue, games were collected where humans played against a chess engine and the engines were labelled 'cheaters'. Games were also collected where titled chess players (high-rated players) played against one another and were assumed to be games where neither player was cheating.


The games were collected from the website 'FICS games’ (www.ficsgames.org), the free chess database.


-----

### Data Collection

Games were downloaded as pgn files and converted to json files (using a script edited from https://github.com/Assios/pgn-to-json). 

----

## Loading and Cleaning

In [None]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import chess
import io
# import chess.pgn
from io import StringIO
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go

import joblib
import re

In [None]:
import detecting_cheaters_in_chess_helpers as hp

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

----

### Loading in Human vs Computer

In [None]:
raw_pickle_list = [joblib.load(f'./data/raw/{year}_CvH.pkl') for \
         year in np.arange(2021, 2019, -1)] # load the pickles from 2022 to 2018 inclusive

In [None]:
# first pass grouping and concatenation
big_df = hp.concatenate_cleaned_pickles(raw_pickle_list)

In [None]:
X, y = hp.X_y_split_simple(big_df)

y = hp.y_convert_to_ints(y)

In [None]:
X_ = X.drop(columns=['emt', 'moves'])

----

### Loading in Human vs Human

In [None]:
raw_pickle_list_titled = [pd.read_json(
    f'./data/raw/json/Titled/ficsgamesdb_{year}_titled_movetimes_26{val}.json') \
                   for year, val in zip(np.arange(2021, 2016, -1), [4827, 5091, 5092, 5093, 5094])]

In [None]:
titled_df = hp.concatenate_cleaned_pickles(raw_pickle_list_titled)

In [None]:
X_titled, y_titled = hp.X_y_split_simple(titled_df)

In [None]:
y_titled.head(2)

In [None]:
y_titled = hp.y_convert_to_ints(y_titled)

Dropping emt and moves:

In [None]:
X_titled_ = X_titled.drop(columns=['emt', 'moves'])

In [None]:
X_titled_.head()

----

### Joining Cheater vs Human and Human vs Human

In [None]:
X_CvH_HvH = pd.concat([X_, X_titled_]).reset_index(drop=True)
y_CvH_HvH = pd.concat([y, y_titled]).reset_index(drop=True)

----

## EDA

### Looking at CvH games

In [None]:
df_2022_2018_C_distinv = joblib.load(
    './data/preprocessed/df_2022_2018_C_distinv.pkl')
df_2022_2018_H_distinv = joblib.load(
    './data/preprocessed/df_2022_2018_H_distinv.pkl')

In [None]:
# cheaters (rated and unrated)
fig = px.histogram(
    data_frame=df_2022_2018_C_distinv,
    x='CheaterElo',
    nbins=(int(df_2022_2018_C_distinv.CheaterElo.describe()['max'] - \
          df_2022_2018_C_distinv.CheaterElo.describe()['min'])),
    title='Distribution of Cheater ELO'
    
    )

fig.show()

In [None]:
# non-cheaters (rated and unrated)
fig = px.histogram(
    data_frame=df_2022_2018_H_distinv,
    x='NonCheaterElo',
    nbins=(int(df_2022_2018_H_distinv.NonCheaterElo.describe()['max'] - \
          df_2022_2018_H_distinv.NonCheaterElo.describe()['min'])),
    title='Distribution of Non-Cheater ELO'
    
    )

fig.show()

The spike at 1720-1721 above is probably because there are still unrated games in this dataframe. One's elo doesn't change due to an unrated game which could lead to anomalies like above. It could also be such a spike because that is the elo assigned to new players when they play their first game. Many of these should probably be eliminated from the dataset to minimise the distribution skew.


Another thing to investigate is how much is the change in any individual player's elo affecting this distribution i.e. how different would it be if the data was grouped by unique players and their respective elo was averaged across all their games, OR their last elo (date-wise) was used?

In [None]:
# both rated and unrated
print('Rated and Unrated:')
display(df_2022_2018_C_distinv.CheaterElo.describe().to_frame().T)

# # only rated
df_2022_2018_C_distinv_rated = df_2022_2018_C_distinv[[' rated' in x for x in df_2022_2018_C_distinv.Event]]
print('Rated:')
display(df_2022_2018_C_distinv.CheaterElo.describe().to_frame().T)

# # only unrated
df_2022_2018_C_distinv_unrated = df_2022_2018_C_distinv[['unrated' in x for x in df_2022_2018_C_distinv.Event]]
print('Unrated:')
display(df_2022_2018_C_distinv_unrated.CheaterElo.describe().to_frame().T)

In [None]:
# both rated and unrated
print('Rated and Unrated:')
display(df_2022_2018_H_distinv.NonCheaterElo.describe().to_frame().T)

# # only rated
df_2022_2018_H_distinv_rated = df_2022_2018_H_distinv[[' rated' in x for x in df_2022_2018_H_distinv.Event]]
print('Rated:')
display(df_2022_2018_H_distinv.NonCheaterElo.describe().to_frame().T)

# # only unrated
df_2022_2018_H_distinv_unrated = df_2022_2018_H_distinv[['unrated' in x for x in df_2022_2018_H_distinv.Event]]
print('Unrated:')
display(df_2022_2018_H_distinv_unrated.NonCheaterElo.describe().to_frame().T)

In [None]:
fig = go.Figure()
fig.add_trace(go.Histogram(x=df_2022_2018_C_distinv_rated['CheaterElo'],
                          name='Cheater'))
fig.add_trace(go.Histogram(x=df_2022_2018_H_distinv_rated['NonCheaterElo'],
                          name='Non-cheater'))

fig.update_layout(barmode='overlay',
                 title=f'Distribution of Cheater and Non-Cheater Elo in Rated Games',
                 xaxis_title='Elo',
                 yaxis_title='Count')
fig.update_traces(opacity=0.75)

fig.show()


fig = go.Figure()
fig.add_trace(go.Histogram(x=df_2022_2018_C_distinv_unrated['CheaterElo'],
                          name='Cheater'))
fig.add_trace(go.Histogram(x=df_2022_2018_H_distinv_unrated['NonCheaterElo'],
                          name='Non-cheater'))

fig.update_layout(barmode='overlay',
                 title=f'Distribution of Cheater and Non-Cheater Elo in Unrated Games',
                 xaxis_title='Elo',
                 yaxis_title='Count')
fig.update_traces(opacity=0.75)


fig.show()

----

#### Investigating rated-cheater's bimodal distribution causes

##### Event

In [None]:
# cheaters (rated)
fig = px.histogram(
    data_frame=df_2022_2018_C_distinv_rated,
    x='CheaterElo',
    nbins=(int(df_2022_2018_C_distinv.CheaterElo.describe()['max'] - \
          df_2022_2018_C_distinv.CheaterElo.describe()['min'])),
    title='Distribution of Cheater ELO',
    color='Event'
    
    )

fig.show()

----

### Looking at Titled

In [None]:
df_2021_2017_titled_distinv = joblib.load(
    './data/preprocessed/2021_2017_titled_distinv.pkl')

In [None]:
df_2021_2017_titled_distinv = hp.drop_no_move_games(df_2021_2017_titled_distinv)

In [None]:
hp.any_missing_emt(df_2021_2017_titled_distinv)

In [None]:
df_2021_2017_titled_distinv.head(1)

#### Elo distribution for all games (rated and unrated)

In [None]:
BlackElo_=df_2021_2017_titled_distinv[['Event', 'Date', 'BlackElo', 'BlackRD']].rename(columns={
    'BlackElo': 'Elo',
    'BlackRD': 'RD'})
WhiteElo_=df_2021_2017_titled_distinv[['Event', 'Date', 'WhiteElo', 'WhiteRD']].rename(columns={
    'WhiteElo': 'Elo',
    'WhiteRD': 'RD'})

Titled_Elos = pd.concat([BlackElo_, WhiteElo_]).reset_index()

Titled_Elos.head()

In [None]:
# fig = px.histogram(
#     data_frame=Titled_Elos,
#     x='Elo',
#     nbins=(int(Titled_Elos.Elo.describe()['max'] - \
#           Titled_Elos.Elo.describe()['min'])),
#     title='Distribution of Titled PLayer ELO'
    
#     )

# fig.show()

fig = px.histogram(
    data_frame=Titled_Elos,
    x='Elo',
#     nbins=(int(Titled_Elos.Elo.describe()['max'] - \
#           Titled_Elos.Elo.describe()['min'])),
    title='Distribution of Titled PLayer ELO'
    
    )

fig.show()

In [None]:
fig = px.histogram(
    data_frame=Titled_Elos[[' rated' in event for event in Titled_Elos.Event]],
    x='Elo',
#     nbins=(int(Titled_Elos.Elo.describe()['max'] - \
#           Titled_Elos.Elo.describe()['min'])),
    title='Distribution of Titled PLayer ELO in Rated Games'
    
    )

fig.show()

fig = px.histogram(
    data_frame=Titled_Elos[['unrated' in event for event in Titled_Elos.Event]],
    x='Elo',
#     nbins=(int(Titled_Elos.Elo.describe()['max'] - \
#           Titled_Elos.Elo.describe()['min'])),
    title='Distribution of Titled PLayer ELO in Unrated Games'
    
    )

fig.show()

---

In [None]:
df_2021_2017_titled_distinv_rated = hp.keep_rated_games(df_2021_2017_titled_distinv)

df_2021_2017_titled_distinv_rated.shape # (58478, 18)

In [None]:
df_2021_2017_titled_distinv_rated = hp.get_abs_elo_diff(df_2021_2017_titled_distinv_rated)

In [None]:
HvH_elo_diff = df_2021_2017_titled_distinv_rated.abs_elo_diff.describe().to_frame().rename(
    columns={'abs_elo_diff': 'rated_HvH_abs_elo_df'}
)

HvH_elo_diff

# count 	58478.000000
# mean 	198.632631
# std 	134.015483
# min 	0.000000
# 25% 	92.000000
# 50% 	180.000000
# 75% 	283.000000
# max 	1168.000000

----

## Creating Features

### Using Engine Evaluations 

#### Getting Evaluations

##### Settings and functions

In [None]:
# engine = chess.engine.SimpleEngine.popen_uci('./stockfish_15_win_x64_ssse/stockfish_15_x64_ssse.exe')

# engine.configure({"Hash": 4096})

# engine.configure({"Threads": 8})

# # engine.configure({"MultiPV": 5})

# engine.configure({"Skill level": 20})

# engine.configure({"Debug Log File": \
#                   'C:/Users/Emanuel/Desktop/data/capstone/preprocessed/evaluations/log.txt'})

# engine.configure({"Move Overhead": 100})

# engine.configure({"Slow Mover": 20})

# limit = chess.engine.Limit(time=0.1, depth=25)

In [None]:
# def evaluate_game(game):
#     # Loop for moves in a single game
#     evaluations_ = []
#     board=chess.Board()
#     for move in game:
#         evalution_ = engine.analyse(board, limit, multipv='5')
#         board.push_san(move)

#         evaluations_.append(evalution_)
    
    
#     return evaluations_

In [None]:
# def evaluate_games(df, save_rate=1000, path='./'): # risky to use function on many games in case something goes wrong
#     list_of_evaluations = []
#     game_count = 0
    
#     for game in zip(df.moves):    
#         try:
#             game_eval_ = evaluate_game(game[0])
#             list_of_evaluations.append(game_eval_)

#             game_count+=1

#             if game_count%save_rate==0:
#                 print(f'{game_count} games completed\nSaving now...')
#                 joblib.dump(list_of_evaluations, 
#                             f'{path}{game_count}_.pkl',
#                            compress=3)
#                 t=time.localtime()[0:6]
#                 print(f'Saved at {t[0]}/{t[1]}/{t[2]} {t[3]}:{t[4]}:{t[5]}')
#             elif game_count%100==0:
#                 print(f'{game_count}')
#                 t=time.localtime()[0:6]
#                 print(f'At {t[0]}/{t[1]}/{t[2]} {t[3]}:{t[4]}:{t[5]}')
#             else:
#                 pass
    
#         except KeyboardInterrupt:
#             print('Keyboard Interrupt')
#             print(f'{game_count}')
#             return list_of_evaluations
# #             break
    
#         except:
#             list_of_evaluations.append(['Error occured'])

#             game_count+=1        

#             print(f'Error occured on game {game_count}')
        
#     return list_of_evaluations

In [None]:
# ## Evaluating multiple games, starting at 0, with try-except

# df_=
# list_of_evaluations = []
# game_count = 0
# path='C:/Users/Emanuel/Desktop/data/capstone/preprocessed/evaluations/'
# for game in zip(df_.moves):
#     try:
#         game_eval_ = evaluate_game(game[0])
#         list_of_evaluations.append(game_eval_)

#         game_count+=1

#         if game_count%1000==0:
#             print(f'{game_count} games completed\nSaving now...')
#             joblib.dump(list_of_evaluations, 
#                         f'{path}{game_count}_.pkl',
#                        compress=3)
#             t=time.localtime()[0:6]
#             print(f'Saved at {t[0]}/{t[1]}/{t[2]} {t[3]}:{t[4]}:{t[5]}')
#         elif game_count%100==0:
#             print(f'{game_count}')
#             t=time.localtime()[0:6]
#             print(f'At {t[0]}/{t[1]}/{t[2]} {t[3]}:{t[4]}:{t[5]}')
#         else:
#             pass
        
#     except KeyboardInterrupt:
#         print('Keyboard Interrupt')
#         print(f'{game_count}')
#         break
        
#     except:
#         list_of_evaluations.append(['Error occured'])
                
#         game_count+=1
        
#         print(f'Error occured on game {game_count}')

-----

#### Using Evaluations to Make Features

##### Loading in 

In [None]:
base_clean_titled = joblib.load('data/cleaned/Titled/base_clean_titled.pkl')
base_clean_CvH = joblib.load('data/cleaned/CvH/base_clean_CvH.pkl')

-----

##### Top recommendation checking

In [None]:
# titled
base_clean_titled = hp.convert_all_game_moves_to_uci(base_clean_titled)
base_clean_titled = hp.get_eval_top_move_for_all_games(base_clean_titled)
base_clean_titled = hp.percent_of_top_moves_played_in_all_games_by_white_and_black(base_clean_titled)

In [None]:
# CvH
base_clean_CvH = hp.convert_all_game_moves_to_uci(base_clean_CvH)
base_clean_CvH = hp.get_eval_top_move_for_all_games(base_clean_CvH)
base_clean_CvH = hp.percent_of_top_moves_played_in_all_games_by_white_and_black(base_clean_CvH)

-----

### Calculating Average Time Per Move

In [None]:
incltopmove_titled = joblib.load('data/cleaned/Titled/incl_top_move_perc_titled.pkl')
incltopmove_CvH = joblib.load('data/cleaned/CvH/incl_top_move_perc_CvH.pkl')

In [None]:
incltopmove_titled = hp.drop_no_move_games(incltopmove_titled, min_game_len=2)
incltopmove_CvH = hp.drop_no_move_games(incltopmove_CvH, min_game_len=2)

In [None]:
incltopmove_titled = hp.separate_all_white_and_black_average_emts(incltopmove_titled)
incltopmove_CvH = hp.separate_all_white_and_black_average_emts(incltopmove_CvH)

-----