### Games data cleanup

This notebook performs some cleanup operations on the raw stats dump obtained from running `stats_dump.ipynb`, geared toward use in a later statistics/data science application.

Not all cleanup operations here are universal. You may with to check them individually and modify them to suit your own needs.

In [None]:
import json
import pandas as pd
from datetime import datetime
from dateutil import parser

Replace the following `player_id` number with the one whose game data you wish to clean.

In [None]:
player_id = dummy_id

In [None]:
df = pd.read_json('./output/'+str(player_id)+'/datasets/raw_stats.json')

This cell drops various rows and columns. It is not intended to be universally applicable. You should check which rows/columns are dropped and customize this cell to your particular needs.

In [None]:
# Drop annulled games, then drop annulled column and re-index
df.drop(df.index[df['annulled'] == True].tolist(), inplace = True)
df.drop(['annulled'], axis = 1, inplace = True)
df.reset_index(drop = True, inplace = True)

# Convert [width, height] information to single column (assuming you only only care about square board sizes)
df.drop(['width'], axis = 1, inplace = True)
df = df.rename(columns = {'height':'size'})

# Define irrelevant columns - customize these according to individual need

game_metadata = ['players', 'name', 'sgf_filename', 'flags', 'creator', 'source', 'mode']
# these are from before the pre-2017 Glicko adoption
deprecated = ['black_player_rating', 'white_player_rating', 'black_player_rank', 'white_player_rank']
# keep these if you want to keep rengo info
rengo = ['rengo', 'rengo_black_team', 'rengo_white_team', 'rengo_casual_mode']
# keep these if you want to keep tournament info
tournament = ['tournament', 'tournament_round', 'ladder']
# keep this if you play correspondence with this setting and think it's relevant
correspondence = ['pause_on_weekends']

# Drop above columns
df.drop(deprecated, axis = 1, inplace = True)
df.drop(rengo, axis = 1, inplace = True)
df.drop(game_metadata, axis = 1, inplace = True)
df.drop(tournament, axis = 1, inplace = True)
df.drop(correspondence, axis = 1, inplace = True)

This cell reorganizes the dataframe in terms of a player-vs-opponent perspective rather than a black-vs-white perspective. For instance, instead of `{'black':'player_id', 'white':'opponent_id'}`, we have `{'player_color':'black'}`. This is applied to:
* player colors
* player IDs
* game results

In [None]:
playedBlack = df.index[df['black'] == player_id].tolist()
playedWhite = df.index[df['white'] == player_id].tolist()

player_color = ['white'] * df.shape[0]
opponent_id = df.get('black')
player_won = [False] * df.shape[0]

df.insert(3, 'player_color', player_color)
df.insert(4, 'opponent_id', player_won)
df.insert(5, 'player_won', player_won)
df.loc[playedBlack, 'player_color'] = 'black'
df.loc[playedBlack, 'opponent_id'] = df.loc[playedBlack, 'white']
df.loc[playedWhite, 'opponent_id'] = df.loc[playedWhite, 'black']
df.drop(['black', 'white'], axis = 1, inplace = True)

df.loc[playedBlack, 'player_won'] = df.loc[playedBlack, 'white_lost'] 
df.loc[playedWhite, 'player_won'] = df.loc[playedWhite, 'black_lost']
df.drop(['black_lost', 'white_lost'], axis = 1, inplace = True)

This cell extracts the live/blitz/correspondence information from the `'time_control_parameters'` column. There is other potentially relevant information in here but I have chosen to drop it.

In [None]:
speed = [json.loads(df['time_control_parameters'][i])['speed'] for i in range(df.shape[0])]
df.insert(9, 'speed', speed)
df.drop(['time_control_parameters'], axis = 1, inplace = True)

This cell computes the length of each game (to the minute, rounded down), then drops the `'ended'` column.

In [None]:
print(df['started'][0])
parser.parse(df['started'][0])
game_length = [divmod((parser.parse(df['ended'][i]) - parser.parse(df['started'][i])).seconds, 60)[0] for i in range(df.shape[0])]
df.insert(0, 'game_length', game_length)
df.drop(['ended'], axis = 1, inplace = True)

This cell organizes the ratings data at the time of each game.

In [None]:
# Extract historical ratings data and store in ['player_rating', 'player_deviation', 'player_volatility',
#                                               'opp_rating', 'opp_deviation', 'opp_volatility']
black_rating = pd.DataFrame([df['historical_ratings'][i]['black']['ratings']['overall']['rating'] 
                             for i in range(df.shape[0])], columns = ['black_rating'])
black_deviation = pd.DataFrame([df['historical_ratings'][i]['black']['ratings']['overall']['deviation'] 
                                for i in range(df.shape[0])], columns = ['black_deviation'])
black_volatility = pd.DataFrame([df['historical_ratings'][i]['black']['ratings']['overall']['volatility'] 
                                 for i in range(df.shape[0])], columns = ['black_volatility'])
white_rating = [df['historical_ratings'][i]['white']['ratings']['overall']['rating'] for i in range(df.shape[0])]
white_deviation = [df['historical_ratings'][i]['white']['ratings']['overall']['deviation'] for i in range(df.shape[0])]
white_volatility = [df['historical_ratings'][i]['white']['ratings']['overall']['volatility'] for i in range(df.shape[0])]

df.insert(0, 'player_rating', white_rating)
df.insert(0, 'opponent_rating', white_rating)
df.insert(0, 'player_deviation', white_deviation)
df.insert(0, 'opponent_deviation', white_deviation)
df.insert(0, 'player_volatility', white_volatility)
df.insert(0, 'opponent_volatility', white_volatility)

df.loc[playedBlack, 'player_rating'] = black_rating.loc[playedBlack, 'black_rating']
df.loc[playedWhite, 'opponent_rating'] = black_rating.loc[playedWhite, 'black_rating']
df.loc[playedBlack, 'player_deviation'] = black_deviation.loc[playedBlack, 'black_deviation']
df.loc[playedWhite, 'opponent_deviation'] = black_deviation.loc[playedWhite, 'black_deviation']
df.loc[playedBlack, 'player_volatility'] = black_volatility.loc[playedBlack, 'black_volatility']
df.loc[playedWhite, 'opponent_volatility'] = black_volatility.loc[playedWhite, 'black_volatility']

df.drop(['historical_ratings'], axis = 1, inplace = True)

This cell reorganizes the columns according to the following paradigm:

1. Game start time
2. Game id information
3. Player id information
4. Result information
5. Rules information
6. Time information
7. Player rank information

It then sorts the dataframe by game start time (descending order). The start times are converted from `str` to `datetime` objects.

In [None]:
cols = ['started',
        'id', 'ranked',
        'opponent_id', 'player_color',
        'player_won', 'outcome',
        'size', 'rules', 'handicap', 'komi', 'disable_analysis',
        'time_control', 'time_per_move', 'game_length',
        'player_rating', 'opponent_rating', 'player_deviation', 'opponent_deviation', 'player_volatility', 'opponent_volatility',
        'related']
df = df[cols]
df['started'] = pd.to_datetime(df['started']).values
df.sort_values(by = 'started', inplace = True, ascending = False)
df.reset_index(drop = True, inplace = True)

In [None]:
df.head()

This cell stores our cleaned dataframe in a .json file for easy access during future analysis.

In [None]:
df.to_json(r'./output/'+str(player_id)+'/datasets/clean_stats.json')