# Predicting StarCraft II Win Probability - Notebook 1

## Introduction

StarCraft II is a Real-Time Strategy game (RTS) which has been a popular eSport since its release in 2010. The game has progressed through a number of iterations, the most recent being named "Legacy of the Void" (LOTV), which was released in 2015.

### How does the game work?
The game is played on a map by two players (more than two players are possible but the primary focus of the competitive amateur and professional scene is 1 vs 1 games). It is a largely successful game, with around 500,000 players per season at its peak.

![Average Player Population per Season](img\avg-player-population.png)

<center>
<b>Average Player Population per Season</b> (<i>Source: <a href="https://www.rankedftw.com/stats/population/1v1/#v=2&r=-2&sy=c&sx=a">rankedftw.com</a></i>)
</center>

#### What is MMR?
A player will queue up for a "ladder game" (a game against another human competitor), and match-making is performed using MMR (Matchmaking Rating) which is a measure of player skill, similar to ELO in chess. A win or loss in a ladder game is used to update the MMR of both players, with the adjustment being proportional to the initial difference in MMR between the two players (i.e. beating a player with much higher MMR will result in a high adjustment). The MMR range is between 0 and around 7,500. Although there is no technical upper limit for MMR, once a player reaches such high levels, the relative MMR gain per game becomes very small because the MMR difference between other players becomes so large. A player with an MMR around 7,500 will gain only 1 or 2 MMR for a win, but lose as much as 50 or more MMR for a loss, making it difficult to increase their MMR over time.

#### Game setup
Each player chooses one of three races: Protoss, Terran, or Zerg - allowing for a total of 6 possible match-ups. The game begins with a player having 12 worker units and a main base. The basic strategy for the game is outlined by:
* Use workers to mine one of the two resources on the map: Mineral or vespene gas.
* Use workers to construct buildings to create military units, research upgrades, collect more resources, and defend their base.
* Expand bases to new locations on the map. There are typically 5 to 7 bases available for each player to expand to.
* Use military units to attack enemy units, while maintaining economic balance.

#### Winning the game
A game is won when a player has destroyed all enemy buildings. However, games typically end earlier than this as one player concedes when it becomes clear that they can no longer win.

## Goal of the project
The goal of this Jupyter Notebook series is to model the win-probability as the game progresses. This would have 3 major use-cases:
* *Broadcasting*: Professional games are widely broadcast, with the 2019 WCS Global Finals having an average live viewership of 36,430, peak viewership of 80,030 and prize pool of $500,000 __[source](https://escharts.com/tournaments/sc2/wcs-global-finals-2019)__. An overlay which occasionally updates viewers with the probability of one player winning would be an interesting feature to add, highlighting "big moments" in the game where the win-probability swings from one players favor to the other.
* *Player analysis*: It is common practice for players and professionals to re-watch their own games in order to understand how they performed, and what improvements could be made. This tool would allow for additional insights into which moments were most significant in affecting the outcome of the game, as well as helping players identify where they have conceded too early.
* *AI training*: StarCraft II is already a popular field of study for AI, with Google's __[AlphaStar](https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii)__ having been trained to the point where it could even defeat professional players. Predicting the probability of victory for a given game state could provide an additional reward metric for training AIs.

In addition to the above, this methodology could also be applied to new games such as Age of Empires, or even popular turn-based strategy games such as Civilization VI.

## Data Sources
Data are sourced from `SC2REPLAY` files which are autogenerated by the game client every time a game is played. StarCraft II is a fully deterministic game, meaning that the replay files simply contain a list of each and every action taken in the game, to find out what happens at any given moment in the game one has to simulate every action up until that point in time.

A python package named __[sc2reader](https://github.com/ggtracker/sc2reader)__ is used to read the replay files, converting each replay into an object with attributes contain both metadata and game events. This package is used in a number of publically available replay analysis tools, as well as tools to analyze the actions most often taken by players in-game.

A replay file contains data at a frame by frame level. Each frame represents 1/16th of an in-game second, and each in-game second is 1/1.4 of a real-time second (i.e. there are 22.4 frames per real-time second).

### Datasets
Two datasets are collected:
* __[Blizzard](https://github.com/Blizzard/s2client-proto#downloads)__: A collection of ~1.2 million replays which were are specifically provided by Blizzard (the game developers) for the purpose of training AI agents to play the game. These replay files are anonymized and contain only the actions taken by the players, and not the game state. In this study we use the Blizzard dataset to train a metadata model that attempts to predict the outcome of the game based only on the information available before the game starts. 
* __[SpawningTool](https://lotv.spawningtool.com/replays/)__: 48,479 replay files downloaded from a collection of user-uploaded replay files. These replay files are not anonymized and contain the game state at every moment in the game. In this study we use the SpawningTool dataset to train a neural network that predicts the probability of victory for a player as the game progresses. The dataset is also used as the test data for the Metadata Model.

## Methodology
* **Notebook 1 (this notebook): Data Collection** - Process replay files to extract metadata and game events.
    * Collect data from `SC2REPLAY` files:    
        * *Metadata*: The information about the game, such as the map; player names, races, and MMR; and game outcome, length and region. Metadata for each dataset are stored in a single `csv` file per dataset. Data are collected using `sc2reader` using the `load_level=2` setting to increase speed of replay parsing (each load level increases the amount of detail extracted at the cost of additional processing time and memory).
        * *Event data*: The actions taken by the players in the game, such as the units that were created, the actions taken by the players, their resource collection rates and spending, etc. Game events for each game are stored in a `pkl` file, named after the filehash of the original replay file. This allows easy referencing to a particular game events file from the metadata file.
    * Clean Event Data
        * Create a dataframe from each `pkl` file.
        * Remove all irrelevant actions.
        * Divide game events into actions taken by each player.
        * Store a final dataframe that consists of a single row per frame of the game.
* **Notebook 2: Metadata Analysis** 
    * Visually explore the metadata.
    * Understand the split of player levels, races, regions, etc that are avaialable - particularly in the SpawningTool dataset that will be used to train the neural network.
    * Model the outcome of the game based on metadata alone. Train the model using the Blizzard dataset and test it using the SpawningTool dataset.
* **Notebook 3: Event data Analysis**
    * Visually explore the event data
    * Understand the types of actions that are taken, and create any new features that could be used to improve modeling.
    * Construct and train a Recurrent Neural Network (RNN) to predict the probability of victory for a player as the game progresses.
    * Compare RNN to the Metadata Model to understand if the RNN is an improvement of game understanding.
    * Conclusions and further work to be done.

For Notebook 1 an environment named `sc2` is created in Conda which contains the following required packages:
* `pandas`
* `numpy`
* `scipy`
* `matplotlib`
* `seaborn`
* `scikit-learn`
* `sc2reader`
* `multiprocessing`

For Notebook 2 and 3 a new environment named `tflow` is used:
* `tensorflow`

`yml` files are included in this repository to set up each environment.

---

---

Package imports and settings:


In [1]:
# initial imports and settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sc2reader
import re
import random
import datetime
import json
from scripts.classes import ReplayInfo
import time
import math
import multiprocessing as mp

# sklearn
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# import svm
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

# import random forest
from sklearn.ensemble import RandomForestClassifier

# import KNN
from sklearn.neighbors import KNeighborsClassifier

# import grid_search
from sklearn.model_selection import GridSearchCV

# import warnings
import warnings

In [2]:
#### Matplotlib settings
%matplotlib inline
import matplotlib as mpl

# specify default rcParams so that fontsize, weight and style do not need to set
# each time
# Title in bold, fontsize 20
mpl.rcParams['figure.titleweight'] = 'bold'
mpl.rcParams['figure.titlesize'] = 20
mpl.rcParams['axes.titleweight'] = 'bold'
mpl.rcParams['axes.titlesize'] = 20
# Plot fontsize 12 and bold
mpl.rcParams['axes.labelweight'] = 'bold'
mpl.rcParams['axes.labelsize'] = 13
# set figure size
mpl.rcParams['figure.figsize'] = (15, 8)

# race list and colors that will be used to represent them
RACE_LIST = [
    'Protoss',
    'Terran',
    'Zerg'
    ]
COLOR_DICT = {
    'Protoss': 'goldenrod',
    'Terran': 'firebrick',
    'Zerg': 'darkviolet'
    }


---

# Data Collection

## Table of Contents


## Section 1: Metdata Collection

This process is broken down into two phases:
1. **Metadata collection** - Extract the metadata from both SpawningTool and Blizzard datasets.
2. **Data cleaning** - Explore the data to determine how much complete information is available. The data should be relatively clean, as duplicates are already removed during replay parsing.

### 1.1 - Processing replay files

o process the metadata, the following steps will be followed:
* A python script `process_replays.py` is written using the ipython magic `%%writefile` to the scripts folder. The separate script enables the use of multiprocessing to greatly improve the speed of execution.
* A replay file is loaded using `load_replay` from the `s2reader` package. Load level 2 is selected as it contains the relevant player and metadata.
* The replay is parsed using the `ReplayInfo` class (found in scripts/classes/replay_info.py, and also written using `%%writefile`).
* The parsed replay is added to a list of replays which is then converted to a pandas dataframe.
* The dataframe is then written to a csv file. Separate csv files are created for SpawningTool and Blizzard replays in the data folder (`spawningtool_replays.csv` and `blizzard_replays.csv` respectively). 

The metadata collected are:
* `Map hash` - this is used in preference to the map name, as different regions have different translations for the same map.
* `Player races`
* `Player MMR` (if available)
* `Player highest league`
* `Game length`
* `Game outcome`
* `Game type` (to ensure that the game is 1v1 instead of a 4 player game etc.)
* `Game speed` (just as an additional check that the game format is correct)
* `Game FPS` (just as an additional check that the game format is correct)
* `Game timestamp`
* `Region` of the game server that the replay was played on
* `Filehash` - this is used to ensure that replays are not duplicated, and as an identifier for event data `pkl` files.
* `Is_ladder` - If the game is a ladder game (again checking that the game format is correct).

#### Script to extract metadata from replay files
A script is written to allow the use of multiprocessing to greatly improve the speed of execution. The settings in `replay_settings.json` allow the user to specify the number of cores that will be used during execution. Setting `n_jobs` to -1 will use all available cores minus one (to prevent locking up when all cores are occupied).

In [None]:
%%writefile scripts/process_replays.py
"""
This file is used to process replays and extract the data
Tt is used as a separate script to allow for parallel processing
Settings can be found in replay_settings.json
"""
import json
import os
import multiprocessing as mp
import random
import sc2reader
from scripts.classes import ReplayInfo
import pandas as pd
import time

def process_replay(filename):
    """
    process_replay
    ___________________________________________________________________________
    Processes a replay file and returns a ReplayInfo object if the replay is valid according to the filters. Else returns None.

    Args:
        filename (string): Absolute path to the replay file.
        filters (dict): A dictionary of filters to apply to the replay. Dictionary keys are the attributes of ReplayInfo, and values are those that should be excluded.

    Returns:
        [type]: [description]
    """
    # load replay
    try:
        replay = sc2reader.load_replay(
            filename,
            load_level=2 # level 2 is all that is required for metadata
            )
    except: # catch exceptions created by sc2reader
        return None

    try:
        replay_object = ReplayInfo(replay)
    except: # catch exceptions created by ReplayInfo
        return None

    return replay_object

if __name__ == "__main__":

    __spec__ = None # suppress warnings
    
    # start timer
    timer = time.time()

    # load settings
    with open("scripts/replay_settings.json", "r") as f:
        settings = json.load(f)

    # get replay directory from settings
    replay_dir = settings["replay_dir"]

    # get sample size from settings
    sample_size = settings["sample_size"]

    # get n_jobs from settings
    n_jobs = settings["n_jobs"]

    # get output_file from settings
    output_file = settings["output_file"]
    if output_file == "":
        output_file = 'data/replays.csv'

    # get random seed from settings
    # check if random_seed key exists
    if "random_seed" in settings:
        random_seed = settings["random_seed"]
    else:
        random_seed = None

    replays_list = []
    # loop through replay directory and get list of .SC2Replay files
    for dirpath, dirnames, filenames in os.walk(replay_dir):
        for filename in filenames:
            if filename.endswith('.SC2Replay'):
                filepath = os.path.join(dirpath, filename)
                replays_list.append(filepath)

    if sample_size != -1:
        # take a random sample of replays
        random.seed(random_seed)
        replays_list = random.sample(replays_list, sample_size)

    # process replays
    if n_jobs == -1:
        cpu_total = mp.cpu_count()-1
    else:
        cpu_total = n_jobs

    print(f'Processing {len(replays_list)} replays')

    with mp.Pool(processes=cpu_total) as pool:
        replay_collection = pool.map(
            process_replay,
            replays_list
        )

    # remove all None from replay_collection
    replay_collection = [x for x in replay_collection if x is not None]

    # convert replay collection to dataframe
    replay_df = pd.DataFrame({
        'filename':[x.filename for x in replay_collection],
        'map':[x.map_hash for x in replay_collection],
        'player1_race':[x.player_races[0] for x in replay_collection],
        'player2_race':[x.player_races[1] for x in replay_collection],
        'player1_mmr':[x.player_mmrs[0] for x in replay_collection],
        'player2_mmr':[x.player_mmrs[1] for x in replay_collection],
        'game_length':[x.game_length for x in replay_collection],
        'game_type':[x.game_type for x in replay_collection],
        'game_speed':[x.game_speed for x in replay_collection],
        'game_winner':[x.game_winner for x in replay_collection],
        'timestamp':[x.timestamp for x in replay_collection],
        'fps':[x.fps for x in replay_collection],
        'is_ladder':[x.is_ladder for x in replay_collection],
        'region':[x.region for x in replay_collection],
        'player1_highest_league':[
            x.highest_league[0] for x in replay_collection
        ],
        'player2_highest_league':[
            x.highest_league[1] for x in replay_collection
        ],
        'filehash':[x.filehash for x in replay_collection]
    })

    # remove rows with duplicate filehashes
    replay_df = replay_df.drop_duplicates(subset='filehash')

    # write replay_collection to csv with no index
    replay_df.to_csv(output_file, index=False)

    # found x valid replays
    print(f'Found {replay_df.shape[0]} unique valid replays')
    # print time elapsed as HH:MM:SS
    print(f'Time elapsed: {time.strftime("%H:%M:%S", time.gmtime(time.time() - timer))}')


Overwriting scripts/process_replays.py


#### Class definition for ReplayInfo
This class is used to parse the replay files and extract the metadata to more easily convert it into a pandas dataframe.

In [None]:
%%writefile scripts/classes/ReplayInfo.py
"""
Classes used in the analysis of replays.

Current list:
    - ReplayInfo - Extract metadata from replay files. (e.g. map, player names, etc.)
"""
import re

class ReplayInfo:

    def __init__(self, replay):
        # self.__Replay = replay
        self.map_hash = replay.map_hash
        self.player_races = self._get_player_races(replay)
        self.filename = replay.filename
        self.player_mmrs = self._get_player_mmrs(replay)
        self.game_length = self._get_game_length(replay)
        self.game_winner = self._get_winner(replay)
        self.timestamp = replay.unix_timestamp
        self.game_type = replay.type
        self.game_speed = replay.speed
        self.fps = replay.game_fps
        self.is_ladder = replay.is_ladder
        self.region = replay.region
        self.highest_league = self._get_player_highest_league(replay)
        self.filehash = replay.filehash


    def _get_game_length(self, replay):

        # this converts to minutes.seconds
        length_string = str(replay.game_length)

        # use regex to extract all numbers from length_string
        minutes = length_string.split('.')[0]
        seconds = length_string.split('.')[1]

        # convert to int in seconds
        return int(minutes)*60 + int(seconds)


    def _get_winner(self, replay):

        winner_string = str(replay.winner)
        if 'Player 1' in winner_string:
            return 1
        elif 'Player 2' in winner_string:
            return 2
        else:
            return 0


    def _get_player_highest_league(self, replay):

        str_a = 'replay.initData'
        str_b = 'user_initial_data'
        str_c = 'highest_league'

        # check that replay.initData exists in key
        if str_a not in replay.raw_data.keys():
            str_a = 'replay.initData.backup'

        return (
            replay.raw_data[str_a][str_b][0][str_c],
            replay.raw_data[str_a][str_b][1][str_c]
            )


    def _get_player_mmrs(self, replay):

        str_a = 'replay.initData'
        str_b = 'user_initial_data'
        str_c = 'scaled_rating'

        # check that replay.initData exists in key
        if str_a not in replay.raw_data.keys():
            str_a = 'replay.initData.backup'

        return (
            replay.raw_data[str_a][str_b][0][str_c],
            replay.raw_data[str_a][str_b][1][str_c]
            )


    # get the player races
    def _get_player_races(self, replay):
        """
        _get_player_races
        Iterate through players in self.__Replay.players and extract player
        races as strings from the info

        Returns:
            tuple - Length 2 contain the races of both players
        """

        player_string = []

        for player in replay.players:
            # convert player to string
            player_string.append(str(player))


        return (
            self._get_race(player_string[0]),
            self._get_race(player_string[1])
            )


    def _get_race(self, player):
        """
        _get_race Extract race from string. String is assumed to be of the form:
        'Player x - Race'.

        Args:
            player (str): A string of form 'Player x - Race'

        Returns:
            str: Race of the player
        """

        RACE_LIST = [
            'Protoss',
            'Terran',
            'Zerg'
        ]

        # assert that player is a string
        assert isinstance(player, str), 'player should be a string'


        for race in RACE_LIST:

            race_string = '('+race+')'

            if race_string.lower() in player.lower():

                # delete race from the player1 string

                # assert that race_string is in player
                assert race_string in player, \
                    f'{player} does not adhere to to {race_string} formatting'
                # use replace to delete the player race
                player = player.replace(race_string, '')

                # create regex to find 'Player 1 - ' leaving only actual name
                reg_str = r'Player\s\d\s\-\s'

                # assert that reg_str is in player string
                assert re.search(reg_str, player), \
                    f'{player} does not adhere to to {reg_str} formatting'

                return race

Now that the process for extracting the metadata is defined, the script can be run using the `%%run` magic. For each of the two runs, only the `replay_settings.json` file found in the scripts folder need be modified.

#### Extract SpawningTool metadata
This script took approximately 1 minute to run on a 6 core machine.

In [5]:
# specify json settings in scripts/replay_settings.json
replay_settings = {
    "replay_dir": "data/SpawningTool",
    "sample_size": -1,
    "n_jobs": -1,
    "output_file": "data/spawningtool_replays.csv"
}

# write json settings to file
with open("scripts/replay_settings.json", "w") as file:
    json.dump(replay_settings, file, indent=4)

# run the script to extract metadata from replays
%run ./scripts/process_replays.py 

Processing 48479 replays
Found 36812 unique valid replays
Time elapsed: 00:01:00


In [10]:
# inspect the dataframe
df = pd.read_csv('data/spawningtool_replays.csv')
display(df)

Unnamed: 0,filename,map,player1_race,player2_race,player1_mmr,player2_mmr,game_length,game_type,game_speed,game_winner,timestamp,fps,is_ladder,region,player1_highest_league,player2_highest_league,filehash
0,data/SpawningTool\Other\page1\2-liberator-tank...,abb20e9c587f5f6ce279a6aa8c7e9be2897607abb00434...,Terran,Protoss,-36400.0,,402,1v1,Normal,1,1633598712,16.0,False,us,4,0,9384fda8c370ea7d130ac20244f7d0fda9a9b834004445...
1,data/SpawningTool\Other\page1\2-liberators-1-t...,abb20e9c587f5f6ce279a6aa8c7e9be2897607abb00434...,Terran,Zerg,-36400.0,,371,1v1,Normal,1,1633603684,16.0,False,us,4,0,f9864054498acf297aacf2be80896ba131716a341983de...
2,data/SpawningTool\Other\page1\2000-atmospheres...,c2bcc137d9ce43fbf9a5ddd04f5167892c779c29616de9...,Zerg,Zerg,4022.0,3502.0,371,1v1,Faster,1,1633788999,16.0,True,eu,6,6,178a82fa5045e82a3920688274d7a524595f0ef6dc91ae...
3,data/SpawningTool\Other\page1\2021-09-05-zserr...,e83bd9ddb33fcdc21a801f9bd598ed081f8ca60f810522...,Zerg,Protoss,,,1803,1v1,Faster,1,1630608640,16.0,False,eu,7,6,99ba80721116d18f965c141581baa878abf02526f48bc6...
4,data/SpawningTool\Other\page1\2021-10-01-zvaev...,e83bd9ddb33fcdc21a801f9bd598ed081f8ca60f810522...,Zerg,Protoss,3615.0,3567.0,502,1v1,Faster,1,1633138764,16.0,True,us,5,5,b9e96a1dabcc0de0cfd89e8f0d02b5b8e064ec3f3c07c1...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36807,data/SpawningTool\Pro\page973\zvp-5.SC2Replay,5bcae311df60fac557b25c2889f26a9d10bb602f2a34d7...,Zerg,Protoss,,,804,1v1,Faster,2,1427952943,16.0,True,xx,8,8,86edadabea6cede88a70ed9102e0d45cdea7850d23ade8...
36808,data/SpawningTool\Pro\page973\zvp-6.SC2Replay,27675af51780a8617d291dcb8322c98e316ca8f753c6cf...,Protoss,Zerg,,,381,1v1,Faster,1,1427953453,16.0,True,xx,8,8,42930fdc21561395bcefc960b2d5525f7b6adf11dc0eab...
36809,data/SpawningTool\Pro\page973\zvp.SC2Replay,b588030794f4d6f2b6eac01b50a97dd94a93169a137632...,Zerg,Protoss,,,1272,1v1,Faster,2,1427949111,16.0,True,xx,8,8,8d1c2a7a406cdedfdd8db8ae24968b16edfde4c1a38230...
36810,data/SpawningTool\Pro\page973\zvt-2.SC2Replay,b588030794f4d6f2b6eac01b50a97dd94a93169a137632...,Zerg,Terran,,,957,1v1,Faster,1,1427954447,16.0,True,xx,8,8,d6430bacdb3bc5a22e09d445655486cb06cf7e9d23b479...


#### Extract Blizzard metadata
This script took approximately 16 minutes to run on a 6 core machine.

In [3]:
# specify json settings in scripts/replay_settings.json
replay_settings = {
    "replay_dir": "data/Blizzard",
    "sample_size": -1,
    "n_jobs": -1,
    "output_file": "data/blizzard_replays.csv"
}

# write json settings to file
with open("scripts/replay_settings.json", "w") as file:
    json.dump(replay_settings, file, indent=4)

# run the script to extract metadata from replays
%run ./scripts/process_replays.py 

Processing 1225046 replays
Found 1203339 unique valid replays
Time elapsed: 00:16:05


In [9]:
# inspect the dataframe
df = pd.read_csv('data/blizzard_replays.csv')
display(df)

Unnamed: 0,filename,map,player1_race,player2_race,player1_mmr,player2_mmr,game_length,game_type,game_speed,game_winner,timestamp,fps,is_ladder,region,player1_highest_league,player2_highest_league,filehash
0,data/Blizzard\set_1\0000e057beefc9b1e9da959ed9...,37bd8ab1409dddf9ee2d2630cabddec5c6ffeab113e836...,Protoss,Zerg,5402,5564,932,1v1,Faster,2,1502208724,16.0,True,eu,6,6,9706f355243444e1666cd19a3a6cc1a315957e885f2ded...
1,data/Blizzard\set_1\0002b71a92623234bf67fac85e...,89b9c8252bd9bb72175be78d66280d68ef2a525143db27...,Terran,Terran,3882,3951,324,1v1,Faster,1,1502197141,16.0,True,eu,5,5,e60e9fcdde514415eb2466ceab3c7d23af2b4549d74e89...
2,data/Blizzard\set_1\0002c4f2d94ba7aaf2f71b9d8c...,be0af789b8cef0379fd32602b5730096bb0b0138fe7aba...,Terran,Zerg,4088,3365,667,1v1,Faster,2,1502280613,16.0,True,eu,5,0,82f4f6bf5d2c0af9b5a8f5c0c9606a0dc6819b09ae9a7a...
3,data/Blizzard\set_1\000309f32db5b1e65312208cca...,89b9c8252bd9bb72175be78d66280d68ef2a525143db27...,Zerg,Terran,3234,3063,775,1v1,Faster,1,1502262699,16.0,True,cn,3,3,2cc2384ffd09b3f6a63aa4fd2223db082dad6917c0f3b6...
4,data/Blizzard\set_1\000484de77e46af60af75a20b8...,37bd8ab1409dddf9ee2d2630cabddec5c6ffeab113e836...,Zerg,Zerg,4254,4121,537,1v1,Faster,1,1502290022,16.0,True,us,5,5,537f1b82a364599638da021fabb2c9d975c94ca2f59eb7...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1203334,data/Blizzard\set_2\ffffdbb8fc30435e69a1fe20d9...,37bd8ab1409dddf9ee2d2630cabddec5c6ffeab113e836...,,,2780,2314,428,1v1,Faster,2,1503455719,16.0,True,eu,1,0,b8044f1380c12d222459208e846aecf620313e4e081d7b...
1203335,data/Blizzard\set_2\fffff0871993994afd8ef921f0...,8e8a27bc43f7c310705c16e4a564f64f228bf17085aaf9...,Zerg,Terran,3277,3340,338,1v1,Faster,1,1501307715,16.0,True,eu,3,4,3307446025ff8031b6bb7d1dd3aec6b8fd85a12284281c...
1203336,data/Blizzard\set_2\fffff6ab18bdfbf9afd9a672a1...,8e8a27bc43f7c310705c16e4a564f64f228bf17085aaf9...,Terran,Protoss,5638,5479,500,1v1,Faster,2,1501283007,16.0,True,us,6,6,983c2a9b2003480188f19c976426ea468a4a619763eb51...
1203337,data/Blizzard\set_2\fffff7acc07773147979f493ab...,c3df4517b78fb0c6042f76667341403faf9b0fc479548f...,Terran,Terran,4536,4522,630,1v1,Faster,2,1502044421,16.0,True,us,5,5,5c029414352f98f3a0615bc3b3b26ca3df26ad8f8905fb...


### 1.2 - Metadata cleaning
Metadata will be used for modeling the outcome of the game. It is expected that the most important features will relate to player MMRs, and races. The completeness of these components will be checked.

This step will not make any changes to the data stored in the `csv` files, but will rather be used as a template to create a function that can be imported in Notebook 2 to clean the data as needed, before it is used in the modeling.

In [15]:
# load both dataframes from the csv files
spawningtool_df = pd.read_csv('data/spawningtool_replays.csv')
blizzard_df = pd.read_csv('data/blizzard_replays.csv')

print("The shape of the SpawningTool dataframe is", spawningtool_df.shape)
print("The shape of the Blizzard dataframe is    ", blizzard_df.shape)

The shape of the SpawningTool dataframe is (36812, 17)
The shape of the Blizzard dataframe is     (1203339, 17)


The inspections will be the same for both datasets. We have 36,812 records in the SpawningTool dataset, and 1,203,339 records in the Blizzard dataset, with both datasets containing a total of 17 features.

#### Identify amount of missing data

In [16]:
print("Fraction of missing values in SpawningTool dataframe:")
spawningtool_df.isna().sum()/spawningtool_df.shape[0]

Fraction of missing values in SpawningTool dataframe:


filename                  0.000000
map                       0.000000
player1_race              0.012333
player2_race              0.012333
player1_mmr               0.700532
player2_mmr               0.701782
game_length               0.000000
game_type                 0.000000
game_speed                0.000000
game_winner               0.000000
timestamp                 0.000000
fps                       0.000000
is_ladder                 0.000000
region                    0.000000
player1_highest_league    0.000000
player2_highest_league    0.000000
filehash                  0.000000
dtype: float64

In [17]:
print("Fraction of missing values in Blizzard dataframe:")
blizzard_df.isna().sum()/blizzard_df.shape[0]

Fraction of missing values in Blizzard dataframe:


filename                  0.000000
map                       0.000000
player1_race              0.053712
player2_race              0.053712
player1_mmr               0.000000
player2_mmr               0.000000
game_length               0.000000
game_type                 0.000000
game_speed                0.000000
game_winner               0.000000
timestamp                 0.000000
fps                       0.000000
is_ladder                 0.000000
region                    0.000000
player1_highest_league    0.000000
player2_highest_league    0.000000
filehash                  0.000000
dtype: float64

The SpawningTool dataset appears to contain a large amount of missing MMR data, which is unfortunate, as MMR is a key component that would be used in metadata modeling. However, the primary purpose of this dataset is to train the neural network, and for that only the event data is required.

As a secondary check, let us define a valid MMR range and check the amount of replays in both datasets that would be considered valid.
* Minimum MMR: 2,000 - This value is selected as below this MMR range a player would be in "Bronze" league, and likely very new to the game. Their actions may not be as relevant to predicting the outcome of the game, and may confuse the model.
* Maximum MMR: 8,000 - The maximum MMR is selected as 8,000 as this should include all games. The top players in all regions tend to have MMRs of 7,500 and below.

We will perform a quick check on the MMR ranges to confirm the above assumptions are valid.

In [20]:
# get the max and min MMR for each region for SpawningTool
spawningtool_df.groupby('region')[
    ['player1_mmr', 'player2_mmr']
].agg(['max', 'min'])

Unnamed: 0_level_0,player1_mmr,player1_mmr,player2_mmr,player2_mmr
Unnamed: 0_level_1,max,min,max,min
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
cn,5828.0,-36400.0,6141.0,-36400.0
eu,7372.0,-36400.0,7416.0,-36400.0
kr,6883.0,-36400.0,6830.0,-36400.0
sea,,,,
us,6591.0,-36400.0,6798.0,-36400.0
xx,,,,


In [21]:
# get the max and min MMR for each region for Blizzard
blizzard_df.groupby('region')[
    ['player1_mmr', 'player2_mmr']
].agg(['max', 'min'])


Unnamed: 0_level_0,player1_mmr,player1_mmr,player2_mmr,player2_mmr
Unnamed: 0_level_1,max,min,max,min
region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
cn,6299,-36400,6922,-36560
eu,6901,-36560,7271,-36926
kr,6984,-36560,7020,-36786
us,6874,-36560,6990,-36560


Both datasets appear to have minimum MMRs that are negative, indicating some error in the data. This will be cleaned by specifying the minimum MMR range. The maximum MMR range of 8,000 appears to be reasonable, as the highest in either dataset is 7,416.

In [26]:
# min and max values for MMR
min_mmr, max_mmr = 2000, 8000

# count the number of replays with both players MMRs between min and max

print("The number of valid MMR replays in SpawningTool is", 
    spawningtool_df[
        (spawningtool_df['player1_mmr'] >= min_mmr) &
        (spawningtool_df['player1_mmr'] <= max_mmr) &
        (spawningtool_df['player2_mmr'] >= min_mmr) &
        (spawningtool_df['player2_mmr'] <= max_mmr)
    ].shape[0]
)

print("The number of valid MMR replays in Blizzard is  ", 
    blizzard_df[
        (blizzard_df['player1_mmr'] >= min_mmr) &
        (blizzard_df['player1_mmr'] <= max_mmr) &
        (blizzard_df['player2_mmr'] >= min_mmr) &
        (blizzard_df['player2_mmr'] <= max_mmr)
    ].shape[0]    
)


The number of valid MMR replays in SpawningTool is 10162
The number of valid MMR replays in Blizzard is   1165910


Only ~10,000 replays from SpawningTool have valid MMRs (28% of 36,812), with 1,165,910 replays from Blizzard having valid MMRs (97% of 1,203,339). Because the valid SpawningTool replays is such a small fraction of the Blizzard dataset (~1%), we will use some Blizzard data to test as well as train the Metadata model, and then use the SpawningTool data as a supplementary test set.

#### Missing race data
It is worth noting that ~5% of the Blizzard data and ~1% of the SpawningTool data are missing race data for the players. This is peculiar, but it may be caused by player who play as "random".

Random players are players who, instead of choosing a single race to play with, are randomly assigned a race at the beginning of each match. 

It may be possible to impute this missing data from event data for the SpawningTool data, for Blizzard data these replays will be dropped wherever race is used in modeling.

#### Final checks on data quality
The pandas `describe()` method will be used to determine the value ranges for all numerical data both datasets, to potentially identify abnormal values that may not be NaN (much like the negative MMR values).

Categorical columns (`race`, `game_type`, `map`  and `region`) will be inspected using the pandas `value_counts()` method.

In [29]:
print("SpawningTool Numerical Data")
# use describe to get the mean, std, min, max, and count of each column
# use an apply function to get the values in readable format
# as suggested here: 
# https://stackoverflow.com/questions/40347689/dataframe-describe-suppress-scientific-notation
display(spawningtool_df.describe().T.apply(lambda x: x.map('{:,.2f}'.format)))

print("Blizzard Numerical Data")
display(blizzard_df.describe().T.apply(lambda x: x.map('{:,.2f}'.format)))

SpawningTool Numerical Data


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
player1_mmr,11024.0,2297.5,7008.45,-36400.0,2716.0,3184.0,4066.0,7372.0
player2_mmr,10978.0,2472.88,6505.29,-36400.0,2738.0,3182.0,4078.75,7416.0
game_length,36812.0,708.34,358.35,0.0,472.0,653.0,880.0,3552.0
game_winner,36812.0,1.38,0.59,0.0,1.0,1.0,2.0,2.0
timestamp,36812.0,1553203249.69,52457726.74,1401465118.0,1519662179.75,1563111735.5,1594694470.75,1634008784.0
fps,36812.0,16.0,0.0,16.0,16.0,16.0,16.0,16.0
player1_highest_league,36812.0,2.73,2.92,0.0,0.0,2.0,6.0,8.0
player2_highest_league,36812.0,2.88,2.93,0.0,0.0,2.0,6.0,8.0


Blizzard Numerical Data


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
player1_mmr,1203339.0,3647.7,1692.64,-36560.0,3190.0,3628.0,4129.0,6984.0
player2_mmr,1203339.0,2670.85,6456.05,-36926.0,3174.0,3622.0,4126.0,7271.0
game_length,1203339.0,658.05,429.6,0.0,362.0,631.0,895.0,3596.0
game_winner,1203339.0,1.49,0.51,0.0,1.0,1.0,2.0,2.0
timestamp,1203339.0,1502451556.91,2409097.84,997941480.0,1501808938.5,1502473096.0,1503132401.5,1510862330.0
fps,1203339.0,16.0,0.0,16.0,16.0,16.0,16.0,16.0
player1_highest_league,1203339.0,3.76,1.75,0.0,3.0,4.0,5.0,7.0
player2_highest_league,1203339.0,3.67,1.85,0.0,3.0,4.0,5.0,7.0


## Section 2: Event Data Collection
See __[this link](https://web.archive.org/web/20201031184319/https://miguelgondu.github.io/python/ai/video%20games/2018/09/04/a-tutorial-on-sc2reader-events-and-units.html)__ for the tutorial that was used in establishing how to work with, and extract data from an SC2 replay file, using `sc2reader`, and __[this link](https://github.com/IBM/starcraft2-replay-analysis)__ for an example of analyzing a single SC2 replay file using `sc2reader` and Jupyter Notebooks.

Event data in a replay file is stored in a `json` type format. Initial exploration into a single replay file was used to extract a list of action categories that are available. These are referred to as columns as they will make up the initial columns of the dataframe which is saved to a `pkl` file.

See `/info/raw_columns_list.csv` for the list of columns. For each column, the following information is saved:
* `column_name` - Name of the event type.
* `include_bool` - Whether the event type is to be included in the extracted data (1 or 0).
* `categorical_bool` - If the event type is categorical, this is set to 1.
* `agg_type` - The type of aggregation that would be most appropriate for the event type.

### 2.1 - Processing replay files

Similar to Metadata extraction, Event data is collected using a script to allow for multiprocessing. The script extracts the event data for a single replay file, and multiprocessing is used to apply a list of all the replays to this script.

The python package `sc2reader` is again used to extract the data, however, for event replay `load_level=4` is used. This causes a significant spike in processing time from 1 minute for the metadata extraction to 80 minutes for event data extraction.

In [4]:
%%writefile scripts/process_events.py
"""
Functions that require separate modules (generally for multiprocessing) are ..found here

Currently, these are:
    - get_all_events - load a replay using sc2reader and run _get_event_data on each event

    - _get_event_data - extract the data from an event and return it as a dictionary
"""

# function to extract all attribute value from a single replay event
# create function to get the event data so that multiprocessing can be used
def _get_event_data(event):
    """
    get_event_data
    Extracts the data from the event and returns it as a dictionary
    Ignores events that start with '_', i.e., special attributes and dunder types

    Args:
        event (sc2reader.event): Event object extracted from sc2reader.events

    Returns:
        dictionary: A dictionary containing the event data
    """
    # ignore attributes that are not needed (special or dunder)
    event_attributes = [attr for attr in dir(event) if not attr.startswith('_')]

    # initialize a dictionary to store the values of each attribute
    d = dict()

    # loop through each attribute and store the value in the dictionary
    for attr in event_attributes:
        # ignore attributes if they do not contain a value type
        if type(getattr(event, attr)) in [int, float, str, bool]:
            d[attr] = getattr(event, attr)

    return d

# function to extract all events using get_event_data
def get_all_events(filename, output_dir='data/events'):
    """
    ____________________________________________________________________________

    Extracts all events from a replay file and stores them in a pickle file

    Returns the dataframe of the extracted events

    Args:
        filename (string): Absolute path to the replay file
        output_dir (str, optional): Directory where output csv should be stored. Defaults to 'data/events'.

    Returns:
        pandas.DataFrame: A dataframe containing all events from the replay file
    """
    import sc2reader
    import os
    import pandas as pd

    # assert that output_dir exists
    assert os.path.isdir(output_dir), f'{output_dir} does not exist'

    # use sc2reader to extract replay data, load_level=4
    replay = sc2reader.load_replay(filename, load_level=4)

    # get events as a list from replay object
    events = replay.events

    # loop through each event and extract the data
    event_data = [_get_event_data(event) for event in events]

    # convert event_data to a dataframe
    df = pd.DataFrame(event_data)

    # get the list of columns to be kept
    columns_checklist = pd.read_csv('info/raw_columns_list.csv')
    column_keep_df = columns_checklist.loc[
        columns_checklist['include_bool'] == 1
    ].drop(['include_bool'], axis=1)

    # create a list of columns to be kept
    columns_to_keep = column_keep_df['column_name'].tolist()
    columns_to_keep = [col for col in columns_to_keep if col in df.columns]
    df = df[columns_to_keep]

    # remove all rows where pid not NaN and pid = 1 or 2
    df = df.loc[~(
            (df['pid'].notnull())
            & (df['pid'] != 1)
            & (df['pid'] != 2)
        )
    ]

    # name replay pkl with the filehash of the replay
    output_name = replay.filehash + '.pkl'
    output_path = os.path.join(output_dir, output_name)

    # save the dataframe to a pkl file
    df.to_pickle(output_path)

    return df

Writing scripts/process_events.py


Now that the script has been declared, we can import it as a function and run it under an `if __name__ == '__main__'` statement to allow multiprocessing as is describe in __[this](https://stackoverflow.com/a/47374811/16854204)__ StackOverflow post.

In [5]:
# use multiprocessing to extract all events from all replays
# NOTE: Function is found in scripts.process_events.py

import scripts.process_events as process_events

# use spawningtool_replays.csv to get valid replays and their filepaths
replays_info = pd.read_csv('data/spawningtool_replays.csv')

replays_info.head()

Unnamed: 0,filename,map,player1_race,player2_race,player1_mmr,player2_mmr,game_length,game_type,game_speed,game_winner,timestamp,fps,is_ladder,region,player1_highest_league,player2_highest_league,filehash
0,data/SpawningTool\Other\page1\2-liberator-tank...,abb20e9c587f5f6ce279a6aa8c7e9be2897607abb00434...,Terran,Protoss,-36400.0,,402,1v1,Normal,1,1633598712,16.0,False,us,4,0,9384fda8c370ea7d130ac20244f7d0fda9a9b834004445...
1,data/SpawningTool\Other\page1\2-liberators-1-t...,abb20e9c587f5f6ce279a6aa8c7e9be2897607abb00434...,Terran,Zerg,-36400.0,,371,1v1,Normal,1,1633603684,16.0,False,us,4,0,f9864054498acf297aacf2be80896ba131716a341983de...
2,data/SpawningTool\Other\page1\2000-atmospheres...,c2bcc137d9ce43fbf9a5ddd04f5167892c779c29616de9...,Zerg,Zerg,4022.0,3502.0,371,1v1,Faster,1,1633788999,16.0,True,eu,6,6,178a82fa5045e82a3920688274d7a524595f0ef6dc91ae...
3,data/SpawningTool\Other\page1\2021-09-05-zserr...,e83bd9ddb33fcdc21a801f9bd598ed081f8ca60f810522...,Zerg,Protoss,,,1803,1v1,Faster,1,1630608640,16.0,False,eu,7,6,99ba80721116d18f965c141581baa878abf02526f48bc6...
4,data/SpawningTool\Other\page1\2021-10-01-zvaev...,e83bd9ddb33fcdc21a801f9bd598ed081f8ca60f810522...,Zerg,Protoss,3615.0,3567.0,502,1v1,Faster,1,1633138764,16.0,True,us,5,5,b9e96a1dabcc0de0cfd89e8f0d02b5b8e064ec3f3c07c1...


`replays_info` now contains a dictionary of the metadata, an in particular, a list of each filename with a relative path to the replay file. This will be used as the list of files to be processed.

Each file will be output to a `pkl` file in the data/events folder. The name of the file will be the hash of the replay file as found in `replays_info`.

A for loop is used to process the replays in batches, allowing the user to track the progress of the process. This script runs in approximately 1 hour and 20 minutes on a 6 core machine, and generates approximately 130GB of pkl files. These files are very sparse, and can be compressed to around 7GB using 7zip on normal speed.

In [8]:
# use if __name__ == '__main__' to run in parallel
if __name__ == '__main__':

    # calculate total replays to process to be used in progress tracking
    total_replays = replays_info.shape[0]
    
    # set batch size
    batch_size = 500

    # get filenames as a list
    filenames = replays_info['filename'].tolist()

    # start a timer for ETA calculation
    start_time = time.time()

    for i in range(0, total_replays, batch_size):

        # set max_i to not exceed total_replays
        max_i = min(i + batch_size, total_replays)

        # get the filepaths of the replays in this batch
        replay_paths = filenames[i:max_i]

        # calculate progress as a percentage
        progress = round((i/total_replays)*100,1)

        # calculate eta
        eta = round((time.time() - start_time)/(i+1)*(total_replays-i),0)

        # convert eta to a string in hh:mm:ss format
        eta_str = str(datetime.timedelta(seconds=eta))
        
        # print progress and eta
        print(f'\rFetching ({i+1} to {max_i})/{total_replays} replays. Progress {progress}%. ETA {eta_str}', end='')

        # specify the number of processes to use
        n_jobs = mp.cpu_count() - 1
        # initialize pool using context manager
        with mp.Pool(processes=n_jobs) as pool:
            # use pool to extract all events from all replays in the batch
            batch_replays = pool.map(
                process_events.get_all_events,
                replay_paths
            )


print(f'\rFinished fetching {max_i}/{total_replays} replays. Progress 100%.                            ', end='')

Finished fetching 36812/36812 replays. Progress 100%.                            

In [11]:
# inspect the last replay of the batch
batch_replays[-1].head()

Unnamed: 0,frame,name,pid,count,upgrade_type_name,control_pid,unit_type_name,upkeep_pid,food_made,food_used,...,vespene_used_current_army,vespene_used_current_economy,vespene_used_current_technology,vespene_used_in_progress,vespene_used_in_progress_army,vespene_used_in_progress_economy,vespene_used_in_progress_technology,workers_active_count,ability_id,killer_pid
0,0,PlayerSetupEvent,1.0,,,,,,,,...,,,,,,,,,,
1,0,PlayerSetupEvent,2.0,,,,,,,,...,,,,,,,,,,
2,0,UpgradeCompleteEvent,1.0,1.0,RewardDanceMule,,,,,,...,,,,,,,,,,
3,0,UnitBornEvent,,,,0.0,MineralField,0.0,,,...,,,,,,,,,,
4,0,UnitBornEvent,,,,0.0,MineralField750,0.0,,,...,,,,,,,,,,


In [14]:
# count nans in each column
(batch_replays[-1].isna().sum()/batch_replays[-1].shape[0]).mean()

0.9103763583355421

The amount of NaNs above shows why event data `pkl` files are so large. 91% of the data is NaN, with only 9% being actual data. `pkl` files are less efficient than `csv` files with such sparse datasets, however, `pkl` is still preferred because of the increased speed of loading the data from the number of different files.