# Exploratory Data Analysis on a single SC2 Replay File

This notebook details the process for loading and exploring a single SC2 replay file. The following goals are achieved:
1. Load the replay file using `sc2reader`
2. Extract the events stored in the `.events` attribute of the replay file into a pandas dataframe.
3. Extract other pertinent information from the replay file such as: 
    * Filehash.
    * Datetime the game was played.
    * The map on which the game was played.
    * Length of the game.
    * Races of the players.
    * Skill levels of the players (by league).
    * The winner of the game.
4. Drop all unnecessary columns, convert the rest of the columns into dummy variables which indicate the category of the action and 1 if the action was taken.
5. Divide the data up into time chunks (10 seconds initially, to be revised later). Sum the number of 1s in each chunk to get the number of actions taken in each chunk.
6. Add columns to the dataframe indicating the number of actions taken in all previous chunks.
7. Add columns to the dataframe indicating player races, skill levels, and game winner.
8. Attempt a logistic regression using all chunks as inputs.
9. Investigate an optimization method which results in the earliest prediction that does not change until the end of the game. I.e., how early can the regression make a prediction which remains constant until the end of the game.

See __[this link](https://web.archive.org/web/20201031184319/https://miguelgondu.github.io/python/ai/video%20games/2018/09/04/a-tutorial-on-sc2reader-events-and-units.html )__ for the tutorial that was used in establishing how to work with, and extract data from an SC2 replay file, using `sc2reader`.

In [1]:
# imports and settings

import pandas as pd
import sc2reader
import json
import datetime
from multiprocessing import Pool # for parallel processing of replay files

## Load a Replay
We will use the `load_level=4` setting to extract the most detailed data possible from the replay file. This includes all events, units, and upgrades.

The replay selected is a professional replay between Serral (Zerg) and Showtime (Protoss). The replay is 42 minutes long (which is long for a StarCraft II game) which should be a good indication of the upper end of the processing time and requirements for a replay.

In [2]:
# Load a replay
replay_path = '../data/replays/SpawningTool/Pro/page1/2021-09-05-zserral-vs-pshowtime.SC2Replay',
replay = sc2reader.load_replay(replay_path, load_level=4)

# Extract a list of replay events
events_list = replay.events

## Extract Event Data from Replay
The following code extracts the events from the replay file and stores them in a pandas dataframe. Multiproccessing is used to speed up the process, as the replay file contains over 150,000 events.

Steps:
1. `events_list` now contains a list of stored event objects. `get_event_data` is a function that takes a specific event, and returns a dictionary of the values for each attribute of that event object. 
2. A multiprocessing pool iterates through all events returning a dictionary for each.
3. The output from the pool is stored in a list of dictionaries which can then be compressed into a pandas dataframe. A list of dictionaries is used so that all column names for all events do not need to be created manually.

In [3]:
# create function to get the event data so that multiprocessing can be used
def get_event_data(event):
    """
    get_event_data
    Extracts the data from the event and returns it as a dictionary
    Ignores events that start with '_', i.e., special attributes and dunder types

    Args:
        event (sc2reader.event): Event object extracted from sc2reader.events

    Returns:
        [dict]: A dictionary containing the event data
    """    
    # ignore attributes that are not needed (special or dunder)
    event_attributes = [attr for attr in dir(event) if not attr.startswith('_')]

    # initialize a dictionary to store the values of each attribute
    d = dict()

    # loop through each attribute and store the value in the dictionary
    for attr in event_attributes:
        # ignore attributes if they do not contain a value type
        if type(getattr(event, attr)) in [int, float, str, bool]:
            d[attr] = getattr(event, attr)
    
    return d

In [4]:
# run the function on each event in the list, returns a list of dictionaries
with Pool() as pool:
    dict_list = pool.map(get_event_data, events_list)

# convert the list of dictionaries to a dataframe
event_df = pd.DataFrame(dict_list)
event_df.shape

(151759, 112)

We not have an `event_df` with 151,759 rows and 112 columns.

## Processing the Dataframe
The major steps to accomplish here are:
1. Identify all columns.
2. Drop all columns that are not needed. We will also store a list of columns that are left after dropping in a csv file in the info directory. This list can be used later as a master list of columns for processing of all dataframes.
3. Check if columns are continuous or categorical (discrete integers are considered categorical).
4. Convert all categorical columns into dummy variables and drop the original column, and the first dummy column (using the pandas `drop first` option).
5. Remove all rows with non-player PID values (the players are the first two PIDs in the replay, all other PIDs are observers).
6. Summarise the dataframe by dividing it up into chunks of 10 seconds, summing up the count of all events during that time.


In [5]:
# # print the names of the columns to a csv
# pd.DataFrame(event_df.columns).to_csv(
#     '../info/raw_columns_list.csv',
#     index=False)

In [6]:
# read the csv into columns_checklist
columns_checklist = pd.read_csv('../info/raw_columns_list.csv')
columns_checklist

Unnamed: 0,column_name,include_bool,categorical_bool
0,frame,1,0
1,name,1,1
2,pid,1,1
3,second,1,0
4,sid,0,0
...,...,...,...
107,target,0,0
108,text,0,0
109,to_all,0,0
110,to_allies,0,0


In [7]:
# create a dataframe of all columns with to be included
column_keep_df = columns_checklist.loc[
    columns_checklist['include_bool'] == 1].drop(['include_bool'], axis=1)
column_keep_df

Unnamed: 0,column_name,categorical_bool
0,frame,0
1,name,1
2,pid,1
3,second,0
5,type,1
...,...,...
92,target_unit_id,1
93,target_unit_type,1
101,first_unit_index,1
102,killer_pid,1


In [9]:
# We now have a list of 68 columns that we want to keep
# We can use this list to create a new dataframe with only the columns we want
event_df = event_df[column_keep_df['column_name']]
event_df.shape

(151759, 68)

### Create Masks for Categorical and Continuous Columns
We will create two masks one for all categorical columns and one for all continuous columns.

First we need to identify which columns are continuous and which columns are categorical.

In [18]:
# identify which columns are continuous and which are categorical
categorical_columns = column_keep_df.loc[
        column_keep_df['categorical_bool'] == 1, 'column_name'].tolist()
continuous_columns = column_keep_df.loc[
        column_keep_df['categorical_bool'] == 0, 'column_name'].tolist()

In [19]:
print('<<<<Categorical columns>>>>\n',categorical_columns)

<<<<Categorical columns>>>>
 ['name', 'pid', 'type', 'uid', 'upgrade_type_name', 'unit_id', 'mask_type', 'ability_id', 'target_unit_id', 'target_unit_type', 'first_unit_index', 'killer_pid', 'killing_unit_id']


In [20]:
print('<<<<Continuous columns>>>>\n',continuous_columns)

<<<<Continuous columns>>>>
 ['frame', 'second', 'count', 'ff_minerals_lost_army', 'ff_minerals_lost_economy', 'ff_minerals_lost_technology', 'ff_vespene_lost_army', 'ff_vespene_lost_economy', 'ff_vespene_lost_technology', 'food_made', 'food_used', 'minerals_collection_rate', 'minerals_current', 'minerals_killed', 'minerals_killed_army', 'minerals_killed_economy', 'minerals_killed_technology', 'minerals_lost', 'minerals_lost_army', 'minerals_lost_economy', 'minerals_lost_technology', 'minerals_used_active_forces', 'minerals_used_current', 'minerals_used_current_army', 'minerals_used_current_economy', 'minerals_used_current_technology', 'minerals_used_in_progress', 'minerals_used_in_progress_army', 'minerals_used_in_progress_economy', 'minerals_used_in_progress_technology', 'resources_killed', 'resources_lost', 'resources_used_current', 'resources_used_in_progress', 'vespene_collection_rate', 'vespene_current', 'vespene_killed', 'vespene_killed_army', 'vespene_killed_economy', 'vespene_k

There are 13 categorical columns and 55 continuous columns. 

1 of the continuous columns is 'count', which appears to count the number of an event that happen during a frame. This column will be used to insert the values upon dummification of the categorical columns. In other words, when categories are converted to dummy variables the value inside a column, for a specific row, will be multiplied by the value of the count column. In general, count appears to remain at 1, but this is done just in case it is not true for every replay. After the multiplication, the count column should be dropped.

### Continuous Columns
The value of the columns in the continuous set should be investigated to determine the best way of aggregating the values. For example, mineral collection rate would need to be averaged over the time chunks (this is stored as a value of Minerals collected per Minute), whereas the number of units killed would need to be summed. It may be best to modify any continuous columns that require a different aggregation method (e.g., mineral collection rate) so that they can merely be summed over frames instead.

In [24]:
event_df[continuous_columns].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
frame,151759.0,20436.349119,11404.383827,0.0,10572.0,20413.0,30259.0,40406.0
second,151759.0,1276.807497,712.771601,0.0,660.0,1275.0,1891.0,2525.0
count,117.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
ff_minerals_lost_army,505.0,263.089109,482.600904,0.0,0.0,0.0,300.0,1400.0
ff_minerals_lost_economy,505.0,146.435644,167.044715,0.0,0.0,150.0,400.0,400.0
ff_minerals_lost_technology,505.0,64.752475,119.378301,0.0,0.0,0.0,100.0,400.0
ff_vespene_lost_army,505.0,131.336634,241.408823,0.0,0.0,0.0,150.0,700.0
ff_vespene_lost_economy,505.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ff_vespene_lost_technology,505.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
food_made,505.0,183.641584,76.416207,14.0,152.0,212.0,230.0,313.0


In [28]:
event_df.loc[600:610, ['minerals_collection_rate', 'minerals_current', 'minerals_lost', 'minerals_used_current']]

Unnamed: 0,minerals_collection_rate,minerals_current,minerals_lost,minerals_used_current
600,,,,
601,335.0,40.0,0.0,1050.0
602,251.0,30.0,0.0,1000.0
603,,,,
604,,,,
605,,,,
606,,,,
607,,,,
608,,,,
609,,,,
