# Exploratory Data Analysis on a single SC2 Replay File

This notebook details the process for loading and exploring a single SC2 replay file. The following goals are achieved:
1. Load the replay file using `sc2reader`
2. Extract the events stored in the `.events` attribute of the replay file into a pandas dataframe.
3. Extract other pertinent information from the replay file such as: 
    * Filehash.
    * Datetime the game was played.
    * The map on which the game was played.
    * Length of the game.
    * Races of the players.
    * Skill levels of the players (by league).
    * The winner of the game.
4. Drop all unnecessary columns, convert the rest of the columns into dummy variables which indicate the category of the action and 1 if the action was taken.
5. Divide the data up into time chunks (10 seconds initially, to be revised later). Sum the number of 1s in each chunk to get the number of actions taken in each chunk.
6. Add columns to the dataframe indicating the number of actions taken in all previous chunks.
7. Add columns to the dataframe indicating player races, skill levels, and game winner.
8. Attempt a logistic regression using all chunks as inputs.
9. Investigate an optimization method which results in the earliest prediction that does not change until the end of the game. I.e., how early can the regression make a prediction which remains constant until the end of the game.

See __[this link](https://web.archive.org/web/20201031184319/https://miguelgondu.github.io/python/ai/video%20games/2018/09/04/a-tutorial-on-sc2reader-events-and-units.html )__ for the tutorial that was used in establishing how to work with, and extract data from an SC2 replay file, using `sc2reader`.

In [1]:
# imports and settings

import pandas as pd
import sc2reader
import json
import datetime
from multiprocessing import Pool # for parallel processing of replay files

## Load a Replay
We will use the `load_level=4` setting to extract the most detailed data possible from the replay file. This includes all events, units, and upgrades.

The replay selected is a professional replay between Serral (Zerg) and Showtime (Protoss). The replay is 42 minutes long (which is long for a StarCraft II game) which should be a good indication of the upper end of the processing time and requirements for a replay.

In [2]:
# Load a replay
replay = sc2reader.load_replay(
    '../data/replays/SpawningTool/Pro/page1/2021-09-05-zserral-vs-pshowtime.SC2Replay',
    load_level=4)

# Extract a list of replay events
events_list = replay.events

## Extract Event Data from Replay
The following code extracts the events from the replay file and stores them in a pandas dataframe. Multiproccessing is used to speed up the process, as the replay file contains over 150,000 events.

Steps:
1. `events_list` now contains a list of stored event objects. `get_event_data` is a function that takes a specific event, and returns a dictionary of the values for each attribute of that event object. 
2. A multiprocessing pool iterates through all events returning a dictionary for each.
3. The output from the pool is stored in a list of dictionaries which can then be compressed into a pandas dataframe. A list of dictionaries is used so that all column names for all events do not need to be created manually.

In [3]:
# create function to get the event data so that multiprocessing can be used
def get_event_data(event):
    """
    get_event_data
    Extracts the data from the event and returns it as a dictionary
    Ignores events that start with '_', i.e., special attributes and dunder types

    Args:
        event (sc2reader.event): Event object extracted from sc2reader.events

    Returns:
        [dict]: A dictionary containing the event data
    """    
    # ignore attributes that are not needed (special or dunder)
    event_attributes = [attr for attr in dir(event) if not attr.startswith('_')]

    # initialize a dictionary to store the values of each attribute
    d = dict()

    # loop through each attribute and store the value in the dictionary
    for attr in event_attributes:
        # ignore attributes if they do not contain a value type
        if type(getattr(event, attr)) in [int, float, str, bool]:
            d[attr] = getattr(event, attr)
    
    return d

In [4]:
# run the function on each event in the list, returns a list of dictionaries
with Pool() as pool:
    dict_list = pool.map(get_event_data, events_list)

# convert the list of dictionaries to a dataframe
event_df = pd.DataFrame(dict_list)
event_df.shape

(151759, 112)

We not have an `event_df` with 151,759 rows and 112 columns.

## Processing the Dataframe
The major steps to accomplish here are:
1. Identify all columns.
2. Drop all columns that are not needed. We will also store a list of columns that are left after dropping in a csv file in the info directory. This list can be used later as a master list of columns for processing of all dataframes.
3. Check if columns are continuous or categorical (discrete integers are considered categorical).
4. Convert all categorical columns into dummy variables and drop the original column, and the first dummy column (using the pandas `drop first` option).
5. Remove all rows with non-player PID values (the players are the first two PIDs in the replay, all other PIDs are observers).
6. Summarise the dataframe by dividing it up into chunks of 10 seconds, summing up the count of all events during that time.
