# Exploratory Data Analysis

This notebook contains the code for exploratory data analysis. It will focus on extracting data from the replays as a whole. The following steps will be followed:
1. Create a list of all replays with a relative link to their files.
    * Replays will be identified by their filehash, duplicate filehashes will not be added to the list.
    * First create the table as a pandas dataframe, after which it can be written to csv.
    * The list will be saved in the root of the `data/` folder, and this will be used as the reference point for relative file paths.
    * Only 1v1 replays will be considered. This will be confirmed from the `game_type` attribute.
2. Extract relevant data from replay files. The following data will be extracted:
    * Map
    * Player names
    * Player races
    * Player levels (if available)
    * Player highest league (if available)
    * Date
    * Length
    * Winner
3. Add all relevant data to the csv
4. Data exploration
    * Initial data visualization.
    * Cleaning any missing data.
5. Initial modeling of metadata to observe any interesting trends.

In [14]:
# initial imports and settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sc2reader
from multiprocessing import Pool # for parallel processing of replay files
import re

## Function for extracting data from a single replay file
A function will be created to extract the relevant data from a single replay file. This function can then be used in a loop to extract data from all replays.

The first attempt will be to use the notebook to build the steps of the function before writing the function itself.

In [3]:
replays_dir = '/home/jared/sc2_modeling/data/replays/SpawningTool'

def get_replays(replays_dir):
    # loop through each file in replays_dir using os.walk
    for dirpath, dirnames, filenames in os.walk(replays_dir):
        # loop through each file in the directory
        for filename in filenames:
            # if the file is a replay file
            if filename.endswith('.SC2Replay'):
                # open the replay file
                filepath = os.path.join(dirpath, filename)
                replay = sc2reader.load_replay(
                    filepath, load_level=4)
                return replay, filepath # return the first replay found
    
replay, file_path = get_replays(replays_dir)

In [46]:
# check which of Protoss, Terran, Zerg is contained in player1 string
def player_name_and_race(players):
    """
    player_name_and_race
    Takes players object from sc2reader object and extracts individual
    player names and races

    Args:
        players (sc2reader.replay.players): Players attribute from sc2reader 
        replay object.

    Returns:
        tuple: A tuple with name and race, respectively for each player
    """    
    # empty list to store player names and races
    player_name = []
    player_race = []

    # names of the races
    race_list = ['Protoss', 'Terran', 'Zerg']

    for player in players:
        
        # convert player to string
        player_string = str(player)

        for race in race_list:
            
            if race in player_string:

                # delete race from the player1 string
                race_string = ' ('+race+')'
                # assert that race_string is in player_string
                assert race_string in player_string, \
                    f'{player_string} does not adhere to to {race_string} formatting'
                # use replace to delete the player_string race
                player_string = player_string.replace(race_string, '')

                # create regex to find 'Player 1 - ' leaving only actual name
                reg_str = r'Player\s\d\s\-\s'

                # assert that reg_str is in player_string string
                assert re.search(reg_str, player_string), \
                    f'{player_string} does not adhere to to {reg_str} formatting'

                player_name.append(re.sub(reg_str, '', player_string))
                player_race.append(race)

    # convert player_name and player_race lists into tuple for return
    return tuple(
        [val for pair in zip(player_name, player_race) for val in pair])



In [54]:
# if replay.game_type == '1v1':
# extract filehash
file_hash = replay.filehash
# extract map name
file_map = replay.map_name
# get players
player1_name, player1_race, player2_name, player2_race = \
                                player_name_and_race(replay.players)

dir(replay.players[0])


['URL_TEMPLATE',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'archon_leader_id',
 'attribute_data',
 'clan_tag',
 'color',
 'combined_race_levels',
 'commander',
 'commander_level',
 'commander_mastery_level',
 'commander_mastery_talents',
 'detail_data',
 'difficulty',
 'events',
 'format',
 'handicap',
 'hero_mount',
 'hero_name',
 'hero_skin',
 'highest_league',
 'init_data',
 'is_human',
 'is_observer',
 'is_referee',
 'killed_units',
 'messages',
 'name',
 'pick_race',
 'pid',
 'play_race',
 'recorder',
 'region',
 'result',
 'sid',
 'slot_data',
 'subregion',
 'team',
 'team_id',
 'toon_handle',
 'toon_id',
 'trophy_id',
 'uid',
 'units',
 'url']

In [35]:
a = [1,2]
b = [3,4]
[val for pair in zip(a, b) for val in pair]

[1, 3, 2, 4]