# Exploratory Data Analysis

This notebook contains the code for exploratory data analysis. It will focus on extracting data from the replays as a whole. The following steps will be followed:
1. Create a list of all replays with a relative link to their files.
    * Replays will be identified by their filehash, duplicate filehashes will not be added to the list.
    * First create the table as a pandas dataframe, after which it can be written to csv.
    * The list will be saved in the root of the `data/` folder, and this will be used as the reference point for relative file paths.
    * Only 1v1 replays will be considered. This will be confirmed from the `game_type` attribute.
2. Extract relevant data from replay files. The following data will be extracted:
    * Map
    * Player names
    * Player races
    * Player levels (if available)
    * Player highest league (if available)
    * Date
    * Length
    * Winner
3. Add all relevant data to the csv
4. Data exploration
    * Initial data visualization.
    * Cleaning any missing data.
5. Initial modeling of metadata to observe any interesting trends.

In [1]:
# initial imports and settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sc2reader
from multiprocessing import Pool, cpu_count # for parallel processing of replay files
import re
import random

## Function for extracting data from a single replay file
A function will be created to extract the relevant data from a single replay file. This function can then be used in a loop to extract data from all replays.

The first attempt will be to use the notebook to build the steps of the function before writing the function itself.

In [2]:
replays_dir = '/home/jared/sc2_modeling/data/replays/Replays'

def get_replays(replays_dir):
    # loop through each file in replays_dir using os.walk
    for dirpath, dirnames, filenames in os.walk(replays_dir):
        # loop through each file in the directory
        for filename in filenames:
            # if the file is a replay file
            if filename.endswith('.SC2Replay'):
                filepath = os.path.join(dirpath, filename)
                replay = sc2reader.load_replay(
                    filepath,
                    load_level=4
                    )
                
                return replay, filepath # return the first replay found
    
replay, file_path = get_replays(replays_dir)

In [20]:
my_replay = sc2reader.load_replay(
    '/home/jared/sc2_modeling/data/replays/Replays/14500c67c7c85f019874c47bc0101d803dab2b908ec5bc777d22f74f59405d2f.SC2Replay',
    load_level=4
    )


In [21]:
my_replay.filename

'/home/jared/sc2_modeling/data/replays/Replays/14500c67c7c85f019874c47bc0101d803dab2b908ec5bc777d22f74f59405d2f.SC2Replay'

In [24]:
my_replay.real_length

Length(0)

In [22]:
str(my_replay.game_length)

'00.00'

In [6]:
class ReplayInfo:

    def __init__(self, replay):
        # self.__Replay = replay
        self.map_name = replay.map_name
        self.player_races = self._get_player_races(replay)
        self.filename = replay.filename
        self.player_mmrs = self._get_player_mmrs(replay)
        self.game_length = self._get_game_length(replay)
        self.game_winner = self._get_winner(replay)
        self.timestamp = replay.unix_timestamp
        self.game_type = replay.type
        self.game_speed = replay.speed
        self.fps = replay.game_fps
        self.is_ladder = replay.is_ladder
        self.region = replay.region

        
    def _get_game_length(self, replay):

        # this converts to minutes.seconds
        length_string = str(replay.game_length) 

        # use regex to extract all numbers from length_string
        minutes = length_string.split('.')[0]
        seconds = length_string.split('.')[1]

        # convert to int in seconds
        return int(minutes)*60 + int(seconds)

        
    def _get_winner(self, replay):
        
        winner_string = str(replay.winner)
        if 'Player 1' in winner_string:
            return 1
        elif 'Player 2' in winner_string:
            return 2
        else:
            return 0

    def _get_player_mmrs(self, replay):
        
        str_a = 'replay.initData.backup'
        str_b = 'user_initial_data'
        str_c = 'scaled_rating'

        return (
            replay.raw_data[str_a][str_b][0][str_c],
            replay.raw_data[str_a][str_b][1][str_c]
            )


    # get the player races
    def _get_player_races(self, replay):
        """
        _get_player_races
        Iterate through players in self.__Replay.players and extract player 
        races as strings from the info

        Returns:
            tuple - Length 2 contain the races of both players
        """        

        player_string = []

        for player in replay.players:
            # convert player to string
            player_string.append(str(player))


        return (
            self._get_race(player_string[0]),
            self._get_race(player_string[1])
            )


    def _get_race(self, player):
        """
        _get_race Extract race from string. String is assumed to be of the form:
        'Player x - Race'. 

        Args:
            player (str): A string of form 'Player x - Race'

        Returns:
            str: Race of the player
        """

        # assert that player is a string
        assert isinstance(player, str), 'player should be a string'

        # names of the races
        race_list = ['Protoss', 'Terran', 'Zerg']

        for race in race_list:
                
                if race in player:

                    # delete race from the player1 string
                    race_string = ' ('+race+')'
                    # assert that race_string is in player
                    assert race_string in player, \
                        f'{player} does not adhere to to {race_string} formatting'
                    # use replace to delete the player race
                    player = player.replace(race_string, '')

                    # create regex to find 'Player 1 - ' leaving only actual name
                    reg_str = r'Player\s\d\s\-\s'

                    # assert that reg_str is in player string
                    assert re.search(reg_str, player), \
                        f'{player} does not adhere to to {reg_str} formatting'

                    return race
                        


In [7]:
def _get_race(player):
    """
    Takes in a string containing player name and race and returns as tuple
    with correct formatting
    """

    # assert that player is a string
    assert isinstance(player, str), 'player should be a string'

    # names of the races
    race_list = ['Protoss', 'Terran', 'Zerg']

    for race in race_list:
            
            if race in player:

                # delete race from the player1 string
                race_string = ' ('+race+')'
                # assert that race_string is in player
                assert race_string in player, \
                    f'{player} does not adhere to to {race_string} formatting'
                # use replace to delete the player race
                player = player.replace(race_string, '')

                # create regex to find 'Player 1 - ' leaving only actual name
                reg_str = r'Player\s\d\s\-\s'

                # assert that reg_str is in player string
                assert re.search(reg_str, player), \
                    f'{player} does not adhere to to {reg_str} formatting'

                return race



In [8]:
# check which of Protoss, Terran, Zerg is contained in player string
def _get_players_race(players):
     
    # empty dict to store players metadata
    players_info = {}

    for i, player in enumerate(players):
        
        # convert player to string
        player_string = str(player)

        # create player id
        player_id = 'player' + str(i+1) 

        # get player race and store in dict with player id as key
        players_info[player_id+'_race'] = _get_race(player_string)

    return players_info



In [9]:
# if replay.game_type == '1v1':
# extract filehash
file_hash = replay.filehash
# extract map name
file_map = replay.map_name
# get players race
players = _get_players_race(replay.players)

# get players mmr
# players = _get_players_mmr(
#     players,
#     replay.raw_data['replay.initData.backup']['user_initial_data']
#     )



In [10]:
for person in replay.raw_data['replay.initData.backup']['user_initial_data']:
    print(person['scaled_rating'])
    print(person['highest_league'])
        

5193
6
5189
6
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0
None
0


In [11]:
def process_replay(filename):

    # load replay
    replay = sc2reader.load_replay(
        filename,
        load_level=4
        )

    return ReplayInfo(replay)

In [12]:
replays_dir = '/home/jared/sc2_modeling/data/replays/Replays'
replays_list = []
for dirpath, dirnames, filenames in os.walk(replays_dir):
    for filename in filenames:
        if filename.endswith('.SC2Replay'):
            filepath = os.path.join(dirpath, filename)
            replays_list.append(filepath)


In [13]:
# set random seed
random.seed(42)

# take a random sample of replays_list
replays_list_sample = random.sample(replays_list, 1000)

In [14]:
len(replays_list_sample)

1000

In [15]:
with Pool() as p:
    replay_collection = p.map(process_replay, replays_list_sample)


In [16]:
len(replay_collection)

1000

In [17]:
# convert replay collection to dataframe
replay_df = pd.DataFrame({
    'filename':[x.filename for x in replay_collection],
    'map':[x.map_name for x in replay_collection],
    'player1_race':[x.player_races[0] for x in replay_collection],
    'player2_race':[x.player_races[1] for x in replay_collection],
    'player1_mmr':[x.player_mmrs[0] for x in replay_collection],
    'player2_mmr':[x.player_mmrs[1] for x in replay_collection],
    'game_length':[x.game_length for x in replay_collection],
    'game_type':[x.game_type for x in replay_collection],
    'game_speed':[x.game_speed for x in replay_collection],
    'game_winner':[x.game_winner for x in replay_collection],
    'timestamp':[x.timestamp for x in replay_collection],
    'fps':[x.fps for x in replay_collection],
    'is_ladder':[x.is_ladder for x in replay_collection],
    'region':[x.region for x in replay_collection]
    })

In [18]:
self.map_name = replay.map_name
        self.player_races = self._get_player_races(replay)
        self.filename = replay.filename
        self.player_mmrs = self._get_player_mmrs(replay)
        self.game_length = self._get_game_length(replay)
        self.game_winner = self._get_winner(replay)
        self.timestamp = replay.unix_timestamp
        self.game_type = replay.type
        self.game_speed = replay.speed
        self.fps = replay.game_fps
        self.is_ladder = replay.is_ladder
        self.region = replay.region

Unnamed: 0,filename,map,player1_race,player2_race,player1_mmr,player2_mmr,game_length,game_type,game_speed
0,/home/jared/sc2_modeling/data/replays/Replays/...,卡塔莉娜 - 天梯版 （虚空）,Zerg,Protoss,3722,3409,272,1v1,Faster
1,/home/jared/sc2_modeling/data/replays/Replays/...,Ascension to Aiur LE,Terran,Terran,3573,3509,351,1v1,Faster
2,/home/jared/sc2_modeling/data/replays/Replays/...,Abyssal Reef LE,Terran,Protoss,2821,2618,401,1v1,Faster
3,/home/jared/sc2_modeling/data/replays/Replays/...,Mech Depot LE,Zerg,Terran,4082,3927,560,1v1,Faster
4,/home/jared/sc2_modeling/data/replays/Replays/...,深海暗礁 - 天梯版,Terran,Terran,3876,4006,0,1v1,Faster


In [19]:
replay_df.loc[4, 'filename']

'/home/jared/sc2_modeling/data/replays/Replays/14500c67c7c85f019874c47bc0101d803dab2b908ec5bc777d22f74f59405d2f.SC2Replay'