# Exploratory Data Analysis

This notebook contains the code for exploratory data analysis. It will focus on extracting data from the replays as a whole. The following steps will be followed:
1. Create a list of all replays with a relative link to their files.
    * Replays will be identified by their filehash, duplicate filehashes will not be added to the list.
    * First create the table as a pandas dataframe, after which it can be written to csv.
    * The list will be saved in the root of the `data/` folder, and this will be used as the reference point for relative file paths.
    * Only 1v1 replays will be considered. This will be confirmed from the `game_type` attribute.
2. Extract relevant data from replay files. The following data will be extracted:
    * Map
    * Player names
    * Player races
    * Player levels (if available)
    * Player highest league (if available)
    * Date
    * Length
    * Winner
3. Add all relevant data to the csv
4. Data exploration
    * Initial data visualization.
    * Cleaning any missing data.
5. Initial modeling of metadata to observe any interesting trends.

In [1]:
# initial imports and settings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sc2reader
from multiprocessing import Pool, cpu_count # for parallel processing of replay files
import re
import random

## Function for extracting data from a single replay file
A function will be created to extract the relevant data from a single replay file. This function can then be used in a loop to extract data from all replays.

The first attempt will be to use the notebook to build the steps of the function before writing the function itself.

In [2]:
class ReplayInfo:

    def __init__(self, replay):
        # self.__Replay = replay
        self.map_name = replay.map_name
        self.player_races = self._get_player_races(replay)
        self.filename = replay.filename
        self.player_mmrs = self._get_player_mmrs(replay)
        self.game_length = self._get_game_length(replay)
        self.game_winner = self._get_winner(replay)
        self.timestamp = replay.unix_timestamp
        self.game_type = replay.type
        self.game_speed = replay.speed
        self.fps = replay.game_fps
        self.is_ladder = replay.is_ladder
        self.region = replay.region

        
    def _get_game_length(self, replay):

        # this converts to minutes.seconds
        length_string = str(replay.game_length) 

        # use regex to extract all numbers from length_string
        minutes = length_string.split('.')[0]
        seconds = length_string.split('.')[1]

        # convert to int in seconds
        return int(minutes)*60 + int(seconds)

        
    def _get_winner(self, replay):
        
        winner_string = str(replay.winner)
        if 'Player 1' in winner_string:
            return 1
        elif 'Player 2' in winner_string:
            return 2
        else:
            return 0

    def _get_player_mmrs(self, replay):
        
        str_a = 'replay.initData.backup'
        str_b = 'user_initial_data'
        str_c = 'scaled_rating'

        return (
            replay.raw_data[str_a][str_b][0][str_c],
            replay.raw_data[str_a][str_b][1][str_c]
            )


    # get the player races
    def _get_player_races(self, replay):
        """
        _get_player_races
        Iterate through players in self.__Replay.players and extract player 
        races as strings from the info

        Returns:
            tuple - Length 2 contain the races of both players
        """        

        player_string = []

        for player in replay.players:
            # convert player to string
            player_string.append(str(player))


        return (
            self._get_race(player_string[0]),
            self._get_race(player_string[1])
            )


    def _get_race(self, player):
        """
        _get_race Extract race from string. String is assumed to be of the form:
        'Player x - Race'. 

        Args:
            player (str): A string of form 'Player x - Race'

        Returns:
            str: Race of the player
        """

        # assert that player is a string
        assert isinstance(player, str), 'player should be a string'

        # names of the races
        race_list = ['Protoss', 'Terran', 'Zerg']

        for race in race_list:
                
                if race.lower() in player.lower():

                    # delete race from the player1 string
                    race_string = ' ('+race+')'
                    # assert that race_string is in player
                    assert race_string in player, \
                        f'{player} does not adhere to to {race_string} formatting'
                    # use replace to delete the player race
                    player = player.replace(race_string, '')

                    # create regex to find 'Player 1 - ' leaving only actual name
                    reg_str = r'Player\s\d\s\-\s'

                    # assert that reg_str is in player string
                    assert re.search(reg_str, player), \
                        f'{player} does not adhere to to {reg_str} formatting'

                    return race
                        


In [3]:
def process_replay(filename):

    # load replay
    try:
        replay = sc2reader.load_replay(
            filename,
            load_level=2 # level 2 is all that is required for metadata
            )
    except: # catch exceptions created by sc2reader
        return None

    return ReplayInfo(replay)

In [4]:
replays_dir = '/home/jared/sc2_modeling/data/replays/Replays'
replays_list = []
for dirpath, dirnames, filenames in os.walk(replays_dir):
    for filename in filenames:
        if filename.endswith('.SC2Replay'):
            filepath = os.path.join(dirpath, filename)
            replays_list.append(filepath)


In [5]:
with Pool() as p:
    replay_collection = p.map(process_replay, replays_list)


In [6]:
print(len(replay_collection))

# remove all None from replay_collection
replay_collection = [x for x in replay_collection if x is not None]

64396


In [7]:
# convert replay collection to dataframe
replay_df = pd.DataFrame({
    'filename':[x.filename for x in replay_collection],
    'map':[x.map_name for x in replay_collection],
    'player1_race':[x.player_races[0] for x in replay_collection],
    'player2_race':[x.player_races[1] for x in replay_collection],
    'player1_mmr':[x.player_mmrs[0] for x in replay_collection],
    'player2_mmr':[x.player_mmrs[1] for x in replay_collection],
    'game_length':[x.game_length for x in replay_collection],
    'game_type':[x.game_type for x in replay_collection],
    'game_speed':[x.game_speed for x in replay_collection],
    'game_winner':[x.game_winner for x in replay_collection],
    'timestamp':[x.timestamp for x in replay_collection],
    'fps':[x.fps for x in replay_collection],
    'is_ladder':[x.is_ladder for x in replay_collection],
    'region':[x.region for x in replay_collection]
    })

In [8]:
replay_df.head()

Unnamed: 0,filename,map,player1_race,player2_race,player1_mmr,player2_mmr,game_length,game_type,game_speed,game_winner,timestamp,fps,is_ladder,region
0,/home/jared/sc2_modeling/data/replays/Replays/...,Ascension to Aiur LE,Terran,Zerg,5193,5189,1056,1v1,Faster,2,1502229019,16.0,True,eu
1,/home/jared/sc2_modeling/data/replays/Replays/...,奥德赛 - 天梯版,Terran,Terran,2756,-36400,1108,1v1,Faster,1,1502284117,16.0,True,cn
2,/home/jared/sc2_modeling/data/replays/Replays/...,Eindringling LE,Zerg,Protoss,4078,4216,840,1v1,Faster,2,1502218549,16.0,True,eu
3,/home/jared/sc2_modeling/data/replays/Replays/...,深海暗礁 - 天梯版,Terran,Terran,3140,3133,3,1v1,Faster,2,1502199591,16.0,True,cn
4,/home/jared/sc2_modeling/data/replays/Replays/...,Ascension to Aiur LE,Protoss,Terran,3506,3492,482,1v1,Faster,1,1502212147,16.0,True,eu


In [12]:
replay_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64395 entries, 0 to 64394
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   filename      64395 non-null  object 
 1   map           64395 non-null  object 
 2   player1_race  61009 non-null  object 
 3   player2_race  61009 non-null  object 
 4   player1_mmr   64395 non-null  int64  
 5   player2_mmr   64395 non-null  int64  
 6   game_length   64395 non-null  int64  
 7   game_type     64395 non-null  object 
 8   game_speed    64395 non-null  object 
 9   game_winner   64395 non-null  int64  
 10  timestamp     64395 non-null  int64  
 11  fps           64395 non-null  float64
 12  is_ladder     64395 non-null  bool   
 13  region        64395 non-null  object 
dtypes: bool(1), float64(1), int64(5), object(7)
memory usage: 6.4+ MB


In [13]:
# write replay_collection to csv with no index
replay_df.to_csv('data/replays.csv', index=False)