# Data Collection

[HLTV](https://www.hltv.org) is a forum which covers CSGO esports news. It is also the main archive for statistics relating to the professional CounterStrike scene. Since 2015, they have also developed and upkept performance based rankings for both individuals and teams universally accepted by the CSGO community as the most accurate. For simplicity, we'll only be looking at the top 3 tiers of CS (no official hierarchy, but basically universal).
* T1 CS - Favorites to win any event; typically ranked top 5, exceptional strategy, high preparation for opponents, high individual skill, deep map pool
* T2 CS - Winning T1 events not a massive upset - consistently make it deep into tournament; typically ranked 6 - ~15
* T3 CS - Teams existing in international space with highly skilled individual players but never expected to excel - typically lack strategy, preparation, etc. Upsets are a big deal.

In [1]:
filepath = 'C:/Users/Tim/Desktop/lighthouse/w11,12 - final project/'
data_filepath = filepath+'data/'

import pandas as pd
import numpy as np
import re
import requests
import datetime
from datetime import date, timedelta
from bs4 import BeautifulSoup
import pickle
import copy

In [2]:
# -- first scrape for team rankings
# -- HLTV rankings are released weekly per Monday as a new webpage

# get days for each ranking page
# first ranking page = https://www.hltv.org/ranking/teams/2015/october/1

# method of getting data -- http://krishnan.io/cmsc320.html

def get_days():
    days = []
    mondays = []
    first = pd.to_datetime(pd.Timestamp(year=2015, month=10, day=1))
    for i in range(2015,2021): # get every monday since 2015
        mondays += pd.date_range(start=str(i),
                         end=str(i+1),
                         freq='W-MON')
    for monday in mondays:
        d = monday.to_pydatetime()
        if d >= first:
            days.append(d)
    return days

In [3]:
# all ranking pages follow this format: https://www.hltv.org/ranking/teams/2015/october/1
# https://www.hltv.org/ranking/teams/xxxx(year)/month/x(day)

months = {
    1:'january',
    2:'february',
    3:'march',
    4:'april',
    5:'may',
    6:'june',
    7:'july',
    8:'august',
    9:'september',
    10:'october',
    11:'november',
    12:'december'
}

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

def rank_data():
    days = get_days()
    data = []
    for i in range(len(days)):
        day = days[i]
        next_day = days[i + 1] - datetime.timedelta(days=1) if i < len(days) - 1 else date.today()
        url = 'https://www.hltv.org/ranking/teams/' + \
            str(day.year) + '/' + \
            months[day.month] + '/' + str(day.day)
        
        res = requests.get(url,headers=headers)
        
#         if res.status_code != 200:
#             continue
#         else:
        soup = BeautifulSoup(res.content, 'html.parser')
        team_ranks = soup.findAll("div",class_="ranked-team standard-box")
        pattern = re.compile('\#(\d+)')
        teams = []
        for team in team_ranks:
            rank = pattern.match(team.find("span",class_="position").text).groups(1)[0]
            name = team.find("span",class_="name").text
            players = team.findAll("div",class_="nick")
            playernames = [player.text for player in players]
            date_range = pd.date_range(start=day,end=next_day)
            for d in date_range:
                teams.append([d, name, rank, playernames]) # have to do every day or else it only pulls 2019-12-30
        
        data+=teams
        
    df = pd.DataFrame(data=data, columns=['date','rank','team','player_names'])
    return df

In [10]:
# rank_df_pickle = 'rank_df.pickle'
rank_df_csv = 'rank_df.csv'
rank_df = rank_data()

print(rank_df.shape)
rank_df.head()

# with open(data_filepath+rank_df_pickle, 'wb') as f:
#     pickle.dump(rank_df, f)
    
# to reopen pickle file   
# with open(data_filepath+rank_df_pickle, 'rb') as f:
#     rankings = pickle.load(f) 

(55650, 4)


Unnamed: 0,date,rank,team,player_names
0,2015-10-05,fnatic,1,"[pronax, olofmeister, flusha, JW, KRIMZ]"
1,2015-10-06,fnatic,1,"[pronax, olofmeister, flusha, JW, KRIMZ]"
2,2015-10-07,fnatic,1,"[pronax, olofmeister, flusha, JW, KRIMZ]"
3,2015-10-08,fnatic,1,"[pronax, olofmeister, flusha, JW, KRIMZ]"
4,2015-10-09,fnatic,1,"[pronax, olofmeister, flusha, JW, KRIMZ]"


In [11]:
rank_df.to_csv(data_filepath+rank_df_csv)

Here we collect data that attempts to capture the impact of a player in a game. We will be taking data from each match they play. The obvious starting point would be the kills opposed to the deaths they had in the game, as the higher (lower) number of kills (deaths) you have directly contribute to the state of the game. Rating is HLTV's statistic that captures a player's overall performance in the game.

However, these statistics don't show the whole picture of what goes on within this environment. As is common in team games, roles are prevalent, especially within structured top teams. Stats such as average damage per round or kills per round are inherently biased towards star-players; because they are so good, star-players are generally put in situations that are more likely or favorable than their teammates to find kills. They are expected to achieve higher numeric success moreso than their teammates because of the sacrifices said teammates make in order to set them up to dominate - things like where to play, baiting the enemy in, or even sacrificing one's own buy to buy the star-player a more expensive weapon. After all, the star-player's role IS to carry the game. All these will inherently pad the rating statistic calculated by HLTV and lower others'. 

But that doesn't mean the rest of the team doesn't contribute just as much to the win. No, players like the support player will often take the most vulnerable spot in defense, or the entry fragger will often go in first to create room when attacking. Roles like this are designed specifically to create favorable situations for the star-player, and more often than not, such players will suffer what most will see as lack of performance. 

Thankfully, HLTV also provides us with a statistic: the 'impact' stat. It is measured from the impact a player makes based on their multikills, opening kills, and clutches - basically plays that can be but unnecessarily unfavorable to make but heavily increase the team's chance of winning. It doesn't capture things such as the ability to buy time for your team, but having a way to measure a player's impact on the game helps a lot with capturing this bias - and a little bit of how cerebral a player is. It does take brains and not just brawn to pull a round win from an unfavorable situation.

'KAST' - kill, assist, survived, or traded in a round - is another one of these stats. Kill and assist gives an immediate impact into the round - shows that the player had some damage contribution; survived contributes to economic impact - would not have to buy into the subsequent round; and traded represents teamplay, wherein the player plays in a way where he dies and allows his teammate to get a kill (a trade). (This is a good measure of a player's playstyle/impact but is not included because it will be taken from data on teams.)

As a final note, the role that is very statistically intangible yet perhaps is the most important player on the team is the in-game leader. This is the person that prepares the team's strategy, prepares anti-strategies for other teams, prepares good utility usage, provides spots to play for other role players, and is generally the brains of the team. Yet more often than not, this player will suffer in terms of game play, as they generally focus more on the map and directing the movement of their players. This role is important to mention because a good in game leader may not necessarily have high impact numbers at all, yet will have a huge impact on pushing their team to the finish line. Unfortunately, there isn't any data on this; there aren't any data on time spent on VOD reviews, or any tangible number on how micro-managing a leader is. 
* This is perhaps just applicable for the in-game leader as it is for the team's coach.

In [4]:
# -- second scrape for match history

# try to capture individual skill: take each player per team -- take their stats per match

players = 'https://www.hltv.org/stats/players?startDate=all&matchType=Lan&rankingFilter=Top30'

# returns a list of links for each player
def get_links():
    res = requests.get(players,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    cells = soup.find('table',class_='stats-table player-ratings-table').find('tbody').findAll('tr')
    # table_body = players_table.find('tbody')
    # player_cells = table.body.findAll('tr')

    links = {}
    for cell in cells:
        link_tag = cell.find('td',class_='playerCol').find('a')
        link = link_tag['href']
        name = link_tag.text
        links[name] = 'https://www.hltv.org' + link
    return links

def get_players(links):
    data = []
    # how score is formatted on this page (score = how many rounds team won vs lost)
    score_re = re.compile("\((\d+)\)")
    # how kill/death ratio is formatted on this page
    kd_re = re.compile("(\d+) - (\d+)")
    
    for player, link in links.items():
        res = requests.get(link,headers=headers)
        
#         if res.status_code != 200:
#             continue
#         else:
        soup = BeautifulSoup(res.content,'html.parser')
        impact = soup.findAll('div',class_='summaryStatBreakdownRow')[1].find('div',class_='summaryStatBreakdownDataValue').text.strip()
        
        match_link = link.replace('/players','/players/matches')
        res2 = requests.get(match_link, headers=headers)
        
#         if res2.status_code != 200:
#             continue
#         else:
        soup2 = BeautifulSoup(res2.content,'html.parser')
        rows = soup2.find('table').find('tbody').findAll('tr')
            
        for row in rows:
            cells = row.findAll('td')
            date = cells[0].find('div',class_='time').text.strip()
            team = cells[1].findAll('span')[0].text.strip()
            rounds_text = cells[1].findAll('span')[1].text.strip()
            team_rounds = score_re.match(rounds_text).group(1)
            opposing_team = cells[2].findAll('span')[0].text.strip()
            opposing_team_rounds = score_re.match(cells[2].findAll('span')[1].text.strip()).group(1)
            map_played = cells[3].text.strip()
            kills = kd_re.match(cells[4].text.strip()).group(1)
            deaths = kd_re.match(cells[4].text.strip()).group(2)
            differential = cells[5].text.strip()
            rating = cells[6].text.strip()
            data.append([player, date, team, team_rounds, opposing_team,
                         opposing_team_rounds, map_played, kills, deaths, differential, rating, impact])
            
        # data.append([impact])
    columns = ["player", "date", "team", "team_rounds", "opposing_team",
               "opposing_team_rounds", "map", "kills", "deaths", "differential", "rating", "avg_impact"]
    df = pd.DataFrame(data=data, columns=columns)
    
    return df

In [None]:
# match_df_pickle = 'match_df.pickle'
match_df_csv = 'match_df.csv'
player_links = get_links()
match_df = get_players(player_links)

print(match_df.shape)
match_df.head()

# with open(data_filepath+match_df_pickle, 'wb') as f:
#     pickle.dump(match_df, f)
    
# to reopen pickle file   
# with open(data_filepath+match_df_pickle, 'rb') as f:
#     matches = pickle.load(f) 

In [None]:
match_df.to_csv(data_filepath+match_df_csv)

As with any team game, individual prowess doesn't always net a win. Good, strategic teamwork is extremely important at the highest levels of CS, as is preparation for enemy strategy. For example, good utility usage (good flashes, good grenade usage) - which often costs a hefty amount of per round currency - goes a long way in changing how a round goes. The problem here is that "teamwork" is very ethereal; it is very hard to put into hard tangible values to examine. Luckily, the people at HLTV have tried to do so anyway with stats such as % times traded, 4v5% wins, flash assists %, etc. 

* % rounds won - essentially a big picture view on how good a team is
* opening duels - captures the essence of setup to gain an advantage
* team5v4 - captures the ability of a team to win with advantage
* team4v5 - captures ability of team to win with disadvantage
* team traded - captures teams grasp of basics/teamplay - trading is where one player goes in and trades his life for another - weakening defense/attack to buy time for setup
* utility adr - damage per round with nades - at highest levels of CS, nade damage is down to setups and reads - captures team preparation and setups
* utility flash - flash assists = kills where enemy is flashed by a teammate - captures teamplay, reads

We will try to model strategy talent within teams using these statistics.

In [7]:
def get_month():
    rmonths = []
    months = []
    first = pd.to_datetime(pd.Timestamp(year=2015, month=10, day=1))
    for i in range(2015,2021): # get every monday since 2015
        months += pd.date_range(start=str(i),
                         end=str(i+1),
                         freq='MS')
    for month in months:
        d = month.to_pydatetime()
        if d >= first:
            rmonths.append(d)
    return rmonths

def team_data():
    months = get_month()
    data = []
    for i in range(len(months)):
        month = months[i]
        nextmonth = months[i+1] if i < len(months) - 1 else months[i]
        url = 'https://www.hltv.org/stats/teams/ftu?startDate={}&endDate={}&rankingFilter=Top30'.format(months[i].date(),nextmonth.date())
        res = requests.get(url,headers=headers)
        soup = BeautifulSoup(res.content,'html.parser')
        stats = soup.find('table').find('tbody').findAll('tr')
        team_ = []
        for stat in stats:
            cells = stat.findAll('td')
            team = cells[0].find('a').text.strip()
            p_rounds_won = cells[2].text.strip()
            opening_duels = cells[3].text.strip()
            multi_kills = cells[4].text.strip()
            team_5v4 = cells[5].text.strip()
            team_4v5 = cells[6].text.strip()
            team_traded = cells[7].text.strip()
            utility_adr = cells[8].text.strip()
            utility_flash = cells[9].text.strip()
            date_range = pd.date_range(start=month,end=nextmonth)
            for m in date_range:
                team_.append([m,team,p_rounds_won,opening_duels,multi_kills,
                              team_5v4,team_4v5,team_traded,utility_adr,utility_flash])
        
        data+=team_
    
    columns = ['month','team','p_rounds_won','opening_duels','multi_kills',
               'team_5v4','team_4v5','team_traded','utility_adr','utility_flash']
    df = pd.DataFrame(data=data,columns=columns)
    return df

In [8]:
# team_df_pickle = 'team_df.pickle'
team_df_csv = 'team_df.csv'
team_df = team_data()

print(team_df.shape)
team_df.head()

# with open(data_filepath+team_df_pickle, 'wb') as f:
#     pickle.dump(team_df, f)
    
# to reopen pickle file   
# with open(data_filepath+team_df_pickle, 'rb') as f:
#     team_stats = pickle.load(f) 

(46245, 10)


Unnamed: 0,month,team,p_rounds_won,opening_duels,multi_kills,team_5v4,team_4v5,team_traded,utility_adr,utility_flash
0,2015-10-01,Envy,59.1%,53.7%,0.96,78.7%,36.5%,-,-,-
1,2015-10-02,Envy,59.1%,53.7%,0.96,78.7%,36.5%,-,-,-
2,2015-10-03,Envy,59.1%,53.7%,0.96,78.7%,36.5%,-,-,-
3,2015-10-04,Envy,59.1%,53.7%,0.96,78.7%,36.5%,-,-,-
4,2015-10-05,Envy,59.1%,53.7%,0.96,78.7%,36.5%,-,-,-


In [9]:
team_df.to_csv(data_filepath+team_df_csv)