# ALGS Data Scraping

This notebook will be used for scraping data from https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021-22 for their match data for the ALGS Year 2 season.

We want to explore this data for the IronViz Tableau Competition

The data we want to collect is the following: 
- Region
- Round Number (set of games)
- Game Number (refered to as round in table)

For preseason:
- Qualifier Round (for preseason qualifiers)
- Lobby number 

For splits:
- Circuit Round (for challenger circuit)
- Rounds (set of games) for pro league
- Game number (refered to as round in table)
- Playoffs games

For championship:
- LCQs
- Winners Bracket, Losers Bracket, Finals
- Round number

For the games with group stages:
- Groups A, B, C, D

In [104]:
from collections import defaultdict
import os
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

To make the process a bit more straightforward, we will be creating lists to iterate over. We might not use some of these lists but it will help us at least keep organization straightforward. 

The lists will include for all data:
- Region
- Round Number (set of games)

For preseason:
- Qualifier Round (for preseason qualifiers)

For splits:
- Circuit Round (for challenger circuit)
- Rounds (set of games) for pro league
- Playoffs games

For championship:
- Winners Bracket, Losers Bracket, Finals
- Round number

For the games with group stages:
- Groups A, B, C, D
<hr>

We will go in the following order for our data scraping:

- Preseason
- Split 1
- Split 2
- Championships

In [2]:
# Lets create our lists and start with our generic lists that are shared through most of the wikipedia links

# These are the main regions for ALGS
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]

# There are usually 6 rounds (set of games) starting with round 1
rounds = [1, 2, 3, 4, 5, 6]

In [3]:
# First attempt at scraping data before we create our loops

URL = "https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/Preseason_Qualifier_1/North_America/Round_1"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

In [4]:
# This is a whole lot of info so we want to focus on two regions: the lobby tab and the results table
# I inspected the website using dev tools and found that there is a class for the results of each lobby
# That class is "table-battleroyale-results" meaning we just need to tap into that to get all the info we need

results = soup.find_all(class_="table-battleroyale-results")

In [5]:
data = []
#table = soup.find('table', attrs={'class':['table-battleroyale-results', ]})

for _ in results:
    table_body = _.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])

In [6]:
data

[[],
 [],
 ['1.', 'CLX', '76', '76'],
 ['2.', 'OA', '55', '55'],
 ['3.', 'RCO', '44', '44'],
 ['4.', 'PWM', '39', '39'],
 ['5.', 'TEAM Workaholics', '37', '37'],
 ['6.', 'Insane Xboxsers', '33', '33'],
 ['7.', 'Chud Bungus', '15', '15'],
 ['8.', 'Ampedant', '15', '15'],
 ['9.', 'Team Shadow Wolf', '14', '14'],
 ['10.', 'JustSomeNewGuys', '12', '12'],
 ['11.', '2B1R', '12', '12'],
 ['12.', 'ControllerLegend', '12', '12'],
 ['13.', 'Carbon Esports', '7', '7'],
 ['14.', 'Tooshbags', '6', '6'],
 ['15.', 'Aces Team', '3', '3'],
 [],
 [],
 ['1.', 'BW', '102', '102'],
 ['2.', 'Spooky Scary', '66', '66'],
 ['3.', 'HololiveDXD', '26', '26'],
 ['4.', '6ide', '23', '23'],
 ['5.', 'ParkingLotBirds', '21', '21'],
 ['6.', 'Azakana', '20', '20'],
 ['7.', 'Bot Squad', '17', '17'],
 ['8.', 'Team Animo', '17', '17'],
 ['9.', 'The Not Squad', '16', '16'],
 ['10.', 'KingsVictoryClub', '15', '15'],
 ['11.', 'AVS', '12', '12'],
 ['12.', 'Remember Reach', '11', '11'],
 ['13.', 'Themeathouse', '10', '10'],
 [

This lets us get data for the lobby but it's kind of messy, lets try to make it into a dict of lists. This way we can get the data for each team and still have it neatly stored in a dictionary for each output into csv tables.

In [7]:
# Try using tbody instead of class to get our info
results_body_all = soup.find_all('tbody')

In [8]:
# Create defaultdict of list and iterate through each result to create our dictionary
data_all = defaultdict(list)
i = 0

for _ in results_body_all:
    rows = _.find_all('tr')
    cols = [ele.text.strip() for ele in rows]
    data_all[i] = [ele for ele in cols if ele]
    
    i += 1

In [9]:
data_all

defaultdict(list,
            {0: ['vdeApex Legends Global Series 21-22',
              'Preseason\n20-21\n21-22\n22-23',
              'Preseason QualifiersNorth America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nSouth America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nEMEA\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nAPAC North\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nAPAC South\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'Preseason Qualifiers',
              'North America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'South America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'EMEA\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'APAC North\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'APAC South\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'Split 1North America\nPlayoffs\nPro League (Matches)\nChallenger Circ

This is closer but maybe we can avoid having to clean up the first few unnecessary tables. We should stick with our class find_all and create a temporary table which stores results as we iterate through each lobby.

In [10]:
# This creates a list of results for each lobby which contains the scores.
results = soup.find_all(class_="table-battleroyale-results")

In [11]:
data_all = defaultdict(list)
i = 0

# Iterate through each item as a lobby
for lobby in results:
    # Create a temporary empty table to store data for that lobby
    temp_table = []
    rows = lobby.find_all('tr')
    # Tidy up the info for each lobby by spliting each team into its own list
    for row in rows:
        cols = [ele.text.strip() for ele in row]
        temp_table.append([ele for ele in cols if ele])
    # Add the list of lists to the dictionary
    data_all[i] = temp_table
    i += 1

In [12]:
data_all

defaultdict(list,
            {0: [['Standings'],
              ['Team', 'Total', 'Round 1'],
              ['1.', 'CLX', '76', '76'],
              ['2.', 'OA', '55', '55'],
              ['3.', 'RCO', '44', '44'],
              ['4.', 'PWM', '39', '39'],
              ['5.', 'TEAM Workaholics', '37', '37'],
              ['6.', 'Insane Xboxsers', '33', '33'],
              ['7.', 'Chud Bungus', '15', '15'],
              ['8.', 'Ampedant', '15', '15'],
              ['9.', 'Team Shadow Wolf', '14', '14'],
              ['10.', 'JustSomeNewGuys', '12', '12'],
              ['11.', '2B1R', '12', '12'],
              ['12.', 'ControllerLegend', '12', '12'],
              ['13.', 'Carbon Esports', '7', '7'],
              ['14.', 'Tooshbags', '6', '6'],
              ['15.', 'Aces Team', '3', '3']],
             1: [['Standings'],
              ['Team', 'Total', 'Round 1'],
              ['1.', 'BW', '102', '102'],
              ['2.', 'Spooky Scary', '66', '66'],
              ['3.', 'H

Now we've figured out how to get all the scores for each preseason lobby per round. We now just need to repeat that for each round and each region! First, lets figure out how to get this dictionary in to a need pandas dataframe and save that to a csv

In [13]:
data_df = pd.DataFrame.from_dict(data_all[0])

data_df.head()

Unnamed: 0,0,1,2,3
0,Standings,,,
1,Team,Total,Round 1,
2,1.,CLX,76,76.0
3,2.,OA,55,55.0
4,3.,RCO,44,44.0


In [42]:
folder = 'Round 1'
if not os.path.exists(f'../Outputs/{folder}'):
    os.mkdir(f'../Outputs/{folder}')

In [43]:
for i in range(len(data_all)):
    file = f'Lobby {i+1}'
    data_df = pd.DataFrame.from_dict(data_all[i])
    data_df.to_csv(f'../Outputs/{folder}/{file}.csv')

This code will be rewritten to create a new name for each round and make that directory before saving each lobby's data as an individual CSV. These CSVs will later be combined into visualization.

Since this will be repeated, it may be worth creating a function which can help us optimize our efforts but first lets work thorugh the preseason data for NA and then expand to regions before moving on to other competitions

In [49]:
def folder_gen(region, game_type, round_number):
    directory = f'{region}_{game_type}'
    if not os.path.exists(f'../Outputs/{directory}/{round_number}'):
        os.makedirs(f'../Outputs/{directory}/{round_number}')

## Preseason Data

Lets start by trying to scrape our preseason data. Althought we used BeatifulSoup earlier to better understand the structure of our website, we will now be using Pandas in order to build a more efficient pipeline and help with some post-processing.

In [58]:
# We will start with our NA region
# First preseson qualifier
# All rounds

region = 'North_America'
qualifier = 'Preseason_Qualifier_1'
rounds = ['Round_1', 'Round_2', 'Round_3', 'Quarterfinals', 'Semifinals', 'Finals']

In [60]:
# Lets start by making our folders

for _ in rounds:
    folder_gen(region, qualifier, _)

In [55]:
# Pandas Testing

directory = f'../Outputs/{region}_{qualifier}'
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Round_1"

# Pandas reading HTML creates a list of tables
table_list = pd.read_html(URL)
table_list

In [53]:
# We have a list of tables so now we should look for where the tables begin to give results
table_list[6] # This is the first index where we see results so we want to go from 6-> len(table_list)

Unnamed: 0_level_0,Standings,Standings,Standings,Standings
Unnamed: 0_level_1,Team,Team.1,Total,Round 1
0,1.0,CLX,76,76
1,2.0,OA,55,55
2,3.0,RCO,44,44
3,4.0,PWM,39,39
4,5.0,TEAM Workaholics,37,37
5,6.0,Insane Xboxsers,33,33
6,7.0,Chud Bungus,15,15
7,8.0,Ampedant,15,15
8,9.0,Team Shadow Wolf,14,14
9,10.0,JustSomeNewGuys,12,12


In [66]:
# We find that the finals is different so lets see what's going on there 
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Finals"
table_list = pd.read_html(URL)
df = table_list[5] # Seems like finals table is one short so we need to make sure to index that instead of 6
df.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 September 13, 2021 - 18:00 PDTWorld's Edge Absolute Monarchy CLX xD","Round 1 Round 1 September 13, 2021 - 18:00 PDTWorld's Edge Absolute Monarchy CLX xD","Round 2 Round 2 September 13, 2021 - 18:30 PDTWorld's Edge Lazarus Neanderthals Noble","Round 2 Round 2 September 13, 2021 - 18:30 PDTWorld's Edge Lazarus Neanderthals Noble","Round 3 Round 3 September 13, 2021 - 19:00 PDTWorld's Edge Dudes Night Out Estral Esports BenchWarmers","Round 3 Round 3 September 13, 2021 - 19:00 PDTWorld's Edge Dudes Night Out Estral Esports BenchWarmers","Round 4 Round 4 September 13, 2021 - 19:30 PDTWorld's Edge Noble Absolute Monarchy pub stars","Round 4 Round 4 September 13, 2021 - 19:30 PDTWorld's Edge Noble Absolute Monarchy pub stars","Round 5 Round 5 September 13, 2021 - 20:00 PDTWorld's Edge Noble Lazarus pub stars","Round 5 Round 5 September 13, 2021 - 20:00 PDTWorld's Edge Noble Lazarus pub stars","Round 6 Round 6 September 13, 2021 - 20:30 PDTWorld's Edge SMP Legacy MX Dudes Night Out","Round 6 Round 6 September 13, 2021 - 20:30 PDTWorld's Edge SMP Legacy MX Dudes Night Out"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K,P,K,P,K
0,1.0,DNO,75,92,55,200,0,112,1818,54,66,45,1313,37,33
1,2.0,NBL,72,63,11,37,88,73,44,112,1010,112,99,82,11
2,3.0,AM,60,112,1414,180,0,170,0,29,77,73,44,73,88
3,4.0,pub stars,49,141,33,54,77,63,22,37,66,37,55,92,22
4,5.0,BW,44,45,1111,170,22,37,55,73,66,92,22,160,11


AN issue above is the fact that the table duplicates due to the HTML code. The P column is both "placement" and "points" which gives the placement a team ended up on and the points they got for that placement. The K column is the number of kills they received where it is just the number of kill. We see the number duplicates due to the "span" html in both. We will now create a pipeline to tidy that up for games from QuarterFinals, SemiFinals, and Finals. 

This may take further research.


In [95]:
# Checking Quarterfinals
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Quarterfinals"
table_list = pd.read_html(URL)
df = table_list[6] # Seems like finals table is one short so we need to make sure to index that instead of 6
row = df.iloc[0:5, 3:11:2]
row # Here are the selected columns with just the placement/points

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,P,P,P,P
0,102,37,45,112
1,54,180,112,29
2,73,112,111,45
3,29,121,170,37
4,82,54,29,92


In [94]:
kills = df.iloc[0:5, 4:11:2] 
kills # Here are the select columns with just the kill points

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,K,K,K,K
0,88,55,1010,1414
1,66,0,99,44
2,44,99,11,77
3,88,66,0,44
4,0,55,77,55


In [98]:
# Lets test replacing a value
# A person in first place is marked as position 1 with a point-value of 12. This gives a final "output" of 112
# We should replace all values of 112 with the value of 1
test = df.iloc[0:5, 3:11:2].str[:]
test.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,P,P,P,P
0,102,37,45,1
1,54,180,1,29
2,73,1,111,45
3,29,121,170,37
4,82,54,29,92


In [108]:
# Lets test fixing the kills issues
# We only need the first half of the value, it will always be even so we can slice half the length of the value
test = df.iloc[0:5, 4:11:2]
test.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,K,K,K,K
0,88,55,1010,1414
1,66,0,99,44
2,44,99,11,77
3,88,66,0,44
4,0,55,77,55


KeyError: 'K'

Now that we know how to access these values, we can now create functions which should clean these values. 

For reference, the following values are associated with placement:

- 1st: 12
- 2nd: 9
- 3rd: 7
- 4th: 5
- 5th: 4
- 6th - 7th: 3
- 8th - 10th: 2
- 11th - 15th: 1
- 16th - 20th: 0

These values will show up as unique combinations such as:
- 112
- 29 
- 37 </b>
- 45 </b>
- 54 </b>
- 63 </b>
- 73 </b> 
- 82 </b>
- 92</b>
- 102</b>
- 111</b>
- 121</b>
- 131</b>
- 141</b>
- 151</b>
- 160</b>
- 170</b>
- 180</b>
- 190</b>
- 200</b>

Which can be used to translate the overall number to their overall placement

In [None]:
# Function for fixing placement points
def placement_replace(p):
    if p == 112:
        

In [65]:
# We can use the pandas to_csv function to save the above table so lets start building our for loop
directory = f'../Outputs/{region}_{qualifier}'

for _ in rounds:
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/{_}"
    table_list = pd.read_html(URL)
    
    folder = f'{directory}/{_}'
    lobby = 1
    
    if _ == 'Finals':
        df = table_list[5]
        file = f'Lobby {lobby}'
        df.to_csv(f'../Outputs/{folder}/{file}.csv')
    else:
        for table in range(6,len(table_list)):
            df = table_list[table]
            file = f'Lobby {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1
        

In [None]:
# Now for each round we collect lobby data
# We will do this using Pandas as it is slightly more powerful
directory = f'../Outputs/{region}_{qualifier}'

for _ in rounds:
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/{_}"
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")
    
    results = soup.find_all(class_="table-battleroyale-results")
    
    data_all = defaultdict(list)
    i = 0
    
    if _ in ['Quarterfinals', 'Semifinals', 'Finals']:
        for lobby in results:
            # Create a temporary empty table to store data for that lobby
            temp_table = []
            rows = lobby.find_all('tr')
            # Tidy up the info for each lobby by spliting each team into its own list
            for row in rows:
                for rows in soup.find_all("span",attrs={'data-toggle-area-content':"2"}):
                    continue
                cols = [ele.text.strip() for ele in row]
                temp_table.append([ele for ele in cols if ele])
            # Add the list of lists to the dictionary
            data_all[i] = temp_table
            i += 1
        
    else:
    # Iterate through each item as a lobby
        for lobby in results:
            # Create a temporary empty table to store data for that lobby
            temp_table = []
            rows = lobby.find_all('tr')
            # Tidy up the info for each lobby by spliting each team into its own list
            for row in rows:
                cols = [ele.text.strip() for ele in row]
                temp_table.append([ele for ele in cols if ele])
            # Add the list of lists to the dictionary
            data_all[i] = temp_table
            i += 1

    folder = f'{directory}/{_}'
    
    for j in range(len(data_all)):
        file = f'Lobby {j+1}'
        data_df = pd.DataFrame.from_dict(data_all[j])
        data_df.to_csv(f'../Outputs/{folder}/{file}.csv')
    
