# ALGS Data Scraping

This notebook will be used for scraping data from https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021-22 for their match data for the ALGS Year 2 season.

We want to explore this data for the IronViz Tableau Competition

The data we want to collect is the following: 
- Region
- Round Number (set of games)
- Game Number (refered to as round in table)

For preseason:
- Qualifier Round (for preseason qualifiers)
- Lobby number 

For splits:
- Circuit Round (for challenger circuit)
- Rounds (set of games) for pro league
- Game number (refered to as round in table)
- Playoffs games

For championship:
- LCQs
- Winners Bracket, Losers Bracket, Finals
- Round number

For the games with group stages:
- Groups A, B, C, D

In [1]:
# import necessary packages 
 
from collections import defaultdict
import os
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

To make the process a bit more straightforward, we will be creating lists to iterate over. We might not use some of these lists but it will help us at least keep organization straightforward. 

The lists will include for all data:
- Region
- Round Number (set of games)

For preseason:
- Qualifier Round (for preseason qualifiers)

For splits:
- Circuit Round (for challenger circuit)
- Rounds (set of games) for pro league
- Playoffs games

For championship:
- Winners Bracket, Losers Bracket, Finals
- Round number

For the games with group stages:
- Groups A, B, C, D
<hr>

We will go in the following order for our data scraping:

- Preseason
- Split 1
- Split 2
- Championships

In [2]:
# First attempt at scraping data before we create our loops

URL = "https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/Preseason_Qualifier_1/North_America/Round_1"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

In [3]:
# This is a whole lot of info so we want to focus on two regions: the lobby tab and the results table
# I inspected the website using dev tools and found that there is a class for the results of each lobby
# That class is "table-battleroyale-results" meaning we just need to tap into that to get all the info we need

results = soup.find_all(class_="table-battleroyale-results")

In [4]:
data = []
#table = soup.find('table', attrs={'class':['table-battleroyale-results', ]})

for _ in results:
    table_body = _.find('tbody')
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])

In [5]:
data

[[],
 [],
 ['1.', 'CLX', '76', '76'],
 ['2.', 'OA', '55', '55'],
 ['3.', 'RCO', '44', '44'],
 ['4.', 'PWM', '39', '39'],
 ['5.', 'TEAM Workaholics', '37', '37'],
 ['6.', 'Insane Xboxsers', '33', '33'],
 ['7.', 'Chud Bungus', '15', '15'],
 ['8.', 'Ampedant', '15', '15'],
 ['9.', 'Team Shadow Wolf', '14', '14'],
 ['10.', 'JustSomeNewGuys', '12', '12'],
 ['11.', '2B1R', '12', '12'],
 ['12.', 'ControllerLegend', '12', '12'],
 ['13.', 'Carbon Esports', '7', '7'],
 ['14.', 'Tooshbags', '6', '6'],
 ['15.', 'Aces Team', '3', '3'],
 [],
 [],
 ['1.', 'BW', '102', '102'],
 ['2.', 'Spooky Scary', '66', '66'],
 ['3.', 'HololiveDXD', '26', '26'],
 ['4.', '6ide', '23', '23'],
 ['5.', 'ParkingLotBirds', '21', '21'],
 ['6.', 'Azakana', '20', '20'],
 ['7.', 'Bot Squad', '17', '17'],
 ['8.', 'Team Animo', '17', '17'],
 ['9.', 'The Not Squad', '16', '16'],
 ['10.', 'KingsVictoryClub', '15', '15'],
 ['11.', 'AVS', '12', '12'],
 ['12.', 'Remember Reach', '11', '11'],
 ['13.', 'Themeathouse', '10', '10'],
 [

<hr>
This lets us get data for the lobby but it's kind of messy, lets try to make it into a dict of lists. This way we can get the data for each team and still have it neatly stored in a dictionary for each output into csv tables. Since BeautifulSoup gave us this data that is sorted into columns and lists of rows maybe we can make it more legible.
<hr>

In [6]:
# Try using tbody instead of class to get our info
results_body_all = soup.find_all('tbody')

# Create defaultdict of list and iterate through each result to create our dictionary
data_all = defaultdict(list)
i = 0

for _ in results_body_all:
    rows = _.find_all('tr')
    cols = [ele.text.strip() for ele in rows]
    data_all[i] = [ele for ele in cols if ele]
    
    i += 1

data_all

defaultdict(list,
            {0: ['vdeApex Legends Global Series 21-22',
              'Preseason\n20-21\n21-22\n22-23',
              'Preseason QualifiersNorth America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nSouth America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nEMEA\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nAPAC North\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4\nAPAC South\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'Preseason Qualifiers',
              'North America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'South America\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'EMEA\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'APAC North\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'APAC South\nQualifier 1\nQualifier 2\nQualifier 3\nQualifier 4',
              'Split 1North America\nPlayoffs\nPro League (Matches)\nChallenger Circ

<hr>
This is closer but maybe we can avoid having to clean up the first few unnecessary tables. We should stick with our class find_all and create a temporary table which stores results as we iterate through each lobby. This way we can create a dictionary of lists which would sort our data.
<hr>

In [7]:
# This creates a list of results for each lobby which contains the scores.
results = soup.find_all(class_="table-battleroyale-results")

In [8]:
data_all = defaultdict(list)
i = 0

# Iterate through each item as a lobby
for lobby in results:
    # Create a temporary empty table to store data for that lobby
    temp_table = []
    rows = lobby.find_all('tr')
    # Tidy up the info for each lobby by spliting each team into its own list
    for row in rows:
        cols = [ele.text.strip() for ele in row]
        temp_table.append([ele for ele in cols if ele])
    # Add the list of lists to the dictionary
    data_all[i] = temp_table
    i += 1

In [9]:
data_all

defaultdict(list,
            {0: [['Standings'],
              ['Team', 'Total', 'Round 1'],
              ['1.', 'CLX', '76', '76'],
              ['2.', 'OA', '55', '55'],
              ['3.', 'RCO', '44', '44'],
              ['4.', 'PWM', '39', '39'],
              ['5.', 'TEAM Workaholics', '37', '37'],
              ['6.', 'Insane Xboxsers', '33', '33'],
              ['7.', 'Chud Bungus', '15', '15'],
              ['8.', 'Ampedant', '15', '15'],
              ['9.', 'Team Shadow Wolf', '14', '14'],
              ['10.', 'JustSomeNewGuys', '12', '12'],
              ['11.', '2B1R', '12', '12'],
              ['12.', 'ControllerLegend', '12', '12'],
              ['13.', 'Carbon Esports', '7', '7'],
              ['14.', 'Tooshbags', '6', '6'],
              ['15.', 'Aces Team', '3', '3']],
             1: [['Standings'],
              ['Team', 'Total', 'Round 1'],
              ['1.', 'BW', '102', '102'],
              ['2.', 'Spooky Scary', '66', '66'],
              ['3.', 'H

Now we've figured out how to get all the scores for each preseason lobby per round. We now just need to repeat that for each round and each region! First, lets figure out how to get this dictionary in to a need pandas dataframe and save that to a csv. 

In [10]:
data_df = pd.DataFrame.from_dict(data_all[0]) #look at the first list of lists for table info

data_df.head()

Unnamed: 0,0,1,2,3
0,Standings,,,
1,Team,Total,Round 1,
2,1.,CLX,76,76.0
3,2.,OA,55,55.0
4,3.,RCO,44,44.0


In [11]:
# manually create folder for now

folder = 'Round 1'

if not os.path.exists(f'../Outputs/{folder}'): # check if the folder exists, otherwise make it
    os.mkdir(f'../Outputs/{folder}')

In [12]:
# For each entry in the dictionary, make a file of the round results

for i in range(len(data_all)): 
    file = f'Lobby {i+1}'
    data_df = pd.DataFrame.from_dict(data_all[i])
    data_df.to_csv(f'../Outputs/{folder}/{file}.csv')

This code will be rewritten to create a new name for each round and make that directory before saving each lobby's data as an individual CSV. These CSVs will later be combined into visualization.

Since this will be repeated, it may be worth creating a function which can help us optimize our efforts but first lets work thorugh the preseason data for NA and then expand to regions before moving on to other competitions. 

## Preseason Data

Lets start by trying to scrape our preseason data. Although we used BeatifulSoup earlier to better understand the structure of our website, we will now be using Pandas in order to build a more efficient pipeline and help with some post-processing.

Pandas' built-in `read_html()` function is super useful here and allows us to actually look for the tables as a list of dataframes instead of manually creating a dictionary of lists. BeautifulSoup was useful in exploring the website and creating our initial framework but Pandas is much more efficient.

In [13]:
# Function for making folders
def folder_gen(region, game_type, round_number): 
    """ 
    Takes inputs for creating file structure for ALGS results based on region, game_type, and round_number.
    
    region: region of interest
    game_type: will define if it is a preseason qualifer, last-chance qualifier, etc;
    round_number: based on the number of rounds in each game_type
    """
    directory = f'{region}/{game_type}'
    if not os.path.exists(f'../Outputs/{directory}/{round_number}'):
        os.makedirs(f'../Outputs/{directory}/{round_number}')


In [14]:
# We will start with our NA region
# First preseason qualifier
# All rounds

region = 'North_America'
qualifier = 'Preseason_Qualifier_1'
rounds = ['Round_1', 'Round_2', 'Round_3', 'Quarterfinals', 'Semifinals', 'Finals']

In [15]:
# Lets start by making our folders

for _ in rounds:
    folder_gen(region, qualifier, _)

In [16]:
# Pandas Testing

directory = f'../Outputs/{region}_{qualifier}' # select directory to save to
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Round_1" # manually set URL
 
# Pandas reading HTML creates a list of tables
table_list = pd.read_html(URL)
table_list

[                  vdeApex Legends Global Series 21-22  \
 0                         Preseason 20-21 21-22 22-23   
 1   Preseason QualifiersNorth America Qualifier 1 ...   
 2                                Preseason Qualifiers   
 3                                       North America   
 4                                       South America   
 5                                                EMEA   
 6                                          APAC North   
 7                                          APAC South   
 8   Split 1North America Playoffs Pro League (Matc...   
 9                                             Split 1   
 10                                      North America   
 11                                      South America   
 12                                               EMEA   
 13                                         APAC North   
 14                                         APAC South   
 15  Split 2 - PlayoffsNorth America Pro League (Ma...   
 16           

In [17]:
# We have a list of tables so now we should look for where the tables begin to give results
table_list[6].head() # This is the first index where we see results so we want to go from 6-> len(table_list)

Unnamed: 0_level_0,Standings,Standings,Standings,Standings
Unnamed: 0_level_1,Team,Team.1,Total,Round 1
0,1.0,CLX,76,76
1,2.0,OA,55,55
2,3.0,RCO,44,44
3,4.0,PWM,39,39
4,5.0,TEAM Workaholics,37,37


In [18]:
# We find that the finals is different so lets see what's going on there 
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Finals"
table_list = pd.read_html(URL)
df = table_list[5] # Seems like finals table is one short so we need to make sure to index that instead of 6
df.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 September 13, 2021 - 18:00 PDTWorld's Edge Absolute Monarchy CLX xD","Round 1 Round 1 September 13, 2021 - 18:00 PDTWorld's Edge Absolute Monarchy CLX xD","Round 2 Round 2 September 13, 2021 - 18:30 PDTWorld's Edge Lazarus Neanderthals Noble","Round 2 Round 2 September 13, 2021 - 18:30 PDTWorld's Edge Lazarus Neanderthals Noble","Round 3 Round 3 September 13, 2021 - 19:00 PDTWorld's Edge Dudes Night Out Estral Esports BenchWarmers","Round 3 Round 3 September 13, 2021 - 19:00 PDTWorld's Edge Dudes Night Out Estral Esports BenchWarmers","Round 4 Round 4 September 13, 2021 - 19:30 PDTWorld's Edge Noble Absolute Monarchy pub stars","Round 4 Round 4 September 13, 2021 - 19:30 PDTWorld's Edge Noble Absolute Monarchy pub stars","Round 5 Round 5 September 13, 2021 - 20:00 PDTWorld's Edge Noble Lazarus pub stars","Round 5 Round 5 September 13, 2021 - 20:00 PDTWorld's Edge Noble Lazarus pub stars","Round 6 Round 6 September 13, 2021 - 20:30 PDTWorld's Edge SMP Legacy MX Dudes Night Out","Round 6 Round 6 September 13, 2021 - 20:30 PDTWorld's Edge SMP Legacy MX Dudes Night Out"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K,P,K,P,K
0,1.0,DNO,75,92,55,200,0,112,1818,54,66,45,1313,37,33
1,2.0,NBL,72,63,11,37,88,73,44,112,1010,112,99,82,11
2,3.0,AM,60,112,1414,180,0,170,0,29,77,73,44,73,88
3,4.0,pub stars,49,141,33,54,77,63,22,37,66,37,55,92,22
4,5.0,BW,44,45,1111,170,22,37,55,73,66,92,22,160,11


An issue above is the fact that the table duplicates due to the HTML code. The P column is both "placement" and "points" which gives the placement a team ended up on and the points they got for that placement. The K column is the number of kills they received where it is just the number of kill. We see the number duplicates due to the "span" html in both. We will now create a pipeline to tidy that up for games from QuarterFinals, SemiFinals, and Finals. 

The way to do that is to replace certain values using various pandas tools which we will explore below.
For reference, the following values are associated with placement and will create the following unique combinations:

| Placement | Points|Combination(s)           |
|-----------|-------|-------------------------|
|1st        | 12    | 112                     |
|2nd        | 9     | 29                      |
|3rd        | 7     | 37                      | 
|4th        | 5     | 45                      |
|5th        | 4     | 54                      |
|6th-7th    | 3     | 63, 73                  |
|8th-10th   | 2     | 82, 92, 102             |
|11th-15th  | 1     | 111, 121, 131, 141, 151 |
|16th-20th  | 0     | 160, 170, 180, 190, 200 |



Which can be used to translate the overall number to their overall placement. Since we know each number combination is unique, we can take each number combination and replace it with the appropriate value. For instance `112` can be replaced by `1` as their placement since we are more interested in their placement than their given point value.


In [19]:
# Checking Quarterfinals
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/Quarterfinals"
table_list = pd.read_html(URL)
df = table_list[6]

In [20]:
# Lets test replacing a value
# A person in first place is marked as position 1 with a point-value of 12. This gives a final "output" of 112
# We should replace all values of 112 with the value of 1

test = df.iloc[0:5, 3:11:2].replace(112,1) 
test.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,P,P,P,P
0,102,37,45,1
1,54,180,1,29
2,73,1,111,45
3,29,121,170,37
4,82,54,29,92


In [21]:
# Create list of points and their replacement
# Alternatively, use a dictionary here

unique_points = [112, 29, 37, 45, 54, 63, 73, 82, 92, 102, 111, 121, 131, 141, 151, 160, 170, 180, 190, 200]
replacement = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

# Create temporary copy of the dataframe to ensure we replace only the values we care about
temp = df.iloc[:, 3:11:2]

# Go through each point in unique_points and replace
for i in range(len(unique_points)):
    temp = temp.replace(unique_points[i], replacement[i])

# replace those columns with the temp column
df.iloc[:, 3:11:2] = temp
df.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K
0,1.0,DNO,63,10,88,3,55,4,1010,1,1414
1,2.0,NBL,44,5,66,18,0,1,99,2,44
2,3.0,BW,42,7,44,1,99,11,11,4,77
3,4.0,The Semi Pros,35,2,88,12,66,17,0,3,44
4,5.0,OA,34,8,0,5,55,2,77,9,55


In [22]:
# Lets test fixing the kills issues
# We only need the first half of the value, it will always be even so we can slice half the length of the value
# First, we cast the kill columns as str so we can slice for half the values

kills = df.iloc[:, 4:11:2].astype(str)
kills.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,K,K,K,K
0,88,55,1010,1414
1,66,0,99,44
2,44,99,11,77
3,88,66,0,44
4,0,55,77,55


In [23]:
# we slice the first half of the values
# replace the values in the dataframe and cast them back as ints

kills = kills.applymap(lambda points: points[:len(points)//2] if len(points) > 1 else 0)
df.iloc[:, 4:11:2] = kills.astype(int)
df.head()

Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 1 Round 1 September 12, 2021 - 12:00 PDTWorld's Edge Titanes The Semi Pros SMP","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 2 Round 2 September 12, 2021 - 12:30 PDTWorld's Edge BenchWarmers SXG Dudes Night Out","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 3 Round 3 September 12, 2021 - 13:00 PDTWorld's Edge Noble Optimal Ambition SMP","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros","Round 4 Round 4 September 12, 2021 - 13:30 PDTWorld's Edge Dudes Night Out Noble The Semi Pros"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K
0,1.0,DNO,63,10,8,3,5,4,10,1,14
1,2.0,NBL,44,5,6,18,0,1,9,2,4
2,3.0,BW,42,7,4,1,9,11,1,4,7
3,4.0,The Semi Pros,35,2,8,12,6,17,0,3,4
4,5.0,OA,34,8,0,5,5,2,7,9,5


<hr>
Now that we know how to access and change these values, we can now create functions which should clean these values.

There will aso be functions which scrape the data depending on the situation at hand.
<hr>

In [24]:
### ------ Data Clean-up Functions -------- ###

def placement_replace(df):
    """
    Function for fixing placement points
    Takes values based on columns and replaces with appropriate placement
    
    Takes in a dataframe, slices it based on which columns are placement columns and 
    replaces unique point combination with placement. 
    
    """
    unique_points = [112, 29, 37, 45, 54, 63, 73, 82, 92, 102, 111, 121, 131, 141, 151, 160, 170, 180, 190, 200]
    replacement = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

    temp = df.iloc[:, 3:len(df.columns):2]
    
    for i in range(len(unique_points)):
        temp = temp.replace(unique_points[i], replacement[i])
        
    df.iloc[:, 3:len(df.columns):2] = temp
    return df
    
def kill_fix(df):
     """
    Function for fixing kill points
    Takes values based on columns and slices first half to remove duplicate copy of KP
    
    Takes in a dataframe, slices it based on which columns are placement columns and 
    replaces duplicated KP value with appropriate KP
    
    """
    temp = df.iloc[:, 4:len(df.columns):2].astype(str)
    temp = temp.applymap(lambda points: points[:len(points)//2] if len(points) > 1 else 0)
    df.iloc[:, 4:len(df.columns):2] = temp.astype(int)
    return df

### ------ Data gathering Functions -------- ###

def algs_data(year, region, qualifier, round_):
    
    """
    Takes in year, region, qualifier stage, and round number as inputs for scraping Liquidpedia ALGS data
    Designed primarily for preseason qualifier
    
    year: year of ALGS
    region: region of interest (NA, EMEA, APAC N, APAC S, SA) - Must reflect website language
    qualifier: which round of qualifier games are in
    round: round of interest 
    """
    directory = f'../Outputs/{region}/{qualifier}'

    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{qualifier}/{region}/{round_}"
    table_list = pd.read_html(URL)
    

    folder = f'{directory}/{round_}'
    lobby = 1

    if round_ == 'Finals':
        df = table_list[5]
        placement_replace(df)
        kill_fix(df)
        file = f'Lobby {lobby}'
        df.to_csv(f'../Outputs/{folder}/{file}.csv')
    elif round_ in ['Quarterfinals', 'Semifinals']:
         for table in range(6,len(table_list)):
            df = table_list[table]
            placement_replace(df)
            kill_fix(df)
            file = f'Lobby {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1
    else:
        for table in range(6,len(table_list)):
            df = table_list[table]
            file = f'Lobby {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1
            
def algs_pro(year, region, qualifier):
     """
    Takes in year, region, qualifier stage, and round number as inputs for scraping Liquidpedia ALGS data
    Designed primarily for ALGS pro league
    
    year: year of ALGS
    region: region of interest (NA, EMEA, APAC N, APAC S, SA) - Must reflect website language
    qualifier: which round of qualifier games are in
    """
    directory = f'../Outputs/{region}/{qualifier}'
   
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{qualifier}/{region}/Matches"
    table_list = pd.read_html(URL)

    folder = f'{directory}/Matches'
    lobby = 1
    
    for table in range(8,len(table_list)):
            df = table_list[table]
            placement_replace(df)
            kill_fix(df)
            file = f'Round {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1

def algs_playoffs(year, region, split):
    """
    Takes in year, region, and split as inputs for scraping Liquidpedia ALGS data
    Designed primarily for ALGS playoffs
    
    year: year of ALGS
    region: region of interest (NA, EMEA, APAC N, APAC S, SA) - Must reflect website language
    split: which split of ALGS
    """
    directory = f'../Outputs/{region}/{split}/{rounds}'
    
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{split}/{region}/"
    table_list = pd.read_html(URL)
    
    folder = f'{directory}'
    
    points = table_list[28]
    placement_replace(points)
    kill_fix(points)
    points.to_csv(f'../Outputs/{folder}/Points.csv')
    
    matchpoint = table_list[29]
    matchpoint.to_csv(f'../Outputs/{folder}/Matchpoint.csv')
    
def bracket_data(year, qual, region, bracket):
    """
    Takes in year, qualifying stage, region, and bracket round as inputs for scraping Liquidpedia ALGS data
    Designed primarily for ALGS pro league
    
    year: year of ALGS
    region: region of interest (NA, EMEA, APAC N, APAC S, SA) - Must reflect website language
    qual: which round of games we are in
    bracket: which bracket round of interest we are in
    """
    directory = f'../Outputs/{region}/{qual}/{bracket}'

    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{qual}/{region}/{bracket}"
    table_list = pd.read_html(URL)
    folder = f'{directory}'
    points = table_list[6]
    placement_replace(points)
    kill_fix(points)
    points.to_csv(f'../Outputs/{folder}/Points.csv')

In [25]:
# We can use the pandas to_csv function to save the above table so lets start building our for loop
directory = f'../Outputs/{region}/{qualifier}'

# This is the initial 
# Go through each round and get the data
for _ in rounds:
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2021/{qualifier}/{region}/{_}"
    table_list = pd.read_html(URL)
    
    folder = f'{directory}/{_}'
    lobby = 1
    
# if there is a finals select the right table and do appropriate cleaning
    if _ == 'Finals':
        df = table_list[5]
        placement_replace(df)
        kill_fix(df)
        file = f'Lobby {lobby}'
        df.to_csv(f'../Outputs/{folder}/{file}.csv')
# if there is a quarterfinals or semifinals choose the right table
    elif _ in ['Quarterfinals', 'Semifinals']:
         for table in range(6,len(table_list)):
            df = table_list[table]
            placement_replace(df)
            kill_fix(df)
            file = f'Lobby {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1
    else:
        for table in range(6,len(table_list)):
            df = table_list[table]
            file = f'Lobby {lobby}'
            df.to_csv(f'../Outputs/{folder}/{file}.csv')
            lobby += 1

In [26]:
# Now to expand to other regions and qualifiers
# Start by creating our folders
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
qualifiers = ['Preseason_Qualifier_1', 'Preseason_Qualifier_2', 'Preseason_Qualifier_3', 'Preseason_Qualifier_4']
rounds = ['Round_1', 'Round_2', 'Round_3', 'Quarterfinals', 'Semifinals', 'Finals']

for region in regions:
    for qualifier in qualifiers:
        for _ in rounds:
                folder_gen(region, qualifier, _)

In [27]:
for region in regions:
    for qualifier in qualifiers:
        for _ in rounds:
            try:
                algs_data(region, qualifier, _)
            except:
                continue

# Split 1

Now to collect Split 1 data which consists of Playoffs, Pro League, and Challenger Circuit.

We will start with Challenger Circuit, then Pro League, and Playoffs as Playoffs has a somewhat different formatting to its score tables.

With some small changes to the URL, we should be able to pretty easily do Challenger Circuit

In [28]:
# Create our folders
# Not every region has these exact rounds but we can make the folders
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
challenger = ['Split_1/Challenger_Circuit_1', 'Split_1/Challenger_Circuit_2', 'Split_1/Challenger_Circuit_3', 'Split_1/Challenger_Circuit_4']
rounds = ['Round_1','Quarterfinals', 'Semifinals', 'Finals']
year = 2021

for region in regions:
    for circuit in challenger:
        for _ in rounds:
            folder_gen(region, circuit, _)

In [29]:
for region in regions:
    for circuit in challenger:
        for _ in rounds:
            try:
                algs_data(year, region, circuit, _)
            except:
                continue

## Split 1 Pro League Matches

Now we try to get the Pro League Match data. There is a slight format difference and this is reflected in the `algs_pro` function that allows us to capture that data.

In [30]:
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
split = 'Split_1/Pro_League'
matches = 'Matches'
year = 2021

for region in regions:
    folder_gen(region, split, matches)

In [31]:
for region in regions:
        algs_pro(year, region, split)

## Split 1 Playoffs

Lets get playoffs data! There is a bit of difference here a well. There is only one split and rounds are all in match data. 

We will have a new `algs_playoffs` function to help with that.

In [32]:
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
split = 'Split_1/Playoffs'
rounds = ['Rounds']
year = 2022

for region in regions:
    for circuit in challenger:
        for _ in rounds:
            folder_gen(region, circuit, _)

In [33]:
for region in regions:
    try:
        algs_playoffs(year, region, split)
    except:
        continue

In [34]:
# Playoffs have a slightly different URL structure

table_list = pd.read_html('https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2022/Split_1/Playoffs/North_America')


# Split 2

We will now do the same for Split 2! Except for the fact that Split 2 has Last Chance Qualifiers which we will do at the end!
Lets start with the Challenger Circuit!

## Split 2 Challenger Circuit

In [36]:
# Create our folders
# Not every region has these exact rounds but we can make the folders
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
challenger = ['Split_2/Challenger_Circuit_1', 'Split_2/Challenger_Circuit_2', 'Split_2/Challenger_Circuit_3', 'Split_2/Challenger_Circuit_4']
rounds = ['Quarterfinals', 'Semifinals', 'Finals']
year = 2022

for region in regions:
    for circuit in challenger:
        for _ in rounds:
            folder_gen(region, circuit, _)

In [37]:
for region in regions:
    for circuit in challenger:
        for _ in rounds:
            try:
                algs_data(year, region, circuit, _)
            except:
                continue

## Split 2 Pro League

We can use the `algs_pro` function for the Split 2 Pro League

In [38]:
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
split = 'Split_2/Pro_League'
matches = 'Matches'
year = 2022

for region in regions:
    folder_gen(region, split, matches)

In [39]:
for region in regions:
    try:
        algs_pro(year, region, split)
    except:
        continue

## Split 2 Playoffs

Split 2 playoffs have a unique URL format due to format differences. This will be adapted and we will re-write the functions here as needed.

In [40]:
split = 'Split_2/Playoffs'
rounds = ['Group_Stage', 'Bracket_Stage','Finals']
year = 2022

for _ in rounds:
    if not os.path.exists(f'../Outputs/Split 2 Playoffs/{_}'):
        os.makedirs(f'../Outputs/Split 2 Playoffs/{_}')

In [41]:
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{split}/Bracket_Stage"
# print(URL)
table_list = pd.read_html(URL)

points = table_list[10]
points.head()

https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/2022/Split_2/Playoffs/Bracket_Stage


Unnamed: 0_level_0,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints,StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 April 30, 2022 - 20:45 CESTStorm Point Spacestation Gaming GameWith REJECT","Round 1 Round 1 April 30, 2022 - 20:45 CESTStorm Point Spacestation Gaming GameWith REJECT","Round 2 Round 2 April 30, 2022 - 21:20 CESTStorm Point Team Empire Team Burger Players","Round 2 Round 2 April 30, 2022 - 21:20 CESTStorm Point Team Empire Team Burger Players","Round 3 Round 3 April 30, 2022 - 21:55 CESTStorm Point GMT Esports NRG FENNEL","Round 3 Round 3 April 30, 2022 - 21:55 CESTStorm Point GMT Esports NRG FENNEL","Round 4 Round 4 April 30, 2022 - 22:30 CESTWorld's Edge V3 VEGA Team UNITE αDRaccoon","Round 4 Round 4 April 30, 2022 - 22:30 CESTWorld's Edge V3 VEGA Team UNITE αDRaccoon","Round 5 Round 5 April 30, 2022 - 23:05 CESTWorld's Edge Team UNITE Team Burger GMT Esports","Round 5 Round 5 April 30, 2022 - 23:05 CESTWorld's Edge Team UNITE Team Burger GMT Esports","Round 6 Round 6 April 30, 2022 - 23:40 CESTWorld's Edge Elevate Spacestation Gaming FENNEL","Round 6 Round 6 April 30, 2022 - 23:40 CESTWorld's Edge Elevate Spacestation Gaming FENNEL"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K,P,K,P,K
0,1.0,UNI,60,45,88,92,22,180,0,29,77,112,1313,121,11
1,2.0,GMT,59,54,77,131,0,112,99,131,11,37,88,45,44
2,3.0,SSG,56,112,99,45,66,200,0,92,22,131,11,29,99
3,4.0,BRGR,47,63,33,29,66,92,22,121,22,29,1010,200,0
4,5.0,V3,46,111,55,170,11,63,55,112,1515,102,22,170,0


In [42]:
# Split 2 had no region 
# We will do this one a bit differently and change the URL but it will be similar to the algs_playoff function

for _ in rounds:
    directory = f'../Outputs/Split 2 Playoffs/{_}'
    
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{split}/{_}"
    #print(URL)
    table_list = pd.read_html(URL)
    
    folder = f'{directory}'
    
    if _ == 'Finals':    
        points = table_list[6]
        placement_replace(points)
        kill_fix(points)
        points.to_csv(f'../Outputs/{folder}/Points.csv')

        matchpoint = table_list[7]
        matchpoint.to_csv(f'../Outputs/{folder}/Matchpoint.csv')
    elif _ == 'Bracket_Stage':
        for i in range(6,12,2):
            points = table_list[i]
            placement_replace(points)
            kill_fix(points)
            points.to_csv(f'../Outputs/{folder}/Round {i-6}.csv')
    else:
        for i in range(8, 14):
            points = table_list[i]
            placement_replace(points)
            kill_fix(points)
            points.to_csv(f'../Outputs/{folder}/Round {i-7}.csv')

## Split 2 Qualifiers

These are for playoff qualifiers. These have brackets and regions which are different. We can take from the previous functions and re-write it to qualifiers. The main thing to consider is that they use brackets for scoring instead of so many varied matches.

In [43]:
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
qual = 'Split_2/Pro_League/Qualifiers'
brackets = ['Winners_Bracket', 'Losers_Bracket', 'Finals']
year = 2022

for region in regions:
    for bracket in brackets:
        folder_gen(region, qual, bracket)

In [44]:
for region in regions:
    for bracket in brackets:
        directory = f'../Outputs/{region}/{qual}/{bracket}'
    
        URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{qual}/{region}/{bracket}"
        table_list = pd.read_html(URL)
        folder = f'{directory}'
        points = table_list[6]
        placement_replace(points)
        kill_fix(points)
        points.to_csv(f'../Outputs/{folder}/Points.csv')

In [45]:
for region in regions:
    for bracket in brackets:
        bracket_data(year, qual, region, bracket)

# Championship

All that's left is the championship which has the LCQs and the actual championship itself!
The championship had Last-Chance qualifiers which focus on a bracket format.

## Last Chance Qualifiers

These use a bracket format! We wrote a new `bracket_data` function which allows us to focus on collecting bracket data for both LCQ and for the championship games.

In [46]:
regions = ["North_America", "South_America", "EMEA", "APAC_North", "APAC_South" ]
lcqs = ['Championship/Last_Chance_Qualifier_1', 'Championship/Last_Chance_Qualifier_2']
brackets = ['Winners_Bracket', 'Losers_Bracket', 'Finals']
year = 2022

for region in regions:
    for lcq in lcqs:
        for bracket in brackets:
            folder_gen(region, lcq, bracket)

In [47]:
for region in regions:
    for lcq in lcqs:
        for bracket in brackets:
            bracket_data(year, lcq, region, bracket)

## Championship Games

These championship games are the last bit of data that we need! We will sue the bracket function to collect the data and then we can move on to visualization!

In [48]:
champ = 'Championship'
brackets = ['Group_Stage', 'Bracket_Stage', 'Finals']
year = 2022

for bracket in brackets:
    if not os.path.exists(f'../Outputs/Championship/{bracket}'):
        os.makedirs(f'../Outputs/Championship/{bracket}')

In [49]:
URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{champ}/Group_Stage"
table_list = pd.read_html(URL)
points = table_list[13]
points.head()

Unnamed: 0_level_0,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints,B vs C StandingsPlacements and killsPoints
Unnamed: 0_level_1,Unnamed: 0_level_1,Team,Total,"Round 1 Round 1 July 08, 2022 - 13:00 EDTWorld's Edge FURIA Esports Fênix Team EXO Clan","Round 1 Round 1 July 08, 2022 - 13:00 EDTWorld's Edge FURIA Esports Fênix Team EXO Clan","Round 2 Round 2 July 08, 2022 - 13:30 EDTWorld's Edge Team Singularity FURIA Esports Luminosity Gaming","Round 2 Round 2 July 08, 2022 - 13:30 EDTWorld's Edge Team Singularity FURIA Esports Luminosity Gaming","Round 3 Round 3 July 08, 2022 - 14:00 EDTWorld's Edge EXO Clan Fênix Team NRG","Round 3 Round 3 July 08, 2022 - 14:00 EDTWorld's Edge EXO Clan Fênix Team NRG","Round 4 Round 4 July 08, 2022 - 14:30 EDTStorm Point Cloud9 Sutoraiku NRG","Round 4 Round 4 July 08, 2022 - 14:30 EDTStorm Point Cloud9 Sutoraiku NRG","Round 5 Round 5 July 08, 2022 - 15:00 EDTStorm Point Element 6 EXO Clan Luminosity Gaming","Round 5 Round 5 July 08, 2022 - 15:00 EDTStorm Point Element 6 EXO Clan Luminosity Gaming","Round 6 Round 6 July 08, 2022 - 15:30 EDTStorm Point Team Liquid Element 6 GØDFIRE","Round 6 Round 6 July 08, 2022 - 15:30 EDTStorm Point Team Liquid Element 6 GØDFIRE"
Unnamed: 0_level_2,Unnamed: 0_level_2,Team,Total,P,K,P,K,P,K,P,K,P,K,P,K
0,1.0,FUR,68,112,55,29,1212,111,66,160,33,121,33,63,1313
1,2.0,EXO,57,37,44,160,0,112,1414,111,0,29,99,121,0
2,3.0,FNX,51,29,88,190,0,29,55,54,11,45,33,131,66
3,4.0,SNG,49,63,55,112,88,45,77,200,11,151,33,54,0
4,5.0,E6,48,54,77,170,11,190,0,170,0,112,1212,29,33


In [50]:
# Similar to Split 2 playoffs
for bracket in brackets:
    directory = f'../Outputs/Championship/{bracket}'
    
    URL = f"https://liquipedia.net/apexlegends/Apex_Legends_Global_Series/{year}/{champ}/{bracket}"
    #print(URL)
    table_list = pd.read_html(URL)
    
    folder = f'{directory}'
    
    if bracket == 'Finals':    
        points = table_list[6]
        placement_replace(points)
        kill_fix(points)
        points.to_csv(f'../Outputs/{folder}/Points.csv')

        matchpoint = table_list[7]
        matchpoint.to_csv(f'../Outputs/{folder}/Matchpoint.csv')
    elif bracket == 'Bracket_Stage':
        for i in range(6,12,2):
            points = table_list[i]
            placement_replace(points)
            kill_fix(points)
            points.to_csv(f'../Outputs/{folder}/Round {i-5}.csv')
    else:
        for i in range(8, 14):
            points = table_list[i]
            placement_replace(points)
            kill_fix(points)
            points.to_csv(f'../Outputs/{folder}/Round {i-7}.csv')