# Web Scraping Pokemon Tournament Data With BeautifulSoup

The goal of this project is to scrape data from regional level tournaments for the pokemon trading card game, and leave it in a format that will be useful for future projects. I eventually want to have a notebook set up where I can just type in the name of the tournament, press run, and I'll get out a csv that I can just upload to tableau and have a nice dashboard to provide some insight on the tournament.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import copy
import numpy as np

Here, I can set the tournament url by specifying the city and the year. It would be nice to turn this into a web app at some point where the user is prompted to enter the year and the city, and everything else is automated. Or better yet, the rk9 page with the list of all tournament could be scraped to provide a dropdown menu of tournaments. However, that is beyond the scope of this project (for now).

In [2]:
# Set the tournament name with the year and city of the regional

tournament = 'pokemon-merida-2025'

homepage = 'https://rk9.gg/event/' + tournament

## Fetching URLs from the homepage

Now, using BeautifulSoup we can locate the links for the roster and the pairings of the specified tounament. Note that the box with the links for tcg is sometimes labelled as "indigo" instead of "blue". This function will return the a list with the url for the roster as the first element and the url for the pairings in the second.

In [3]:
def get_rk9_urls(homepage):
    soup = BeautifulSoup(requests.get(homepage).text)

    tcg = soup.find("div", class_ = 'card h-100 mt-3 p-2 shadow bg-blue-050') # Locate tcg box (sometimes indigo)

    # Find url extensions for roster and pairings
    roster_code = tcg.find('a', {'href': re.compile('/roster*')})['href']  
    pairings_code = tcg.find('a', {'href': re.compile('/pairings*')})['href']

    roster_url = 'https://rk9.gg' + roster_code
    pairings_url = 'https://rk9.gg' + pairings_code
    
    return [roster_url,pairings_url]

## Scraping the roster of all players

The roster url leads to a table of all players in the tournament. As columns we have: player ID (only first and last digits are displayed), First Name, Last Name, Country, Division, Deck List (as a link), Standing.

The following function will out put this table as either a csv or as a pandas dataframe for futher processing. In another project I'll write a classifier function that takes the deck list url and returns the archetype. This data can be combined with the pairings data to generate a matchup chart. Even better would be to filter the games included in the matchup data according to player elo, so all matchup data is high quality gameplay.

In [4]:
def get_roster(homepage, csv = False, filename = tournament + 'roster.csv'):
    
    url = get_rk9_urls(homepage)[0] #set player roster url
    
    soup = BeautifulSoup(requests.get(url).text) #load in the soup

    table = soup.find('table') #roster page only has one table
    
    headers = table.find_all('th') #Find the column headers
    headers = [heading.string for heading in headers]
    
    body = table.find('tbody') # isolate the body of the table
    
    rows = body.find_all('tr') # get rows
    
    all_roster_data = []
    for row in rows:  # This loop isolates the text in each cell

        row_data = row.find_all('td')
        individual_row_data = [data.text.strip() for data in row_data]

        dlist = row_data[-2]
        dlist_url = dlist.find('a')['href']  # This is grabbing the decklist url, otherwise you just get "view"

        individual_row_data[-2] = dlist_url

        all_roster_data.append(individual_row_data)
        
        df = pd.DataFrame(all_roster_data, columns = headers)
        
    if not csv:
        return df #returns the desired table
    if csv:
        return df.to_csv(filename)

## Scraping the matchup data from each round

First I'll create a function that queries the pairings page for a given round, and returns a dataframe of each game from that round along with the results.

Some cleaning is also done in this function to deal with players dropping, getting DQd, or not showing up to the round.

In [5]:
def get_round_pairings(round_number, total_rounds, pairings_url):
    pairing_soup = BeautifulSoup(requests.get(pairings_url, {'pod' : '2', 'rnd' : str(round_number)}).text)
    games = pairing_soup.find_all('div', class_ = "row row-cols-3 match no-gutter complete")
    P1_names = [game.find('span', class_ = 'name').text for game in games]
    P2_names = [game.find_all('span', class_ = 'name')[-1].text for game in games]

    # Grab the match results, and deal with dropped players

    P1_result = [game.find('div', class_ = re.compile("col-5 text-center player*"))['class'][-1] for game in games]
    for i in range(len(P1_result)):
        game = games[i]
        if P1_result[i] == 'dropped':
            P1_result[i] = game.find('div', class_ = re.compile("col-5 text-center player*"))['class'][4]

        if P1_result[i] == 'dropped':
            P1_result[i] = 'double game loss'

    # Generate a DataFrame for the round

    round_dict = {
        'Player 1' : P1_names,
        'Player 2' : P2_names,
        'Result' : P1_result
        }

    round_df = pd.DataFrame(round_dict)
    
    mask = (round_df['Player 1'] == round_df['Player 2'])
    
    round_df = round_df[~mask]
    
    return round_df

Now I'll use the above function to generate the dataframe for every round and concatenate them together. This is what we want for calculating matchup percentages, since we don't care during which round each game was played.

However, this will need to be handled a bit differently for calculating Elos since the order of games played does matter.

In [6]:
pairings_url = get_rk9_urls(homepage)[1]
    
total_soup = BeautifulSoup(requests.get(pairings_url).text)
total_rounds = int(total_soup.find_all('a', id = re.compile('P2R*'))[-2].text[1:])
all_matches_df = pd.DataFrame({}, columns = ['Player 1', 'Player 2', 'Result'])

for round_number in range(1, total_rounds+1):
    #print(round_number)
    round_df = get_round_pairings(round_number, total_rounds, pairings_url)
    all_matches_df = pd.concat([all_matches_df,round_df],axis = 0,ignore_index = True)

Now I can save everything as csv files for a tableau dashboard

In [7]:
get_roster(homepage, csv = True, filename = 'roster_for_'+tournament+'.csv')
all_matches_df.to_csv('all_games_from_'+tournament+'.csv')