# Yugioh TCG Complete Card Database
This notebook will perform we scraping to load all available cards and their sets from scratch. This is divided into 2 main processing tasks and an additional merge from 3rd party sources for the collectibles market. The first 2 tasks are done in this notebook.

  1. **Indexing**: This will index every unique card as it relates to the trading card game which official rules will treat identically
  2. **Expanding**: Each unique card has multiple rarities, releases, and sets and this task will find them all.
  3. **Merging**: Once we have the data from official sources, we can merge data from 3rd parties like TCGPlayer to understand things like the price.

## Rules
The code is meant to not only be compliant but respectful to the web servers that have the data. The scraper will not only follow robots.txt but also optimize further. In the Appendix, there's code that provides an aggressive way that is literally 1000x faster but will overwhelm the data sources.

In [None]:
import requests
print(requests.get('https://tcgplayer.com/robots.txt').text)

In [None]:
# 404 file not found response, no safeguards
requests.get('https://www.yugioh-card.com/robots.txt')

## 1. Index Configuration & Example
Configuration details including the URL's that need to be predefined are done here. A basic example is provided for understanding and to make sure it will work in the environment. Then, we loop through all cards based on the Example.

In [None]:
# this is the card search from the front end where we are going to start loading the cards
config = {
    'base-url' : 'https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=1&sess=1&rp=10&mode=&sort=1&keyword=&stype=1&ctype=&othercon=2&starfr=&starto=&pscalefr=&pscaleto=&linkmarkerfr=&linkmarkerto=&link_m=2&atkfr=&atkto=&deffr=&defto=&releaseDStart=1&releaseMStart=1&releaseYStart=1999&releaseDEnd=&releaseMEnd=&releaseYEnd=&page=',
    'index-url' : 'https://www.db.yugioh-card.com/yugiohdb/card_search.action?ope=2&cid=',
    'link-monster-img': 'https://www.db.yugioh-card.com/yugiohdb/external/image/parts/link_pc/link{P}.png'
}

In [None]:
from bs4 import BeautifulSoup

In [None]:
example = requests.get(config['base-url'])

In [None]:
soup = BeautifulSoup(example.text, 'html.parser')

In [None]:
soup

In [None]:
hand = soup.find_all('div', class_ = 't_row')

## Scrape Index
The Yugioh card index from Konami is paginated with 10 cards in each page. We will go through and parse all cards on each page. The following functions will process the page, or `hand`, and add them to results

In [None]:
def transform_hand(hand):
    '''
    Takes in the HTML row and returns an appropriate dictionary

    Description: Description of the card's effects or abilities.
    Name: The card's name.
    Attribute: The attribute of the monster (e.g., LIGHT, DARK).
    Level/Rank: item_box_valueLevel for normal monsters or rank for Xyz monsters.
    Attack: Attack points of the monster.
    Defense: Defense points of the monster.
    Type: Card type (e.g., Monster, Spell, Trap).
    SubType: Specific type within the card type (e.g., Dragon, Warrior).
    '''
    results = []
    for card in hand:
        soup = BeautifulSoup(str(card), 'html.parser')
        # these are fields shared by all cards
        
        result = {
            'index': soup.find('input', {'class': 'cid'})['value'],
            'name': soup.find('span', {'class': 'card_name'}).text,
            'description': soup.find('dd', {'class': 'box_card_text'}).text.strip(),
            'type': soup.find('span', {'class': 'box_card_attribute'}).text.strip()
        }

        # based on the type, each card can have different fields
        if result['type'] in ['SPELL', 'TRAP'] :
            sub_type = soup.find('span', {'class': 'box_card_effect'})
            result.update({
                'sub_type': sub_type.text.strip() if sub_type else '',
                'attribute': '',
                'rank': '',
                'attack': '',
                'defense': ''
            })
        else : # MONSTER card
            # check if link card
            link = soup.find('span', {'class': 'box_card_linkmarker'})
            if link :
                rank = soup.find('span', {'class': 'box_card_linkmarker'}).text.strip()
                links = soup.find('img', {'title': 'Link'})['src'].split('/')[-1].replace('.png', '').replace('link', '')
                rank = f'{rank} P{links}'
            else:
                rank = soup.find('span', {'class': 'box_card_level_rank'}).text.strip()
            result.update({
                'type': 'MONSTER',
                'attribute': result['type'],
                'sub_type': soup.find('span', {'class': 'card_info_species_and_other_item'}).text.replace('\r\n', '').replace('\t', ''),
                'rank': rank,
                'attack': soup.find('span', {'class': 'atk_power'}).text.strip().split()[-1],
                'defense': soup.find('span', {'class': 'def_power'}).text.strip().split()[-1]
            })
        
        results.append(result)
    
    return results

In [None]:
cards = transform_hand(hand)

In [None]:
page = 2 # we're on page 2 after example
hand = cards

In [None]:
# iterate through database and collect links to cards
while len(hand) > 0:
  print(page)
  data = None
  
  while data is None:
    try: # connect
      data = requests.get(config['base-url'] + str(page))
    except:
      print('trying again')
      pass
  
  soup = BeautifulSoup(data.text, 'html.parser')
  hand = transform_hand(soup.find_all('div', class_ = 't_row'))
  cards.extend(hand)
  page += 1

In [None]:
len(cards)

In [None]:
deck = cards

## Save Index
We will save out the index to a CSV file. Please make sure to rename it.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(deck)

In [None]:
df

In [None]:
df.to_csv('202408141845.csv')

## 2. Expand Configuration and Example
Each card has a variety of information that can be parsed from the detailed page. This includes release dates and sets the part was part of.

In [None]:
example = requests.get(config['index-url'] + df.iloc[456]['index'])

In [None]:
deck = list(df['index'])
print(len(deck))
print(deck[:10])

In [None]:
with open('example.html', 'w+') as f :
    f.write(example.text)

## Expand Scrape
Based on the index parameters, we can now query the official Konami database to request all of the details available about the card. Because there is so much data, we will save out all details and then process it. The data is the same from the card website and each HTML file will be saved out into a folder called `output`.

In [None]:
import os

completed = os.listdir('output/')
for i, index in enumerate(deck) :
    print(f'\rCard {i} ID: {index}{" "*50}', end='')
    
    filename = f'output/{index}.html'
    
    if f'{index}.html' in completed :
        print(f'Found {filename} already, skipping.')
        continue

    response = None
    while (not response) :
        try:
            response = requests.get(config['index-url'] + index)
        except:
            print(f'Retrying')
            response = None

    if response.ok :
        with open(filename, 'w+') as f:
            f.write(response.text)

In [None]:
card_files = os.listdir('output/')
len(card_files)

In [None]:
# define relevant columns
index_df = df
df = df[['index', 'name', 'description', 'type',
       'sub_type', 'attribute', 'rank', 'attack', 'defense']]

In [None]:
from bs4 import BeautifulSoup

cards = []
for i, index in enumerate(df['index']) :
    # extract card from database
    card_default = df[df['index'] == index].to_dict(orient='records')[0]

    # extract card details
    file_path = f'output/{index}.html'
    with open(file_path, 'r') as f:
        data = f.read()
        soup = BeautifulSoup(data, 'html.parser')
        sets = soup.find('div', {'id': 'update_list'}).find_all('div', class_='t_row')
    # loaded all sets, now parse
    if len(sets) < 1 :
        print('error , no sets found: ' + card['name'])
    for set in sets :
        card = card_default.copy()
        # gets set name, release, card set id
        set_soup = BeautifulSoup(str(set), 'html.parser')
        card.update({
            'set_name': set_soup.find('div', {'class': 'pack_name'}).text.strip(),
            'set_id': set_soup.find('div', {'class': 'card_number'}).text.strip(),
            'set_release': set_soup.find('div', {'class': 'time'}).text.strip(),
            'rarity': ' '.join(set_soup.find('div', {'class': 'lr_icon'}).text.split())
        })
        cards.append(card)
        print(f"\rGenerated {card['set_id']} {card['rarity']} {i+1} {len(df)} {' '*50}", end='')

In [None]:
len(cards)

## Finalize Complete Card Database
We will now save out our expanded version but make sure to change the filename. It is considered complete in terms of official data sources. Additional data like market prices are merged but this is considered 3rd party.

In [None]:
full_df = pd.DataFrame(cards)
full_df

In [None]:
full_df.to_csv('202408142150.csv')

## Appendix

In [None]:
'''
# reference https://www.scrapingbee.com/tutorials/make-concurrent-requests-in-python/
import concurrent.futures
import requests

MAX_RETRIES = 5 # Setting the maximum number of retries if we have failed requests to 5.
MAX_THREADS = 1000

def scrape(url):
    for _ in range(MAX_RETRIES):
        response = requests.get(config['index-url'] + url) # Scrape!

        if response.ok: # If we get a successful request
            print(index)
            index = url.split('=')[-1] #the cid parameter in the url
            with open(f'output/{index}.html', 'w+') as f:
                f.write(response.text)
            
        else: # If we get a failed request, then we continue the loop
            print(response.content)

with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    executor.map(scrape, list(set(deck)))
'''

In [None]:
import os
outputs = os.listdir('output/')
print(len(outputs))