# TCG Player Yugioh Cards Scraper
This will scrape the cards from TCG Player.

  1. Search cards by card name
  2. Grab latest sales data from card


## Setup

In [1]:
try_count = 0 # increment this for reruns or when returning from crashes to greater than 0

In [2]:
# we can see the robots.txt
import requests
import datetime
import sqlite3
import utils
print(requests.get('https://tcgplayer.com/robots.txt').text)


User-agent: *
Crawl-Delay: 10
Allow: /
Disallow: /*?*seller=*
Disallow: /login
Disallow: /search/articles
Disallow: /content/magic-the-gathering/deck/
Disallow: /content/disney-lorcana/deck/
Disallow: /content/yugioh/deck/
Disallow: /content/pokemon/deck/
Disallow: /content/flesh-and-blood/deck/
Sitemap: https://www.tcgplayer.com/sitemap/index.xml



In [3]:
# set this as the output database with SQLite

# manually define db
if try_count > 0:
    target_timestamp = ''
    db = f'yugioh-tcgplayer-{target_timestamp}.db'
else:
    db = f'yugioh-tcgplayer-{utils.timestamp()}.db'
print(db)

yugioh-tcgplayer-2025SEP05-185755.db


In [4]:
# Configure logging
logging = utils.logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(f"{db}-scraper.log"),
        #logging.StreamHandler() outputs to cell, can lead to huge file sizes
    ]
)

In [5]:
# Setup SQLite database
conn = sqlite3.connect(db)
cursor = conn.cursor()

# Create table if it doesn't exist
cursor.execute('''
CREATE TABLE IF NOT EXISTS tcgplayer_files (
    ygo_index TEXT,
    valid TEXT,
    content TEXT
)
''')
cursor.execute('''
CREATE TABLE IF NOT EXISTS tcgplayer_html_files (
    ygo_index TEXT,
    valid TEXT,
    content TEXT
)
''')
conn.close()

## Web Scaping Limits & Rules
Our goal is not only to maintain compliance, but also contribute to the servers that we scrape. The requirements in robots.txt must be adhered to. Here are some of the fields and what they mean

### User-agent: *
The User-agent field is located in every request and can indicate if the request is coming from a web browser, operating system, and so on. The wildcard parameter, *, means 'All', and in this case there are no restrictions in the User-agent.
### Crawl-Delay: 10
In seconds, it describes the amount of delay between requests. This is important as it makes sure web scrapers are not overloading the server. There are an incredible number of methods to scrape but some of the fastest ones can make over 1 million requests all at once. 
### Allow: /
This explains which parts of the web application or URL can be scraped. a '/' indicates the root directory which means that everything is allowed, pending understanding of what is disallowed.
### Disallow: /*?*seller=*
Combined with the allowed statement, this provides endpoints or parts of the URL that are not allowed. In this case, TCGPlayer doesn't allow requests that scrape seller information as a parameter indicated by the wildcards. Query parameters in the URL are specified as ?parameter_name=parameter_value . We will also respect seller privacy even when we have seller data from a page that is allowed.
### Sitemap: https://www.tcgplayer.com/sitemap/index.xml
This provides a "hardcoded" list of links and metadata that can understand what it's about.

In [6]:
from xml.dom.minidom import parseString
sitemap = parseString(requests.get('https://www.tcgplayer.com/sitemap/index.xml').content)

In [7]:
# get all yugioh sitemaps
yugioh_maps = []
for loc in sitemap.getElementsByTagName('loc') : # all TCGPlayer collections like Pokemon or Magic
    current_map = loc.childNodes[0].data
    if "yugioh" in current_map :
        yugioh_maps.append(current_map)

## TCGPlayer Yugioh Sitemaps
Not all sites have a sitemap defined, but understanding the structure of it, here are some conclusions,

- Each yugioh map XML file has the list of cards
- We're interested in the URL's that define "product" instead of other's like "categories" or "search"
- The links seem to be sorted by newest to oldest products
- The newest products may not have any sales


In [8]:
print(yugioh_maps)

['https://www.tcgplayer.com/sitemap/yugioh.0.xml', 'https://www.tcgplayer.com/sitemap/yugioh.1.xml']


## Advanced Web Scraping
If we're lucky, we can get all the data we need using the built in Python tools like **requests**. However, some web apps run Javascript that provide the necessary tools. In our case, there's some pertinent sales data that is loaded dynamically from TCG Player. We could break it down and make multiple simple requests but it would also multiply our processing time by the same factor. This is because the **robots.txt** asks to make 1 request per 10 seconds. The data gap between putting a URL into a modern web browser and using Python's simple requests can be filled by using advanced web scraping libraries.

This takes the "engine" from the web browsers, referred to as "web drivers", and allows us to process the data similarly to what we experience from it. Additionally, some web browsers use different engines. Here, we will install the library in Python known as **playwright** and a few web drivers for compatibility. Other libraries are available including **Selenium** but it requires manual web driver installion. This will half the scraping time because we would need to make 2 requests from a simple request and at more than 33,000 cards to scrape, the time savings can equal to days.

### References
https://playwright.dev/python/docs/intro

In [9]:
!pip install pytest-playwright



In [10]:
# install web drivers
!playwright install

╔════════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers.   ║
║ Please install them with the following command:        ║
║                                                        ║
║     sudo playwright install-deps                       ║
║                                                        ║
║ Alternatively, use apt:                                ║
║     sudo apt-get install libgstreamer-plugins-bad1.0-0 ║
║                                                        ║
║ <3 Playwright Team                                     ║
╚════════════════════════════════════════════════════════╝
    at validateDependenciesLinux (/home/hurricane-server/venv/lib/python3.12/site-packages/playwright/driver/package/lib/server/registry/dependencies.js:269:9)
[90m    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)[39m
    at async Registry._validateHostRequirements (/home/hurricane-server/venv/lib/python3.12/site-p

In [11]:
import nest_asyncio
nest_asyncio.apply()

In [12]:
example_url = 'https://www.tcgplayer.com/product/285222/yugioh-2022-tin-of-the-pharaohs-gods-ghost-mourner-and-moonlit-chill'
sales_uri = 'https://mpapi.tcgplayer.com/v2/product/26720/latestsales?mpfev=2631'

In [13]:
await utils.tcgplayer_yugioh_sales(db, example_url, 'test')

True

In [14]:
urls = utils.get_all_urls()        

In [None]:
# parse and save outputs in time descending order
while len(utils.check_if_complete(db_name = db)) > 0:
    todo = utils.check_if_complete(db_name = db)
    for index, url in enumerate(todo):
        product_id = utils.to_id(url)
        result = await utils.get_url(db, url, product_id)
        if result :
            log_text = f'CID {product_id} done. {index+1} out of {len(todo)} ({round((index+1)/len(todo), 5) * 100}%)'
            logging.info(log_text)

In [None]:
# run all above here 

## Example Outputs
The following cells of code will outline some example outputs

In [52]:
import importlib
importlib.reload(utils)
example = utils.q("select * from tcgplayer_html_files where ygo_index = '90804';", db)

In [53]:
utils.card_webpage_extract(example[0][2])

{'name': 'Uria, Lord of Searing Flames - 2006 Collectors Tin (CT03)',
 'number': 'CT03-EN005',
 'rarity': 'Secret Rare',
 'price': '$6.97',
 'quantity': '0',
 'quantity_sellers': '0',
 'price_change_3m': None,
 'volatility': 'Med'}

## Transform Data
Now that we have the data, we can extract the information and transform it.

### Card Data
The card data is from each webpage (e.g. 57114.html) which represents 1 card. This extracts metadata from the card. Note that full card details are meant to be joined from the TCG Complete Card Database because it contains attack, defense, descriptions, and other information from the official Konami source.

In [60]:
# process cards.
card_ids_sql = utils.q('select ygo_index from tcgplayer_html_files;', db)
card_ids = [index[0] for index in card_ids_sql if 'test' not in index[0]]
# warning: running here again after redoing malformed may take long, use the designated cell
card_data = []
malformed_cards = []
print("Please reference log file")
for index, card_id in enumerate(card_ids) :
    logging.info(f'Processing {card_id} ({round(((index+1)/len(card_ids)), 4)*100}%){" "*50}')
    # open HTML file
    sql_result = utils.q(f"select content from tcgplayer_html_files where ygo_index='{card_id}';", db)
    card_webpage = sql_result[0][0]
    # extract and save
    try:
        card_extract = utils.card_webpage_extract(card_webpage)
        card_extract.update({
            'index': card_id
        })
        card_data.append(card_extract)
    except Exception as e:
        logger.info(f'\n{card_id}\n{e}')
        malformed_cards.append((card_id, e))

Please reference log file


In [61]:
result = utils.q('select count(*) from tcgplayer_html_files;', db)
print(len(card_data))
print(result)

44747
[(44748,)]


In [62]:
# only expected value from errors is {"'NoneType' object has no attribute 'text'"}
set([str(card[1]) for card in malformed_cards])

set()

In [63]:
len(malformed_cards)

0

In [64]:
# get the urls for the malformed cards
malformed_urls = []
for redo in malformed_cards:
    cid = redo[0].split('.')[0] # example: 57114.html
    for url in urls:
        if cid == url[len('https://www.tcgplayer.com/product/'):].split('/')[0] :
            malformed_urls.append(url)
            continue
print(len(malformed_urls))

0


In [65]:
# run again here if done processing malformed
todo = malformed_urls.copy()
while len(todo) > 0:
    url = todo.pop(0) # get the first element in todo
    product_id = utils.to_id(url)
    result = await utils.get_url(db, url, product_id, True) # set overwrite to True
    try:
        sql_result = utils.q(f"select content from tcgplayer_html_files where ygo_index='{product_id}';", db)
        logging.info(f"Database returned {len(sql_result[0])} for {product_id}")
        card_webpage = sql_result[0][0]
        card_extract = utils.card_webpage_extract(card_webpage)
        card_extract.update({
            'index': card_id
        })
        card_data.append(card_extract)
    except Exception as e:
        logging.info(f'\n{product_id}\n{e}')
        todo.append(url) # if it fails, put back in queue
        continue
    log_text = f'CID {product_id} done. {len(todo)} left. ({round((len(malformed_urls) - len(todo))/len(todo), 5) * 100}%)'
    logging.info(log_text)

In [66]:
import pandas as pd
card_df = pd.DataFrame(card_data)

In [67]:
card_df

Unnamed: 0,name,number,rarity,price,quantity,quantity_sellers,price_change_3m,volatility,index
0,Dark Magician - 2003 Collectors Tin (BPT),BPT-007,Secret Rare,$46.05,0,0,,Indeterminate,22800
1,Buster Blader - 2003 Collectors Tin (BPT),BPT-008,Secret Rare,$26.44,0,0,,Indeterminate,22801
2,Blue-Eyes White Dragon - 2003 Collectors Tin (...,BPT-009,Secret Rare,$29.01,0,0,,Indeterminate,22802
3,XYZ-Dragon Cannon - 2003 Collectors Tin (BPT),BPT-010,Secret Rare,$13.08,0,0,,Indeterminate,22803
4,Jinzo - 2003 Collectors Tin (BPT),BPT-011,Secret Rare,$21.63,0,0,,Low,22804
...,...,...,...,...,...,...,...,...,...
44742,Mad Sword Beast - Retro Pack 2 (2020 Date Repr...,RP02-EN023,Common,-,0,0,,,649833
44743,Melchid the Four-Face Beast - Retro Pack 2 (20...,RP02-EN029,Common,-,0,0,,,649839
44744,Harpie's Pet Dragon - Retro Pack 2 (2020 Date ...,RP02-EN093,Secret Rare,-,0,0,,,649903
44745,Dragon Master Knight - Retro Pack 2 (2020 Date...,RP02-EN097,Secret Rare,$49.98,0,0,,Indeterminate,649907


In [69]:
card_df.to_csv(f'{db}.csv')