# Grab MTGJSON Card Data

Here I will download and clean the data for MTG cards.

First we will download the data from [MTGJSON](https://mtgjson.com/downloads/all-files/).  The `AllPrintings` card data comes in various formats, such as json, sql, csv, and parquet.

I will use the [parquet format](https://parquet.apache.org/), since that is the most performant format for data analysis.  It has high compression, fast load times, and can query directly on disk.  This minimizes both disk space and memory usage.
https://mtgjson.com/api/v5/AllPrintingsParquetFiles.tar.gz

The following code downloads and decompresses the data.

In [1]:
DOWNLOAD_MTGJSON = False

In [2]:
import os
import pathlib
import platform
from datetime import datetime

# Determine the operating system
os_name = platform.system()
print(f"Operating system: {os_name}")

# Define paths and file names
save_path = pathlib.Path("../../data/raw/mtgjson")
file_base = "AllPrintingsParquetFiles"  # extension added later
file_ext = "tar.gz" if os_name == "Linux" else "zip"
file = f"{file_base}.{file_ext}"
url = f"https://mtgjson.com/api/v5/{file}"
filepath = save_path / file
final_path = save_path / file_base

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)


Operating system: Windows


In [3]:
if DOWNLOAD_MTGJSON:
    print("Downloading Data")
    print(f"Starting datetime: {datetime.now()}")

    if os_name == "Linux":
        # Linux commands
        os.system(f"wget -P {save_path} --progress=dot:giga {url}")
        os.system(f"tar -xzf {filepath} -C {save_path}")
        os.system(f"rm {filepath}")
        size = os.system(f"du -sh {save_path}")
    elif os_name == "Windows":
        # Windows commands (using PowerShell)
        os.system(f"powershell Invoke-WebRequest -Uri {url} -OutFile {filepath}")
        os.system(f"powershell Expand-Archive -Path {filepath} -DestinationPath {save_path / file_base}")
        os.system(f"powershell Remove-Item {filepath}")
        size = os.system(f"powershell Get-ChildItem {save_path} | Measure-Object -Property Length -Sum | ForEach-Object {{ $_.Sum / 1MB }}")
    
    print(f"Disk usage (MB): {size}")
    print(f"Finished datetime: {datetime.now()}")

else:
    print("Skipping download of data")

Skipping download of data


# Review Tables

We have 18 parquet files associated with the card data, let't take a quick tour.  

In [4]:
# list of files
path = final_path
files = os.listdir(path)
files.sort()
files

['cardForeignData.parquet',
 'cardIdentifiers.parquet',
 'cardLegalities.parquet',
 'cardPrices.parquet',
 'cardPurchaseUrls.parquet',
 'cardRulings.parquet',
 'cards.parquet',
 'meta.parquet',
 'setBoosterContentWeights.parquet',
 'setBoosterContents.parquet',
 'setBoosterSheetCards.parquet',
 'setBoosterSheets.parquet',
 'setTranslations.parquet',
 'sets.parquet',
 'tokenIdentifiers.parquet',
 'tokens.parquet']

## Card Files:
- `cards.parquet`: The primary file that contains card data, such as card name, mana cost, type, and text.
- `tokens.parquet`: Same for tokens.
- `cardForeignData.parquet`: Foreign language translations of cards.
- `cardLegalities.parquet`: Legality of cards for various play formats.
- `cardPrices.parquet`: Latest prices for cards on various platforms, including retail and buylist prices.
- `cardPurchaseUrls.parquet`: URLs to various retail platforms.
- `cardRulings.parquet`: The rulings for cards.

## Set Files:
- `sets.parquet`: Data on various released sets, such as set code (10E, OTJ...), set size, and release date.
- `setTranslations.parquet`: Translations for set names in various languages.

## Identifier Files:
- `cardIdentifiers.parquet`: Identifiers for various MTG data platforms (TCG Collector, Scryfall, Cardmarket...).
- `tokenIdentifiers.parquet`: Same for tokens.

## Set Booster Files:
- `setBoosterContents.parquet`: For booster packs, different mixes of sheet composition (1 theList + 13 others versus 0 theList + 14 others).
- `setBoosterContentWeights.parquet`: The weight of each booster mix (1 in 10 boosters has theList).
- `setBoosterSheets.parquet`: Card sheet information.
- `setBoosterSheetCards.parquet`: Card composition of each sheet, including counts.

## Meta File:
- `meta.parquet`: Version and date for current MTGJSON build.


# Unique Identifiers

Most of the files have a `uuid`. This is the universally unique identifier (UUID v5) for each card printing.  It is the primary key for the `cards.parquet` file and will be used to join data across tables.

## MTGJSON
 - `uuid`:
   - Reprinted card editions: Unique id
   - [Double-faced cards](https://mtg.fandom.com/wiki/Double-faced_card) (DBC): Each face has a unique `uuid`.
   - Foreign languages: Same Id. 

## WOTC Gatherer
 - `multiverseId`: The WOTC card identifier used their [Gatherer](https://gatherer.wizards.com) card database.  
    - Reprinted card editions: Unique id
    - Double-faced cards: Same id
    - Foreign languages: Different id

## Scryfall
 - `scryfallId`: The [Scryfall](https://scryfall.com/) uuid.  It has different rules than the MTGJSON uuid, such as faces of DFCs are not unique.
    - Reprinted card editions: Unique id
    - Double-faced cards: Same id.  See `scryfallCardBackId`.
    - Foreign languages: Different id