# Summary

Wrangle the booster set data.

# Introduction

This notebook will wrangle the data needed to understand booster pack composition.

This data is sourced by MTGJSON from [https://github.com/taw/magic-sealed-data](https://github.com/taw/magic-sealed-data).  See the intro notebook [00-10-introduction](../00-intro/00-10-introduction.ipynb)


I made a dataset for simulating booster packs.  This will look at for each combination
of booster pack config, sheet name, and card name. The three rates that affect expected pull rates 
are:
- Booster Pack Config: Unique mixes of booster packs, such as some having "The List" cards or 
land foils.
- Sheet Picks: Number of picks from each sheet type in the booster config, such as some packs having 
6 picks from the commons sheets, while others get 5 picks.
- Card Sheet Rate: The rate at which a card uuid occurs on the sheet.
These rates are then used to create an expected pull rate.   I break this into two expected pull 
rate.  One is per uuid, which has different values for alternate cards.  The other is per name, 
which lumps all card alternates with the same gameplay mechanics together.

I took the above dataset and collapse it to the expected pull rates for each card and 
uuid, aggregating the pack configurations and sheet names.  This can be added to other card stats.


# Initial Load and Transforms



In [31]:
# Setup Notebook
import os
if os.path.basename(os.getcwd()) != "mtg-modeling":
    get_ipython().run_line_magic("run", '-i "../../scripts/notebook_header.py"')  # type: ignore

In [32]:
import pandas as pd
import polars as pl

Set paths.  Using the OUTLAWS OF THUNDER JUNCTION set code and play booster type.

In [42]:
SET_CODE = "BLB" # OUTLAWS OF THUNDER JUNCTION
BOOSTER_NAME = "play" # play boosters

paths = {
    "raw": Path("data/raw/mtgjson/AllPrintingsParquetFiles"),
    "interim": Path("data/interim/mtgjson/Boosters"),
    "processed": Path("data/processed/mtgjson/Boosters"),
}

paths['raw'].exists()
os.makedirs(paths["interim"], exist_ok=True)
os.makedirs(paths["processed"], exist_ok=True)

Scan files into lazy dataframes.  No compute until .collect() method is called.

In [43]:
cards = pl.scan_parquet(
    Path(f"data/processed/mtgjson/AllPrintings/{SET_CODE}_std_cards.parquet")
)
boost_content = pl.scan_parquet(paths["raw"] / "setBoosterContents.parquet")
boost_weight = pl.scan_parquet(paths["raw"] / "setBoosterContentWeights.parquet")
boost_sheets = pl.scan_parquet(paths["raw"] / "setBoosterSheets.parquet")
boost_cards = pl.scan_parquet(paths["raw"] / "setBoosterSheetCards.parquet")

Tidy the card data.

In [44]:
df_cards = (
    cards.select(["name", "setCode", "number", "uuid"])
    .filter(pl.col("setCode") == SET_CODE)
    .join(boost_cards, left_on="uuid", right_on="cardUuid", how="inner")
    .with_columns(
        [
            pl.col("number")
            .str.zfill(3)
            .alias("number")
        ]
    )
)
df_cards.collect().head()

name,setCode,number,uuid,boosterName,cardWeight,setCode_right,sheetName
str,str,str,str,str,i64,str,str
"""Raccoon Rallier""","""BLB""","""148""","""0303bce7-92ec-53d3-8dc4-d27dce…","""collector""",1,"""BLB""","""foilCommon"""
"""Tempest Angler""","""BLB""","""235""","""039d071b-cfed-55df-ba49-51b209…","""collector""",1,"""BLB""","""foilCommon"""
"""Bonebind Orator""","""BLB""","""084""","""04c6c11d-d510-566e-b2bd-5a9992…","""collector""",1,"""BLB""","""foilCommon"""
"""Carrot Cake""","""BLB""","""007""","""06ee1d2b-9cb8-5aa6-93b5-fdf8c2…","""collector""",1,"""BLB""","""foilCommon"""
"""Intrepid Rabbit""","""BLB""","""017""","""0e9c0952-72c4-5ded-bdfa-9ec5d5…","""collector""",1,"""BLB""","""foilCommon"""


Tidy the booster composition, which shows which sheet names and frequencies occur in a give booster.

Display the pivot to compare different booster compositions.

In [45]:
df_bc = boost_content.filter(pl.col("setCode") == SET_CODE).filter(
    pl.col("boosterName") == BOOSTER_NAME
)
df_bc.collect().pivot(values="sheetPicks", on="boosterIndex")

boosterName,setCode,sheetName,0,1,2,3
str,str,str,i64,i64,i64,i64
"""play""","""BLB""","""common""",7.0,6.0,7.0,6.0
"""play""","""BLB""","""foil""",1.0,1.0,1.0,1.0
"""play""","""BLB""","""land""",1.0,1.0,,
"""play""","""BLB""","""rareMythicWithShowcase""",1.0,1.0,1.0,1.0
"""play""","""BLB""","""uncommon""",3.0,3.0,3.0,3.0
"""play""","""BLB""","""wildcard""",1.0,1.0,1.0,1.0
"""play""","""BLB""","""theList""",,1.0,,1.0
"""play""","""BLB""","""foilLand""",,,1.0,1.0


Tidy the booster weight.  Shows the frequency at which each booster config occurs.

In [37]:
df_wgt = (
    boost_weight.filter(pl.col("setCode") == SET_CODE)
    .filter(pl.col("boosterName") == BOOSTER_NAME)
    # .sort(pl.col("sheetName"))
    .with_columns(
        (pl.col("boosterWeight") / pl.col("boosterWeight").sum()).alias("boosterConfigRate"),
    )
)
df_wgt.collect().to_pandas()

Unnamed: 0,boosterIndex,boosterName,boosterWeight,setCode,boosterConfigRate
0,0,play,16,OTJ,0.64
1,1,play,4,OTJ,0.16
2,2,play,4,OTJ,0.16
3,3,play,1,OTJ,0.04


Tidy the booster sheets data.  This has some data for each sheet type.

In [38]:
df_bs = boost_sheets.filter(pl.col("setCode") == SET_CODE).filter(
    pl.col("boosterName") == BOOSTER_NAME
)
df_bs.collect().to_pandas()

Unnamed: 0,boosterName,setCode,sheetHasBalanceColors,sheetIsFoil,sheetName
0,play,OTJ,False,False,breakingNews
1,play,OTJ,False,False,common
2,play,OTJ,False,True,foil
3,play,OTJ,False,True,foilLand
4,play,OTJ,False,False,land
5,play,OTJ,False,False,rareMythicWithShowcase
6,play,OTJ,False,False,theList
7,play,OTJ,False,False,uncommon
8,play,OTJ,False,False,wildcard


Tidy the booster sheet card data.  This contains the card uuid frequencies on each sheet.

In [39]:
df_bcards = (
    boost_cards.filter(pl.col("setCode") == SET_CODE)
    .filter(pl.col("boosterName") == BOOSTER_NAME)
    .join(
        cards.select(["uuid", "name", "number", "rarity"]),
        left_on="cardUuid",
        right_on="uuid",
        how="inner",
    )
    .with_columns(pl.col("cardWeight").sum().over("sheetName").alias("totalCardWeight"))
    .with_columns((pl.col("cardWeight") / pl.col("totalCardWeight")).alias("cardSheetRate"))
)

df_bcards.collect().to_pandas().sample(10).sort_values("cardSheetRate")

Unnamed: 0,boosterName,cardUuid,cardWeight,setCode,sheetName,name,number,rarity,totalCardWeight,cardSheetRate
205,play,7c2f8f1d-085e-5cb9-8bc6-018d3ca8c5b5,162,OTJ,foil,Double Down,44,mythic,149634,0.001083
822,play,f4409d13-2bcf-5d83-80e0-abdb796c070f,8100,OTJ,wildcard,Caustic Bronco,82,rare,4496850,0.001801
124,play,2ca6b276-df48-572a-8f6e-cfae9c3dc73b,378,OTJ,foil,Rictus Robber,102,uncommon,149634,0.002526
152,play,49e5eaf9-9a3f-5561-9982-58db55a14f06,378,OTJ,foil,Mobile Homestead,245,uncommon,149634,0.002526
282,play,c80f2a52-32ca-5203-a3c3-7fcc6856c2eb,378,OTJ,foil,Lavaspur Boots,243,uncommon,149634,0.002526
174,play,602d5db0-db76-5c56-9f36-1eff1c00ad66,1120,OTJ,foil,Consuming Ashes,83,common,149634,0.007485
186,play,6a1018f9-ca3b-5c28-a357-de3e8c3d07da,1120,OTJ,foil,Armored Armadillo,3,common,149634,0.007485
756,play,b3ed033c-9928-5a73-8da0-ff232311af60,39200,OTJ,wildcard,Djinn of Fool's Fall,43,common,4496850,0.008717
75,play,ec54a1af-e697-5576-82af-7e04eb05dfaf,1,OTJ,common,Sterling Keykeeper,32,common,81,0.012346
364,play,e8c6f29c-ed4e-574b-ba61-c6564ad84b8b,3,OTJ,foilLand,Creosote Heath,255,common,60,0.05


# Draw Rates per Card-Sheet-Booster Combos

Use this for simulating booster pack draws, given rates for booster configs, sheets, and cards.

In [40]:
df = (
    df_bc.join(df_wgt, on=["boosterIndex", "boosterName", "setCode"], how="inner")
    .join(df_bs, on=["sheetName", "boosterName", "setCode"], how="inner")
    .join(df_bcards, on=["sheetName", "boosterName", "setCode"], how="inner")
    .with_columns(
        (
            pl.col("cardSheetRate") * pl.col("boosterConfigRate") * pl.col("sheetPicks")
        ).alias("expectedConfigCardPullRate"),
    )
    .with_columns(
        pl.col("expectedConfigCardPullRate")
        .sum()
        .over("cardUuid")
        .alias("expectedCardUuidPullRate")
    )
    .with_columns(
        pl.col("expectedConfigCardPullRate")
        .sum()
        .over("name")
        .alias("expectedCardNamePullRate")
    )
    .sort(pl.col("expectedCardNamePullRate"), descending=False)
)
df.collect().write_parquet(paths["processed"] / f"{SET_CODE}_booster_sheet_card_rates.parquet")
print(df.collect().shape)
df.filter(
    pl.col("expectedCardUuidPullRate") != pl.col("expectedCardNamePullRate")
).collect().to_pandas().sample(5)

(3232, 19)


Unnamed: 0,boosterIndex,boosterName,setCode,sheetName,sheetPicks,boosterWeight,boosterConfigRate,sheetHasBalanceColors,sheetIsFoil,cardUuid,cardWeight,name,number,rarity,totalCardWeight,cardSheetRate,expectedConfigCardPullRate,expectedCardUuidPullRate,expectedCardNamePullRate
55,1,play,OTJ,land,1,4,0.16,False,False,d625c38f-3b9c-59b7-85eb-e4bdfda3a8d0,2,Swamp,281,common,60,0.033333,0.005333,0.033333,0.1
13,3,play,OTJ,foilLand,1,1,0.04,False,True,6932a15f-4028-5ffb-8a25-9cf7a3a78df3,2,Plains,278,common,60,0.033333,0.001333,0.033333,0.1
10,2,play,OTJ,foilLand,1,4,0.16,False,True,4a1a324d-b930-5afb-845c-2410c98d4b28,2,Swamp,274,common,60,0.033333,0.005333,0.033333,0.1
0,2,play,OTJ,foilLand,1,4,0.16,False,True,053ded5c-2cea-5a75-a55e-daf37e2ea91e,2,Mountain,275,common,60,0.033333,0.005333,0.033333,0.1
38,0,play,OTJ,land,1,16,0.64,False,False,452d4eaf-d736-5dbf-9fc2-cb35022ee7d9,2,Forest,276,common,60,0.033333,0.021333,0.033333,0.1


# Draw Rates per Card, Aggregated

Use this for per-card draw rates.  Takes above dataset and aggregates the booster configs and sheets types.

In [41]:
df_card_rates = (
    df.group_by(["cardUuid", "boosterName"])
    .first()
    .select(
        [
            "cardUuid",
            "name",
            "number",
            "setCode",
            "boosterName",
            "expectedCardUuidPullRate",
            "expectedCardNamePullRate",
        ]
    )
)
df_card_rates.collect().write_parquet(
    paths["processed"] / f"{SET_CODE}_card_pull_rates.parquet"
)
df_card_rates.collect().sort(
    "expectedCardUuidPullRate", pl.col("number").str.zfill(3)
).to_pandas().head(5)

Unnamed: 0,cardUuid,name,number,setCode,boosterName,expectedCardUuidPullRate,expectedCardNamePullRate
0,53a66eef-e857-53da-b23c-93a9926f8a9e,"Geralf, the Fleshwright",50,OTJ,play,0.006437,0.006437
1,da46f56f-f4a8-58de-8a30-f6bd1de92add,"Gisa, the Hellraiser",89,OTJ,play,0.006437,0.006437
2,d9ddb794-3cb2-5a53-97d2-20375d1c607d,"Tinybones, the Pickpocket",109,OTJ,play,0.006437,0.006437
3,63c69341-9c62-550f-994e-8ec0ab0c0c46,"Annie Flash, the Veteran",190,OTJ,play,0.006437,0.006437
4,83d76712-309f-532d-bbd6-d92a3b0870d4,"Kellan, the Kid",213,OTJ,play,0.006437,0.006437
