# Summary

Wrangle the booster set data.

# Introduction

This notebook will wrangle the data needed to understand booster pack composition.

This data is sourced by MTGJSON from [https://github.com/taw/magic-sealed-data](https://github.com/taw/magic-sealed-data).  See the intro notebook [00-10-introduction.ipynb](../00-intro/00-10-introduction.ipynb)


I made a dataset for simulating booster packs.  This will look at for each combination
of booster pack config, sheet name, and card name. The three rates that affect expected pull rates 
are:
- Booster Pack Config: Unique mixes of booster packs, such as some having "The List" cards or 
land foils.
- Sheet Picks: Number of picks from each sheet type in the booster config, such as some packs having 
6 picks from the commons sheets, while others get 5 picks.
- Card Sheet Rate: The rate at which a card uuid occurs on the sheet.
These rates are then used to create an expected pull rate.   I break this into two expected pull 
rate.  One is per uuid, which has different values for alternate cards.  The other is per name, 
which lumps all card alternates with the same gameplay mechanics together.

I took the above dataset and collapse it to the expected pull rates for each card and 
uuid, aggregating the pack configurations and sheet names.  This can be added to other card stats.


# Initial Load and Transforms



In [190]:
# Setup Notebook
import os
if os.path.basename(os.getcwd()) != "mtg-modeling":
    get_ipython().run_line_magic("run", '-i "../../scripts/notebook_header.py"')  # type: ignore

In [191]:
import pandas as pd
import polars as pl

Set paths.  Using the OUTLAWS OF THUNDER JUNCTION set code and play booster type.

In [192]:
SET_CODE = "OTJ" # OUTLAWS OF THUNDER JUNCTION
BOOSTER_NAME = "play" # play boosters

paths = {
    "raw": Path("data/raw/mtgjson/AllPrintingsParquetFiles"),
    "interim": Path("data/interim/mtgjson/Boosters"),
    "processed": Path("data/processed/mtgjson/Boosters"),
}

paths['raw'].exists()
os.makedirs(paths["interim"], exist_ok=True)
os.makedirs(paths["processed"], exist_ok=True)

Scan files into lazy dataframes.  No compute until .collect() method is called.

In [193]:
cards = pl.scan_parquet(
    Path("data/processed/mtgjson/AllPrintings/OTJ_std_cards.parquet")
)
boost_content = pl.scan_parquet(paths["raw"] / "setBoosterContents.parquet")
boost_weight = pl.scan_parquet(paths["raw"] / "setBoosterContentWeights.parquet")
boost_sheets = pl.scan_parquet(paths["raw"] / "setBoosterSheets.parquet")
boost_cards = pl.scan_parquet(paths["raw"] / "setBoosterSheetCards.parquet")

Tidy the card data.

In [194]:
df_cards = (
    cards.select(["name", "setCode", "number", "uuid"])
    .filter(pl.col("setCode") == SET_CODE)
    .join(boost_cards, left_on="uuid", right_on="cardUuid", how="inner")
    .with_columns(
        [
            pl.col("number")
            .str.zfill(3)
            .alias("number")
        ]
    )
)
df_cards.collect().head()

name,setCode,number,uuid,boosterName,cardWeight,setCode_right,sheetName
str,str,str,str,str,i64,str,str
"""Trick Shot""","""OTJ""","""151""","""0577fa46-115e-5605-8710-4d5da5…","""collector""",1,"""OTJ""","""foilCommon"""
"""Irascible Wolverine""","""OTJ""","""130""","""0d0a0017-bc07-5bc7-9680-4a27d9…","""collector""",1,"""OTJ""","""foilCommon"""
"""Soured Springs""","""OTJ""","""264""","""0e5f7df0-c992-5920-9cd5-7f3cd3…","""collector""",1,"""OTJ""","""foilCommon"""
"""Stagecoach Security""","""OTJ""","""030""","""0f8534c3-19d7-59b6-9d73-3751b8…","""collector""",1,"""OTJ""","""foilCommon"""
"""Explosive Derailment""","""OTJ""","""122""","""0fb87202-3591-59e3-951b-41d1ab…","""collector""",1,"""OTJ""","""foilCommon"""


Tidy the booster composition, which shows which sheet names and frequencies occur in a give booster.

Display the pivot to compare different booster compositions.

In [195]:
df_bc = boost_content.filter(pl.col("setCode") == SET_CODE).filter(
    pl.col("boosterName") == BOOSTER_NAME
)
df_bc.collect().pivot(values="sheetPicks", on="boosterIndex")

boosterName,setCode,sheetName,0,1,2,3
str,str,str,i64,i64,i64,i64
"""play""","""OTJ""","""breakingNews""",1.0,1.0,1.0,1.0
"""play""","""OTJ""","""common""",6.0,5.0,6.0,5.0
"""play""","""OTJ""","""foil""",1.0,1.0,1.0,1.0
"""play""","""OTJ""","""land""",1.0,1.0,,
"""play""","""OTJ""","""rareMythicWithShowcase""",1.0,1.0,1.0,1.0
"""play""","""OTJ""","""uncommon""",3.0,3.0,3.0,3.0
"""play""","""OTJ""","""wildcard""",1.0,1.0,1.0,1.0
"""play""","""OTJ""","""theList""",,1.0,,1.0
"""play""","""OTJ""","""foilLand""",,,1.0,1.0


Tidy the booster weight.  Shows the frequency at which each booster config occurs.

In [196]:
df_wgt = (
    boost_weight.filter(pl.col("setCode") == SET_CODE)
    .filter(pl.col("boosterName") == BOOSTER_NAME)
    # .sort(pl.col("sheetName"))
    .with_columns(
        (pl.col("boosterWeight") / pl.col("boosterWeight").sum()).alias("boosterConfigRate"),
    )
)
df_wgt.collect().to_pandas()

Unnamed: 0,boosterIndex,boosterName,boosterWeight,setCode,boosterConfigRate
0,0,play,16,OTJ,0.64
1,1,play,4,OTJ,0.16
2,2,play,4,OTJ,0.16
3,3,play,1,OTJ,0.04


Tidy the booster sheets data.  This has some data for each sheet type.

In [197]:
df_bs = boost_sheets.filter(pl.col("setCode") == SET_CODE).filter(
    pl.col("boosterName") == BOOSTER_NAME
)
df_bs.collect().to_pandas()

Unnamed: 0,boosterName,setCode,sheetHasBalanceColors,sheetIsFoil,sheetName
0,play,OTJ,False,False,breakingNews
1,play,OTJ,False,False,common
2,play,OTJ,False,True,foil
3,play,OTJ,False,True,foilLand
4,play,OTJ,False,False,land
5,play,OTJ,False,False,rareMythicWithShowcase
6,play,OTJ,False,False,theList
7,play,OTJ,False,False,uncommon
8,play,OTJ,False,False,wildcard


Tidy the booster sheet card data.  This contains the card uuid frequencies on each sheet.

In [198]:
df_bcards = (
    boost_cards.filter(pl.col("setCode") == SET_CODE)
    .filter(pl.col("boosterName") == BOOSTER_NAME)
    .join(
        cards.select(["uuid", "name", "number", "rarity"]),
        left_on="cardUuid",
        right_on="uuid",
        how="inner",
    )
    .with_columns(pl.col("cardWeight").sum().over("sheetName").alias("totalCardWeight"))
    .with_columns((pl.col("cardWeight") / pl.col("totalCardWeight")).alias("cardSheetRate"))
)

df_bcards.collect().to_pandas().sample(10).sort_values("cardSheetRate")

Unnamed: 0,boosterName,cardUuid,cardWeight,setCode,sheetName,name,number,rarity,totalCardWeight,cardSheetRate
307,play,dd1288f4-e995-5448-b1c2-6ce723204fb3,216,OTJ,foil,"Kaervek, the Punisher",92,rare,149634,0.001444
588,play,14db4347-7490-512e-acd9-c1799fb8338a,7938,OTJ,wildcard,Forsaken Miner,88,uncommon,4496850,0.001765
781,play,cf710bb7-5f65-5e52-9560-955bd96c36a3,7938,OTJ,wildcard,Slick Sequence,233,uncommon,4496850,0.001765
826,play,f759dba3-5b11-5765-bce1-36996cd4e1ee,7938,OTJ,wildcard,Scorching Shot,145,uncommon,4496850,0.001765
141,play,42586fcf-de1a-5e18-aeea-f04c5c83f8c1,324,OTJ,foil,"Bruse Tarl, Roving Rancher",198,rare,149634,0.002165
152,play,49e5eaf9-9a3f-5561-9982-58db55a14f06,378,OTJ,foil,Mobile Homestead,245,uncommon,149634,0.002526
103,play,17d238bd-d4cd-53fd-a907-70248c201c2d,378,OTJ,foil,Ruthless Lawbringer,229,uncommon,149634,0.002526
64,play,cd21e7cd-d035-56b7-8355-61f564cafdb2,1,OTJ,common,Blacksnag Buzzard,79,common,81,0.012346
31,play,6a1018f9-ca3b-5c28-a357-de3e8c3d07da,1,OTJ,common,Armored Armadillo,3,common,81,0.012346
423,play,698e6144-a907-577d-babc-9511c405e2d8,12,OTJ,rareMythicWithShowcase,"Wylie Duke, Atiin Hero",239,rare,782,0.015345


# Draw Rates per Card-Sheet-Booster Combos

Use this for simulating booster pack draws, given rates for booster configs, sheets, and cards.

In [199]:
df = (
    df_bc.join(df_wgt, on=["boosterIndex", "boosterName", "setCode"], how="inner")
    .join(df_bs, on=["sheetName", "boosterName", "setCode"], how="inner")
    .join(df_bcards, on=["sheetName", "boosterName", "setCode"], how="inner")
    .with_columns(
        (
            pl.col("cardSheetRate") * pl.col("boosterConfigRate") * pl.col("sheetPicks")
        ).alias("expectedConfigCardPullRate"),
    )
    .with_columns(
        pl.col("expectedConfigCardPullRate")
        .sum()
        .over("cardUuid")
        .alias("expectedCardUuidPullRate")
    )
    .with_columns(
        pl.col("expectedConfigCardPullRate")
        .sum()
        .over("name")
        .alias("expectedCardNamePullRate")
    )
    .sort(pl.col("expectedCardNamePullRate"), descending=False)
)
df.collect().write_parquet(paths["processed"] / "OTJ_booster_sheet_card_rates.parquet")
print(df.collect().shape)
df.filter(
    pl.col("expectedCardUuidPullRate") != pl.col("expectedCardNamePullRate")
).collect().to_pandas().sample(5)

(3232, 19)


Unnamed: 0,boosterIndex,boosterName,setCode,sheetName,sheetPicks,boosterWeight,boosterConfigRate,sheetHasBalanceColors,sheetIsFoil,cardUuid,cardWeight,name,number,rarity,totalCardWeight,cardSheetRate,expectedConfigCardPullRate,expectedCardUuidPullRate,expectedCardNamePullRate
8,2,play,OTJ,foilLand,1,4,0.16,False,True,452d4eaf-d736-5dbf-9fc2-cb35022ee7d9,2,Forest,276,common,60,0.033333,0.005333,0.033333,0.1
18,2,play,OTJ,foilLand,1,4,0.16,False,True,a7fda38b-455a-50b3-8408-eeddeaeb19f2,2,Forest,285,common,60,0.033333,0.005333,0.033333,0.1
58,0,play,OTJ,land,1,16,0.64,False,False,f40527c4-8a3b-5cc8-99c3-f81fd94757bf,2,Island,280,common,60,0.033333,0.021333,0.033333,0.1
36,0,play,OTJ,land,1,16,0.64,False,False,3d2e9367-5afd-560b-869a-c7c612d0db53,2,Island,279,common,60,0.033333,0.021333,0.033333,0.1
37,1,play,OTJ,land,1,4,0.16,False,False,3d2e9367-5afd-560b-869a-c7c612d0db53,2,Island,279,common,60,0.033333,0.005333,0.033333,0.1


# Draw Rates per Card, Aggregated

Use this for per-card draw rates.  Takes above dataset and aggregates the booster configs and sheets types.

In [200]:
df_card_rates = (
    df.group_by(["cardUuid", "boosterName"])
    .first()
    .select(
        [
            "cardUuid",
            "name",
            "number",
            "setCode",
            "boosterName",
            "expectedCardUuidPullRate",
            "expectedCardNamePullRate",
        ]
    )
)
df_card_rates.collect().write_parquet(
    paths["processed"] / "OTJ_card_pull_rates.parquet"
)
df_card_rates.collect().sort(
    "expectedCardUuidPullRate", pl.col("number").str.zfill(3)
).to_pandas().head(5)

Unnamed: 0,cardUuid,name,number,setCode,boosterName,expectedCardUuidPullRate,expectedCardNamePullRate
0,53a66eef-e857-53da-b23c-93a9926f8a9e,"Geralf, the Fleshwright",50,OTJ,play,0.006437,0.006437
1,da46f56f-f4a8-58de-8a30-f6bd1de92add,"Gisa, the Hellraiser",89,OTJ,play,0.006437,0.006437
2,d9ddb794-3cb2-5a53-97d2-20375d1c607d,"Tinybones, the Pickpocket",109,OTJ,play,0.006437,0.006437
3,63c69341-9c62-550f-994e-8ec0ab0c0c46,"Annie Flash, the Veteran",190,OTJ,play,0.006437,0.006437
4,83d76712-309f-532d-bbd6-d92a3b0870d4,"Kellan, the Kid",213,OTJ,play,0.006437,0.006437
