**Author :** Lorenzo Bortolussi (Bachelor Thesis), academic year 2024/2025

# Contextual Preference Ranking for Card Recommendations in Magic: The Gathering
This notebook demonstrates the complete, end-to-end pipeline for the Card Recommender System developed for this thesis. It covers every stage of the project, from initial data acquisition to the final training and deployment of the models.

The pipeline is divided into seven distinct stages:
1.  **Corpus Creation:** Scraping and assembling a diverse text corpus for domain adaptation of an LLM.
2.  **Language Model Training:** Fine-tuning the foundation and classifier models.
3.  **Card Data Retrieval & Representation:** Downloading, filtering, and vectorizing all card data.
4.  **Decklist & Dataset Creation:** Scraping decklists and creating the final training datasets.
5.  **CPR Model Training:** Training the core recommender models with various configurations.
6.  **Vector Database Construction:** Populating the ChromaDB collections for efficient search.
7.  **Gradio Demo Launch:** Running the interactive web application.

## Stage 1: Corpus Creation for Domain Adaptation
The first step is to create a rich, domain-specific text corpus. This involves scraping thousands of paragraphs from online articles related to *Magic: The Gathering* strategy. This colloquial text is essential for teaching a language model how players actually talk about the game. This data will be combined with official rules and card texts in the next stage.


In [None]:
# --- WARNING --- 
# This is a long-running process that scrapes hundreds of web pages.
# It is designed to be restartable. If it stops, run it again to resume.

import article_scraper
from pathlib import Path

base_dir = Path.cwd()
output_file = str(base_dir / "data" / "scraped_articles.txt")

articles_scraper = article_scraper.ArticlesScraper(
    stop_phrases=article_scraper.Params.stop_phrases, 
    lang_check=article_scraper.Params.lang_check,
    requests_per_second=article_scraper.Params.requests_per_second
)

articles_scraper.scrape_articles_into_paragraphs(output_file)

## Stage 2: Card Data Retrieval
Next, we download the latest card data from Scryfall, apply filters to create a clean dataset of Commander-legal cards, and download their images.

In [None]:
# --- WARNING ---
# This stage downloads a large JSON file and potentially thousands of images.

import utils
import preprocess_cards
from pathlib import Path


base_dir = Path.cwd()
data_dir = base_dir / "data"
this = str(Path().resolve()) # Emulate os.path.dirname(__file__)
raw_data = str(data_dir / "raw_data.json")
clean_data = str(data_dir / "clean_data.json")
img_dir = str(data_dir / "images")

download = input("Download fresher data? (Y/N): ").strip().lower() == "y"
if download:
    preprocess_cards.download_data(raw_data)
else:
    print("Download skipped.")
    
preprocess_cards.filter_data(raw_data, clean_data)
preprocess_cards.download_images(clean_data, output_folder=img_dir)

utils.generate_and_save_dict(this)

## Stage 3: Language Model Training
In order to build the the multi-modal vector representations for the cards, we first need to fine tune a pretrained language model in order to process the text of the cards appropriately.

Using the corpus assembled on **Stage 1**, we execute the three-stage pipeline to train two expert language models:
1.  **Foundation Model:** A `distilbert-base-uncased` model is fine-tuned on the composite corpus using Masked Language Modeling.
2.  **Pseudo-Labeling:** A zero-shot NLI model generates role labels for all cards.
3.  **Classifier Model:** The foundation model is further fine-tuned on the pseudo-labels to become an expert card role classifier.

In order to complete the text corpus for this training, please download the *Magic: The Gathering* complete rulebook from [this link](https://media.wizards.com/2025/downloads/MagicCompRules%2020250919.txt) and place it into the data folder as **rules.txt**:




In [None]:
# --- WARNING ---
# This is a very long-running, GPU-intensive process.
# The script will automatically use a CUDA-enabled GPU if available.

import text_embeddings
from pathlib import Path

base_dir = Path.cwd()
data_dir = base_dir / "data"
models_dir = base_dir / "models"
text_embeddings.Paths.clean_data_json = str(data_dir / "clean_data.json")
text_embeddings.Paths.rules_txt =  str(data_dir / "mtg_rules.txt")
text_embeddings.Paths.fundation_model =  str(models_dir / "magic-distilbert-base-v1")
text_embeddings.Paths.pseudo_labeled_dataset = str(data_dir / "card_roles_dataset.jsonl")
text_embeddings.Paths.final_classifier = str(models_dir / "card-role-classifier-final")
text_embeddings.Paths.scraped_articles_txt = str(data_dir / "scraped_articles.txt")

# Call the three stage pipeline
text_embeddings.main()

## Stage 4: Card Representations
We can now process the clean data from **stage 2** to create the multi-modal vector representations for every card, which are the fundamental building blocks for our recommender model.

In [None]:
preprocess_cards.build_card_representations(this, batch_size=16, use_img=False)

## Stage 5: Decklist & Dataset Creation
This stage scrapes thousands of Commander decklists from Archidekt. The raw decklists are then processed into two distinct training datasets (`_all` and `_div`) that the CPR model will be trained on.


In [None]:
# ---  WARNING --- 
# Scraping 100,000 decks can take over 15 hours. This process can be halted but is not restartable.
# It is highly recommended to run this once and save the output files.

import edh_scraper
from pathlib import Path

base_dir = Path.cwd()
data_dir = base_dir / "data"
card_dict_path = str(data_dir / "card_dict.pt")
decks_path_all = str(data_dir / "edh_decks_all.jsonl")
decks_path_div = str(data_dir / "edh_decks_div.jsonl")

dataset_path_all = str(data_dir / "cpr_dataset_v1_all.pt")
dataset_path_div = str(data_dir / "cpr_dataset_v1_div.pt")
card_feat_map_path = str(data_dir / "card_repr_dict_v1.pt")
cat_feat_map_path = str(data_dir / "type_and_keyw_dict.pt")

edh_scraper.main(
    this=this,
    card_dict=card_dict_path,
    out_jsonl=decks_path_all,
    out_jsonl_diversified=decks_path_div,
    max_archidekt=100000,
    per_color_bucket=3000,
    n_duplicates_per_strategy = 10,
    rate_per_sec=4.0
)

# --- Create Datasets from Scraped Decklists ---
# This part is much faster and can be run independently, given the .jsonl files are present.
create_and_save_CPRdataset(decks_path_div, dataset_path_div, card_feat_map_path, cat_feat_map_path)
create_and_save_CPRdataset(decks_path_all, dataset_path_all, card_feat_map_path, cat_feat_map_path)

## Stage 6: CPR Model Training
This is the core training stage. Eight different versions of the CPR recommender model are trained, varying the dataset, loss function, and number of epochs. After each model is trained, they are used to generate a corresponding dictionary of card embeddings.

In [None]:
# --- WARNING ---
# This is a long-running, GPU-intensive process that trains 8 separate models.

from pathlib import Path
import torch
import torch.nn as nn
import train

base_dir = Path.cwd()
data_dir = base_dir / "data"
models_dir = base_dir / "models"

dataset_path_all = str(data_dir / "cpr_dataset_v1_all.pt")
dataset_path_div = str(data_dir / "cpr_dataset_v1_div.pt")
cat_feat_map_path = str(data_dir / "type_and_keyw_dict.pt")
card_feat_map_path = str(data_dir / "card_repr_dict_v1.pt")

# Determine number of types/keywords from data
types_and_keyw_dict = torch.load(cat_feat_map_path, map_location="cpu")
first_entry = next(iter(types_and_keyw_dict.values()))
num_types = len(first_entry['types'])
num_keyw = len(first_entry['keywords'])

# --- Training Loop for all configurations ---
for ds_path, ds_name in [(dataset_path_all, "all"), (dataset_path_div, "div")]:
    for epochs in [20, 200]:
        # Train with Triplet Loss
        triplet_checkpoint_path = str(models_dir / f"cpr_checkpoint_v1_{ds_name}_{epochs}_3.pt")
        train.main_cpr_training(
            cpr_dataset_path=ds_path,
            cpr_checkpoint_path=triplet_checkpoint_path,
            loss_fn=nn.TripletMarginLoss(margin=0.3),
            step_fn=train.cpr_step_fn_triplet,
            NUM_EPOCHS=epochs,
            NUM_TYPES=num_types, NUM_KEYW=num_keyw
        )

        # Train with InfoNCE Loss
        infonce_checkpoint_path = str(models_dir / f"cpr_checkpoint_v1_{ds_name}_{epochs}_nce.pt")
        train.main_cpr_training(
            cpr_dataset_path=ds_path,
            cpr_checkpoint_path=infonce_checkpoint_path,
            loss_fn=nn.CrossEntropyLoss(),
            step_fn=train.cpr_step_fn_infonce,
            NUM_EPOCHS=epochs,
            NUM_TYPES=num_types, NUM_KEYW=num_keyw
        )

# --- Generate Embedding Dictionaries for all trained models ---
for ds_name in ["all", "div"]:
    for epochs in ["20", "200"]:
        for loss_name in ["3", "nce"]:
            emb_dict_path = str(data_dir / f"emb_dict_v1_{ds_name}_{epochs}_{loss_name}.pt")
            cpr_checkpoint_path = str(models_dir / f"cpr_checkpoint_v1_{ds_name}_{epochs}_{loss_name}.pt")
            train.generate_and_save_emb_dict(
                card_feat_map_path, cat_feat_map_path, cpr_checkpoint_path,
                num_types, num_keyw, batch_size=64, out_path=emb_dict_path
            )

## Stage 7: Vector Database Construction
With the embedding dictionaries created for all eight models, we now populate the ChromaDB vector database. A separate collection is created for each model, allowing the final application to query them independently.

In [None]:
import vector_database
import chromadb
from pathlib import Path

base_dir = Path.cwd()
data_dir = base_dir / "data"
db_path = str(base_dir / "card_db")
client = chromadb.PersistentClient(path=db_path)
clean_data_path = str(data_dir / "clean_data.json")

for ds_name in ["all", "div"]:
    for epochs in ["20", "200"]:
        for loss_abr in ["nce", "3"]:
            emb_dict_path = str(data_dir / f"emb_dict_v1_{ds_name}_{epochs}_{loss_abr}.pt")
            db_name = f"mtg_cards_v1_{ds_name}_{epochs}_{loss_abr}"
            vector_database.build_and_save_chroma_db(emb_dict_path, clean_data_path, client, db_name)

## Stage 8: Gradio Demo Launch
The final stage is to launch the interactive Gradio web application. This app provides an interface for querying the recommender system and collecting the valuable human feedback needed for model evaluation.

**Note:** Running a Gradio app within a Jupyter Notebook can sometimes be unstable or consume excessive memory, potentially causing the kernel to crash. It is highly recommended to run the demo by executing `python app.py` from the terminal.

In [None]:
import app
from pathlib import Path

base_dir = Path.cwd()
models_dir = base_dir / "models"

# For this example, only two models are made available
app.model_versions = {} 
app.model_versions = {
    "Diversified Dataset (20 Epochs, Triplet)": {
        "checkpoint": str(models_dir / "cpr_checkpoint_v1_div_20_triplet_s2.pt"), # _s2.pt
        "db_name": "mtg_cards_v1_div_20_triplet_s2"
    },
    "Complete Dataset (200 Epochs, InfoNCE)": {
        "checkpoint": str(models_dir / "cpr_checkpoint_v1_all_200_nce_s2.pt"),
        "db_name": "mtg_cards_v1_all_200_nce_s2"
    }
}
 
for model_version in list(app.model_versions.keys()):
    app.get_retriever(model_version)

app.demo.launch(share=False)

Loading retriever for 'Diversified Dataset (20 Epochs, Triplet)' for the first time...


No sentence-transformers model found with name c:\A\UNIVERSITA'\AIDA\III ANNO\thesis\Card Recommender System\models\magic-distilbert-base-v1. Creating a new one with mean pooling.


Loading complete.
Loading retriever for 'Complete Dataset (200 Epochs, InfoNCE)' for the first time...


No sentence-transformers model found with name c:\A\UNIVERSITA'\AIDA\III ANNO\thesis\Card Recommender System\models\magic-distilbert-base-v1. Creating a new one with mean pooling.


Loading complete.
* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




Processing Deck ID: 9151052, Model: Diversified Dataset (20 Epochs, Triplet), Prompt: ''
Using cached retriever for 'Diversified Dataset (20 Epochs, Triplet)'
--- Generating synergy recommendations ---


: 