# Finding Magic: The Gathering archetypes with LDA: Code

This notebook is meant as a supplement for [this article](https://medium.com/@hlynurd/finding-magic-the-gathering-archetypes-with-latent-dirichlet-allocation-729112d324a6). The results were obtained by working with [this data](Modern.htm). 
You can try this method on data from other formats as well. There is an API on <a href="https://mtgdecks.net" rel="follow">MTG Decks</a> to access the latest 500 tournament decklists from <a href="https://mtgdecks.net/decks/csv/Standard" rel="follow">Standard</a>, <a href="https://mtgdecks.net/decks/csv/Modern" rel="follow">Modern</a>,
<a href="https://mtgdecks.net/decks/csv/Legacy" rel="follow">Legacy</a>, <a href="https://mtgdecks.net/decks/csv/Vintage" rel="follow">Vintage</a>, <a href="https://mtgdecks.net/decks/csv/Commander" rel="follow">Commander</a>, <a href="https://mtgdecks.net/decks/csv/Pauper" rel="follow">Pauper</a>, <a href="https://mtgdecks.net/decks/csv/Frontier" rel="follow">Frontier</a>, <a href="https://mtgdecks.net/decks/csv/Peasant" rel="follow">Peasant</a>  or <a href="https://mtgdecks.net/decks/csv/Highlander" rel="follow">Highlander</a>.

## Preparing the data

The usual first step of machine learning tasks is making sure that the data is in the right form for our algorithms. The raw data is a csv file where each line represents a decklist. Each line contains a main deck and sideboard:

We feed the data into a gensim Dictionary, similarly as in [this tutorial](https://radimrehurek.com/gensim/tut1.html). We split each decklist into individual cards, ignoring the card counts and cards that appear only once. 

In [None]:
import gensim
import json
import re 
from six import iteritems

In [None]:
with open('validated_decks.json') as f:
    j = json.load(f)
    card_dictionary = gensim.corpora.Dictionary([card.strip() for card in deck] for deck in j['decks'])
    
    # remove cards that appear only once
    once_ids = [tokenid for tokenid, docfreq in iteritems(card_dictionary.dfs) if docfreq == 1]
    card_dictionary.filter_tokens(once_ids)

    # remove gaps in id sequence after words that were removed
    card_dictionary.compactify()
    
    unique_cards = len(card_dictionary.keys())
    print('unique cards: ', unique_cards)

Next we create a gensim Corpus. Instead of having a bag of words (cards) model, we take note how many times each card appears in a deck and "uncompress" the decklist description.

In [None]:
import numpy as np

In [None]:
class MyCorpus(object):
    def __iter__(self):
        with open('validated_decks.json') as f:
            j = json.load(f)
            
            for deck in j['decks']:
                cleaned_decklist = []
                for card_name in deck:
                    card_count = deck[card_name]
                    for i in range(card_count):
                        cleaned_decklist.append(card_name)
                yield card_dictionary.doc2bow(cleaned_decklist)
        
corpus_memory_friendly = MyCorpus()  

## Training the model
Now that the data is ready, we set the number of achetypes to be found. Setting it to 30 gave me good results. Try varying this and see what happens! 

In [None]:
archetypes = 30

Since there are stochastic steps in the training of the model, you might get slightly different results each time. Having the seed set to 1 allows you to recreate my results.

In [None]:
np.random.seed(1)

The "Latent Dirichlet" part of the method name comes from the assumption that the latent [priors](https://en.wikipedia.org/wiki/Prior_probability) on the per-archetype card distribution and per-decklist archetype distributions are [Dirichlet](https://en.wikipedia.org/wiki/Dirichlet_distribution). This allows us to steer the learning of the model.

By incorporating such priors, we can tell the model how we believe the data actually looks like. If we have a large number of archetypes and are confident that each decklist only falls under one archetype, then setting a low alpha indicates that we prefer each decklist to belong to few, dominating archetypes. We can similarly control the archetype-card sparsity with beta. 


In [None]:
alpha_prior = [1.0 / archetypes] * archetypes
beta_prior = [1.0 / archetypes] * unique_cards

We finally train the model. This could take a couple of minutes.

In [None]:
iterations = 30
lda = gensim.models.ldamodel.LdaModel(corpus=corpus_memory_friendly, id2word=card_dictionary, num_topics=archetypes, passes=iterations, alpha = alpha_prior, eta = beta_prior)

## Checking the results

**Define Functions to convert card IDs to their name and picture**

In [None]:
import requests
from PIL import Image
import requests
from io import BytesIO

def getCard(id):
    """
    Returns card JSON based on ID from Scryfall API
    """
    r = requests.get('https://api.scryfall.com/cards/multiverse/' + str(id))
    data = r.json()
    try:
        name = data['name']
    except KeyError:
        name = ''
    try:
        url = data['image_uris']['normal']
    except KeyError:
        url = ''
    return name, url

def getImage(url):
    """
    Returns "normal" image JPEG from Scryfall Image Library
    """
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    return img

A good rule of thumb while doing machine learning work is to do regular sanity checks. Anything from simple output prints to beautiful visualizations will help you understand what's going on. After the training is finished, we can explore the archetypes that it finds. Gensim offers a nice way to see the probability-card pairs in each archetype. 

In [None]:
number_of_top_cards = 16
archetypes_to_inspect = 3
for i in range(archetypes_to_inspect):
    print(("Archetype %i \n %s \n") % (i, lda.print_topic(i, topn=number_of_top_cards)))

**Take a look at archetypes**

In [None]:
def print_archetype_topn(archetype_id, topn, show_name=True):
    """
    Print the top n most probable cards for an archetype.
    
    This function prints the card name when 'show_name'==True and the
    card_id when 'show_name'==False.
    """
    top_cards = np.array(lda.show_topic(archetype_id, topn=topn))
    for card in top_cards:
        if show_name:
            card_name = getCard(card[0])[0]
        else:
            card_name = card[0]
        card_prob = float(card[1])
        print('{:.4f} {}'.format(card_prob, card_name))

In [None]:
archetype_id = 13
topn = 30

print_archetype_topn(archetype_id, topn, show_name=False)

**Generate a Deck**

Since the model is generative, we can generate new decks as well. Here's an example of how to make a metagame altering affinity deck:

In [None]:
archetype_id = 10
archetype_topic = np.array(lda.show_topic(archetype_id, topn=9999))

archetype_distribution = np.array(archetype_topic[:,1], dtype="float32")
archetype_distribution = archetype_distribution / np.sum(archetype_distribution)

archetype_indices = np.zeros(len(archetype_distribution))
main_deck = 60
sideboard = 15
while np.sum(archetype_indices) < main_deck+sideboard:
    new_card = np.random.multinomial(1, archetype_distribution)
    archetype_indices += new_card
    if 5 in archetype_indices:
        archetype_indices -= new_card
archetype_cards = np.array(archetype_topic[:,0], dtype=np.unicode_)
minimum_cards = 1.0
deck_title = 'Archetype: {}'.format(archetype_id)
print(deck_title)
for i in range(len(archetype_distribution)):
    if archetype_indices[i] >= minimum_cards:        
        print('%i %s' % (archetype_indices[i], getCard(archetype_cards[i])[0]))

## Exporting Archetypes

In [None]:
def export_archetypes():
    """
    Export the top most probable cards per archetype to a json file.
    """
    num_archetypes = 30
    num_cards = 30
    with open('30_archetypes.json', 'w') as f:
        archetypes_list = []
        for archetype_id in range(num_archetypes):
            archetype_json = {}
            archetype_json['archetype_id'] = archetype_id
            archetype_json['num_cards'] = num_cards
            archetype_json['cards'] = []
            for card_id, prob in np.array(lda.show_topic(archetype_id, topn=num_cards)):
                card_name, image_url = getCard(card_id)
                archetype_json['cards'].append({'card_id': card_id,
                                                'probability': prob,
                                                'card_name': card_name,
                                                'image_url': image_url})
            archetypes_list.append(archetype_json)
        json.dump(archetypes_list, f)

In [None]:
# Uncomment the line below to export archetypes
export_archetypes()

In [None]:
def print_deck_freq(archetype_num, filename):
    """
    Print out Archetype with name and card images
    """
    with open(filename, 'r') as f:
        card_dict = {}
        data = json.load(f)
        deck = data[archetype_num]
        for card in deck['cards']:
            name, image_url = getCard(card['card_id'])
            if name in card_dict:
                card_dict[name] += 1
            else:
                card_dict[name] = 1
        print(card_dict)

In [None]:
print_deck_freq(15, '30_archetypes.json')

## Judging Deck Archetypes

In [None]:
def get_deck_archetypes(deck, topn=5):
    """
    Return the top n archetypes of a deck.
    """
    deck_corpus = card_dictionary.doc2bow(deck)
    archetype_probs = lda.get_document_topics(deck_corpus)
    topn_archetypes = []
    for i in range(topn):
        highest = (-1, 0)
        for archetype_prob in archetype_probs:
            if archetype_prob[1] > highest[1]:
                highest = archetype_prob
        topn_archetypes.append(highest)
        archetype_probs.remove(highest)
    return topn_archetypes

query_deck = []
query_deck.append('447176')    # Top card from Archetype 10
query_deck.append('447148')    # 2nd top card from Archetype 10
print(get_deck_archetypes(query_deck))