# Finding Magic: The Gathering archetypes with LDA: Code

This notebook is meant as a supplement for [this article](https://medium.com/@hlynurd/finding-magic-the-gathering-archetypes-with-latent-dirichlet-allocation-729112d324a6). The results were obtained by working with [this data](Modern.htm). 
You can try this method on data from other formats as well. There is an API on <a href="https://mtgdecks.net" rel="follow">MTG Decks</a> to access the latest 500 tournament decklists from <a href="https://mtgdecks.net/decks/csv/Standard" rel="follow">Standard</a>, <a href="https://mtgdecks.net/decks/csv/Modern" rel="follow">Modern</a>,
<a href="https://mtgdecks.net/decks/csv/Legacy" rel="follow">Legacy</a>, <a href="https://mtgdecks.net/decks/csv/Vintage" rel="follow">Vintage</a>, <a href="https://mtgdecks.net/decks/csv/Commander" rel="follow">Commander</a>, <a href="https://mtgdecks.net/decks/csv/Pauper" rel="follow">Pauper</a>, <a href="https://mtgdecks.net/decks/csv/Frontier" rel="follow">Frontier</a>, <a href="https://mtgdecks.net/decks/csv/Peasant" rel="follow">Peasant</a>  or <a href="https://mtgdecks.net/decks/csv/Highlander" rel="follow">Highlander</a>.

## Preparing the data

The usual first step of machine learning tasks is making sure that the data is in the right form for our algorithms. The raw data is a csv file where each line represents a decklist. Each line contains a main deck and sideboard:

In [1]:
with open('Modern.htm', 'r') as f:
    print(f.readline())

"4 Celestial Colonnade 1 Celestial Purge 3 Cryptic Command 2 Detention Sphere 1 Disdainful Stroke 1 Dispel 1 Elspeth, Sun's Champion 2 Field of Ruin 4 Flooded Strand 1 Geist of Saint Traft 2 Ghost Quarter 1 Gideon Jura 2 Gideon of the Trials 1 Gideon, Ally of Zendikar 3 Glacial Fortress 1 Grafdigger's Cage 1 Hallowed Fountain 5 Island 1 Jace, Architect of Thought 3 Leyline of Sanctity 3 Mana Leak 2 Negate 4 Path to Exile 3 Plains 2 Rest in Peace 1 Search for Azcanta 4 Serum Visions 3 Snapcaster Mage 1 Sphinx's Revelation 4 Spreading Seas 2 Stony Silence 3 Supreme Verdict 1 Temple of Enlightenment 1 Think Twice 1 Vendilion Clique"



We feed the data into a gensim Dictionary, similarly as in [this tutorial](https://radimrehurek.com/gensim/tut1.html). We split each decklist into individual cards, ignoring the card counts and cards that appear only once. 

In [2]:
import gensim
import re 
from six import iteritems

Using TensorFlow backend.


In [3]:
dictionary = gensim.corpora.Dictionary([x.strip() for x in re.split(r"[\d]+", line.replace("\"", ""))] for line in open('Modern.htm'))
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(once_ids)  # remove cards that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed

There are almost 700 cards in our modern-card dictionary:

In [4]:
unique_cards = len(dictionary.keys())
print(unique_cards)

698


Next we create a gensim Corpus. Instead of having a bag of words (cards) model, we take note how many times each card appears in a deck and "uncompress" the decklist description.

In [5]:
import numpy as np

In [6]:
class MyCorpus(object):
    def __iter__(self):
        for line in open('Modern.htm'):
            decklist = line.replace("\"", "") # remove start and end tokens            
            decklist = re.split(r"([\d]+)", decklist) # split by numbers and card names
            decklist = [x.strip() for x in decklist] # remove whitespace
            decklist = filter(None, decklist) # remove empty words
            cleaned_decklist = [] 
            for i in range(len(decklist)/2): # remove numbers, add multiplicities of cards
                for j in range(int(decklist[i*2])):
                    cleaned_decklist.append(decklist[i*2+1])
            yield dictionary.doc2bow(cleaned_decklist)
corpus_memory_friendly = MyCorpus()  



## Training the model
Now that the data is ready, we set the number of achetypes to be found. Setting it to 30 gave me good results. Try varying this and see what happens! 

In [7]:
archetypes = 30

Since there are stochastic steps in the training of the model, you might get slightly different results each time. Having the seed set to 1 allows you to recreate my results.

In [8]:
np.random.seed(1)

The "Latent Dirichlet" part of the method name comes from the assumption that the latent [priors](https://en.wikipedia.org/wiki/Prior_probability) on the per-archetype card distribution and per-decklist archetype distributions are [Dirichlet](https://en.wikipedia.org/wiki/Dirichlet_distribution). This allows us to steer the learning of the model.

By incorporating such priors, we can tell the model how we believe the data actually looks like. If we have a large number of archetypes and are confident that each decklist only falls under one archetype, then setting a low alpha indicates that we prefer each decklist to belong to few, dominating archetypes. We can similarly control the archetype-card sparsity with beta. 


In [None]:
alpha_prior = [1.0 / archetypes] * archetypes
beta_prior = [1.0 / archetypes] * unique_cards

We finally train the model. This could take a couple of minutes.

In [None]:
iterations = 30
lda = gensim.models.ldamodel.LdaModel(corpus=corpus_memory_friendly, id2word=dictionary, num_topics=archetypes, passes=iterations, alpha = alpha_prior, eta = beta_prior)

## Checking the results
A good rule of thumb while doing machine learning work is to do regular sanity checks. Anything from simple output prints to beautiful visualizations will help you understand what's going on. After the training is finished, we can explore the archetypes that it finds. Gensim offers a nice way to see the probability-card pairs in each archetype. 

In [None]:
number_of_top_cards = 16
archetypes_to_inspect = 3
for i in range(archetypes_to_inspect):
    print(("Archetype %i \n %s \n") % (i, lda.print_topic(i, topn=number_of_top_cards)))

Since the model is generative, we can generate new decks as well. Here's an example of how to make a metagame altering affinity deck:

In [None]:
archetype_id = 13
archetype_topic = np.array(lda.show_topic(archetype_id, topn=9999))

archetype_distribution = np.array(archetype_topic[:,1], dtype="float32")
archetype_distribution = archetype_distribution / np.sum(archetype_distribution)

archetype_indices = np.zeros(len(archetype_distribution))
main_deck = 60
sideboard = 15
while np.sum(archetype_indices) < main_deck+sideboard:
    new_card = np.random.multinomial(1, archetype_distribution)
    archetype_indices += new_card
    if 5 in archetype_indices:
        archetype_indices -= new_card
archetype_cards = np.array(archetype_topic[:,0], dtype="string")
minimum_cards = 1.0
deck_title = "Affinity for AI"
print(deck_title)
for i in range(len(archetype_distribution)):
    if archetype_indices[i] >= minimum_cards:        
        print('%i %s' % (archetype_indices[i], archetype_cards[i]))

In [None]:
import numpy as np
#np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})
#print(topic[i, 0])
#print(topic[i, 1])
#probs = np.array(np.array(lda.show_topic(0, topn=9999))[:,1], dtype="float32")
#print(np.sum(probs))
for j in range(30):
    print("Archetype %i " % (j))
    for i in range(32):
        topic = np.array(lda.show_topic(j, topn=9999))
        print('%.3f %s' % (float(topic[i, 1]), topic[i, 0]))
    print("\n")