# Topic Modeling and Magic: The Gathering

This notebook is to be read along this blog post. Here I have used [all the Legacy decks registered in 2020 on mtgtop8](https://www.mtgtop8.com/format?f=LE&meta=199). I obtained this file from a companion project: [spider_mtg](https://github.com/pfr974/spider_mtg) and have done the same for other years here: https://github.com/pfr974/mtg-legacy-data. Please, feel free to use them and also point out mistakes if you see some (シ_ _)シ 

## Prerequisites:

You will need the following librairies:
- [NumPy](https://numpy.org/doc/stable/user/quickstart.html) 
- [six](https://six.readthedocs.io/) 
- [gensim](https://radimrehurek.com/gensim/)
- [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html#installation)

## Acknowledgments ೕ(･ㅂ･ ):

The starting point for this project was reading a while ago [this article](https://towardsdatascience.com/finding-magic-the-gathering-archetypes-with-latent-dirichlet-allocation-729112d324a6) by [hlynurd](https://github.com/hlynurd). Please give it a read and check [his notebook](https://github.com/hlynurd/lda-for-magic)!

Both [hlynurd](https://github.com/hlynurd) and myself pretty much followed [this tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html) which explains the core concepts needed to understand and use [gensim](https://radimrehurek.com/gensim/).

## Disclaimer (`Д´)ゞ:

The information presented here about Magic: The Gathering is copyrighted by Wizards of the Coast. This project is not produced, endorsed, supported, or affiliated with Wizards of the Coast.

https://www.mtgtop8.com/ is the source of my data. This project would not have been possible without their amazing work!

**I by no mean claim to be a data science expert. Feel free to critize if you don't agree with something**.

In [2]:
# Adapted from: https://github.com/hlynurd/lda-for-magic/blob/master/lda-mtg-notebook.ipynb

#import pandas as pd
#import itertools as it

import gensim 
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.nmf import Nmf
from gensim.models.hdpmodel import HdpModel
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet

import json
import numpy as np
import re 
from six import iteritems

import logging
try:
    import pyLDAvis.gensim
except ImportError:
    ValueError("SKIP: please install pyLDAvis")
    
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

In [3]:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.debug("test")

DEBUG:root:test


## Importing and processing the data

The documents (decklists) are stored in a single file, one document per line. We have 75 cards in a deck: 60 cards mainboard, 15 cards sideboard.

In [4]:
with open('single_legacy_2020.txt', 'r') as f:
    print(f.readline())

"3  Bayou 1  Dryad Arbor 2  Marsh Flats 3  Misty Rainforest 3  Polluted Delta 1  Snow-Covered Swamp 3  Underground Sea 4  Verdant Catacombs 4  Bloodghast 4  Gravecrawler 4  Hedron Crab 4  Hogaak, Arisen Necropolis 2  Putrid Imp 4  Stitcher\\'s Supplier 4  Vengevine 4  Cabal Therapy 2  Careful Study 4  Altar of Dementia 4  Bridge from Below 3  Chain of Vapor 4  Force of Vigor 4  Leyline of the Void 1  Oko, Thief of Crowns 3  Thoughtseize "



We need to know the set of all words that will be used in the corpus, i.e. the **vocabulary**. Here, it corresponds to the card names. Fortunately, gensim has a class which can do that: **gensim.corpora.Dictionary**. We construct a memory friendly dictionary without loading all the decklists into memory; see [core concepts of gensim](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html). Note that we also remove the card names that only appear once.

In [5]:
dictionary = gensim.corpora.Dictionary([x.strip() for x in re.split(r"[\d]+", line.replace("\"", ""))] for line in open('single_legacy_2020.txt'))
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(once_ids)  # remove cards that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(1639 unique tokens: ['', 'Altar of Dementia', 'Bayou', 'Bloodghast', 'Bridge from Below']...) from 3718 documents (total 124854 corpus positions)
DEBUG:gensim.corpora.dictionary:rebuilding dictionary, shrinking gaps
DEBUG:gensim.corpora.dictionary:rebuilding dictionary, shrinking gaps


Dictionary(1206 unique tokens: ['', 'Altar of Dementia', 'Bayou', 'Bloodghast', 'Bridge from Below']...)


We obtain a vocabulary of 1206 unique card names.

Now, in terms of preprocessing steps, we do not have as much to do as for, let's say, a collection of newspaper articles. No stop words here! Looking at what we have above for a line, we need to remove:
- the **\"** character at the start and end of the line;
- the number of cards.

SImilarly to the vocabulary dictionary, we want a **memory friendly corpus**. Following [the core concepts of gensim](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html) and [hlynurd's original notebook](https://github.com/hlynurd/lda-for-magic), we define a class **MyCorpus** that yield documents and also preprocess them.

In [6]:
class MyCorpus(object):
    
    def __iter__(self):
        for line in open('single_legacy_2020.txt'):
            decklist = line.replace("\"", "") # remove start and end tokens
            decklist = re.split(r"([\d]+)", decklist) # split by numbers and card names
            decklist = [x.strip() for x in decklist] # remove whitespace
            decklist = list(filter(None, decklist)) # remove empty words
            cleaned_decklist = [] 
            for i in range(int(len(list(decklist))/2)):
                for j in range(int(len(list(decklist[i*2])))):
                    cleaned_decklist.append(decklist[i*2+1])
            yield dictionary.doc2bow(cleaned_decklist)
    
corpus_memory_friendly = MyCorpus()

A gensim corpus contains the word id and its frequency. With the line <code> <i>yield dictionary.doc2bow(cleaned_decklist)</i> </code>, we convert a list of tokenized words via a dictionary to their ids and yield the resulting bag of words (bow) corpus. To simplify, here we are counting how many time a card name, via its **id**, appears in a decklist. 

# Analysis

We can now proceed and search for different archetypes using LDA. We have below a function called <code><i>compute_models_coherence</i></code> to do so. It returns a list of models and u_mass coherence values for various number of topics. **The coherence allows to quantitatively evaluate how good a model is**, how it can find patterns in the corpus. Sure, we could simply read all the weighted card names associated to a topic to see if they make sense but I don't think you would like to go through hundreds of topics. Morever, human interpretation is subjective. To compare our different models, we will therefore investigate their coherence score.

In [7]:
def compute_models_coherence(dictionary, corpus_memory_friendly, model, limit, start=2, step=3):
    """
    Return topic modeling models and u_mass coherence values for various number of topics.
    For more info about coherence, see:
    - https://radimrehurek.com/gensim/models/coherencemodel.html
    - https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    model : the topic modeling model (lda or nmf)
    start: Starting number of topics
    limit : Max number of topics
    step: increment

    Returns:
    -------
    model_list : List of LDA or NMF topic models
    coherence_values : u_mass Coherence values corresponding to the LDA or NMF model with respective number of topics
    """
    
    coherence_values = []
    model_list = []
    
    # We set iterations and passes to the same number
    iterations = 50
    # See https://groups.google.com/g/gensim/c/z0wG3cojywM to read about the difference between passes and iterations 
    
    np.random.seed(1) # For reproductivity
    unique_cards = len(dictionary.keys())
    
    if model == 'nmf':
        
        for archetypes in range(start, limit, step):
        
            model= Nmf(corpus=corpus_memory_friendly, num_topics=archetypes,id2word=dictionary,chunksize=2000,
                                     passes=iterations,kappa=.1,minimum_probability=0.01,w_max_iter=300,
                                     w_stop_condition=0.0001,h_max_iter=100,
                                     h_stop_condition=0.001,eval_every=10,
                                     normalize=True,random_state=np.random.seed(1))
        
            model_list.append(model)
            coherencemodel = CoherenceModel(model=model, corpus=corpus_memory_friendly, dictionary=dictionary, coherence='u_mass')
            coherence_values.append(coherencemodel.get_coherence())

    if model == 'lda':
        
        for archetypes in range(start, limit, step):
        
            alpha_prior = [1.0 / archetypes] * archetypes
            beta_prior = [1.0 / archetypes] * unique_cards
        
            model=gensim.models.ldamodel.LdaModel(corpus=corpus_memory_friendly, id2word=dictionary, 
                                                  num_topics=archetypes, passes=iterations, 
                                                  alpha = alpha_prior, eta = beta_prior)
            model_list.append(model)
            coherencemodel = CoherenceModel(model=model, corpus=corpus_memory_friendly, dictionary=dictionary, coherence='u_mass')
            coherence_values.append(coherencemodel.get_coherence())
    
    return model_list, coherence_values

We train the lda model 

In [None]:
model_list_lda, coherence_values_lda = compute_coherence_values(dictionary=dictionary,
                                                                corpus_memory_friendly = corpus_memory_friendly, 
                                                                model='lda', start=2, limit=20, step=6)