### Topic Modeling with NNMF

In [1]:
import pandas as pd
import re

In [2]:
# import modern magic cards from file 
modern = pd.read_pickle('data/5color_modern_no_name_hardmode.pkl')

def nolist(x):
    return x[-1]

modern['type'] = modern['types'].apply(nolist)

modern.head(2)

Unnamed: 0,artist,cmc,colors,flavor,manaCost,name,power,rarity,text,toughness,type,types,set,releaseDate
0,Michael Sutfin,4.0,Black,To gaze under its hood is to invite death.,{2}{B}{B},Abyssal Specter,2,Uncommon,"Flying Whenever This deals damage to a player,...",3,Creature,[Creature],Eighth Edition,2003-07-28
1,Wayne England,5.0,Blue,Pray that it doesn't seek the safety of your l...,{3}{U}{U},Air Elemental,4,Uncommon,Flying,4,Creature,[Creature],Eighth Edition,2003-07-28


In this case we have too much data. No problems with unlabeled or missing data. 

#### CountVectorizer Time 
Starting simple. Just a term frequency matrix. 

In [4]:
# vocab size 
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
vectorized_data = vectorizer.fit_transform(modern.text) 
names = vectorizer.get_feature_names()

print "There are {:,} words in the vocabulary.".format(len(vectorizer.vocabulary_))

There are 1,020 words in the vocabulary.


#### Non-negative matrix factorization 

In [23]:
# Non-Negative Matrix factorization

# code modified from documentation

from sklearn.decomposition import NMF

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print
        print "Topic #%d:" % (topic_idx + 1)
        print " ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]])
    
    
nmf = NMF(n_components=9, random_state=41).fit(vectorized_data)

nmf_feature_names = vectorizer.get_feature_names()

print_top_words(nmf, nmf_feature_names, 20)


Topic #1:
creature sacrifice token dies power green white blocked black attacks battlefield tokens deals control damage blocks attacking additional zombie regenerate

Topic #2:
card library cards hand graveyard player search reveal draw shuffle mana land exile cost puts cast return revealed reveals look

Topic #3:
damage player deals combat dealt flying prevent turn target equal deal controls instead cards number source way players discards prevented

Topic #4:
turn end gets gains flying haste trample gain strike beginning cast dealt target attacks creatures activate ability prevent able step

Topic #5:
battlefield enters flying token return counter tokens green counters land white control exile graveyard sacrifice tapped artifact haste leaves owners

Topic #6:
control creatures number power gain equal long opponents tap flying toughness counter thiss dont lands untap owners greater trample end

Topic #7:
life gain loses player opponent equal beginning upkeep lose total pay cards toug

In [7]:
H = nmf.transform(vectorized_data)
W = nmf.components_
print H.shape
print W.shape

(7874L, 9L)
(9L, 1020L)


Here we have our H matrix of 7874 individual cards, and 9 lables.  
and in the W matrix we have 9 lables, we have 1020 unique words in the vocabulary.  

In [22]:
# modified code from documentation

def print_top_cards(model, feature_names, n_top_art):
    human_lables = [' enchant ', ' deck manipulation ', ' damage ', 
                    ' planewalkers/choose one ', ' creature tokens ', 
                    ' counters ', 
                    ' life total tricks ', '---- ', ' enchant ']
    for topic_idx, topic in enumerate(model.T):
        print "Topic: -----------%s---------" % human_lables[topic_idx]
        indices = [i for i in topic.argsort()[:-n_top_art - 1:-1]]
        for i in indices:
            print "Actual color =", modern.colors[i]
            print "Actual type =", modern.type[i]
            print modern.text[i][:158]
            print
        print
        print

    print
    
print_top_cards(H, nmf_feature_names, 4)

Topic: ----------- enchant ---------
Actual color = Green
Actual type = Enchantment
Enchant creature Enchanted creature has "Tap : This creature deals damage equal to its power to target creature. That creature deals damage equal to its power

Actual color = Black
Actual type = Enchantment
Enchant creature Enchanted creature has "At the beginning of your upkeep, sacrifice this creature." When enchanted creature dies, its controller chooses targe

Actual color = Green
Actual type = Enchantment
Enchant creature Enchanted creature gets +2/+2. When enchanted creature dies, you may return This from your graveyard to the battlefield attached to a creatur

Actual color = Black
Actual type = Instant
Destroy target nonwhite, nonblack creature. Put a 1/1 white Spirit creature token with flying onto the battlefield. Haunt  When the creature This haunts dies,



Topic: ----------- deck manipulation ---------
Actual color = Black
Actual type = Sorcery
Reveal a card from your hand. Search your libra

Most common missclassifications are everything. Working from a data munged set of just the 5 colors, expecting 5 colors back out. What I got out were rules topics, not colors topics. Different types of cards, and actions you can take in the game.  

### To test: lets throw in all the cards. 

Previous dataset had colorless and multicolor cards removed.

In [24]:
# import all modern cards 
all_modern = pd.read_pickle('data/all_cards_modern_no_name.pkl')

# vectorize 
vectorizer = CountVectorizer(stop_words='english')
vectorized_data = vectorizer.fit_transform(all_modern.text) 

print "There are {:,} words in the vocabulary.".format(len(vectorizer.vocabulary_))

# NMF
nmf = NMF(n_components=9, random_state=41).fit(vectorized_data)

nmf_feature_names = vectorizer.get_feature_names()

print_top_words(nmf, nmf_feature_names, 20)

There are 5,723 words in the vocabulary.

Topic #1:
creature enchanted enchant target gets equip flying equipped tap sacrifice token destroy blocked untap long dies attached attach block counter

Topic #2:
card graveyard hand draw target return exile cost player discard opponent library reveals converted face discards reveal beginning owners sorcery

Topic #3:
turn end gets target gains haste gain attacks trample dealt strike ability activate beginning step flying able untap face blocks

Topic #4:
damage player deals target combat life dealt destroy equal controls creature counters choose tap opponent prevent players cards deal defending

Topic #5:
battlefield enters counter token control counters tapped life return land sacrifice owners beginning flying green gain tokens leaves opponent permanent

Topic #6:
mana tap pool add color tapped land sacrifice untap charge converted cost activate enters life spend remove ability colorless pay

Topic #7:
control creatures flying gain blocked p

In [25]:
print_top_cards(H, nmf_feature_names, 4)

Topic: ----------- enchant ---------
Actual color = Green
Actual type = Enchantment
Enchant creature Enchanted creature has "Tap : This creature deals damage equal to its power to target creature. That creature deals damage equal to its power

Actual color = Black
Actual type = Enchantment
Enchant creature Enchanted creature has "At the beginning of your upkeep, sacrifice this creature." When enchanted creature dies, its controller chooses targe

Actual color = Green
Actual type = Enchantment
Enchant creature Enchanted creature gets +2/+2. When enchanted creature dies, you may return This from your graveyard to the battlefield attached to a creatur

Actual color = Black
Actual type = Instant
Destroy target nonwhite, nonblack creature. Put a 1/1 white Spirit creature token with flying onto the battlefield. Haunt  When the creature This haunts dies,



Topic: ----------- deck manipulation ---------
Actual color = Black
Actual type = Sorcery
Reveal a card from your hand. Search your libra

Results:
- Most common missclassifications are everything. Working from a data munged set of just the 5 colors, expecting 5 colors back out. What I got out were rules topics, not colors topics. Different types of cards, and actions you can take in the game.  To test lets throw in all the cards.  
- Topics can be assigned to a few categories. They are labeled above. 
- Little to no change when my carefully data munged cards were put back in. Hypothesis confirmed. Topic modeling is not affected by color. Topic modeling is finding something else to group by.  
- English stop words removed.

### LDA visualization with GraphLab

In [14]:
## pyLDAvis copy/paste

from __future__ import absolute_import

import funcy as fp
import numpy as np
import pandas as pd
import graphlab as gl
import pyLDAvis

def _topics_as_df(topic_model):
    tdf = topic_model['topics'].to_dataframe()
    return pd.DataFrame(np.vstack(tdf['topic_probabilities'].values), index=tdf['vocabulary'])

def _sum_sarray_dicts(sarray):
    counts_sf = gl.SFrame({'count_dicts': sarray}).stack('count_dicts').groupby(key_columns='X1',
                                              operations={'count': gl.aggregate.SUM('X2')})
    return counts_sf.unstack(column=['X1', 'count'])[0].values()[0]

def _extract_doc_data(docs):
    doc_lengths = list(docs.apply(lambda d: np.array(d.values()).sum()))
    term_freqs_dict = _sum_sarray_dicts(docs)

    vocab = term_freqs_dict.keys()
    term_freqs = term_freqs_dict.values()

    return {'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}

def _extract_model_data(topic_model, docs, vocab):
    doc_topic_dists = np.vstack(topic_model.predict(docs, output_type='probabilities'))

    topics = _topics_as_df(topic_model)
    topic_term_dists = topics.T[vocab].values

    return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists}

def _extract_data(topic_model, docs):
    doc_data = _extract_doc_data(docs)
    model_data = _extract_model_data(topic_model, docs, doc_data['vocab'])
    return fp.merge(doc_data, model_data)

def prepare(topic_model, docs, **kargs):
    """Transforms the GraphLab TopicModel and related corpus data into
    the data structures needed for the visualization.
    Parameters
    ----------
    topic_model : graphlab.toolkits.topic_model.topic_model.TopicModel
        An already trained GraphLab topic model.
    docs : SArray of dicts
        The corpus in bag of word form, the same docs used to train the model.
    **kwargs :
        additional keyword arguments are passed through to :func:`pyldavis.prepare`.
    Returns
    -------
    prepared_data : PreparedData
        the data structures used in the visualization
    Example
    --------
    For example usage please see this notebook:
    http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb
    """
    opts = fp.merge(_extract_data(topic_model, docs), kargs)
    return pyLDAvis.prepare(**opts)

In [13]:
# LDA vis

import graphlab as gl
import pyLDAvis 

pyLDAvis.enable_notebook()

modernSFrame = gl.SFrame(modern[["text", "colors"]])

# modernSFrame.show()

modernSFrame['features'] = gl.text_analytics.count_ngrams(modernSFrame['text'], 1)

topicModel = gl.topic_model.create(modernSFrame['features'], num_topics=14, num_iterations=50)

prepare(topicModel, modernSFrame['features'])



PROGRESS: Learning a topic model
PROGRESS:        Number of documents      7874
PROGRESS:            Vocabulary size      1299
PROGRESS:    Running collapsed Gibbs sampling
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | 10        | 384.863ms     | 4.33239e+06    | 0               |
PROGRESS: | 20        | 732.8ms       | 4.52632e+06    | 0               |
PROGRESS: | 30        | 1.06s         | 4.37516e+06    | 0               |
PROGRESS: | 40        | 1.41s         | 4.70569e+06    | 0               |
PROGRESS: | 50        | 1.76s         | 4.22612e+06    | 0               |
PROGRESS: +-----------+---------------+----------------+-----------------+


In [33]:
from collections import Counter

Counter(modern['type'])

Counter({u'Artifact': 23,
         u'Creature': 4400,
         u'Enchantment': 972,
         u'Instant': 1321,
         u'Planeswalker': 59,
         u'Sorcery': 1099})

In [34]:
modernSFrame2 = gl.SFrame(modern[["text", "type"]])

# modernSFrame.show()

modernSFrame2['features'] = gl.text_analytics.count_ngrams(modernSFrame2['text'], 1)

topicModel2 = gl.topic_model.create(modernSFrame2['features'], num_topics=6, num_iterations=50)

prepare(topicModel2, modernSFrame2['features'])

PROGRESS: Learning a topic model
PROGRESS:        Number of documents      7874
PROGRESS:            Vocabulary size      1299
PROGRESS:    Running collapsed Gibbs sampling
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | 10        | 407.097ms     | 3.51787e+06    | 0               |
PROGRESS: | 20        | 785.577ms     | 3.77455e+06    | 0               |
PROGRESS: | 30        | 1.16s         | 4.03943e+06    | 0               |
PROGRESS: | 40        | 1.55s         | 3.46689e+06    | 0               |
PROGRESS: | 50        | 1.93s         | 3.87689e+06    | 0               |
PROGRESS: +-----------+---------------+----------------+-----------------+


---
Scikit-learn's LDA
------

Use [Scikit-learn's LDA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to find topics.

In [8]:
# vocab size 
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")
vdf = vectorizer.fit_transform(modern.text) 
names = vectorizer.get_feature_names()

print "There are {:,} words in the vocabulary.".format(len(vectorizer.vocabulary_))

There are 1,020 words in the vocabulary.


In [9]:
# code from documentation

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print "Topic #%d:" % topic_idx
        print " ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]])
    print

In [11]:
%%time

from sklearn.decomposition.online_lda import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=9).fit(vdf)

lda_feature_names = vectorizer.get_feature_names()

print_top_words(lda, lda_feature_names, 20)


Topic #0:
creature enchanted enchant untap control creatures gets block pay long flying regenerate flash tap lose controllers step dont attack doesnt
Topic #1:
damage creature deals target player flying tap turn dealt controls opponent dies vigilance deathtouch attacking reach creatures control instead counter
Topic #2:
battlefield enters creature control token green flying black prevent tokens red counters combat permanents time trample deal haste dragon kicked
Topic #3:
life gain creatures strike control loses flying lifelink choose double level goblin total hexproof equal attacking attached aura sliver intimidate
Topic #4:
library mana card land battlefield search shuffle reveal choose cost pool tapped ability tap add sorcery hand instant converted color
Topic #5:
beginning upkeep blocked players combat defender step spells opponents flying artifacts player cast blocks transform infect colorless regenerated win countered
Topic #6:
target counter spell creature destroy cast power art

## Graphlab LDA

In [13]:
# graphlab

import graphlab as gl
import pandas as pd
import pyLDAvis 
import ftfy

A newer version of GraphLab Create (v1.9) is available! Your current version is v1.8.5.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.


This non-commercial license of GraphLab Create is assigned to hollisnolan@gmail.com and will expire on November 18, 2016. For commercial licensing options, visit https://dato.com/buy/.


2016-05-09 14:23:17,078 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1462828995.log


In [14]:
pyLDAvis.enable_notebook()

In [None]:
# nyArticlesDF = pd.read_pickle("nyt_articles.pkl")

# nyArticlesDF['content'] = nyArticlesDF['content'].apply(lambda x: ftfy.fix_encoding(x)
#                                                         if isinstance(x, unicode)
#                                                         else "Warning: not Unicode")

modern.text


In [21]:
text = gl.SFrame(modern.text)

In [16]:
text.show()

Canvas is accessible via web browser at the URL: http://localhost:64991/index.html
Opening Canvas in default web browser.


In [26]:
text['features'] = gl.text_analytics.count_ngrams(text['X1'],1)


topicModel = gl.topic_model.create(text['features'], 
                                   num_topics=10, num_iterations=50)

In [27]:
"""
pyLDAvis GraphLab
===============
Helper functions to visualize GraphLab Create's TopicModel (an implementation of LDA)
"""

from __future__ import absolute_import

import funcy as fp
import numpy as np

def _topics_as_df(topic_model):
    tdf = topic_model['topics'].to_dataframe()
    return pd.DataFrame(np.vstack(tdf['topic_probabilities'].values), index=tdf['vocabulary'])

def _sum_sarray_dicts(sarray):
    counts_sf = gl.SFrame({'count_dicts': sarray}).stack('count_dicts').groupby(key_columns='X1',
                                              operations={'count': gl.aggregate.SUM('X2')})
    return counts_sf.unstack(column=['X1', 'count'])[0].values()[0]

def _extract_doc_data(docs):
    doc_lengths = list(docs.apply(lambda d: np.array(d.values()).sum()))
    term_freqs_dict = _sum_sarray_dicts(docs)

    vocab = term_freqs_dict.keys()
    term_freqs = term_freqs_dict.values()

    return {'doc_lengths': doc_lengths, 'vocab': vocab, 'term_frequency': term_freqs}

def _extract_model_data(topic_model, docs, vocab):
    doc_topic_dists = np.vstack(topic_model.predict(docs, output_type='probabilities'))

    topics = _topics_as_df(topic_model)
    topic_term_dists = topics.T[vocab].values

    return {'topic_term_dists': topic_term_dists, 'doc_topic_dists': doc_topic_dists}

def _extract_data(topic_model, docs):
    doc_data = _extract_doc_data(docs)
    model_data = _extract_model_data(topic_model, docs, doc_data['vocab'])
    return fp.merge(doc_data, model_data)

def prepare(topic_model, docs, **kargs):
    """Transforms the GraphLab TopicModel and related corpus data into
    the data structures needed for the visualization.
    Parameters
    ----------
    topic_model : graphlab.toolkits.topic_model.topic_model.TopicModel
        An already trained GraphLab topic model.
    docs : SArray of dicts
        The corpus in bag of word form, the same docs used to train the model.
    **kwargs :
        additional keyword arguments are passed through to :func:`pyldavis.prepare`.
    Returns
    -------
    prepared_data : PreparedData
        the data structures used in the visualization
    Example
    --------
    For example usage please see this notebook:
    http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/GraphLab.ipynb
    """
    opts = fp.merge(_extract_data(topic_model, docs), kargs)
    return pyLDAvis.prepare(**opts)

In [29]:
prepare(topicModel, text['features'])