# Using PCA to visualize the MtG Universe

In this notebook, we're going to scrape Magic the Gathering's <a href="http://gatherer.wizards.com/Pages/Default.aspx" target="_blank">Gatherer</a> card database and then perform <a href="http://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal components analysis</a> to visualize hidden structure in this network of trading cards. Magic the Gathering is a very popular trading card game (and personal favorite of mine) and presents an interesting high-dimensional structured dataset to analyze.

<img src="http://thisisinfamous.com/wp-content/uploads/2014/01/Magic-The-Gathering-Duels-of-the-Planeswalkers-2012.jpg" style="max-height:350px">

We'll do this in 4 steps:
1. scrape + clean the data
- extract features from the data
- perform PCA
- visualize + interpret results

In [428]:
# boring imports

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

import requests
from pattern import web
import re, string
from sets import Set
from collections import Counter
from fuzzywuzzy import fuzz
database = {}

# Silly helper functions

def isInt(s):
    try: 
        int(s)
        return True
    except ValueError:
        return False

def anyIntOrColor(l):
    for val in l:
        if isInt(val) | (val in ['Black', 'Red', 'Green', 'Blue', 'White']) : return True
    return False

In [429]:
# Some global variables we might want to change later

textFeatures = True
functionalFeatures = True
cardsToScrape = 1000

# (1) -- Scrape baby, scrape

First order of business is scraping the data from the Gatherer database. Here, we use HTML ids for each card attribute (i.e. name, text, etc.) that we wish to grab, based on a cards `multiverse_id` which we pass as a parameter to the URL.

In [388]:
# scrapes:

# name, types, text (lowered, alphanumeritized), mana cost,
# cmc, power and toughness,

# and adds it to the global card database

def grabCard(multiverse_id):
    xml = "http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=" + str(multiverse_id)
    dom = web.Element(requests.get(xml).text)
    
    # card name, card type
    
    cardName = dom('div.cardImage img')[0].attributes['alt'] if dom('div .cardImage img') else ''
        
    cardType = [element.strip() for element in \
                dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_typeRow div.value')[0].content.split(u'\u2014')]
    
    # extract, parse, clean text into a list
    
    cardText = []
    pattern = re.compile('[\W_]+')
    for line in dom('div.cardtextbox'):
        for element in line:
            cardText.append(element)
    
    for i in xrange(len(cardText)):
        if cardText[i].type == 'element' and cardText[i].tag == 'img':
            cardText[i] = cardText[i].attributes['alt']
        else:
            cardText[i] = str(cardText[i]).strip().lower()
        pattern.sub('', cardText[i]) 
    
    # mana symbols
    
    manaCost = [element.attributes['alt'] for element in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_manaRow div.value img')]
    cmc = int(dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value')[0].content.strip()) \
            if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value') else np.nan
    
    # p/t
    
    power = np.nan
        
    power = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][0] \
                if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan
    power = float(power) if power != '*' and power != np.nan else np.nan
    
    toughness = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][1] \
                    if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan
    toughness = float(toughness) if (toughness != '*' and toughness != '7-*' and toughness != np.nan) else np.nan
      
    # add data
    
    database[cardName] = {
                            'cardType' : cardType,
                            'cardText' : cardText,
                            'manaCost' : manaCost,
                            'cmc' : cmc,
                            'power' : power,
                            'toughness' : toughness
                         }
    

In [432]:
# do the scraping!

for i in xrange(1, cardsToScrape):
    if (i % 25 == 0): print "grabbed " + str(i)
    grabCard(i)

print "Done!"

grabbed 25
grabbed 50
grabbed 75
grabbed 100
grabbed 125
grabbed 150
grabbed 175
grabbed 200
grabbed 225
grabbed 250
grabbed 275
grabbed 300
grabbed 325
grabbed 350
grabbed 375
grabbed 400
grabbed 425
grabbed 450
grabbed 475
grabbed 500
grabbed 525
grabbed 550
grabbed 575
grabbed 600
grabbed 625
grabbed 650
grabbed 675
grabbed 700
grabbed 725
grabbed 750
grabbed 775
grabbed 800
grabbed 825
grabbed 850
grabbed 875
grabbed 900
grabbed 925
grabbed 950
grabbed 975
Done!


At this point, a card in our database looks like this:

In [452]:
database['Black Lotus']

{'cardText': [u'Tap',
  ', sacrifice black lotus: add three mana of any one color to your mana pool.'],
 'cardType': [u'Artifact'],
 'cmc': 0,
 'manaCost': [u'0'],
 'power': nan,
 'toughness': nan}

We'll put this into a pandas dataframe and be good to go.

In [440]:
data = pd.DataFrame.from_dict(database, orient='index')
data['cardName'] = data.index
data

Unnamed: 0,toughness,power,cmc,cardType,cardText,manaCost,cardName
Abu Ja'far,1,0,1,"[Creature, Human]","[when abu ja'far dies, destroy all creatures b...",[White],Abu Ja'far
Air Elemental,4,4,5,"[Creature, Elemental]",[flying],"[3, Blue, Blue]",Air Elemental
...,...,...,...,...,...,...,...
Ydwen Efreet,6,3,3,"[Creature, Efreet]","[whenever ydwen efreet blocks, flip a coin. if...","[Red, Red, Red]",Ydwen Efreet
Zombie Master,3,2,3,"[Creature, Zombie]","[other zombie creatures have swampwalk., other...","[1, Black, Black]",Zombie Master


# (2) -- Feature extraction

Based on our domain knowledge, we're going to extract features in 4 main areas:
1. Mana costs/amounts of a card
2. Type of card (i.e. Artifact, Creature, etc.)
3. Text features based on the card's text (i.e. "When this creature enters the battlefield...")
4. Functional features: special thinga like having a Tap ability or being a global effect (i.e. has the word 'all' or 'any' in the text box)

In [441]:
# 1. Mana features

colorlessMana = []
colorless = []

for row in data['manaCost']:
    found = 0
    for val in row:
        if isInt(val):
            colorlessMana.append(float(val))
            found = 1
    if found == 0:
        colorlessMana.append(0)

data['colorlessMana'] = colorlessMana 
data['Variable Colorless'] = [1 if 'Variable Colorless' in text else 0 for text in data['manaCost']]

# count mana symbols

manaSymbols = []

manaSymbols = ['Blue', 'Black', 'Red', 'Green', 'White']
manaVars = ['mana_' + _ for _ in manaSymbols]

for i in xrange(len(manaSymbols)):
    data[manaVars[i]] = [text.count(manaSymbols[i]) for text in data['manaCost']]
    data[manaSymbols[i]] = [1 if text.count(manaSymbols[i]) > 0 else 0 for text in data['manaCost']]

# find color (ignores multicolor)

def isColorless(l):
    for val in l:
        if val in manaSymbols: return False
    return True

data['Artifact'] = [1 if isColorless(x) else 0 for x in data['manaCost']]

def findColor(l):
    for val in l:
        if not isInt(val) and val != 'Variable Colorless': return val
    return 'Artifact'

data['color'] = [findColor(l) for l in data['manaCost']]

data.groupby(data['color']).describe().to_csv('colorSummary.csv')
data.groupby(data['color']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Artifact,Black,Blue,Green,Red,Variable Colorless,White,cmc,colorlessMana,mana_Black,mana_Blue,mana_Green,mana_Red,mana_White,power,toughness
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Artifact,count,92,92,92,92,92,92,92,69.000000,92.000000,92,92,92,92,92,8.000,8.00
Artifact,mean,1,0,0,0,0,0,0,2.811594,2.108696,0,0,0,0,0,1.875,4.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
White,75%,0,0,0,0,0,0,1,3.000000,1.000000,0,0,0,0,2,3.000,3.00
White,max,0,0,0,0,0,1,1,6.000000,3.000000,0,0,0,0,3,6.000,6.00


In [442]:
# 2. Type features

data['Primary Type'] = [cardType[0] for cardType in data['cardType']]
data = pd.concat([data, pd.get_dummies(data['Primary Type'])], axis=1)

data.groupby(data['Primary Type']).describe().to_csv('typeSummaries.csv')
data.groupby(data['Primary Type']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,toughness,power,cmc,colorlessMana,Variable Colorless,mana_Blue,Blue,mana_Black,Black,mana_Red,...,White,Artifact,Artifact,Artifact Creature,Basic Land,Creature,Enchantment,Instant,Land,Sorcery
Primary Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Artifact,count,1,1,62.000000,62.000000,62,62,62,62,62,62,...,62,62,62,62,62,62,62,62,62,62
Artifact,mean,6,3,2.693548,2.693548,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Sorcery,75%,,,3.000000,2.000000,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
Sorcery,max,,,6.000000,4.000000,1,3,1,3,1,1,...,1,0,0,0,0,0,0,0,0,1


In [443]:
# 3. Text-based features

def partialMatch(s, l, threshold=95):
    fuzzVals = [fuzz.partial_ratio(s, x) for x in l]
    if not fuzzVals: fuzzVals = [0]
    return max(fuzzVals) >= threshold

if textFeatures:

    data['Damage'] = [1 if partialMatch('damage', l) else 0 for l in data['cardText']]
    data['Hand'] = [1 if partialMatch('hand', l) else 0 for l in data['cardText']]
    data['Draw'] = [1 if partialMatch('draw', l, 80) else 0 for l in data['cardText']]
    data['Upkeep'] = [1 if partialMatch('draw', l, 80) else 0 for l in data['cardText']]
    data['Library'] = [1 if partialMatch('library', l) else 0 for l in data['cardText']]
    data['Sacrifice'] = [1 if partialMatch('sacrifice', l) else 0 for l in data['cardText']]
    data['Destroy'] = [1 if partialMatch('destroy', l) else 0 for l in data['cardText']]
    data['Discard'] = [1 if partialMatch('discard', l) else 0 for l in data['cardText']]
    data['Prevent'] = [1 if partialMatch('prevent', l) else 0 for l in data['cardText']]
    data['Life'] = [1 if partialMatch('life', l) else 0 for l in data['cardText']]
    data['Attack'] = [1 if partialMatch('attack', l) else 0 for l in data['cardText']]
    data['Block'] = [1 if partialMatch('block', l) else 0 for l in data['cardText']]
    data['Search'] = [1 if partialMatch('search', l) else 0 for l in data['cardText']]
    data['Choose'] = [1 if partialMatch('choose', l) else 0 for l in data['cardText']]
    data['Copy'] = [1 if partialMatch('copy', l) else 0 for l in data['cardText']]
    data['Change'] = [1 if partialMatch('change', l) else 0 for l in data['cardText']]
    data['Turn'] = [1 if partialMatch('turn', l) else 0 for l in data['cardText']]
    data['End of turn'] = [1 if partialMatch('end of turn', l, 80) else 0 for l in data['cardText']]
    data['Beginning of turn'] = [1 if partialMatch('beginning of turn', l, 80) else 0 for l in data['cardText']]
    data['Spell ref'] = [1 if partialMatch('spell', l) else 0 for l in data['cardText']]
    data['Creature ref'] = [1 if partialMatch('creature', l) else 0 for l in data['cardText']]
    data['Land'] = [1 if partialMatch('land', l) else 0 for l in data['cardText']]
    data['Mana'] = [1 if partialMatch('mana', l) else 0 for l in data['cardText']]
    data['Battlefield'] = [1 if partialMatch('battlefield', l) else 0 for l in data['cardText']]

    data['Blue ref'] = [1 if partialMatch('blue', l) else 0 for l in data['cardText']]
    data['Black ref'] = [1 if partialMatch('black', l) else 0 for l in data['cardText']]
    data['Green ref'] = [1 if partialMatch('green', l) else 0 for l in data['cardText']]
    data['Red ref'] = [1 if partialMatch('red', l) else 0 for l in data['cardText']]
    data['White ref'] = [1 if partialMatch('white', l) else 0 for l in data['cardText']]
    data['Colorless ref'] = [1 if partialMatch('colorless', l) else 0 for l in data['cardText']]

In [444]:
# 4. Special functional features

def isBuff(str, l):
    found = 0
    for val in l:
        if str in val:
            found += 1
    if found > 0: return True
    else: return False

if functionalFeatures:

    data['Untap'] = [1 if partialMatch('untap', l) else 0 for l in data['cardText']]
    data['All'] = [1 if partialMatch('all', l) | partialMatch('any', l) else 0 for l in data['cardText']]

    data['Tap ability'] = [1 if 'Tap' in x else 0 for x in data['cardText']]
    data['Mana symbol'] = [1 if anyIntOrColor(x) else 0 for x in data['cardText']]
    data['Mana related'] = [1 if partialMatch('add mana', l) | partialMatch('your mana pool', l) \
                                  else 0 for l in data['cardText']]

    data['Buff'] = [1 if isBuff('+', l) else 0 for l in data['cardText']]
    data['Debuff'] = [1 if isBuff('-', l) else 0 for l in data['cardText']]

# (3) -- Perform PCA

Surprisingly, the PCA itself is the easiest part of this entire thing.

In [445]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

import plotly.plotly as py
py.sign_in('nhuber', 'bmopo8hk40')
from plotly.graph_objs import *
import plotly.tools as tls

numericData = data.copy()
numericData_std = scale(numericData.fillna(0).select_dtypes(include=['float64', 'int64']))

pca = PCA(n_components=10)
Y_pca = pca.fit_transform(numericData_std)

print pca.explained_variance_ratio_

[ 0.07565148  0.06063182  0.05227842  0.04867915  0.04141488  0.04095507
  0.0387418   0.03632631  0.03336077  0.03125994]


# (4a) -- Visualize results

In [446]:
traces = []

for color in set(data['color']):

    matches = []
    for i in xrange(len(data['color'])):
        if data['color'].irow(i) == color:
            matches.append(i)
    
    graphColor = color
    if color == 'White': graphColor = '#B2B2B2'
    if color == 'Artifact': graphColor = '#996633'
    if color == 'Red' : graphColor = '#E50000'
    if color == 'Blue': graphColor = '#0000FF'
    if color == 'Green' : graphColor = '#006400'
    if color == 'Black' : graphColor = '#000000'
    
    trace = Scatter(
        x=Y_pca[matches,0],
        y=Y_pca[matches,1],
        mode='text',
        name=color,
        marker=Marker(
            size=12,
            color=graphColor,
            line=Line(
                color='rgba(0, 0, 0, 0)',
                width=1),
            opacity=0.9),
        text = data['cardName'].irow(matches),
        textfont = Font(
            family='Georgia',
            size=11,
            color=graphColor
            )
        )
    
    traces.append(trace)

dataToGraph = Data(traces)

xRange = max(np.percentile(np.array([x[0] for x in Y_pca]), 2.5),
                np.percentile(np.array([x[0] for x in Y_pca]), 97.5))
yRange = max(np.percentile(np.array([x[1] for x in Y_pca]), 2.5),
                np.percentile(np.array([x[1] for x in Y_pca]), 97.5))

layout = Layout(title="PCA on MtG",
                titlefont=Font(family='Georgia', size=26),
                autosize = False,
                height = 750,
                width = 850,
                xaxis=XAxis(
                    range=[-xRange, +xRange],
                    title='PC1', showline=False),
                yaxis=YAxis(
                    range=[-yRange, +yRange],
                    title='PC2', showline=False))
fig = Figure(data=dataToGraph, layout=layout)
py.iplot(fig)





# (4b) -- Interpret Results

In [447]:
# why is mountain such an outlier

In [448]:
# 3 axes, with examples
# color affinities