# Using PCA to visualize the MtG universe

In this notebook, we're going to scrape Magic the Gathering's <a href="http://gatherer.wizards.com/Pages/Default.aspx" target="_blank">Gatherer</a> card database and then perform <a href="http://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">principal components analysis</a> to visualize hidden relationships between cards. Ultimately, our goal will be to see how much of variations between cards can be simplified and then plotted in two-dimensions.

<img src="http://thisisinfamous.com/wp-content/uploads/2014/01/Magic-The-Gathering-Duels-of-the-Planeswalkers-2012.jpg" style="max-height:300px">

A dataset of Magic cards is incredibly high-dimensional -- there are over a 100 unique mechanics in the game and the game state has many different elements (hand, battlefield, mana pool, etc.). Being able to translate the <a href="http://mtgsalvation.gamepedia.com/Magic:_The_Gathering_statistics_and_trivia?cookieSetup=true" target="_blank">13,000 unique card texts</a> into structured data is also a challenge NLP-related task.

##Warning: This notebook is long...so, for the impatient:

Here is what we will be working towards:

<img src="./pca2.png">

Pretty baller, right? We will interpret and grok this graph later, but for now, let's do this...

LEERRRROOYYY JENNNKIINNNNNSSS

## Outline
Here's a breakdown of the four steps that we'll go through to accomplish this task:
1. Scrape + clean the data using `requests`, `web` from `pattern`, and `pandas`
- Extract features from the data using `fuzzywuzzy` and domain knowledge
- Perform and analyze PCA using `sklearn`
- Visualize + interpret results using the `plotly` graphing library

*First some boring imports and settings (feel free to skip over)*

In [764]:
# boring imports

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

import requests
from pattern import web
requests.packages.urllib3.disable_warnings()

import re, string
from sets import Set
from collections import Counter
from fuzzywuzzy import fuzz

database = {}

# Silly helper functions

def isInt(s):
    try: 
        int(s)
        return True
    except ValueError:
        return False

def anyIntOrColor(l):
    for val in l:
        if isInt(val) | (val in ['Black', 'Red', 'Green', 'Blue', 'White']) : return True
    return False

# (1) -- Scrape baby, scrape

Our first order of business is scraping the data from the Gatherer database using `requests` and `web` from `pattern`. In it's simplest form, every Magic card has a name, text, type, mana cost, and power/toughness (if it's a creature). An example is Hypnotic Specter, a powerful creature in the early days of Magic:

<img src="./hyppy.png" style="margin: auto; display:block; height: 300px;">

To scrape the relevant card features, we will construct card URLs using the card's `multiverse_id` and on the page we load will look for unique HTML elements that correspond to each of the features we will to obtain.

In [765]:
# grabCard scrapes:

# name, types, text (lowered, alphanumeritized), mana cost,
# cmc, power and toughness, and rarity.

# and adds it to the global card database

def grabCard(multiverse_id):
    xml = "http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=" + str(multiverse_id)
    dom = web.Element(requests.get(xml).text)
    
    # card name, card type
    cardName = dom('div.cardImage img')[0].attributes['alt'] if dom('div .cardImage img') else ''
    cardType = [element.strip() for element in \
                dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_typeRow div.value')[0].content.split(u'\u2014')]
    
    # extract, parse, clean text into a list
    cardText = []
    pattern = re.compile('[\W_]+')
    for line in dom('div.cardtextbox'):
        for element in line:
            cardText.append(element)
    
    for i in xrange(len(cardText)):
        if cardText[i].type == 'element' and cardText[i].tag == 'img':
            cardText[i] = cardText[i].attributes['alt']
        else:
            cardText[i] = str(cardText[i]).strip().lower()
        pattern.sub('', cardText[i]) 
    
    # mana symbols
    manaCost = [element.attributes['alt'] for element in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_manaRow div.value img')]
    cmc = int(dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value')[0].content.strip()) \
            if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_cmcRow div.value') else np.nan
    
    # rarity
    rarity = dom('div #ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rarityRow div.value span')[0].content.lower()
    
    # p/t
    power = np.nan
    power = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][0] \
                if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan
    power = float(power) if power != '*' and power != np.nan else np.nan
    toughness = [_.strip() for _ in dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value')[0].content.split(' / ')][1] \
                    if dom('div#ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_ptRow div.value') else np.nan
    toughness = float(toughness) if (toughness != '*' and toughness != '7-*' and toughness != np.nan) else np.nan
      
    # add data
    database[cardName] = {
                            'cardType' : cardType,
                            'cardText' : cardText,
                            'manaCost' : manaCost,
                            'cmc' : cmc,
                            'rarity': rarity,
                            'power' : power,
                            'toughness' : toughness
                         }

## Do the scraping

We'll iterate through a range of `multiverse_id`s to scrape a desired amount of cards. Note that it takes around 1 minute/500 `multiverse_id`s. Given that there are 13k+ cards (and multiple versions of each -- see below), we'll limit our scraping to ~500 cards from the very first Magic set: <a href="http://en.wikipedia.org/wiki/Limited_Edition_%28Magic:_The_Gathering%29" target="_blank">Alpha</a>.

In [766]:
cardsToScrape = 600

for i in xrange(1, cardsToScrape):
    if (i % 100 == 0): print "grabbed " + str(i)
    grabCard(i)

print "Done!"

grabbed 100
grabbed 200
grabbed 300
grabbed 400
grabbed 500
Done!


At this point, we now have roughly `cardsToScrape` cards and associated values in a local `dict` using the `cardName` as the key. (Note that we have less than `cardsToScrape` as we're iterating over `multiverse_id`s and some ids don't actually match to a card page.)

### Note for potential future work

*There are other aspects represented on the Gatherer database such as set and community ratings but we leave this to future work. Annoyingly, for cards in multiple sets, the card will have a different page (and subsequently different set of ratings) for each set; though this would require more work, it'd be super interesting if you could predict a card's community interest (# ratings) and favorability (average rating).*

### Making the data usable

We'll now put this into a `pandas` dataframe for cleaning, variable creation and initial analysis/spot checking/understanding.

In [767]:
data = pd.DataFrame.from_dict(database, orient='index')
data['cardName'] = data.index
data

Unnamed: 0,toughness,power,cmc,rarity,cardType,cardText,manaCost,cardName
Air Elemental,4,4,5,uncommon,"[Creature, Elemental]",[flying],"[3, Blue, Blue]",Air Elemental
Ancestral Recall,,,1,rare,[Instant],[target player draws three cards.],[Blue],Ancestral Recall
...,...,...,...,...,...,...,...,...
Wrath of God,,,4,rare,[Sorcery],[destroy all creatures. they can't be regenera...,"[2, White, White]",Wrath of God
Zombie Master,3,2,3,rare,"[Creature, Zombie]","[other zombie creatures have swampwalk., other...","[1, Black, Black]",Zombie Master


# (2) -- Feature extraction

Based on our domain knowledge, we're going to extract four main types of features for each card:
1. **Mana cost**/amounts of a card
2. **Categorical features** -- type (i.e. Artifact, Creature, etc.) and rarity (i.e. Common, Uncommon, etc.)
3. **Text features** based on the card's text (i.e. "When this creature enters the battlefield...")
4. **Functional features** -- having a Tap ability, being a mana generator, etc.

In [768]:
# Which features do we want to use?
# All enabled by default, mana features required

categoricalFeatures = True
textFeatures = True
functionalFeatures = True

## (2.1) -- Mana features

In [769]:
# Create mana features

colorlessMana = []
colorless = []

for row in data['manaCost']:
    found = 0
    for val in row:
        if isInt(val):
            colorlessMana.append(float(val))
            found = 1
    if found == 0:
        colorlessMana.append(0)

data['colorlessMana'] = colorlessMana 
data['Variable Colorless'] = [1 if 'Variable Colorless' in text else 0 for text in data['manaCost']]

In [770]:
# Find color (ignores multicolor)

def isColorless(l):
    for val in l:
        if val in manaSymbols: return False
    return True

data['Artifact'] = [1 if isColorless(x) else 0 for x in data['manaCost']]

def findColor(l):
    for val in l:
        if not isInt(val) and val != 'Variable Colorless': return val
    return 'Artifact'

data['color'] = [findColor(l) for l in data['manaCost']]

In [771]:
# Count mana symbols

manaSymbols = []

manaSymbols = ['Blue', 'Black', 'Red', 'Green', 'White']
manaVars = ['mana_' + _ for _ in manaSymbols]

for i in xrange(len(manaSymbols)):
    data[manaVars[i]] = [text.count(manaSymbols[i]) for text in data['manaCost']]
    data[manaSymbols[i]] = [1 if text.count(manaSymbols[i]) > 0 else 0 for text in data['manaCost']]

data.groupby(data['color']).describe().to_csv('colorSummary.csv')
data.groupby(data['color']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,Artifact,Black,Blue,Green,Red,Variable Colorless,White,cmc,colorlessMana,mana_Black,mana_Blue,mana_Green,mana_Red,mana_White,power,toughness
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Artifact,count,62,62,62,62,62,62,62,47.000000,62.000000,62,62,62,62,62.00,5.0,5.0
Artifact,mean,1,0,0,0,0,0,0,2.361702,1.790323,0,0,0,0,0.00,2.4,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
White,75%,0,0,0,0,0,0,1,3.000000,1.000000,0,0,0,0,1.75,3.0,4.5
White,max,0,0,0,0,0,1,1,6.000000,3.000000,0,0,0,0,3.00,6.0,6.0


## (2b) -- Categorical features

In [772]:
# Create categorical features

if categoricalFeatures:

    data['Primary Type'] = [cardType[0] for cardType in data['cardType']]
    
    data = pd.concat([data, pd.get_dummies(data['Primary Type'])], axis=1)
    data = pd.concat([data, pd.get_dummies(data['rarity'])], axis=1)

    data.groupby(data['rarity']).describe().to_csv('byRarity.csv')
    data.groupby(data['rarity']).describe()
    
    data.groupby(data['Primary Type']).describe().to_csv('byType.csv')
    data.groupby(data['Primary Type']).describe()

## (2c) -- Text features

A helper function from `fuzzywuzzy` to find partial word matches in card text boxes:

In [773]:
def partialMatch(s, l, threshold=95):
    fuzzVals = [fuzz.partial_ratio(s, x) for x in l]
    if not fuzzVals: fuzzVals = [0]
    return max(fuzzVals) >= threshold

Based on domain knowledge, we'll fuzzy match if certain important words are in a card's text box that will give us a hint of what the card does.

In [774]:
# Create text-based features

if textFeatures:

    data['Damage'] = [1 if partialMatch('damage', l) else 0 for l in data['cardText']]
    data['Hand'] = [1 if partialMatch('hand', l) else 0 for l in data['cardText']]
    data['Draw'] = [1 if partialMatch('draw', l, 80) else 0 for l in data['cardText']]
    data['Upkeep'] = [1 if partialMatch('draw', l, 80) else 0 for l in data['cardText']]
    data['Library'] = [1 if partialMatch('library', l) else 0 for l in data['cardText']]
    data['Sacrifice'] = [1 if partialMatch('sacrifice', l) else 0 for l in data['cardText']]
    data['Destroy'] = [1 if partialMatch('destroy', l) else 0 for l in data['cardText']]
    data['Discard'] = [1 if partialMatch('discard', l) else 0 for l in data['cardText']]
    data['Prevent'] = [1 if partialMatch('prevent', l) else 0 for l in data['cardText']]
    data['Life'] = [1 if partialMatch('life', l) else 0 for l in data['cardText']]
    data['Attack'] = [1 if partialMatch('attack', l) else 0 for l in data['cardText']]
    data['Block'] = [1 if partialMatch('block', l) else 0 for l in data['cardText']]
    data['Search'] = [1 if partialMatch('search', l) else 0 for l in data['cardText']]
    data['Choose'] = [1 if partialMatch('choose', l) else 0 for l in data['cardText']]
    data['Copy'] = [1 if partialMatch('copy', l) else 0 for l in data['cardText']]
    data['Change'] = [1 if partialMatch('change', l) else 0 for l in data['cardText']]
    data['Turn'] = [1 if partialMatch('turn', l) else 0 for l in data['cardText']]
    data['End of turn'] = [1 if partialMatch('end of turn', l, 80) else 0 for l in data['cardText']]
    data['Beginning of turn'] = [1 if partialMatch('beginning of turn', l, 80) else 0 for l in data['cardText']]
    data['Spell ref'] = [1 if partialMatch('spell', l) else 0 for l in data['cardText']]
    data['Creature ref'] = [1 if partialMatch('creature', l) else 0 for l in data['cardText']]
    data['Land'] = [1 if partialMatch('land', l) else 0 for l in data['cardText']]
    data['Mana'] = [1 if partialMatch('mana', l) else 0 for l in data['cardText']]
    data['Battlefield'] = [1 if partialMatch('battlefield', l) else 0 for l in data['cardText']]
    data['Blue ref'] = [1 if partialMatch('blue', l) else 0 for l in data['cardText']]
    data['Black ref'] = [1 if partialMatch('black', l) else 0 for l in data['cardText']]
    data['Green ref'] = [1 if partialMatch('green', l) else 0 for l in data['cardText']]
    data['Red ref'] = [1 if partialMatch('red', l) else 0 for l in data['cardText']]
    data['White ref'] = [1 if partialMatch('white', l) else 0 for l in data['cardText']]
    data['Colorless ref'] = [1 if partialMatch('colorless', l) else 0 for l in data['cardText']]

## (2d) -- Functional features

In [775]:
# 4. Special functional features

def isBuff(str, l):
    found = 0
    for val in l:
        if str in val:
            found += 1
    if found > 0: return True
    else: return False

if functionalFeatures:

    data['Untap'] = [1 if partialMatch('untap', l) else 0 for l in data['cardText']]
    data['All'] = [1 if partialMatch('all', l) | partialMatch('any', l) else 0 for l in data['cardText']]

    data['Tap ability'] = [1 if 'Tap' in x else 0 for x in data['cardText']]
    data['Mana symbol'] = [1 if anyIntOrColor(x) else 0 for x in data['cardText']]
    data['Mana related'] = [1 if partialMatch('add mana', l) | partialMatch('your mana pool', l) \
                                  else 0 for l in data['cardText']]

    data['Buff'] = [1 if isBuff('+', l) else 0 for l in data['cardText']]
    data['Debuff'] = [1 if isBuff('-', l) else 0 for l in data['cardText']]

## Note

Some of this might have been able to be done automatically, especially the **text features**, which could have been done by finding the most common words referred to in text boxes. Again, I leave this to future work and am really curious about what the literature on automatic feature creation says about this.

# (3) -- Perform PCA

Surprisingly, the PCA itself is the easiest part of this entire thing. We'll use `sklearn` to perform a 10-component PCA to see how much of the entire data's dimensional variation can be reduced to 10 dimensions.

In [776]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

numericData = data.copy()
# scale to mean 0, variance 1
numericData_std = scale(numericData.fillna(0).select_dtypes(include=['float64', 'int64']))

pca = PCA(n_components=10)
Y_pca = pca.fit_transform(numericData_std)

###So, how well did we do?

Well, based on the explained variance vector below it doesn't look like we did very well. The first two principal components only combined for **14% of the total variance** in the data; though, of note, is that the first 10 factors do account for **44% of the total variance**. Considering we're working with a database of 64 features though, this is pretty decent.

In [777]:
# PCA analysis

print
print "Variance explained by each factor:"
print [round(x, 3) for x in pca.explained_variance_ratio_]
print
print "Variance explained by all 10 factors:"
print round(sum(pca.explained_variance_ratio_), 3)
print
print "Num features:"
print len(numericData_std[0])


Variance explained by each factor:
[0.079, 0.061, 0.055, 0.051, 0.041, 0.038, 0.037, 0.036, 0.032, 0.03]

Variance explained by all 10 factors:
0.46

Num features:
64


# (4a) -- Visualize results

Now time to see if it was all worth it -- and apply the PCA projection to our data set. We're going to make a scatterplot of all of our cards with the PCA two-dimensional projection applied grouped by card color.

In [778]:
import plotly.plotly as py
py.sign_in('nhuber', 'bmopo8hk40')
from plotly.graph_objs import *
import plotly.tools as tls

# create the data to graph, grouped by card color

traces = []
colorSeries = {}

for color in set(data['color']):

    matches = []
    for i in xrange(len(data['color'])):
        if data['color'].irow(i) == color:
            matches.append(i)
    
    graphColor = color
    if color == 'White': graphColor = '#B2B2B2'
    if color == 'Artifact': graphColor = '#996633'
    if color == 'Red' : graphColor = '#E50000'
    if color == 'Blue': graphColor = '#0000FF'
    if color == 'Green' : graphColor = '#006400'
    if color == 'Black' : graphColor = '#000000'
    
    trace = Scatter(
        x=Y_pca[matches,0],
        y=Y_pca[matches,1],
        mode='text',
        name=color,
        marker=Marker(
            size=12,
            color=graphColor,
            line=Line(
                color='rgba(0, 0, 0, 0)',
                width=1),
            opacity=0.9),
        text = data['cardName'].irow(matches),
        textfont = Font(
            family='Georgia',
            size=8.5,
            color=graphColor
            )
        )

    colorSeries[color] = (Y_pca[matches,0], Y_pca[matches,0])
    traces.append(trace)

In [779]:
# Set up the scatter plot layout

dataToGraph = Data(traces)

# auto-focus on where most of the data is clustered
xRange = max(abs(np.percentile(np.array([x[0] for x in Y_pca]), 2.5)),
                abs(np.percentile(np.array([x[0] for x in Y_pca]), 97.5)))
yRange = max(abs(np.percentile(np.array([x[1] for x in Y_pca]), 2.5)),
                abs(np.percentile(np.array([x[1] for x in Y_pca]), 97.5)))

layout = Layout(title="PCA on MtG",
                titlefont=Font(family='Georgia', size=26),
                showlegend = True,
                autosize = False,
                height = 600,
                width = 700,
                xaxis=XAxis(
                    range=[-xRange, +xRange],
                    title='PC1', showline=False),
                yaxis=YAxis(
                    range=[-yRange, +yRange],
                    title='PC2', showline=False))

### Let's graph this shiznit

In [780]:
fig = Figure(data=dataToGraph, layout=layout)
py.iplot(fig)

Ok a few notes on this graph:
- It's interactive: you can zoom into an area on the graph by dragging to create a rectangle
- Also note that you can click the labels on the top right to turn on and off showing cards of different colors
- It will probably have more meaning if you know about Magic and what each of these cards do; so, I'll offer my analysis below but if you do play and have alternate interpretations about how these cards are grouped, please do lmk
- There appear to be two large clusters/corridors along diagonal to each axis
- Cards of the same color are clustered in similar locations, whereas artifacts seem to be in their own space

# (4b) -- Interpret Results

<img src="./pca2.png">


## Result 1: A story of two psuedo-axes

## Going through salient examples

## Result 2: Exploring the color identities

In [781]:
centers = {}

for color in set(data['color']):
    centers[color] = ( np.mean(colorSeries[color][0]), np.mean(colorSeries[color][1])

    

SyntaxError: invalid syntax (<ipython-input-781-11667bdc7454>, line 6)

# Conclusions and future work

dfsf

graph by type

@nhuber | nicholas.e.huber@gmail.com

<style>

.p {
    font-family: 'Georgia'
}

</style>