### Ingest

Goal: Get all cards read in from JSON and saved as a pickle. Keep as much data as will be useful, drop the rest. 

In [1]:
import numpy as np
import pandas as pd
import json
import re

To download all the cards run:

In [7]:
# import all magic cards ever printed, from web 14mb

# allsets = pd.read_json('http://mtgjson.com/json/AllSetsArray.json')
# allsets.info()

Else run this to load a locally stored file (and be nice to our kind friend's api)

In [8]:
# import all magic cards from file 

allsets = pd.read_json('data/AllSetsArray.json')
allsets.tail(2)

Unnamed: 0,block,booster,border,cards,code,gathererCode,magicCardsInfoCode,magicRaritiesCodes,name,oldCode,onlineOnly,releaseDate,type
185,Battle for Zendikar,"[[rare, mythic rare], uncommon, uncommon, unco...",black,"[{u'layout': u'normal', u'name': u'Bane of Bal...",BFZ,,bfz,,Battle for Zendikar,,,2015-10-02,expansion
186,,,black,"[{u'layout': u'normal', u'name': u'Prairie Str...",EXP,,exp,,Zendikar Expeditions,,,2015-10-02,promo






Top level dataframe is 186 rows of one set per row. Cards for that set are stored in the "cards" colum as another JSON array. Sets tend to have 200-350 cards in each.

In [9]:
# drop columns that are full of nan and/or junk

allsets = allsets.drop(['booster', 'gathererCode', 'magicCardsInfoCode', 
                        'magicCardsInfoCode', 'code',
                        'oldCode', 'onlineOnly', 'magicRaritiesCodes'
                        ], axis=1)

### Grab the most recent set and check it out

In [99]:
# zoom in on cards from latest set, 'Battle for Zendikar'

zen = pd.read_json(json.dumps(allsets.cards[185]))

# add set name and date
zen['set'] = allsets.name[185]
zen['releaseDate'] = allsets.releaseDate[185]

zen.head(2)

Unnamed: 0,artist,cmc,colors,flavor,id,imageName,layout,loyalty,manaCost,multiverseid,...,rarity,subtypes,supertypes,text,toughness,type,types,variations,set,releaseDate
0,Chase Stone,7,,The continent of Bala Ged was ravaged by massi...,b2baccd791f5ada8bbd505f11bf81e70c8ea27ab,bane of bala ged,normal,,{7},401814,...,Uncommon,[Eldrazi],,"Whenever Bane of Bala Ged attacks, defending p...",5,Creature — Eldrazi,[Creature],,Battle for Zendikar,2015-10-02
1,Todd Lockwood,5,,,cfff5e44e0263f10015fcf613b0f40065acf00b7,blight herder,normal,,{5},401819,...,Rare,"[Eldrazi, Processor]",,"When you cast Blight Herder, you may put two c...",5,Creature — Eldrazi Processor,[Creature],,Battle for Zendikar,2015-10-02


In [100]:
# drop junk 
zen.drop(['id', 'layout', 'multiverseid', 'imageName', 'subtypes', 
          'supertypes', 'variations', 'loyalty', 'number'], 
         axis=1, inplace=True)

zen.head(2)

Unnamed: 0,artist,cmc,colors,flavor,manaCost,name,power,rarity,text,toughness,type,types,set,releaseDate
0,Chase Stone,7,,The continent of Bala Ged was ravaged by massi...,{7},Bane of Bala Ged,7,Uncommon,"Whenever Bane of Bala Ged attacks, defending p...",5,Creature — Eldrazi,[Creature],Battle for Zendikar,2015-10-02
1,Todd Lockwood,5,,,{5},Blight Herder,4,Rare,"When you cast Blight Herder, you may put two c...",5,Creature — Eldrazi Processor,[Creature],Battle for Zendikar,2015-10-02


In [101]:
# grab the text from card 1

zen['text'][1]

u'When you cast Blight Herder, you may put two cards your opponents own from exile into their owners\' graveyards. If you do, put three 1/1 colorless Eldrazi Scion creature tokens onto the battlefield. They have "Sacrifice this creature: Add {1} to your mana pool."'

In [102]:
# fix errors

zen['text'].replace("\n" , " ", inplace=True, regex=True)
zen['text'].replace("\'" , "", inplace=True, regex=True)

zen['text'][1]

u'When you cast Blight Herder, you may put two cards your opponents own from exile into their owners graveyards. If you do, put three 1/1 colorless Eldrazi Scion creature tokens onto the battlefield. They have "Sacrifice this creature: Add {1} to your mana pool."'

### More cards

Now that we know everything words for one set, lets grab a bunch of sets and make one large table of cards. "Modern" is a major rules and format change the creators of the game enacted in 2003. It marks a turning point in the design, so, to start, we're only looking at cards after that date.  

In [10]:
# slice sets for all of modern (arbritrary rules change date in 2003)

modern_sets = allsets[ (allsets['releaseDate'] >= '2003-07-28' ) ]
modern_sets = modern_sets.loc[modern_sets['type'].isin(['core', 'expansion'])]
modern_sets = modern_sets.loc[modern_sets['border'].isin(['white', 'black'])]

modern_sets.reset_index(inplace=True)
modern_sets

Unnamed: 0,index,block,border,cards,name,releaseDate,type
0,71,,white,"[{u'layout': u'normal', u'name': u'Abyssal Spe...",Eighth Edition,2003-07-28,core
1,72,Mirrodin,black,"[{u'layout': u'normal', u'name': u'Æther Spell...",Mirrodin,2003-10-02,expansion
2,73,Mirrodin,black,"[{u'layout': u'normal', u'name': u'Æther Snap'...",Darksteel,2004-02-06,expansion
3,74,Mirrodin,black,"[{u'layout': u'normal', u'name': u'Abuna's Cha...",Fifth Dawn,2004-06-04,expansion
4,75,Kamigawa,black,"[{u'layout': u'normal', u'name': u'Akki Avalan...",Champions of Kamigawa,2004-10-01,expansion
5,77,Kamigawa,black,"[{u'layout': u'normal', u'name': u'Akki Blizza...",Betrayers of Kamigawa,2005-02-04,expansion
6,78,Kamigawa,black,"[{u'layout': u'normal', u'name': u'Adamaro, Fi...",Saviors of Kamigawa,2005-06-03,expansion
7,79,,white,"[{u'layout': u'normal', u'name': u'Adarkar Was...",Ninth Edition,2005-07-29,core
8,80,Ravnica,black,"[{u'layout': u'normal', u'name': u'Agrus Kos, ...",Ravnica: City of Guilds,2005-10-07,expansion
9,83,Ravnica,black,"[{u'layout': u'normal', u'name': u'Absolver Th...",Guildpact,2006-02-03,expansion


In [12]:
# groups cards from sets into one data frame 

cards = None

for s in xrange(len(modern_sets)):
    print "Reading in set:", modern_sets.name[s]
    target = pd.read_json(json.dumps(modern_sets.cards[s]))
    
    # slice to just the good stuff 
    target = target[['artist', 'cmc', 'colors', 'flavor', 'manaCost', 'name', 
                    'power', 'rarity', 'text', 'toughness', 'type', 'types']]
    
    # add set name and date
    target['set'] = modern_sets.name[s]
    target['releaseDate'] = modern_sets.releaseDate[s]    
    
    # add to cards df 
    cards = pd.concat([cards, target])

# clean up errors
cards['text'].replace("\n" , " ", inplace=True, regex=True)
cards['text'].replace("\'" , "", inplace=True, regex=True)
cards['flavor'].replace("\n" , " ", inplace=True, regex=True)

print 
print cards.info(verbose=False)

Reading in set: Eighth Edition
Reading in set: Mirrodin
Reading in set: Darksteel
Reading in set: Fifth Dawn
Reading in set: Champions of Kamigawa
Reading in set: Betrayers of Kamigawa
Reading in set: Saviors of Kamigawa
Reading in set: Ninth Edition
Reading in set: Ravnica: City of Guilds
Reading in set: Guildpact
Reading in set: Dissension
Reading in set: Coldsnap
Reading in set: Time Spiral
Reading in set: Time Spiral "Timeshifted"
Reading in set: Planar Chaos
Reading in set: Future Sight
Reading in set: Tenth Edition
Reading in set: Lorwyn
Reading in set: Morningtide
Reading in set: Shadowmoor
Reading in set: Eventide
Reading in set: Shards of Alara
Reading in set: Conflux
Reading in set: Alara Reborn
Reading in set: Magic 2010
Reading in set: Zendikar
Reading in set: Worldwake
Reading in set: Rise of the Eldrazi
Reading in set: Magic 2011
Reading in set: Scars of Mirrodin
Reading in set: Mirrodin Besieged
Reading in set: New Phyrexia
Reading in set: Magic 2012
Reading in set: Inni

This gives us 11502 cards

but this is too many cards, included are promotional tokens that we do not want. 

In [13]:
# filter out lands and tokens 

cards = cards.loc[cards['rarity'].isin(['Common', 'Uncommon', 'Rare', 
                                        'Mythic Rare'])]
cards.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10765 entries, 0 to 248
Columns: 14 entries, artist to releaseDate
dtypes: float64(1), object(13)
memory usage: 1.2+ MB


This gives 10765 cards.

Now the next problem is some cards do not have a color (artifacts) and some cards are multiple colors (gold, hybrid). Lets leave them out for now.  

In [14]:
# only keep cards with a 'color' attribute
cards_no_nulls = cards[cards['colors'].notnull()]

# only keep cards with text
cards_no_nulls = cards_no_nulls[cards_no_nulls['text'].notnull()]

# only keep cards with a mana cost
cards_no_nulls = cards_no_nulls[cards_no_nulls['colors'].map(len) == 1]
cards_no_nulls.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7874 entries, 0 to 197
Columns: 14 entries, artist to releaseDate
dtypes: float64(1), object(13)
memory usage: 922.7+ KB


In [15]:
# reset index and drop old index vals 

cards_no_nulls.reset_index(inplace=True)
cards_no_nulls.pop('index')
# cards_no_nulls.pop('level_0')
cards_no_nulls

Unnamed: 0,artist,cmc,colors,flavor,manaCost,name,power,rarity,text,toughness,type,types,set,releaseDate
0,Michael Sutfin,4,[Black],To gaze under its hood is to invite death.,{2}{B}{B},Abyssal Specter,2,Uncommon,Flying Whenever Abyssal Specter deals damage t...,3,Creature — Specter,[Creature],Eighth Edition,2003-07-28
1,Wayne England,5,[Blue],Pray that it doesn't seek the safety of your l...,{3}{U}{U},Air Elemental,4,Uncommon,Flying,4,Creature — Elemental,[Creature],Eighth Edition,2003-07-28
2,Junko Taguchi,4,[Black],"""Knowledge demands sacrifice.""",{3}{B},Ambition's Cost,,Uncommon,You draw three cards and you lose 3 life.,,Sorcery,[Sorcery],Eighth Edition,2003-07-28
3,Alex Horley-Orlandelli,4,[Red],Just try taking this bull by the horns.,{3}{R},Anaba Shaman,2,Common,"{R}, {T}: Anaba Shaman deals 1 damage to targe...",2,Creature — Minotaur Shaman,[Creature],Eighth Edition,2003-07-28
4,Melissa A. Benson,5,[White],A song of life soars over fields of blood.,{4}{W},Angel of Mercy,3,Uncommon,Flying When Angel of Mercy enters the battlefi...,3,Creature — Angel,[Creature],Eighth Edition,2003-07-28
5,Marc Fishman,2,[White],If only every message were as perfect as its b...,{1}{W},Angelic Page,1,Common,Flying {T}: Target attacking or blocking creat...,1,Creature — Angel Spirit,[Creature],Eighth Edition,2003-07-28
6,Donato Giancola,4,[Blue],"""Words—so innocent and powerless are they, as ...",{2}{U}{U},Archivist,1,Rare,{T}: Draw a card.,1,Creature — Human Wizard,[Creature],Eighth Edition,2003-07-28
7,Paolo Parente,5,[White],Knights fight for honor and mercenaries fight ...,{4}{W},Ardent Militia,2,Uncommon,Vigilance,5,Creature — Human Soldier,[Creature],Eighth Edition,2003-07-28
8,Mark Zug,8,[White],,{6}{W}{W},Avatar of Hope,4,Rare,"If you have 3 or less life, Avatar of Hope cos...",9,Creature — Avatar,[Creature],Eighth Edition,2003-07-28
9,Justin Sweet,4,[White],,{3}{W},Aven Cloudchaser,2,Common,Flying (This creature cant be blocked except b...,2,Creature — Bird Soldier,[Creature],Eighth Edition,2003-07-28


In [16]:
# datamunge to get "color" out of a list format

def nolist(x):
    return x[0]

cards_no_nulls['colors'] = cards_no_nulls['colors'].apply(nolist)

Some cards have resource symbols in their text with their color. This might give a classifier an unfair advantage. 

In [17]:
# remove resource symbols from one card 

a = cards_no_nulls.text[3]
re.sub("{[A-Z]}" , "{1}", a)

u'{1}, {1}: Anaba Shaman deals 1 damage to target creature or player.'

In [18]:
# remove resource symbols from all cards 

def tap(x):
    return re.sub("{T}" , "Tap ", x)

def nomana(x):
    return re.sub("{[A-Z]}" , "{1}", x)

cards_no_nulls.text = cards_no_nulls.text.apply(tap)
cards_no_nulls.text = cards_no_nulls.text.apply(nomana)

In [19]:
cards_no_nulls.text

0       Flying Whenever Abyssal Specter deals damage t...
1                                                  Flying
2               You draw three cards and you lose 3 life.
3       {1}, Tap : Anaba Shaman deals 1 damage to targ...
4       Flying When Angel of Mercy enters the battlefi...
5       Flying Tap : Target attacking or blocking crea...
6                                      Tap : Draw a card.
7                                               Vigilance
8       If you have 3 or less life, Avatar of Hope cos...
9       Flying (This creature cant be blocked except b...
10      Flying (This creature cant be blocked except b...
11      Flying (This creature cant be blocked except b...
12      If target opponent has more cards in hand than...
13      Flying Tap : Add one mana of any color to your...
14      Enchant creature (Target a creature as you cas...
15      Blaze deals X damage to target creature or pla...
16       You gain 3 life for each creature attacking you.
17      Flying

In [20]:
# store to disk

# cards_no_nulls.reset_index(inplace=True)
# cards_no_nulls.pop('index')

cards_no_nulls.to_pickle('data/cards_modern.pkl')

Some cards have their name in the text, names have no predictive ability and could only lead to overfitting. 

In [21]:
# remove name from one card 

re.sub(cards_no_nulls.name[3] , "This", cards_no_nulls.text[3] )

u'{1}, Tap : This deals 1 damage to target creature or player.'

In [24]:
# remove card names from all cards  

import time

t0 = time.time()

for i in xrange(len(cards_no_nulls)):
    cards_no_nulls['text'][i] = re.sub(cards_no_nulls['name'][i] , "This", cards_no_nulls['text'][i]) 

t1 = time.time()

print round((t1-t0)/60, 2), "minutes"
cards_no_nulls.text

9.96 minutes


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0       Flying Whenever This deals damage to a player,...
1                                                  Flying
2               You draw three cards and you lose 3 life.
3       {1}, Tap : This deals 1 damage to target creat...
4       Flying When This enters the battlefield, you g...
5       Flying Tap : Target attacking or blocking crea...
6                                      Tap : Draw a card.
7                                               Vigilance
8       If you have 3 or less life, This costs {6} les...
9       Flying (This creature cant be blocked except b...
10      Flying (This creature cant be blocked except b...
11      Flying (This creature cant be blocked except b...
12      If target opponent has more cards in hand than...
13      Flying Tap : Add one mana of any color to your...
14      Enchant creature (Target a creature as you cas...
15      This deals X damage to target creature or player.
16       You gain 3 life for each creature attacking you.
17      Flying

In [26]:
# store to disk

# cards_no_nulls.reset_index(inplace=True)
# cards_no_nulls.pop('index')

cards_no_nulls.to_pickle('data/cards_modern_no_name.pkl')

Some cards have parentheses explaining what a rules term means. All this text is redundanct and could only mislead a classifer. 

In [41]:
# remove helper text from one card 

a = cards_no_nulls.text[9]
re.sub('\([^)]*\)' , "", a)

u'Flying  When This enters the battlefield, destroy target enchantment.'

In [42]:
# remove helper text 

def hardmode(x):
    return re.sub('\([^)]*\)' , "", x)

cards_no_nulls.text = cards_no_nulls.text.apply(hardmode)

In [45]:
# store to disk

cards_no_nulls.to_pickle('data/5color_modern_no_name_hardmode.pkl')

### New DF with all modern cards



In [46]:
# groups all modern cards of all colors from sets into one data frame 

cards = None

for s in xrange(len(modern_sets)):
    print "Reading in set:", modern_sets.name[s]
    target = pd.read_json(json.dumps(modern_sets.cards[s]))
    
    # slice to just the good stuff 
    target = target[['artist', 'cmc', 'colors', 'flavor', 'manaCost', 'name', 
                    'power', 'rarity', 'text', 'toughness', 'type', 'types']]
    
    # add set name and date
    target['set'] = modern_sets.name[s]
    target['releaseDate'] = modern_sets.releaseDate[s]    
    
    # add to cards df 
    cards = pd.concat([cards, target])

# clean up errors
cards['text'].replace("\n" , " ", inplace=True, regex=True)
cards['text'].replace("\'" , "", inplace=True, regex=True)

print 
print cards.info(verbose=False)

# drop tokens 
cards_no_nulls = cards.loc[cards['rarity'].isin(['Common', 'Uncommon', 
                                                 'Rare', 'Mythic Rare'])]

# only keep cards with text
cards_no_nulls = cards_no_nulls[cards_no_nulls['text'].notnull()]

# save the "tap" symbol
cards_no_nulls.text = cards_no_nulls.text.apply(tap)

# flatten all resource symbols from all cards 
cards_no_nulls.text = cards_no_nulls.text.apply(nomana)

# remove helper text 
cards_no_nulls.text = cards_no_nulls.text.apply(hardmode)

# reset index
cards_no_nulls.reset_index(inplace=True)
cards_no_nulls.pop('index')

print "Done!"
cards_no_nulls.info()

Reading in set: Eighth Edition
Reading in set: Mirrodin
Reading in set: Darksteel
Reading in set: Fifth Dawn
Reading in set: Champions of Kamigawa
Reading in set: Betrayers of Kamigawa
Reading in set: Saviors of Kamigawa
Reading in set: Ninth Edition
Reading in set: Ravnica: City of Guilds
Reading in set: Guildpact
Reading in set: Dissension
Reading in set: Coldsnap
Reading in set: Time Spiral
Reading in set: Time Spiral "Timeshifted"
Reading in set: Planar Chaos
Reading in set: Future Sight
Reading in set: Tenth Edition
Reading in set: Lorwyn
Reading in set: Morningtide
Reading in set: Shadowmoor
Reading in set: Eventide
Reading in set: Shards of Alara
Reading in set: Conflux
Reading in set: Alara Reborn
Reading in set: Magic 2010
Reading in set: Zendikar
Reading in set: Worldwake
Reading in set: Rise of the Eldrazi
Reading in set: Magic 2011
Reading in set: Scars of Mirrodin
Reading in set: Mirrodin Besieged
Reading in set: New Phyrexia
Reading in set: Magic 2012
Reading in set: Inni

In [47]:
# store to disk

cards_no_nulls.to_pickle('data/all_cards_modern_no_name_hardmode.pkl')