# Deep Learning Based Card Classification
## Introduction
The goal of this notebook is to classify the cards based on their *oracle_text* with deep learning models

## Text Preprocessing

In [1]:
import pandas as pd
from os.path import join
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import re, string
from spacy.lang.en.stop_words import STOP_WORDS

df_path = join(*['..', '..', 'data', 'cards-tags', 'tagged_cards.csv'])
word2vec_path = join(*['../../data/word2vec.txt.gz'])

word_dim = 100

Create numerical labels from the cards tags:

In [3]:
cards_df = pd.read_csv(df_path)

# Get numerical encoding of the card
le = LabelEncoder()
cards_df['label'] = le.fit_transform(cards_df['tag'])

cards_df.head()

Unnamed: 0,name,oracle_text,oracleid,tag,type_line,label
0,Abomination of Gudul,Flying\nWhenever Abomination of Gudul deals co...,3d98af5f-7a0b-4a5a-b3e4-f3c9d150c993,discard-outlet,Creature — Horror,0
1,Academy Elite,Academy Elite enters the battlefield with X +1...,ba6c3c72-c014-45c6-a0b4-59eb9a65303e,discard-outlet,Creature — Human Wizard,0
2,Academy Raider,Intimidate (This creature can't be blocked exc...,75131d75-0703-44d0-b503-35190be8e66f,discard-outlet,Creature — Human Warrior,0
3,Akoum Flameseeker,"Cohort — {T}, Tap an untapped Ally you control...",efae637f-3232-46f2-9839-f3386e2f447d,discard-outlet,Creature — Human Shaman Ally,0
4,"Alexi, Zephyr Mage","{X}{U}, {T}, Discard two cards: Return X targe...",3f60de36-ed63-4d08-a012-fc16e91da46d,discard-outlet,Legendary Creature — Human Spellshaper,0


Create functions to normalize text (remove carriage return, tabs, punctuation...) and filter out English stop words:

In [4]:
def normalize(text):
    text = text.replace('\n', ' ').replace('\t', '')
    text = re.split(r'\W+', text)
    table = str.maketrans('', '', string.punctuation)
    text = [word.translate(table) for word in text]
    text = ' '.join([word.lower() for word in text if word != ''])
    return text

def filter_stop_words(text):
    text = re.split(r'\W+', text)
    text = ' '.join([word.lower() for word in text if word not in STOP_WORDS])
    return text

Apply those functions on the cards *oracle_text*:

In [5]:
cards_df = cards_df.loc[:, ['oracle_text', 'label']].dropna()
cards_df.loc[:,'normalized_oracle_text'] = cards_df['oracle_text'].apply(lambda x: filter_stop_words(normalize(x)))

Result:

In [6]:
cards_df.head()

Unnamed: 0,oracle_text,label,normalized_oracle_text
0,Flying\nWhenever Abomination of Gudul deals co...,0,flying abomination gudul deals combat damage p...
1,Academy Elite enters the battlefield with X +1...,0,academy elite enters battlefield x 1 1 counter...
2,Intimidate (This creature can't be blocked exc...,0,intimidate creature t blocked artifact creatur...
3,"Cohort — {T}, Tap an untapped Ally you control...",0,cohort t tap untapped ally control discard car...
4,"{X}{U}, {T}, Discard two cards: Return X targe...",0,x u t discard cards return x target creatures ...


## Create *word2vec* Embedding for *oracle_text*
We create a 100 dimensional word embedding for all the cards *oracle_text*

**NB: those word embeddings can be created just once, skip this section if already done**

First we fetch the *oracle_text* for all the available cards:

In [18]:
import psycopg2

conn = psycopg2.connect(database="mtg", user="postgres", password="postgres", port=5432, host='localhost')
cur = conn.cursor()
cur.execute("select oracle_text from cards where exists (select 1 from jsonb_each_text(cards.legalities) j where j.value not like '%not_legal%') and lang='en';")

cards = []
card = cur.fetchone()
 
while card is not None:
    card = cur.fetchone()
    cards.append(card)
 
cur.close()

cards = cards[:-1]

Second we normalize all those *oracle_text*s:

In [27]:
cards = [filter_stop_words(normalize(card[0])) for card in cards if card[0]]

Third we create the *Word2Vec* representations for all the words in the normalized *oracle_text*s and persist them:

In [31]:
from gensim.models import Word2Vec
from gensim.test.utils import get_tmpfile
import gzip


path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(cards, size=word_dim, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("../../data/word2vec.txt")

# gzip the model
f_in = open('../../data/word2vec.txt', 'rb')
f_out = gzip.open('../../data/word2vec.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()

# Then command line:
# python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz

And the last step is to create a SpaCy model based on those persisted *Word2Vec* representations by executing:

`python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz`

in the terminal

## Modeling

The *SpaCy* model aggregates the (100-dimensional) vectors of the words in the cards (resp. in their *oracle_text*) into a single (100-dimensional) vector. For each card, we create this vector:

In [7]:
from spacy import load
from numpy import zeros


nlp_mtg = load('../../data/spacy.word2vec.model')

X = zeros((cards_df.shape[0], word_dim))
for i, text in enumerate(cards_df['normalized_oracle_text']):
    X[i,:] = nlp_mtg(text).vector

The model will, for each card it will predict, a 6 dimensional array. The value in each dimension will correspond to the probability for the card to belong to the corresponding tag.

For this, we have to provide `y` as a 6 dimensional vector. We *one-hot-encode* the labels: 
* *0* => [1, 0, 0, 0, 0, 0]
* *1* => [0, 1, 0, 0, 0, 0]
* etc.

In [39]:
from sklearn.preprocessing import OneHotEncoder


y = cards_df['label'].values.reshape(-1, 1)
enc = OneHotEncoder()
y =  enc.fit_transform(y)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


We split the data into train and test, with 90% allocated to train and 10% to test:

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)

In [41]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Bidirectional, GlobalMaxPool1D, Dropout

Now we create a model

**NB: here this model is not deep at all, it has just one input layer with 10 cells and one output layer, so it's not deep at all. There is also no "wizardry" involved concerning the learning rates etc. At his stage, it's just for illustration purpose and as a basis for further developments**

In [42]:
n_labels = y.shape[1]

model = Sequential()

model.add(Dense(10, input_dim=word_dim, activation='relu'))
model.add(Dense(n_labels, activation="sigmoid"))

In [43]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [44]:
model.fit(X_train, y_train, epochs=50, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fe49a2eea20>

In [45]:
from numpy import argmax, mean


y_predict = model.predict(X_test)
y_predict = argmax(y_predict, 1)
y_test = argmax(y_test, 1)

accuracy = mean(y_predict == y_test)
print(f'accuracy: {accuracy}')

accuracy: 0.46010972067840367


Ok, the accuracy is "just" 3X better than the random baseline, and 2X worse than what we could obtain with machin learning models, but again, this is just a dummy model to illustrate how to use deep learning for classifying cards