# Deep Learning Based Card Classification
## Introduction
The goal of this notebook is to classify the cards based on their *oracle_text* with deep learning models

## Text Preprocessing

In [57]:
import pandas as pd
from os.path import join
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import re, string
from spacy.lang.en.stop_words import STOP_WORDS

df_path = join(*['..', '..', 'data', 'cards-tags', 'tagged_cards.csv'])
word2vec_path = join(*['../../data/word2vec.txt.gz'])

word_dim = 100

Create numerical labels from the cards tags:

In [58]:
cards_df = pd.read_csv(df_path)

# Get numerical encoding of the card
le = LabelEncoder()
cards_df['label'] = le.fit_transform(cards_df['tag'])

cards_df.head()

Unnamed: 0,name,oracle_text,oracleid,tag,type_line,label
0,Abomination of Gudul,Flying\nWhenever Abomination of Gudul deals co...,3d98af5f-7a0b-4a5a-b3e4-f3c9d150c993,discard-outlet,Creature — Horror,0
1,Academy Elite,Academy Elite enters the battlefield with X +1...,ba6c3c72-c014-45c6-a0b4-59eb9a65303e,discard-outlet,Creature — Human Wizard,0
2,Academy Raider,Intimidate (This creature can't be blocked exc...,75131d75-0703-44d0-b503-35190be8e66f,discard-outlet,Creature — Human Warrior,0
3,Akoum Flameseeker,"Cohort — {T}, Tap an untapped Ally you control...",efae637f-3232-46f2-9839-f3386e2f447d,discard-outlet,Creature — Human Shaman Ally,0
4,"Alexi, Zephyr Mage","{X}{U}, {T}, Discard two cards: Return X targe...",3f60de36-ed63-4d08-a012-fc16e91da46d,discard-outlet,Legendary Creature — Human Spellshaper,0


Create functions to normalize text (remove carriage return, tabs, punctuation...) and filter out English stop words:

In [59]:
def normalize(text):
    text = text.replace('\n', ' ').replace('\t', '').replace('\'', '')
    text = re.split(r'\W+', text)
    table = str.maketrans('', '', string.punctuation)
    text = [word.translate(table) for word in text]
    text = ' '.join([word.lower() for word in text if word != ''])
    return text

def filter_stop_words(text):
    text = re.split(r'\W+', text)
    text = ' '.join([word.lower() for word in text if word not in STOP_WORDS])
    return text

Apply those functions on the cards *oracle_text*:

In [60]:
cards_df = cards_df.loc[:, ['oracle_text', 'label']].dropna()
cards_df.loc[:,'normalized_oracle_text'] = cards_df['oracle_text'].apply(lambda x: filter_stop_words(normalize(x)))

Result:

In [61]:
cards_df.head()

Unnamed: 0,oracle_text,label,normalized_oracle_text
0,Flying\nWhenever Abomination of Gudul deals co...,0,flying abomination gudul deals combat damage p...
1,Academy Elite enters the battlefield with X +1...,0,academy elite enters battlefield x 1 1 counter...
2,Intimidate (This creature can't be blocked exc...,0,intimidate creature cant blocked artifact crea...
3,"Cohort — {T}, Tap an untapped Ally you control...",0,cohort t tap untapped ally control discard car...
4,"{X}{U}, {T}, Discard two cards: Return X targe...",0,x u t discard cards return x target creatures ...


## Create *word2vec* Embedding for *oracle_text*
We create a 100 dimensional word embedding for all the cards *oracle_text*

**NB: those word embeddings can be created just once, skip this section if already done**

First we fetch the *oracle_text* for all the available cards:

In [82]:
import psycopg2

conn = psycopg2.connect(database="mtg", user="postgres", password="postgres", port=5432, host='localhost')
cur = conn.cursor()
cur.execute("select oracle_text from cards where exists (select 1 from jsonb_each_text(cards.legalities) j where j.value not like '%not_legal%') and lang='en';")

cards = []
card = cur.fetchone()
 
while card is not None:
    card = cur.fetchone()
    cards.append(card)
 
cur.close()

cards = cards[:-1]

Second we normalize all those *oracle_text*s:

In [83]:
cards = [filter_stop_words(normalize(card[0])) for card in cards if card[0]]
cards = [card.split(' ') for card in cards]

How many distinct words are there in the cards? we will need it later

In [85]:
import itertools as it
card_words = list(set(list(it.chain.from_iterable(cards))))
print(f'There are {len(card_words)} distinct words in the cards')

There are 8949 distinct words in the cards


Third we create the *Word2Vec* representations for all the words in the normalized *oracle_text*s and persist them:

In [86]:
from gensim.models import Word2Vec
from gensim.test.utils import get_tmpfile
import gzip


path = get_tmpfile("./data/word2vec.model")

model = Word2Vec(cards, size=word_dim, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format("../../data/word2vec.txt")

# gzip the model
f_in = open('../../data/word2vec.txt', 'rb')
f_out = gzip.open('../../data/word2vec.txt.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()

# Then command line:
# python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz

And the last step is to create a SpaCy model based on those persisted *Word2Vec* representations by executing:

`python3.6 -m spacy init-model en ./data/spacy.word2vec.model --vectors-loc data/word2vec.txt.gz`

in the terminal

## Modeling
### Card text as a vector

The *SpaCy* model aggregates the (100-dimensional) vectors of the words in the cards (resp. in their *oracle_text*) into a single (100-dimensional) vector. For each card, we create this vector:

In [87]:
from spacy import load
from numpy import zeros


nlp_mtg = load('../../data/spacy.word2vec.model')

X = zeros((cards_df.shape[0], word_dim, 1))
for i, text in enumerate(cards_df['normalized_oracle_text']):
    X[i,:] = nlp_mtg(text).vector.reshape(-1, 1)

  "__main__", mod_spec)


The model will, for each card it will predict, a 6 dimensional array. The value in each dimension will correspond to the probability for the card to belong to the corresponding tag.

For this, we have to provide `y` as a 6 dimensional vector. We *one-hot-encode* the labels: 
* *0* => [1, 0, 0, 0, 0, 0]
* *1* => [0, 1, 0, 0, 0, 0]
* etc.

In [117]:
from sklearn.preprocessing import OneHotEncoder


y = cards_df['label'].values.reshape(-1, 1)
enc = OneHotEncoder()
y =  enc.fit_transform(y)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [118]:
y.shape

(3334, 6)

We split the data into train and test, with 90% allocated to train and 10% to test:

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)

In [90]:
y_test

<334x6 sparse matrix of type '<class 'numpy.float64'>'
	with 334 stored elements in Compressed Sparse Row format>

In [91]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, GRU, Bidirectional, SpatialDropout1D, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, Dropout, concatenate
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

Now we create a model consisting in various dense layers with some pooling in the middle:

In [92]:
n_labels = y.shape[1]
batch_size = 32
n_epochs = 50

In [93]:
sequence_input = Input(shape=(word_dim, 1, ))
x = Dense(60, activation='relu')(sequence_input)
x = Dense(30, activation='relu')(x)
# x = Conv1D(15, kernel_size=3, padding='valid', kernel_initializer='glorot_uniform')(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool])
x = Dense(60, activation='relu')(x)
preds = Dense(n_labels, activation='sigmoid')(x)
model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3),metrics=['accuracy'])

In [94]:
model_filepath = '../../data/weights_dl_model.hdf5'
checkpoint = ModelCheckpoint(model_filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early = EarlyStopping(monitor='val_acc', mode='max', patience=5)
callbacks = [checkpoint, early]

model.fit(X_train, y_train, batch_size=batch_size, epochs=50, validation_split=.1, callbacks=callbacks, verbose=1)

Train on 2700 samples, validate on 300 samples
Epoch 1/50
Epoch 00001: val_acc improved from -inf to 0.83333, saving model to ../../data/weights_dl_model.hdf5
Epoch 2/50
Epoch 00002: val_acc did not improve from 0.83333
Epoch 3/50
Epoch 00003: val_acc improved from 0.83333 to 0.85056, saving model to ../../data/weights_dl_model.hdf5
Epoch 4/50
Epoch 00004: val_acc did not improve from 0.85056
Epoch 5/50
Epoch 00005: val_acc improved from 0.85056 to 0.85278, saving model to ../../data/weights_dl_model.hdf5
Epoch 6/50
Epoch 00006: val_acc improved from 0.85278 to 0.86611, saving model to ../../data/weights_dl_model.hdf5
Epoch 7/50
Epoch 00007: val_acc improved from 0.86611 to 0.86722, saving model to ../../data/weights_dl_model.hdf5
Epoch 8/50
Epoch 00008: val_acc improved from 0.86722 to 0.86778, saving model to ../../data/weights_dl_model.hdf5
Epoch 9/50
Epoch 00009: val_acc improved from 0.86778 to 0.86778, saving model to ../../data/weights_dl_model.hdf5
Epoch 10/50
Epoch 00010: val_

<tensorflow.python.keras.callbacks.History at 0x7ff7a50925f8>

We can reach roughly 86% accuracy like that

### Card Text as word embeddings
Here each word (that has an embedding) is concatenated and we get a matrix for each text (padded with 0 if too long):

#### Build word embedding

In [160]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, LSTM, Embedding
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils
from keras.layers import Dense, Input, LSTM, Bidirectional, Activation, Conv1D, GRU, TimeDistributed
from keras.layers import Dropout, Embedding, GlobalMaxPooling1D, MaxPooling1D, Add, Flatten, SpatialDropout1D
from keras.layers import GlobalAveragePooling1D, BatchNormalization, concatenate
from keras.layers import Reshape, merge, Concatenate, Lambda, Average
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints
from keras.models import Model
from keras.optimizers import Adam
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score

In [152]:
n_words = len(card_words)

embeddings_index = np.zeros((n_words + 1, 100))
for idx, word in enumerate(card_words):
    try:
        embedding = nlp_mtg.vocab[word].vector
        embeddings_index[idx] = embedding
    except:
        pass

In [153]:
max_length_card = max([len(card) for card in cards])
print(f'The maximal length of a card is {max_length_card}')

The maximal length of a card is 58


The following is the implementation of a quite clean and generic deep learning text classification class:

In [168]:
class CardClassifier(BaseEstimator, TransformerMixin):
    '''Wrapper class for keras text classification models that takes raw text as input.'''
  
    def __init__(self, max_words=10000, input_length=30, emb_dim=100, n_classes=6, epochs=100, batch_size=32, emb_idx=0, lr=1e-3, model_path='/tmp/text_classification.hdf5'):
        self.max_words = max_words
        self.input_length = input_length
        self.emb_dim = emb_dim
        self.n_classes = n_classes
        self.epochs = epochs
        self.bs = batch_size
        self.embeddings_index = emb_idx
        self.lr = lr
        self.model_path = model_path
        self.tokenizer = Tokenizer(num_words=self.max_words+1, lower=True, split=' ')
        self.model = self._get_model()
        return self.model.summary()
    
    def _get_model(self):
        input_text = Input((self.input_length,))
        text_embedding = Embedding(input_dim=self.max_words+1, output_dim=self.emb_dim, input_length=self.input_length, 
                                   mask_zero=False, weights=[self.embeddings_index], trainable=False)(input_text)
        text_embedding = SpatialDropout1D(0.4)(text_embedding)
        bilstm =(LSTM(units=50, recurrent_dropout=0.2, return_sequences = True))(text_embedding)
        x = Dropout(0.2)(bilstm)
        x =(LSTM(units=50,  recurrent_dropout=0.2, return_sequences = True))(x)
        x = Dropout(0.2)(x)
        x =(LSTM(units=50,  recurrent_dropout=0.2))(x)
        out = Dense(units=self.n_classes, activation="softmax")(x)
        model = Model(inputs=[input_text],outputs=[out])
        
        model.compile(optimizer=Adam(lr=self.lr), loss='categorical_crossentropy', metrics=['accuracy'])
        return model
  
    def _get_sequences(self, texts):
        seqs = self.tokenizer.texts_to_sequences(texts)
        return pad_sequences(seqs, maxlen=self.input_length, value=0)
  
    def fit(self, X, y):
        self.tokenizer.fit_on_texts(X)
        self.tokenizer.word_index = {e: i for e,i in self.tokenizer.word_index.items() if i <= self.max_words}
        self.tokenizer.word_index[self.tokenizer.oov_token] = self.max_words + 1
        seqs = self._get_sequences(X)
        
        checkpoint = ModelCheckpoint(self.model_path, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
        early = EarlyStopping(monitor='val_acc', mode='max', patience=5)
        callbacks = [checkpoint, early]
        self.model.fit([seqs ], y, batch_size=self.bs, epochs=self.epochs, validation_split=0.1, callbacks=callbacks)
  
    def predict_proba(self, X, y=None):
        seqs = self._get_sequences(X)
        return self.model.predict(seqs)
  
    def predict(self, X, y=None):
        return np.argmax(self.predict_proba(X), axis=1)
  
    def score(self, X, y):
        y_pred = self.predict(X)
        return accuracy_score(np.argmax(y, axis=1), y_pred)

In [171]:
batch_size = 16
model_path = '../../data/dl_text_classification.hdf5'
card_model = CardClassifier(emb_idx=embeddings_index, max_words=n_words, input_length=20, batch_size=batch_size, lr=1e-2, model_path=model_path)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_16 (InputLayer)        (None, 20)                0         
_________________________________________________________________
embedding_16 (Embedding)     (None, 20, 100)           895000    
_________________________________________________________________
spatial_dropout1d_14 (Spatia (None, 20, 100)           0         
_________________________________________________________________
lstm_39 (LSTM)               (None, 20, 50)            30200     
_________________________________________________________________
dropout_27 (Dropout)         (None, 20, 50)            0         
_________________________________________________________________
lstm_40 (LSTM)               (None, 20, 50)            20200     
_________________________________________________________________
dropout_28 (Dropout)         (None, 20, 50)            0         
__________

In [172]:
X, y = cards_df['normalized_oracle_text'], cards_df['label'].values.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state=7, stratify=y)
y_train, y_test =  enc.fit_transform(y_train), enc.fit_transform(y_test)
card_model.fit(X_train, y_train)

Train on 2700 samples, validate on 300 samples
Epoch 1/100

Epoch 00001: val_acc improved from -inf to 0.64333, saving model to ../../data/dl_text_classification.hdf5
Epoch 2/100

Epoch 00002: val_acc did not improve from 0.64333
Epoch 3/100

Epoch 00003: val_acc improved from 0.64333 to 0.66333, saving model to ../../data/dl_text_classification.hdf5
Epoch 4/100

Epoch 00004: val_acc improved from 0.66333 to 0.70333, saving model to ../../data/dl_text_classification.hdf5
Epoch 5/100

Epoch 00005: val_acc improved from 0.70333 to 0.76333, saving model to ../../data/dl_text_classification.hdf5
Epoch 6/100

Epoch 00006: val_acc improved from 0.76333 to 0.79000, saving model to ../../data/dl_text_classification.hdf5
Epoch 7/100

Epoch 00007: val_acc did not improve from 0.79000
Epoch 8/100

Epoch 00008: val_acc did not improve from 0.79000
Epoch 9/100

Epoch 00009: val_acc improved from 0.79000 to 0.80333, saving model to ../../data/dl_text_classification.hdf5
Epoch 10/100

Epoch 00010: va

In [173]:
card_model.score(X_test, y_test)

0.8083832335329342

And with this model we reach 81% accuracy