# TODO  
## Card Similarity Search  
### Data Prep (raw card data > clean card data) 
  - ~~merge card, set, and legality data~~  
  - ~~concat token cards (later - may not want to)~~  
  - ensure clean  
    - fillna, non-english issues, maybe replace symbols ({T}...) w/ text  
  - **AWS**  
    - ~~Load raw MTGJson from S3~~  
    - ~~Lambda (or Glue): Prep > to S3~~  
### Similarity (clean card data > embeddings, similarity matrix)
  - USE embeddings from card text  
  - explore including other card props as text (color, type, mana cost...)  
  - similarity matrix  
  - pre-sort and save each card (50K+ cards)
  - **AWS**  
    - ~~Load clean card data from S3~~  
    - ~~SM Processing Job: USE embeddings & similarity matrix > to S3~~  
    - Lambda: embeddings/sim_matrix from S3 > EFS (?)  
    - Lambda: sort each card by similarity EFS > EFS (?)
    - Data Pipeline: cards data from S3 to Dynamo (?)  
### App Backend  
  - API accepts card name and returns topK similar cards (w/ some metadata for filtering)  
  - **AWS**  
    - Lambda: API queries EFS or Dynamo  
    - StepFunctions or Lambda destinations: refresh data pipeline as needed
### App Frontend  
  - Home page  
    - Search box and results  
    - A few filters (color, type, mana cost...)  
    - add placement  
    - sign in (eventually)  
### Deploy  
  - Backend
    - Serverless  
    - Seed  
  - Frontend  
    - React  
    - Amplify

# Data Prep

In [1]:
import os
import json
import numpy as np
import pandas as pd

#import tensorflow as tf
#import tensorflow_hub as hub
#import matplotlib.pyplot as plt

In [75]:
cards_df = pd.read_csv('../data/mtgjson/cards.csv')\
    .drop(columns=['index'])

print(cards_df.shape)
print('{} MB'.format(round(cards_df.memory_usage().sum()/1000000, 2)))
cards_df.head(1)

(55943, 74)
33.12 MB


Unnamed: 0,id,artist,asciiName,availability,borderColor,cardKingdomFoilId,cardKingdomId,colorIdentity,colorIndicator,colors,...,subtypes,supertypes,tcgplayerProductId,text,toughness,type,types,uuid,variations,watermark
0,1,Rebecca Guay,,"mtgo,paper",black,123335.0,122967.0,G,,G,...,,,15023.0,"If you would draw a card, you may instead choo...",,Enchantment,Enchantment,38513fa0-ea83-5642-8ecd-4f0b3daa6768,,


In [76]:
sets_df = pd.read_csv('../data/mtgjson/sets.csv')[['code','name']]\
    .rename(columns={'name': 'setName', 'code':'setCode'})

print(sets_df.shape)
print('{} MB'.format(round(sets_df.memory_usage().sum()/1000000, 2)))
sets_df.head(1)

(545, 2)
0.01 MB


Unnamed: 0,setCode,setName
0,10E,Tenth Edition


In [77]:
# Merge set names into cards
cards_df = cards_df\
    .merge(sets_df, how='left', on='setCode')

print(cards_df.shape)
print('{} MB'.format(round(cards_df.memory_usage().sum()/1000000, 2)))
cards_df.head(1)

(55943, 75)
34.01 MB


Unnamed: 0,id,artist,asciiName,availability,borderColor,cardKingdomFoilId,cardKingdomId,colorIdentity,colorIndicator,colors,...,supertypes,tcgplayerProductId,text,toughness,type,types,uuid,variations,watermark,setName
0,1,Rebecca Guay,,"mtgo,paper",black,123335.0,122967.0,G,,G,...,,15023.0,"If you would draw a card, you may instead choo...",,Enchantment,Enchantment,38513fa0-ea83-5642-8ecd-4f0b3daa6768,,,Tenth Edition


In [78]:
legs_df = pd.read_csv('../data/mtgjson/legalities.csv')\
    .pivot(index='uuid', columns='format', values='status')\
    .reset_index()\
    .fillna('Blank')

print(legs_df.shape)
print('{} MB'.format(round(legs_df.memory_usage().sum()/1000000, 2)))
legs_df.head(1)

(54718, 14)
6.13 MB


format,uuid,brawl,commander,duel,future,historic,legacy,modern,oldschool,pauper,penny,pioneer,standard,vintage
0,00010d56-fe38-5e35-8aed-518019aa36a5,Blank,Legal,Legal,Blank,Blank,Legal,Legal,Blank,Blank,Blank,Legal,Blank,Legal


In [79]:
# Merge legalities into cards
cards_df = cards_df\
    .merge(legs_df, how='left', on='uuid')

print(cards_df.shape)
print('{} MB'.format(round(cards_df.memory_usage().sum()/1000000, 2)))
cards_df.head(1)

(55943, 88)
39.83 MB


Unnamed: 0,id,artist,asciiName,availability,borderColor,cardKingdomFoilId,cardKingdomId,colorIdentity,colorIndicator,colors,...,future,historic,legacy,modern,oldschool,pauper,penny,pioneer,standard,vintage
0,1,Rebecca Guay,,"mtgo,paper",black,123335.0,122967.0,G,,G,...,Blank,Blank,Legal,Legal,Blank,Blank,Legal,Blank,Blank,Legal


In [82]:
cards_df.to_csv('cards.csv', index=False)

### Handle token cards later

In [53]:
tokens_df = pd.read_csv('../data/mtgjson/tokens.csv')

print(tokens_df.shape)
print('{} MB'.format(round(tokens_df.memory_usage().sum()/1000000, 2)))
tokens_df.head(1)

(1704, 45)
0.61 MB


Unnamed: 0,index,id,artist,asciiName,availability,borderColor,colorIdentity,colors,edhrecRank,faceName,...,side,subtypes,supertypes,tcgplayerProductId,text,toughness,type,types,uuid,watermark
0,0,1,Jim Pavelec,,paper,black,R,R,,,...,,Dragon,,78608.0,Flying,5,Token Creature — Dragon,"Token,Creature",7decf258-eb10-50da-83f7-c7eba74adbfb,


In [3]:
print(cards_df.shape)
cards_df.head(2)

(55943, 74)


Unnamed: 0,id,artist,asciiName,availability,borderColor,cardKingdomFoilId,cardKingdomId,colorIdentity,colorIndicator,colors,...,subtypes,supertypes,tcgplayerProductId,text,toughness,type,types,uuid,variations,watermark
0,1,Rebecca Guay,,"mtgo,paper",black,123335.0,122967.0,G,,G,...,,,15023.0,"If you would draw a card, you may instead choo...",,Enchantment,Enchantment,38513fa0-ea83-5642-8ecd-4f0b3daa6768,,
1,2,Stephen Daniele,,"mtgo,paper",black,123149.0,122781.0,U,,U,...,"Human,Wizard",,15024.0,When Academy Researchers enters the battlefiel...,2.0,Creature — Human Wizard,Creature,b8a68840-4044-52c0-a14e-0a1c630ba42c,,


In [55]:
cards_df.columns

Index(['id', 'artist', 'asciiName', 'availability', 'borderColor',
       'cardKingdomFoilId', 'cardKingdomId', 'colorIdentity', 'colorIndicator',
       'colors', 'convertedManaCost', 'duelDeck', 'edhrecRank',
       'faceConvertedManaCost', 'faceName', 'flavorName', 'flavorText',
       'frameEffects', 'frameVersion', 'hand', 'hasAlternativeDeckLimit',
       'isFullArt', 'isOnlineOnly', 'isOversized', 'isPromo', 'isReprint',
       'isReserved', 'isStarter', 'isStorySpotlight', 'isTextless',
       'isTimeshifted', 'keywords', 'layout', 'leadershipSkills', 'life',
       'loyalty', 'manaCost', 'mcmId', 'mcmMetaId', 'mtgArenaId',
       'mtgjsonV4Id', 'mtgoFoilId', 'mtgoId', 'multiverseId', 'name', 'number',
       'originalReleaseDate', 'originalText', 'originalType', 'otherFaceIds',
       'power', 'printings', 'promoTypes', 'purchaseUrls', 'rarity',
       'scryfallId', 'scryfallIllustrationId', 'scryfallOracleId', 'setCode',
       'side', 'subtypes', 'supertypes', 'tcgplayerProd

In [7]:
cards_df.type.nunique()

1971

In [8]:
cards_df.setCode.nunique()

531

In [9]:
cards_df.memory_usage().sum()/1000000

33.118384

In [56]:
for col in cards_df.columns:
    print(col + ': ' + str(cards_df[col][0]) + '\n')

id: 1

artist: Rebecca Guay

asciiName: nan

availability: mtgo,paper

borderColor: black

cardKingdomFoilId: 123335.0

cardKingdomId: 122967.0

colorIdentity: G

colorIndicator: nan

colors: G

convertedManaCost: 4.0

duelDeck: nan

edhrecRank: 1111.0

faceConvertedManaCost: nan

faceName: nan

flavorName: nan

flavorText: nan

frameEffects: nan

frameVersion: 2003

hand: nan

hasAlternativeDeckLimit: 0


hasFoil: 1

hasNonFoil: 1

isAlternative: 0

isFullArt: 0

isOnlineOnly: 0

isOversized: 0

isPromo: 0

isReprint: 1

isReserved: 0

isStarter: 0

isStorySpotlight: 0

isTextless: 0

isTimeshifted: 0

keywords: nan

layout: normal

leadershipSkills: nan

life: nan

loyalty: nan

manaCost: {2}{G}{G}

mcmId: 16413.0

mcmMetaId: 19.0

mtgArenaId: nan

mtgjsonV4Id: 1669af17-d287-5094-b005-4b143441442f

mtgoFoilId: 27283.0

mtgoId: 27282.0

multiverseId: 130483.0

name: Abundance

number: 249

originalReleaseDate: nan

originalText: If you would draw a card, you may instead choose land or

# Download and Store USE Model from TFHub

In [5]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
model = hub.load(module_url)
print ("module %s loaded" % module_url)

module https://tfhub.dev/google/universal-sentence-encoder-large/5 loaded


## Saved the downloaded USE-Large model

In [2]:
tf.saved_model.save(model, "../models/use-large")

NameError: name 'model' is not defined

## Load USE-Large from local disk

In [30]:
import tensorflow as tf

In [31]:
use_embed = tf.saved_model.load('../models/use-large/1')

In [27]:
for local_path in model_files:
    s3_key = '/'.join(str(local_path).split('\\')[2:])
    if '.' in s3_key:
        s3.upload_file(str(local_path), 'magicml-models.dev', s3_key)

***
# Get USE Embeddings

In [70]:
corr = np.inner(embeddings, embeddings)
print(corr.shape)

(5, 5)


In [72]:
import plotly.express as px

In [73]:
card_df = pd.DataFrame(corr, columns=arena_name, index=arena_name)
card_df.head()

NameError: name 'arena_name' is not defined

In [107]:
test_card = 'Golos,_Tireless_Pilgrim'
test_card

'Golos,_Tireless_Pilgrim'

In [106]:
[card for card in card_df.columns if card.startswith('Golos')]

['Golos,_Tireless_Pilgrim']

In [108]:
card_df[[test_card]].sort_values(by=test_card, ascending=False)

Unnamed: 0,"Golos,_Tireless_Pilgrim"
3059,1.000000
1687,0.805584
1583,0.801413
2606,0.798853
835,0.796900
...,...
5170,-0.020036
3215,-0.020036
5212,-0.041830
4441,-0.041830


In [109]:
test_card = test_card.replace('_',' ')
arena_df.query('name == @test_card').text.values

array(['When Golos, Tireless Pilgrim enters the battlefield, you may search your library for a land card, put that card onto the battlefield tapped, then shuffle your library.\n{2}{W}{U}{B}{R}{G}: Exile the top three cards of your library. You may play them this turn without paying their mana costs.'],
      dtype=object)

In [112]:
test_name = arena_name[1583].replace('_',' ')
test_name

'Emergent Ultimatum'

In [113]:
arena_df.query('name == @test_name').text.values

array(['Search your library for up to three monocolored cards with different names and exile them. An opponent chooses one of those cards. Shuffle that card into your library. You may cast the other cards without paying their mana costs. Exile Emergent Ultimatum.'],
      dtype=object)

# Sagemaker Processing Workflow

In [6]:
import os
import pathlib
import boto3
from boto3.session import Session

def aws_connect(service, profile='default', session=False):
    # Connect to AWS with IAM Role
    sess = Session(profile_name=profile)

    try:
        resource = sess.resource(service)
        client = resource.meta.client

        if session:
            return resource, client, sess
        else:
            return resource, client
    except:
        client = sess.client(service)

        if session:
            return client, sess
        else:
            return client

In [7]:
_, s3, boto_sess = aws_connect('s3', 'lw2134', session=True)

In [None]:
# Tar the model.tar.gz package
# from models/use-large directory
'tar -czvf model.tar.gz 1'

In [40]:
## SM Processing Job

In [36]:
import os
import boto3
#import sagemaker
#from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

In [8]:
_, s3, boto_sess = aws_connect('s3', 'lw2134', session=True)

In [9]:
#sagemaker_session = sagemaker.Session(boto_session=boto_sess)
role = 'arn:aws:iam::553371509391:role/magicml-sagemaker'
image_uri = '763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.3.1-cpu-py37-ubuntu18.04'

model_bucket = 'magicml-models.dev'
model_prefix = 'use-large'
model_data = 's3://{}/{}/model.tar.gz'.format(model_bucket, model_prefix)

input_bucket = 'magicml-clean-data.dev'
input_prefix = 'cards'
input_data = 's3://{}/{}/cards.csv'.format(input_bucket, input_prefix)

src_bucket = 'magicml-src.dev'
src_prefix = 'sm_processing'
src_code = 's3://{}/{}/process_embeddings.py'.format(src_bucket, src_prefix)

output_bucket = 'magicml-inference.dev'
output_prefix = 'use-large'
output_data = 's3://{}/{}'.format(output_bucket, output_prefix)

In [11]:
sm_client = boto3.client('sagemaker')

In [10]:
s3.upload_file('../services/similarity/src/process_embeddings.py', 'magicml-src.dev', 'sm_processing/process_embeddings.py')

In [13]:
sm_client.create_processing_job(
    ProcessingJobName='use-large-embeddings',
    RoleArn=role,
    StoppingCondition={
        'MaxRuntimeInSeconds': 7200
    },
    AppSpecification={
        'ImageUri': image_uri,
        'ContainerEntrypoint': [
            'python3',
            '-v',
            '/opt/ml/processing/code/process_embeddings.py'
        ]
    },
    ProcessingResources={
        'ClusterConfig': {
            'InstanceCount': 1,
            'InstanceType': 'ml.m5.2xlarge',
            'VolumeSizeInGB': 30
        }
    },
    ProcessingInputs=[
        {
            'InputName': 'model',
            'AppManaged': True,
            'S3Input': {
                'S3Uri': model_data,
                'LocalPath': '/opt/ml/processing/model',
                'S3DataType': 'S3Prefix',
                'S3InputMode': 'File',
                'S3DataDistributionType': 'FullyReplicated'
            }
        },
        {
            'InputName': 'cards',
            'AppManaged': True,
            'S3Input': {
                'S3Uri': input_data,
                'LocalPath': '/opt/ml/processing/input',
                'S3DataType': 'S3Prefix',
                'S3InputMode': 'File',
                'S3DataDistributionType': 'FullyReplicated',
            }
        },
        {
            'InputName': 'code',
            'AppManaged': True,
            'S3Input': {
                'S3Uri': src_code,
                'LocalPath': '/opt/ml/processing/code',
                'S3DataType': 'S3Prefix',
                'S3InputMode': 'File',
                'S3DataDistributionType': 'FullyReplicated'
            }
        }
    ],
    ProcessingOutputConfig={
        'Outputs': [
            {
                'OutputName': 'embeddings',
                'AppManaged': True,
                'S3Output': {
                    'S3Uri': output_data,
                    'LocalPath': '/opt/ml/processing/output',
                    'S3UploadMode': 'EndOfJob'
                }
            }
        ]
    }
)

SyntaxError: keyword argument repeated (<ipython-input-13-544c28de7600>, line 35)

In [38]:
tf_processor = ScriptProcessor(
    sagemaker_session=sagemaker_session,
    role=role,
    image_uri=image_uri,
    instance_type="ml.m5.2xlarge",
    instance_count=1,
    command=['python3', '-v'],
    max_runtime_in_seconds=7200,
    base_job_name='use-large-embeddings'
)

In [39]:
tf_processor.run(
    code='../services/similarity/process_embeddings.py',
    inputs=[
        ProcessingInput(
            input_name='model',
            source=model_data,
            destination='/opt/ml/processing/model'
        ),
        ProcessingInput(
            input_name='cards',
            source=input_data,
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name='embeddings',
            source='/opt/ml/processing/output',
            destination=output_data
        )
    ],
    wait=False
)


Job Name:  use-large-embeddings-2020-12-22-21-09-33-238
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://magicml-models.dev/use-large/model.tar.gz', 'LocalPath': '/opt/ml/processing/model', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'input-2', 'S3Input': {'S3Uri': 's3://magicml-clean-data.dev/cards/cards.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-553371509391/use-large-embeddings-2020-12-22-21-09-33-238/input/code/process_embeddings.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'S3Output': {'S3Uri': 's3://magicml-infer

# Stage in Lambda - Sort, Merge, and Save in EFS

In [41]:
s3.download_file('magicml-inference.dev', 'use-large/arena_embeddings.csv', 'embeddings.csv')

In [139]:
pred_df = pd.read_csv('embeddings.csv')\
    .rename(columns={'Unnamed: 0': 'Names'})

print(pred_df.shape)
pred_df.head(3)

(5419, 5420)


Unnamed: 0,Names,Archon_of_Sun's_Grace-AJMP,Audacious_Thief-AJMP,Banishing_Light-AJMP,Bond_of_Revival-AJMP,Carnifex_Demon-AJMP,Doomed_Necromancer-AJMP,Dryad_Greenseeker-AJMP,Fanatic_of_Mogis-AJMP,"Gadwick,_the_Wizened-AJMP",...,Veteran_Adventurer-ZNR,Vine_Gecko-ZNR,Wayward_Guide-Beast-ZNR,Windrider_Wizard-ZNR,"Yasharn,_Implacable_Earth-ZNR","Zagras,_Thief_of_Heartbeats-ZNR","Zareth_San,_the_Trickster-ZNR",Zof_Consumption_//_Zof_Bloodbog-ZNR,Zof_Consumption_//_Zof_Bloodbog-ZNR.1,Zulaport_Duelist-ZNR
0,Archon_of_Sun's_Grace-AJMP,1.0,0.388027,0.540133,0.626441,0.514713,0.497357,0.414993,0.587006,0.667408,...,0.372712,0.545192,0.698247,0.554204,0.654834,0.693039,0.552945,0.199106,0.36539,0.57649
1,Audacious_Thief-AJMP,0.388027,1.0,0.364457,0.447599,0.297466,0.377156,0.339213,0.379261,0.452389,...,0.386959,0.421869,0.409473,0.473218,0.464308,0.39208,0.448119,0.156082,0.608094,0.399204
2,Banishing_Light-AJMP,0.540133,0.364457,1.0,0.583114,0.45885,0.504666,0.293982,0.581624,0.707415,...,0.250604,0.421688,0.53511,0.520927,0.534659,0.525378,0.570414,0.242288,0.342716,0.529742


In [123]:
merge_cols = [
    'Names','id','mtgArenaId','scryfallId','name','colorIdentity','colors','setName',
    'convertedManaCost','manaCost','life','loyalty','power','toughness',
    'type','types','subtypes','supertypes','text','purchaseUrls',
    'brawl','commander','duel','future','historic','legacy','modern',
    'oldschool','pauper','penny','pioneer','standard','vintage'
]

In [140]:
cards_df = pd.read_csv('cards.csv')\
    .query('mtgArenaId.notnull()')\
    .assign(Names=lambda df: df.name + '-' + df.setCode)\
    .assign(Names=lambda df: df.Names.apply(lambda x: x.replace(' ', '_')))\
    [merge_cols]

print(cards_df.shape)
cards_df.head(3)

(5419, 33)


Unnamed: 0,Names,id,mtgArenaId,scryfallId,name,colorIdentity,colors,setName,convertedManaCost,manaCost,...,future,historic,legacy,modern,oldschool,pauper,penny,pioneer,standard,vintage
4753,Archon_of_Sun's_Grace-AJMP,4754,74983.0,94f05268-0d4f-4638-aec3-a85fc339e3a7,Archon of Sun's Grace,W,W,Jumpstart Arena Exclusives,4.0,{2}{W}{W},...,Legal,Legal,Legal,Legal,Blank,Blank,Blank,Legal,Legal,Legal
4754,Audacious_Thief-AJMP,4755,74991.0,ba315deb-d5a9-4013-b6ef-e4efe652e569,Audacious Thief,B,B,Jumpstart Arena Exclusives,3.0,{2}{B},...,Blank,Legal,Legal,Legal,Blank,Legal,Blank,Legal,Blank,Legal
4755,Banishing_Light-AJMP,4756,74986.0,ca112bae-6ac5-4cdf-9e8c-1b99f7396995,Banishing Light,W,W,Jumpstart Arena Exclusives,3.0,{2}{W},...,Legal,Legal,Legal,Legal,Blank,Blank,Legal,Legal,Legal,Legal


In [137]:
cards_df.id.nunique()

5419

In [141]:
for card in pred_df.columns[1:3]:
    pred_df[['Names', card]]\
        .merge(cards_df, how='left', on='Names')\
        .sort_values(by=card, ascending=False)\
        .to_csv('sorted/{}.csv'.format(card), index=False)

In [142]:
a_card = pd.read_csv('sorted/Banishing_Light-AJMP.csv')
print(a_card.shape)

(7639, 33)


In [149]:
a_card.head(3)

Unnamed: 0,Banishing_Light-AJMP,id,mtgArenaId,scryfallId,name,colorIdentity,colors,setName,convertedManaCost,manaCost,...,future,historic,legacy,modern,oldschool,pauper,penny,pioneer,standard,vintage
0,1.0,4756,74986.0,ca112bae-6ac5-4cdf-9e8c-1b99f7396995,Banishing Light,W,W,Jumpstart Arena Exclusives,3.0,{2}{W},...,Legal,Legal,Legal,Legal,Blank,Blank,Legal,Legal,Legal,Legal
1,1.0,49136,70515.0,a1ddd113-140f-49c9-b45c-cf1b0d1dffd8,Banishing Light,W,W,Theros Beyond Death,3.0,{2}{W},...,Legal,Legal,Legal,Legal,Blank,Blank,Legal,Legal,Legal,Legal
2,0.773027,28775,67708.0,197743cd-249c-42ba-ac8d-027c088f8418,Hieromancer's Cage,W,W,Core Set 2019,4.0,{3}{W},...,Blank,Legal,Legal,Legal,Blank,Blank,Blank,Legal,Blank,Legal
