# Preface
The goal is to create a recommendation system for MTG Players to be able to use. 

~~~
response = requests.get('https://api.scryfall.com/bulk-data') #to get all data API endpoints
~~~
~~~
    import json
        def jprint(obj):
            text = json.dumps(obj, sort_keys = True, indent = 4)
            print(text)
    
            jprint(response.json())
~~~

# Observations
* Brought in both oracle and card datasets. upon inspection, the oracle will be best to go with. Less repeated cards, easier to work with, unique id_identifiers

In [52]:
response = requests.get('https://api.scryfall.com/bulk-data')
import json
def jprint(obj):
    text = json.dumps(obj, sort_keys = True, indent = 4)
    print(text)
    
jprint(response.json())

{
    "data": [
        {
            "compressed_size": 13763246,
            "content_encoding": "gzip",
            "content_type": "application/json",
            "description": "A JSON file containing one Scryfall card object for each Oracle ID on Scryfall. The chosen sets for the cards are an attempt to return the most up-to-date recognizable version of the card.",
            "download_uri": "https://c2.scryfall.com/file/scryfall-bulk/oracle-cards/oracle-cards-20220421090147.json",
            "id": "27bf3214-1271-490b-bdfe-c0be6c23d02e",
            "name": "Oracle Cards",
            "object": "bulk_data",
            "type": "oracle_cards",
            "updated_at": "2022-04-21T09:01:47.636+00:00",
            "uri": "https://api.scryfall.com/bulk-data/27bf3214-1271-490b-bdfe-c0be6c23d02e"
        },
        {
            "compressed_size": 17472621,
            "content_encoding": "gzip",
            "content_type": "application/json",
            "description": "A JSON file

Ratio of missing values = the number of missing values / total number of observations * 100


* Dropped for modeling:
    * mtgo_foil_id is not needed.
    * Flavor text is not needed. We can add a line about where to find all flavor text, or import into database at later time for search recommendations
    * security_stamp is not needed.
    * preview is not needed for modeling purposes
    * arena_id could be useful for an online rec system
    * watermark not needed
    * produced mana can be removed. If not, we need to replace NaN values with just Not Applicable
    * all parts could be viewed as a target for combo model
    * object not needed. every object is a card-type object
    * lang contains only 10 japanese cards out of 26000. Removed column.
    * type can help fix color column. We can use type to tell if it's an artifact, then create a colorless condition
    * mana_cost: could be fixed. It has a few cards that are duplicates. Dropped those. Then there are leftovers that are dual faced cards. Costs are different for each. We could fill these with cmc costs instead or look at cards individually for this issue. For now, dropped column.
* Fixed:
    * power: set na to 0. Missing value indicates a card that does not have a power value(i.e. enchants, sorcery, instant)
    * toughness: set na to 0. Missing value indicates a card that does not have a toughness value(i.e. enchants, sorcery, instant)
    * edhrec_rank: filled na values using the max rec number +1 and setting an incremental counter, which, for each na value, added 1 to the counter and saved it as that value. This does not accurately reflect the ranking of the cards, but allows us to at least get an understanding to the rest of the ranked cards. 
    * mana_cost: removed from a dictionary and seperated values with a /.
    * keywords: removed from list, replaced na values with None, to allow our program to understand that there is no keywords associated with that card.
    * colors: removed from list, filled na values with C for colorless. Meaning that the card is cast with colorless mana only.
    * color_identity: removed from list, filled na values with C for colorless. Meaning that the card is cast with colorless mana only.
    * oracle_id: identified as a unique ID. set to index to allow unique key.
    * cmc: cmc had a single card listed with a 0.5 mana value. Replaced with 1. converted column to integer after removing single float value from column.
    * oracle_text: dropped na values. Not only was this missing, but most values were improper and would lead to skewed results.
    

In [45]:
import numpy as np
import pandas as pd
import re
import warnings
warnings.filterwarnings("ignore")
import requests

In [61]:
class Data_Scraping:
    def create_data_frame():
        response = requests.get('https://api.scryfall.com/bulk-data')
        j = response.json()
        df = pd.DataFrame(j['data'])
        return df
    
    def scrape_oracle_uri():
        df = Data_Scraping.create_data_frame()
        filepath = df['download_uri'][df['type'] == 'oracle_cards'][0]
        return filepath
    
    def scrape_default_uri():
        df = Data_Scraping.create_data_frame()
        filepath = df['download_uri'][df['type'] == 'default_cards'][0]
        return filepath
    
    def get_all_cards():
        df = Data_Scraping.create_data_frame()
        filepath = df['download_uri'][df['type'] == 'all_cards'][0]
        return filepath
    def get_artwork():
        df = Data_Scraping.create_data_frame()
        filepath = df['download_uri'][df['type'] == 'unique_artwork'][0]
        return filepath
        
class Data_Handling(Data_Scraping):
    def __init__(self, filepath = Data_Scraping.scrape_oracle_uri()):
        self.filepath = filepath
        
    def wrangle_oracle_uri(self):
        if self.filepath.endswith(".json"):
            df = pd.read_json(self.filepath)
            
        else:
            df = pd.read_csv(self.filepath)
        
        #Fix NA Values for edhrec_rank
        if 'edhrec_rank' in df.columns:
            edh_fix = df[df['edhrec_rank'].isna() == True]
            counter = 22665 # Max rank + 1
    
            edh_fix.edhrec_rank = range(counter, (counter + len(edh_fix)))
            df.loc[edh_fix.index, :] = edh_fix[:]
            df['edhrec_rank'] = df['edhrec_rank'].astype('int64')
    
        # Fix power column
        if 'power' in df.columns:
            df['power'].loc[df['power'].isna() == True] = 0
        
        # Fix Toughness Columns
        if 'toughness' in df.columns:
            df['toughness'].loc[df['toughness'].isna() == True] = 0
            
        # Fix CMC Column
        if 'cmc' in df.columns:
            df['cmc'].loc[17411] = 1
            df['cmc'] = df['cmc'].astype('int64')
        
        if 'oracle_id' in df.columns:
            df.set_index('oracle_id', inplace=True)
        
        if 'colors' in df.columns:
            df['colors'] = df['colors'].str[0]
            df['colors'].fillna('C', inplace=True)

        if 'color_identity' in df.columns:
            df['color_identity'] = df['color_identity'].str[0]
            df['color_identity'].fillna('C', inplace=True)
            
        if 'keywords' in df.columns:
            df['keywords'] = df['keywords'].str[0]
            df['keywords'].fillna('None', inplace =True)
        
        if 'mana_cost' in df.columns:
            df['mana_cost'].fillna(df['cmc'].astype(str), inplace=True)
           
            l = []
            for val in df.mana_cost:
                val = re.sub(r'[{]', '', str(val))
                val = re.sub(r'[}]', '/', val)
                val = val.strip('/')
                l.append(val)
            df['mana_cost'] = l
            df['mana_cost'] = np.where(df['mana_cost'] == '', df['cmc'], df['mana_cost'])
        
        # if 'oracle_text' in df.columns:
            # df['oracle_text'].dropna(inplace=True)
            
        return df
    
    def drop_cols(df):
        # Drops all columns with greater than 35% NA values
        #Drops mtgo_id column, which has a high number of NA values as well + not needed for modeling.
        drop_cols = [col for col in df.columns if (df[col].isna().sum() / len(df) *100) > 35]
        drop_cols.append('mtgo_id')
        df.drop(columns = drop_cols, inplace=True)
        
        return df
    
    def modeling_prep_mtg_oracle(df):
        # Drop columns for modeling purposes
        drop_cols = ['id', 'multiverse_ids', 'tcgplayer_id', 'cardmarket_id', 'lang', 'object', 
                     'released_at', 'uri', 'scryfall_uri', 'layout', 'highres_image', 'image_status', 
                     'image_uris', 'games', 'frame', 'full_art', 'textless', 'booster', 'story_spotlight', 'prices',
                     'legalities', 'reserved', 'foil', 'nonfoil', 'card_back_id', 'artist', 'artist_ids', 'illustration_id', 
                     'border_color', 'oversized', 'finishes', 'scryfall_set_uri', 'rulings_uri', 'promo', 'set', 'set_uri', 'set_search_uri', 
                     'reprint', 'variation', 'set_id', 'prints_search_uri', 'collector_number', 'digital']
        if 'related_uris' in df.columns:
            target_cards = df['related_uris']
            drop_cols.append('related_uris')
        df.drop(columns = drop_cols, inplace= True)
        
        if 'type_line' in df.columns:
            a = df[df['type_line'].str.contains('Token Creature')]
            df.drop(labels=a.index, inplace=True)
                
        if 'set_name' in df.columns:
            c = df[df['set_name'].str.contains('Art Series')]
            df.drop(labels = c.index, inplace=True)

        return df, target_cards

In [62]:
df = Data_Handling().wrangle_oracle_uri()
df, target_cards = Data_Handling.modeling_prep_mtg_oracle(df)
df = Data_Handling.drop_cols(df)
df.head()

Unnamed: 0_level_0,name,mana_cost,cmc,type_line,oracle_text,colors,color_identity,keywords,set_name,set_type,rarity,edhrec_rank,power,toughness
oracle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0004ebd0-dfd6-4276-b4a6-de0003e94237,Static Orb,3,3,Artifact,"As long as Static Orb is untapped, players can...",C,C,,Seventh Edition,core,rare,2643,0,0
0006faf6-7a61-426c-9034-579f2cfcfa83,Sensory Deprivation,U,1,Enchantment — Aura,Enchant creature\nEnchanted creature gets -3/-0.,U,U,Enchant,Magic 2014,core,common,21524,0,0
0007c283-5b7a-4c00-9ca1-b455c8dff8c3,Road of Return,G/G,2,Sorcery,Choose one —\n• Return target permanent card f...,G,G,Entwine,Commander 2019,commander,rare,4101,0,0
000d5588-5a4c-434e-988d-396632ade42c,Storm Crow,1/U,2,Creature — Bird,Flying (This creature can't be blocked except ...,U,U,Flying,Ninth Edition,core,common,12416,1,2
000e5d65-96c3-498b-bd01-72b1a1991850,Walking Sponge,1/U,2,Creature — Sponge,{T}: Target creature loses your choice of flyi...,U,U,,Urza's Legacy,expansion,uncommon,18960,1,1


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24469 entries, 0004ebd0-dfd6-4276-b4a6-de0003e94237 to ffff90c3-63c4-4dee-a21d-6b2b113f4f80
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            24469 non-null  object
 1   mana_cost       24469 non-null  object
 2   cmc             24469 non-null  int64 
 3   type_line       24469 non-null  object
 4   oracle_text     24005 non-null  object
 5   colors          24469 non-null  object
 6   color_identity  24469 non-null  object
 7   keywords        24469 non-null  object
 8   set_name        24469 non-null  object
 9   set_type        24469 non-null  object
 10  rarity          24469 non-null  object
 11  edhrec_rank     24469 non-null  int64 
 12  power           24469 non-null  object
 13  toughness       24469 non-null  object
dtypes: int64(2), object(12)
memory usage: 2.8+ MB
