# Magic DB
## Data Engineering Capstone Project

### Project Summary

Magic: The Gathering (MTG) is a popular card game from the 90s that has a solid fan base and active trading card community. The game competitiveness and complexity attracts fans all around the world, creating a high demand for cards in the market.

The card trading market for MTG is complex given that number of players, the geographical spread of the game and the professional scenario. Even the construction of a new deck by the fan base can sky rocket the price of a single card within a day. In addition, official cards are released seasonally and many game stores buy and sell MTG products to casual or professional players making card prices volatile.

In the following sections we will explore the data and explain the steps taken.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

> **NOTE:**
This is a unoffical project for academic purpose only and should not be used for monetary gain. It is not funded or endorsed by any company.

In [1]:
# All imports
import pandas as pd
import gzip
import json
import requests

from collections.abc import Mapping
from operator import add

### Step 1: Scope the Project and Gather Data

#### Scope
This project proposes the construction of a Magic database with card dimensional data and the prices that changes every day. This information is gathered and optimized for a fictional Data Science team to utilize in order to predict card prices in the next days. 

To accomplish this, we created an Airflow pipeline that extract the data from public hosts [Scryfall](https://scryfall.com) and [MTGJson](https://mtgjson.com), creates the source datasets for the project, load the dataset into staging tables in Redshift and finally populates the dimension and fact tables in a star schema also in Redshift. The schema is created to optimize card price queries.

#### Describe and Gather Data
In this project, we will collect data from two distinct sources:

- [Scryfall](https://scryfall.com): following the [guidelines](https://scryfall.com/docs/api) of [MTG policy](https://company.wizards.com/en/legal/fancontentpolicy) this site provides an API to search MTG cards and detailed information, even the card images are available to request. In this project, we will request programmatically the [bulk data](https://scryfall.com/docs/api/bulk-data) with all the card.

- [MTGJson](https://mtgjson.com): for the card prices, we resource to MTGJson that provides a download link to all card prices that they collect from major stores in Europe and United States. Their guidlines and licesing are available [here](https://github.com/mtgjson/mtgjson).

#### Scryfall
Lets first collect the data from Scryfall and have a look at it. To do this, we will request the API for the bulk-data endpoint:

In [2]:
response = requests.get("https://api.scryfall.com/bulk-data")
response.json()['data']

[{'object': 'bulk_data',
  'id': '27bf3214-1271-490b-bdfe-c0be6c23d02e',
  'type': 'oracle_cards',
  'updated_at': '2021-12-17T10:03:55.620+00:00',
  'uri': 'https://api.scryfall.com/bulk-data/27bf3214-1271-490b-bdfe-c0be6c23d02e',
  'name': 'Oracle Cards',
  'description': 'A JSON file containing one Scryfall card object for each Oracle ID on Scryfall. The chosen sets for the cards are an attempt to return the most up-to-date recognizable version of the card.',
  'compressed_size': 13179613,
  'download_uri': 'https://c2.scryfall.com/file/scryfall-bulk/oracle-cards/oracle-cards-20211217100355.json',
  'content_type': 'application/json',
  'content_encoding': 'gzip'},
 {'object': 'bulk_data',
  'id': '6bbcf976-6369-4401-88fc-3a9e4984c305',
  'type': 'unique_artwork',
  'updated_at': '2021-12-17T10:13:12.364+00:00',
  'uri': 'https://api.scryfall.com/bulk-data/6bbcf976-6369-4401-88fc-3a9e4984c305',
  'name': 'Unique Artwork',
  'description': 'A JSON file of Scryfall card objects that t

In [3]:
# Lets see all the data types:
[x['type'] for x in response.json()['data']]

['oracle_cards', 'unique_artwork', 'default_cards', 'all_cards', 'rulings']

In [4]:
# Now we get only the 'all_cards' because it is the most complete data.
[x for x in response.json()['data'] if x['type'] == 'all_cards']

[{'object': 'bulk_data',
  'id': '922288cb-4bef-45e1-bb30-0c2bd3d3534f',
  'type': 'all_cards',
  'updated_at': '2021-12-17T10:11:57.749+00:00',
  'uri': 'https://api.scryfall.com/bulk-data/922288cb-4bef-45e1-bb30-0c2bd3d3534f',
  'name': 'All Cards',
  'description': 'A JSON file containing every card object on Scryfall in every language.',
  'compressed_size': 212613215,
  'download_uri': 'https://c2.scryfall.com/file/scryfall-bulk/all-cards/all-cards-20211217101157.json',
  'content_type': 'application/json',
  'content_encoding': 'gzip'}]

In [5]:
# With the download_uri we can gather the JSON data
scryfall_json = requests.get('https://c2.scryfall.com/file/scryfall-bulk/all-cards/all-cards-20211216221311.json')
scryfall_json = scryfall_json.json()

In [6]:
# The data returned is a list with all the information per card.
print(type(scryfall_json))
print(len(scryfall_json))

<class 'list'>
349905


In [7]:
# Saving the data gathered in a gzip format
with gzip.open('scryfall_json.json.gz', 'wt') as file:
    json.dump(scryfall_json, file)

Now that we have MTG card informations, lets collect the prices

#### MTGJson
This dataset we have to make some transformations to meet our goals. First lets collect two separate data from the MTGJson endpoint:

- `Prices`: https://mtgjson.com/api/v5/AllPrices.json
- `Prints`: https://mtgjson.com/api/v5/AllPrintings.json

In [None]:
# Download prices data
prices = requests.get("https://mtgjson.com/api/v5/AllPrices.json")
prices = prices.json()
prices = prices['data']

# Download prints data
prints = requests.get("https://mtgjson.com/api/v5/AllPrintings.json")
prints = prints.json()
prints = prints['data']

The `prices` data is a complex nested dictionary that we need to acces further to find the prices. First lets explore all the possible keys in the dictionary and a sample of its data.

In [None]:
# first key is a card id
list(prices.keys())[0:10]

['00010d56-fe38-5e35-8aed-518019aa36a5',
 '0001e0d0-2dcd-5640-aadc-a84765cf5fc9',
 '0003caab-9ff5-5d1a-bc06-976dd0457f19',
 '0003d249-25d9-5223-af1e-1130f09622a7',
 '0004a4fb-92c6-59b2-bdbe-ceb584a9e401',
 '00054115-b2b6-5e22-a694-76fc8639eeb2',
 '00059c8d-868a-53ef-a1b0-fcfaabed2570',
 '0005d268-3fd0-5424-bc6b-573ecd713aa1',
 '0005f481-f2d4-53fa-ba37-cfcf5a5f87f1',
 '0006172e-304e-5f7b-ba48-f21b8da92178']

In [None]:
# SECOND level keys, lets call it `online_paper`
aux = []
for key in prices:
    aux.extend(prices.get(key).keys())
set(aux)

{'mtgo', 'paper'}

In [None]:
# FIRST level keys, lets call it `online_paper`
aux1, aux2, aux3, aux4, aux5 = [], [], [], [], []
for key1 in prices:
    aux1.extend(prices.get(key1))
    for key2 in prices.get(key1):
        aux2.extend(prices.get(key1).get(key2).keys())
        for key3 in prices.get(key1).get(key2):
            aux3.extend(prices.get(key1).get(key2).get(key3).keys())
            for key4 in prices.get(key1).get(key2).get(key3):
                if(key4 == 'currency'):
                    aux4.extend([prices.get(key1).get(
                        key2).get(key3).get(key4)])
                else:
                    aux4.extend(prices.get(key1).get(
                        key2).get(key3).get(key4).keys())

                    for key5 in prices.get(key1).get(key2).get(key3).get(key4):
                        aux5.extend(prices.get(key1).get(key2).get(
                            key3).get(key4).get(key5).keys())

print(100*'-')
print('First level')
print(set(aux1))

print(100*'-')
print('Second level')
print(set(aux2))

print(100*'-')
print('Third level')
print(set(aux3))

print(100*'-')
print('Fourth level')
print(set(aux4))

print(100*'-')
print('Fifth level')
print(sorted(set(prices.get(key1).get(key2).get(
    key3).get(key4).get(key5))))


----------------------------------------------------------------------------------------------------
First level
{'paper', 'mtgo'}
----------------------------------------------------------------------------------------------------
Second level
{'tcgplayer', 'cardkingdom', 'cardhoarder', 'cardmarket'}
----------------------------------------------------------------------------------------------------
Third level
{'retail', 'buylist', 'currency'}
----------------------------------------------------------------------------------------------------
Fourth level
{'normal', 'foil', 'EUR', 'USD'}
----------------------------------------------------------------------------------------------------
Fifth level
['2021-09-16', '2021-09-17', '2021-09-18', '2021-09-19', '2021-09-20', '2021-09-21', '2021-09-22', '2021-09-23', '2021-09-24', '2021-09-25', '2021-09-26', '2021-09-27', '2021-09-28', '2021-09-29', '2021-09-30', '2021-10-01', '2021-10-02', '2021-10-05', '2021-10-06', '2021-10-07', '2021-10-

In [None]:
# Checking currency for each store
d = {}
d['tcgplayer'] = []
d['cardhoarder'] = []
d['cardkingdom'] = []
d['cardmarket'] = []

for key1 in prices:
    for key2 in prices.get(key1):
        for key3 in prices.get(key1).get(key2):
            for key4 in prices.get(key1).get(key2).get(key3):
                if(key4 == 'currency'):
                    d[key3].extend([prices.get(key1).get(
                        key2).get(key3).get(key4)])

d['tcgplayer']   = set(d['tcgplayer'])
d['cardhoarder'] = set(d['cardhoarder'])
d['cardkingdom'] = set(d['cardkingdom'])
d['cardmarket']  = set(d['cardmarket'])
d

{'tcgplayer': {'USD'},
 'cardhoarder': {'USD'},
 'cardkingdom': {'USD'},
 'cardmarket': {'EUR'}}

Lets unnest the JSON in the following format:

```
  card_id: card id for MTGJSON.
    values: uuid

  online_paper: indicates if it is the price of paper or online card
    values: {'mtgo', 'paper'}
    
  store: store that the price was extracted.
    values: {'tcgplayer', 'cardhoarder', 'cardkingdom', 'cardmarket'}

  price_type: indicates if price is a buylist (similar to buy bid price) or retail.
    values: {'retail', 'buylist'}

  currency: price currency. Here only 'cardmarket' is in EUR.
    values: {'USD', 'EUR'}

  card_type: indicates if the card is normal or foil.
    values: {'foil', 'normal'}
```

In [None]:
# Auxiliar function to unnest a dictionary
def flattenDict(d, join=add, lift=lambda x: (x,)):
    results = []
    _FLAG_FIRST = object()

    def visit(subdict, results, partialKey):
        for k, v in subdict.items():
            newKey = lift(k) if partialKey == _FLAG_FIRST else join(
                partialKey, lift(k))
            if isinstance(v, Mapping):
                visit(v, results, newKey)
            else:
                results.append(add(newKey, lift(v)))
    visit(d, results, _FLAG_FIRST)
    return results

currency = {
    'tcgplayer':  'USD',
    'cardmarket':  'EUR',
    'cardkingdom':  'USD',
    'cardhoarder':  'USD'
}
columns = ["card_id", "online_paper", "store", "price_type",
            "card_type", "dt", "price"]

df = pd.DataFrame(flattenDict(prices), columns=columns)
df = df[~df.card_type.isin(['USD', 'EUR'])]
df["currency"] = df.store.map(currency)
df.head()

Unnamed: 0,card_id,online_paper,store,price_type,card_type,dt,price,currency
0,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-16,4.8,USD
1,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-17,4.8,USD
2,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-18,4.8,USD
3,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-19,4.8,USD
4,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-20,4.8,USD


In [None]:
# saving the dataset
df.to_csv('mtgjson_prices.csv.gz', compression='gzip', index=False)

Now lets get the card names from `prints` to be able to merge data with the Scryfall dataset.

In [None]:
# first key is the card edition. The edition is a set in which a group of cards
# is released together and share some caracteristics
list(prints.keys())[0:10]

['10E', '2ED', '2XM', '3ED', '4BB', '4ED', '5DN', '5ED', '6ED', '7ED']

In [None]:
# All the data from the edition
prints.get('10E').keys()

dict_keys(['baseSetSize', 'block', 'booster', 'cards', 'code', 'isFoilOnly', 'isOnlineOnly', 'keyruneCode', 'mcmId', 'mcmName', 'mtgoCode', 'name', 'releaseDate', 'sealedProduct', 'tcgplayerGroupId', 'tokens', 'totalSetSize', 'translations', 'type'])

In [None]:
# To simplify the project, lets only get the card info
# each edition has a set of cards. Each card have the following information
prints.get('10E').get('cards')[0].keys()

dict_keys(['artist', 'availability', 'borderColor', 'colorIdentity', 'colors', 'convertedManaCost', 'edhrecRank', 'finishes', 'foreignData', 'frameVersion', 'hasFoil', 'hasNonFoil', 'identifiers', 'isReprint', 'keywords', 'layout', 'legalities', 'manaCost', 'manaValue', 'name', 'number', 'originalText', 'originalType', 'power', 'printings', 'purchaseUrls', 'rarity', 'rulings', 'setCode', 'subtypes', 'supertypes', 'text', 'toughness', 'type', 'types', 'uuid', 'variations'])

In [46]:
# Lets loop through the `edition` and get the `id`, `name`, `collector_number`
# these information will be important to join scryfall and mtgjson data
columns = ['card_id', 'name', 'collector_number', 'edition']

df = pd.DataFrame(columns=columns)
for edition in prints:
    cards = prints.get(edition).get('cards')
    if(len(cards) > 0):
        aux = pd.DataFrame(cards)
        aux = aux[['uuid', 'name', 'number']]
        aux['edition'] = edition
        aux.columns = columns
        df = pd.concat([df, aux], axis=0)

In [47]:
# dataset with card_name, name, edition
df.head()

Unnamed: 0,card_id,name,collector_number,edition
0,5f8287b1-5bb6-5f4c-ad17-316a40d5bb0c,Ancestor's Chosen,1,10E
1,57aaebc1-850c-503d-9f6e-bb8d00d8bf7c,Angel of Mercy,2,10E
2,8ac972b5-9f6e-5cc8-91c3-b9a40a98232e,Aven Cloudchaser,7,10E
3,a69b404f-144a-5317-b10e-7d9dce135b24,Ballista Squad,8,10E
4,6d268c95-c176-5766-9a46-c14f739aba1c,Bandage,9,10E


In [None]:
# saving the dataset
df.to_csv('mtgjson_prints.csv.gz', compression='gzip', index=False)

### Step 2: Explore and Assess the Data

#### Explore the Data 
The data comes from thre well governed datasets:
- `scryfall_json.json.gz`: has ~ 350_000 rows and card information.
- `mtgjson_prices.csv.gz`: has ~ 28_000_000 rows and all card prices
- `mtgjson_prints.csv.gz`: has ~ 64_000 rows and all contains the cards names and edition of the mtgjson_prices dataset.

In [None]:
def get_df_information(df):
    pd.set_option("display.max_rows", None, "display.max_columns", None)
    df_final = pd.DataFrame()
    for column in df.columns:
        aux = pd.DataFrame()
        aux['columns'] = [column]
        aux['dtpye']   = df[column].dtype
        aux['%duplicates'] = df[column].duplicated().sum()/df.shape[0]
        aux['%null']   = df[column].isnull().sum()/df.shape[0]
        aux['sample'] = ' | '.join(
            df[column][~df[column].isnull()].astype(str).unique()[0:5])
        df_final = pd.concat([df_final, aux], axis=0)
    return(df_final.reset_index(drop=True))


##### Lets start with `scryfall_json.json.gz`

In [None]:
# Lets create a DataFrame and have a look.
scryfall_df = pd.read_json('scryfall_json.json.gz', compression='gzip')

In [None]:
print('Shape: ', scryfall_df.shape)
scryfall_df.head()

Shape:  (349905, 82)


Unnamed: 0,object,id,oracle_id,multiverse_ids,mtgo_id,mtgo_foil_id,tcgplayer_id,cardmarket_id,name,lang,released_at,uri,scryfall_uri,layout,highres_image,image_status,image_uris,mana_cost,cmc,type_line,oracle_text,power,toughness,colors,color_identity,keywords,legalities,games,reserved,foil,nonfoil,finishes,oversized,promo,reprint,variation,set_id,set,set_name,set_type,set_uri,set_search_uri,scryfall_set_uri,rulings_uri,prints_search_uri,collector_number,digital,rarity,flavor_text,card_back_id,artist,artist_ids,illustration_id,border_color,frame,full_art,textless,booster,story_spotlight,edhrec_rank,prices,related_uris,printed_name,printed_type_line,printed_text,security_stamp,all_parts,promo_types,arena_id,loyalty,watermark,preview,frame_effects,produced_mana,card_faces,color_indicator,tcgplayer_etched_id,content_warning,life_modifier,hand_modifier,variation_of,flavor_name
0,card,0000579f-7b35-4ed3-b44c-db2a538066fe,44623693-51d6-49ad-8cd7-140505caf02f,[109722],25527.0,25528.0,14240.0,13850.0,Fury Sliver,en,2006-10-06,https://api.scryfall.com/cards/0000579f-7b35-4...,https://scryfall.com/card/tsp/157/fury-sliver?...,normal,True,highres_scan,{'small': 'https://c1.scryfall.com/file/scryfa...,{5}{R},6.0,Creature — Sliver,All Sliver creatures have double strike.,3,3,[R],[R],[],"{'standard': 'not_legal', 'future': 'not_legal...","[paper, mtgo]",False,True,True,"[nonfoil, foil]",False,False,False,False,c1d109bc-ffd8-428f-8d7d-3f8d7e648046,tsp,Time Spiral,expansion,https://api.scryfall.com/sets/c1d109bc-ffd8-42...,https://api.scryfall.com/cards/search?order=se...,https://scryfall.com/sets/tsp?utm_source=api,https://api.scryfall.com/cards/0000579f-7b35-4...,https://api.scryfall.com/cards/search?order=re...,157,False,uncommon,"""A rift opened, and our arrows were abruptly s...",0aeebaf5-8c7d-4636-9e82-8c27447861f7,Paolo Parente,[d48dd097-720d-476a-8722-6a02854ae28b],2fcca987-364c-4738-a75b-099d8a26d614,black,2003,False,False,True,False,5411.0,"{'usd': '0.43', 'usd_foil': '1.73', 'usd_etche...",{'gatherer': 'https://gatherer.wizards.com/Pag...,,,,,,,,,,,,,,,,,,,,
1,card,00006596-1166-4a79-8443-ca9f82e6db4e,8ae3562f-28b7-4462-96ed-be0cf7052ccc,[189637],34586.0,34587.0,33347.0,21851.0,Kor Outfitter,en,2009-10-02,https://api.scryfall.com/cards/00006596-1166-4...,https://scryfall.com/card/zen/21/kor-outfitter...,normal,True,highres_scan,{'small': 'https://c1.scryfall.com/file/scryfa...,{W}{W},2.0,Creature — Kor Soldier,"When Kor Outfitter enters the battlefield, you...",2,2,[W],[W],[],"{'standard': 'not_legal', 'future': 'not_legal...","[paper, mtgo]",False,True,True,"[nonfoil, foil]",False,False,False,False,eb16a2bd-a218-4e4e-8339-4aa1afc0c8d2,zen,Zendikar,expansion,https://api.scryfall.com/sets/eb16a2bd-a218-4e...,https://api.scryfall.com/cards/search?order=se...,https://scryfall.com/sets/zen?utm_source=api,https://api.scryfall.com/cards/00006596-1166-4...,https://api.scryfall.com/cards/search?order=re...,21,False,common,"""We take only what we need to survive. Believe...",0aeebaf5-8c7d-4636-9e82-8c27447861f7,Kieran Yanner,[aa7e89ed-d294-4633-9057-ce04dacfcfa4],de0310d1-e97f-46e0-bc16-c980c2adedee,black,2003,False,False,True,False,12019.0,"{'usd': '0.24', 'usd_foil': '2.64', 'usd_etche...",{'gatherer': 'https://gatherer.wizards.com/Pag...,,,,,,,,,,,,,,,,,,,,
2,card,00009878-d086-46f0-a964-15734d8368ac,30cd69a8-7893-4075-94ca-04450ff821d3,[433932],,,,,Spirit of the Hearth,fr,2017-08-25,https://api.scryfall.com/cards/00009878-d086-4...,https://scryfall.com/card/c17/73/fr/esprit-du-...,normal,False,lowres,{'small': 'https://c1.scryfall.com/file/scryfa...,{4}{W}{W},6.0,Creature — Cat Spirit,Flying\nYou have hexproof. (You can't be the t...,4,5,[W],[W],[Flying],"{'standard': 'not_legal', 'future': 'not_legal...",[paper],False,False,True,[nonfoil],False,False,True,False,5caec427-0c78-4c37-b4ec-30f7e0ba9abf,c17,Commander 2017,commander,https://api.scryfall.com/sets/5caec427-0c78-4c...,https://api.scryfall.com/cards/search?order=se...,https://scryfall.com/sets/c17?utm_source=api,https://api.scryfall.com/cards/00009878-d086-4...,https://api.scryfall.com/cards/search?order=re...,73,False,rare,Les voleurs savent qu'un grognement dans la nu...,0aeebaf5-8c7d-4636-9e82-8c27447861f7,Jason Chan,[8062d5a9-51b6-4822-933f-fa9e9dba8416],e8a09f86-98f6-46df-bf53-46a9f4117f36,black,2015,False,False,False,False,7162.0,"{'usd': None, 'usd_foil': None, 'usd_etched': ...",{'gatherer': 'https://gatherer.wizards.com/Pag...,Esprit du foyer,Créature : chat et esprit,Vol\nVous avez la défense talismanique. (Vous ...,oval,,,,,,,,,,,,,,,,
3,card,0000a54c-a511-4925-92dc-01b937f9afad,dc4e2134-f0c2-49aa-9ea3-ebf83af1445c,[],,,98659.0,,Spirit,en,2015-05-22,https://api.scryfall.com/cards/0000a54c-a511-4...,https://scryfall.com/card/tmm2/5/spirit?utm_so...,token,True,highres_scan,{'small': 'https://c1.scryfall.com/file/scryfa...,,0.0,Token Creature — Spirit,Flying,1,1,[W],[W],[Flying],"{'standard': 'not_legal', 'future': 'not_legal...",[paper],False,False,True,[nonfoil],False,False,True,False,f7aa47c6-c1e2-4de5-9a68-4406d84bd6bb,tmm2,Modern Masters 2015 Tokens,token,https://api.scryfall.com/sets/f7aa47c6-c1e2-4d...,https://api.scryfall.com/cards/search?order=se...,https://scryfall.com/sets/tmm2?utm_source=api,https://api.scryfall.com/cards/0000a54c-a511-4...,https://api.scryfall.com/cards/search?order=re...,5,False,common,,0aeebaf5-8c7d-4636-9e82-8c27447861f7,Mike Sass,[155bc2cb-038d-4b1f-9990-6178db1d1a21],1dbe0618-dd47-442c-acf6-ac5e4b136e5a,black,2015,False,False,True,False,,"{'usd': '0.15', 'usd_foil': None, 'usd_etched'...",{'tcgplayer_infinite_articles': 'https://infin...,,,,,"[{'object': 'related_card', 'id': '9964629e-d7...",[setpromo],,,,,,,,,,,,,,
4,card,0000cd57-91fe-411f-b798-646e965eec37,9f0d82ae-38bf-45d8-8cda-982b6ead1d72,[435231],65170.0,65171.0,145764.0,301766.0,Siren Lookout,en,2017-09-29,https://api.scryfall.com/cards/0000cd57-91fe-4...,https://scryfall.com/card/xln/78/siren-lookout...,normal,True,highres_scan,{'small': 'https://c1.scryfall.com/file/scryfa...,{2}{U},3.0,Creature — Siren Pirate,Flying\nWhen Siren Lookout enters the battlefi...,1,2,[U],[U],"[Flying, Explore]","{'standard': 'not_legal', 'future': 'not_legal...","[arena, paper, mtgo]",False,True,True,"[nonfoil, foil]",False,False,False,False,fe0dad85-54bc-4151-9200-d68da84dd0f2,xln,Ixalan,expansion,https://api.scryfall.com/sets/fe0dad85-54bc-41...,https://api.scryfall.com/cards/search?order=se...,https://scryfall.com/sets/xln?utm_source=api,https://api.scryfall.com/cards/0000cd57-91fe-4...,https://api.scryfall.com/cards/search?order=re...,78,False,common,,0aeebaf5-8c7d-4636-9e82-8c27447861f7,Chris Rallis,[a8e7b854-b15a-421a-b66d-6e68187ae285],e0a40a54-9216-4c86-b9e3-daed04abc310,black,2015,False,False,True,False,10469.0,"{'usd': '0.04', 'usd_foil': '0.26', 'usd_etche...",{'gatherer': 'https://gatherer.wizards.com/Pag...,,,,,,,66119.0,,,,,,,,,,,,,


In [39]:
get_df_information(scryfall_df)

Unnamed: 0,columns,dtpye,%null,sample
0,object,object,0.0,card
1,id,object,0.0,0000579f-7b35-4ed3-b44c-db2a538066fe | 0000659...
2,oracle_id,object,1.4e-05,44623693-51d6-49ad-8cd7-140505caf02f | 8ae3562...
3,multiverse_ids,object,0.0,[109722] | [189637] | [433932] | [] | [435231]
4,mtgo_id,float64,0.900307,25527.0 | 34586.0 | 65170.0 | 78170.0 | 81171.0
5,mtgo_foil_id,float64,0.93067,25528.0 | 34587.0 | 65171.0 | 9702.0 | 38805.0
6,tcgplayer_id,float64,0.836393,14240.0 | 33347.0 | 98659.0 | 145764.0 | 1623.0
7,cardmarket_id,float64,0.852108,13850.0 | 21851.0 | 301766.0 | 5664.0 | 400134.0
8,name,object,0.0,Fury Sliver | Kor Outfitter | Spirit of the He...
9,lang,object,0.0,en | fr | pt | ja | ru


In [38]:
print("Scryfall is unique in name, set, collector number and language: ",
scryfall_df[['name', 'set', 'collector_number','lang']].duplicated().sum())

Scryfall is unique in name, set, collector number and language:  0


##### Now lets have a look at  `mtgjson_prices.csv.gz` and `mtgjson_prints.csv.gz`

##### Prices

In [None]:
# Lets create a DataFrame and have a look at mtgjson_prices.
mtgjson_prices = pd.read_csv('mtgjson_prices.csv.gz', compression='gzip')

In [None]:
mtgjson_prices.head()

Unnamed: 0,card_id,online_paper,store,price_type,card_type,dt,price,currency
0,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-16,4.8,USD
1,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-17,4.8,USD
2,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-18,4.8,USD
3,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-19,4.8,USD
4,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-20,4.8,USD


In [None]:
print('Shape: ', mtgjson_prices.shape)
mtgjson_prices.head()

Shape:  (28211466, 8)


Unnamed: 0,card_id,online_paper,store,price_type,card_type,dt,price,currency
0,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-16,4.8,USD
1,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-17,4.8,USD
2,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-18,4.8,USD
3,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-19,4.8,USD
4,00010d56-fe38-5e35-8aed-518019aa36a5,paper,cardkingdom,buylist,foil,2021-09-20,4.8,USD


In [None]:
get_df_information(mtgjson_prices)

Unnamed: 0,columns,dtpye,%null,sample
0,card_id,object,0.0,00010d56-fe38-5e35-8aed-518019aa36a5 | 0001e0d...
1,online_paper,object,0.0,paper | mtgo
2,store,object,0.0,cardkingdom | cardmarket | tcgplayer | cardhoa...
3,price_type,object,0.0,buylist | retail
4,card_type,object,0.0,foil | normal
5,dt,object,0.0,2021-09-16 | 2021-09-17 | 2021-09-18 | 2021-09...
6,price,float64,0.0,4.8 | 5.2 | 5.5 | 6.0 | 7.99
7,currency,object,0.0,USD | EUR


##### Prints

In [41]:
# Lets create a DataFrame and have a look at mtgjson_prints.
mtgjson_prints = pd.read_csv('mtgjson_prints.csv.gz', compression='gzip')

In [42]:
print('Shape: ', mtgjson_prints.shape)
mtgjson_prints.head()

Shape:  (64455, 4)


Unnamed: 0,card_id,name,collector_number,edition
0,5f8287b1-5bb6-5f4c-ad17-316a40d5bb0c,Ancestor's Chosen,1,10E
1,57aaebc1-850c-503d-9f6e-bb8d00d8bf7c,Angel of Mercy,2,10E
2,8ac972b5-9f6e-5cc8-91c3-b9a40a98232e,Aven Cloudchaser,7,10E
3,a69b404f-144a-5317-b10e-7d9dce135b24,Ballista Squad,8,10E
4,6d268c95-c176-5766-9a46-c14f739aba1c,Bandage,9,10E


In [43]:
get_df_information(mtgjson_prints)

Unnamed: 0,columns,dtpye,%null,sample
0,card_id,object,0.0,5f8287b1-5bb6-5f4c-ad17-316a40d5bb0c | 57aaebc...
1,name,object,0.0,Ancestor's Chosen | Angel of Mercy | Aven Clou...
2,collector_number,object,0.0,1 | 2 | 7 | 8 | 9
3,edition,object,0.0,10E | 2ED | 2XM | 3ED | 4BB


In [45]:
print("MTGJson Prints is unique in card id: ",
      mtgjson_prints['card_id'].duplicated().sum())

MTGJson Prints is unique in card id:  0


#### Cleaning Steps

The Scryfall data has some complex data format in some fields. For example, `finishes` is an array and `legalities` is a dictionary. In addition, some columns of the dataset are not usefull in our project, for exemple, `mtgo_id` is an id in the online game that should not be used in any price forecast model.

Some ids we should keep to find the correct price for each card. In this project we will create dimensions out of the dataset to optimize the proposed goal. We will first load all the data into Redshift and treat the data in there.

In the MTGJson data, we need to JOIN the two tables and them find the correspoding cards in Scry fall. Since all the data is gathered from APIs and websites that makes a pre cleaning and we are constructing the dataset, we believe it is not necessary to do any further cleaning in this step. We will treat some particularities of the dataset in Redshift after loading these as steage tables.

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.