# How to import Data with Pandas

In this notebook we discuss how to load large data sets. In particular, we will consider .csv and .json files. We give three examples that will come up in following notebooks.

## Before we start ...

... we have to decide on which data we want to work on. Of course, there are plenty of data bases online available. In the following, we will use data basis about a collectible card game, called Magic The Gathering. 

There are multiple reasons for this. First of all, it is a complex game and, as such, there is a lot to learn and analyse. What makes it so complex? It has over 20.000 different printed cards with unique names and in-game rules text. And it is Turing complete. (So one can simulate an abstract machine that is capable of implementing any computer algorithm.) The second reason is that I love playing this game. Ever since I was little I was faszinated by the variety and complexity of strategies one can come up with in this game. 

Indeed, there are multiple sources of data basis about Magic The Gathering (MTG). One data project that stands out is the open-source project MTGJSON, due to daily updates and direct downloads (see [here](https://mtgjson.com/)). 

The idea of my data project is inspired by an article of Gabriel Pierobon (see [here](https://towardsdatascience.com/artificial-intelligence-in-magic-the-gathering-4367e88aee11)). (At some point I will use his pre-processed data.)

## Importing Data with Pandas

The files we work with are .csv or .json files. Since these files in general contain lots of data, we have to be careful how we load them. There are multiple options. One is to divide the data into chunks and work with these. And another is to specify data types for individual columns. We follow the second strategy, since we want to train our models on all cards at once.

In [1]:
import json
import pandas as pd

In [2]:
# .csv file

#db = pd.read_csv('../data/mtg_cards_data/datasets_vow_20211220_FULL.csv') - DtypeWarning

columns = pd.read_csv('../data/mtg_cards_data/datasets_vow_20211220_FULL.csv',nrows=0).columns
columns_with_obj_dtype = ['name','oracle_text','set','edhrec_rank','cmc_grp','usd','eur','tix',
               'normal_image','normal_image_1','normal_image_2','multiclass_colrs','multiclass_rarty',
               'multiclass_binusd','multiclass_bineur','multiclass_bintix']
dtype = {col: ('int8' if col not in columns_with_obj_dtype else 'object') for col in columns}

db = pd.read_csv('../data/mtg_cards_data/datasets_vow_20211220_FULL.csv',dtype=dtype)

In [3]:
# .csv file

#db = pd.read_csv('../data/mtg_cards_data/AllPrintingsCSVFiles/cards.csv') - DtypeWarning

# selected columns of interest for our data project
columns = ['id','name','text','manaCost','manaValue','colorIdentity','colors','convertedManaCost','type','types','loyalty','power','toughness','keywords',
    'edhrecRank','life','defense','scryfallId','scryfallIllustrationId','scryfallOracleId','relatedCards']

dtype = {'id': 'int64','name': str,'text': str,'manaCost': str,'manaValue': float,'colorIdentity': str,'colors': str,
        'convertedManaCost': float,'type': str,'types': str,'loyalty': 'object','power': 'object','toughness': 'object',
        'keywords': str,'edhrecRank': 'object','life': 'object','defense': 'object','scryfallId': str,'scryfallIllustrationId': str,
        'scryfallOracleId': str,'relatedCards': str }

db = pd.read_csv('../data/mtg_cards_data/AllPrintingsCSVFiles/cards.csv',usecols = columns, dtype = dtype)

In [4]:
# .json file

# selected key of interest for our data project
with open('../data/mtg_cards_data/AllDeckFiles/NecronDynasties_40K.json') as f:
    data = json.load(f)
data = data['data']['mainBoard']

db = pd.json_normalize(data,max_level=1)