# Ingesting Data

In python there are multiple ways to ingest data into a pandas dataframe.  This is because we often will work with data from various sources, even in a single project, and thus being able to easily ingest and format data is crucial to being productive.

We will be looking at 6 different ways to ingest data today:
- via csv
- via JSON
- via API
- via google sheets
- via s3
- via SQL

## CSV ingestion

One of the most common data formats you'll deal with in the real world is csvs.  CSVs are an easy, text based representation of tabular data.  It can be easily saved (it's just a text file), passed around and while it's not the most efficient way to store data, it is less verbose than other formats like JSON. 

There are a few ways to read csv data into a pandas DataFrame.  The easiest way is to use `.read_csv`, e.g.:

In [1]:
import pandas as pd

In [7]:
market_caps = pd.read_csv('market_caps.csv')
market_caps.head()

Unnamed: 0,Rank,Name,Symbol,MarketCap,Price,VolumeUSD
0,1,Bitcoin,BTC,932695400000.0,49368.85,37198200000.0
1,2,Ethereum,ETH,498007900000.0,4198.32,25533060000.0
2,3,Binance Coin,BNB,93038860000.0,557.78,2394236000.0
3,4,Tether,USDT,75159690000.0,1.0,82054150000.0
4,5,Solana,SOL,59983320000.0,196.17,3402592000.0


We can also see that `.read_csv` will smartly try to format columns for us.  E.g. it can detect that MarketCap, Price and VolumeUSD are floats, and Rank is an integer

In [10]:
market_caps.dtypes

Rank           int64
Name          object
Symbol        object
MarketCap    float64
Price        float64
VolumeUSD    float64
dtype: object

However, we can add a bit more customization to this.  For example, let's say we wanted custom names and to use rank as an index.  We can do so directly from `.read_csv`:

In [15]:
market_caps_custom = pd.read_csv(
    'market_caps.csv',
    header=None,
    skiprows=1,
    names=['rank', 'name', 'sym', 'market_size', 'price', 'volume_usd'],
    index_col=0
)
market_caps_custom.head()

Unnamed: 0_level_0,name,sym,market_size,price,volume_usd
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Bitcoin,BTC,932695400000.0,49368.85,37198200000.0
2,Ethereum,ETH,498007900000.0,4198.32,25533060000.0
3,Binance Coin,BNB,93038860000.0,557.78,2394236000.0
4,Tether,USDT,75159690000.0,1.0,82054150000.0
5,Solana,SOL,59983320000.0,196.17,3402592000.0


## JSON ingestion

JSONs have similar properties to csvs, in that they are simple text files that can be passed around easily, however they can be a bit harder to work with for data analysis.  

While a JSON can represent a table, it can also represent any arbitrary bundle of data.  This means that we may need to parse the JSON first and extract out the relevant values before we can convert it into a pandas DataFrame.

Let's start with a simple example where the JSON is nicely formatted, and represents a table.  In this case, like with `.read_csv`, ad can simply call `.read_json`

In [30]:
market_caps_json = pd.read_json('market_caps.json', orient='columns')

In [31]:
market_caps_json

Unnamed: 0,Rank,Name,Symbol,MarketCap,Price,VolumeUSD
0,1,Bitcoin,BTC,9.326954e+11,49368.85000,3.719820e+10
1,2,Ethereum,ETH,4.980079e+11,4198.32000,2.553306e+10
2,3,Binance Coin,BNB,9.303886e+10,557.78000,2.394236e+09
3,4,Tether,USDT,7.515969e+10,1.00000,8.205415e+10
4,5,Solana,SOL,5.998332e+10,196.17000,3.402592e+09
...,...,...,...,...,...,...
95,96,Livepeer,LPT,8.771234e+08,41.44000,8.792440e+07
96,97,Audius,AUDIO,8.818012e+08,1.73000,3.114046e+07
97,98,Ankr,ANKR,8.784732e+08,0.10760,1.308197e+08
98,99,yearn.finance,YFI,8.726712e+08,23818.93000,1.832332e+08
