# Importing Data

In [15]:
import numpy as np
import pandas as pd

pd.options.display.max_rows = 20
pd.options.display.max_columns = 15
pd.__version__ 

'0.17.1'

We often times have a variety of input data.

- CSV
- Excel
- SQL
- JSON
- HDF5
- pickle
- msgpack
- Stata
- BigQuery

This is subset of the data from beeradvocate.com, via [Standford](https://snap.stanford.edu/data/web-RateBeer.html). It's strangely formatted.

This dataset is no longer available!

<p style="font-size:20px"; style=font-family:Courier>
beer/name: Sausa Weizen<br>
beer/beerI: 47986<br>
beer/brewerId: 10325<br>
beer/ABV: 5.00<br>
beer/style: Hefeweizen<br>
review/appearance: 2.5<br>
review/aroma: 2<br>
review/time: 1234817823<br>
review/profileName: stcules<br>
review/text: A lot of foam. But a lot.	In the smell some banana, and then lactic and tart. Not a good start.	Quite dark orange in color, with a lively carbonation (now visible, under the foam).	Again tending to lactic sourness.	Same for the taste. With some yeast and banana.<br>
<br>
beer/name: Red Moon<br>
beer/beerId: 48213<br>
beer/brewerId: 10325<br>
beer/ABV: 6.20<br>
 ...<br>
</p>


# CSV

http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

In [13]:
df = pd.read_csv('data/beer2.csv.gz', 
                 index_col=0,
                 parse_dates=['time'],
                 encoding='utf-8')

In [3]:
df

Unnamed: 0,abv,beer_id,brewer_id,beer_name,...,profile_name,review_taste,text,time
0,7.0,2511,287,Bell's Cherry Stout,...,blaheath,4.5,Batch 8144\tPitch black in color with a 1/2 f...,2009-10-05 21:31:48
1,5.7,19736,9790,Duck-Rabbit Porter,...,GJ40,4.0,Sampled from a 12oz bottle in a standard pint...,2009-10-05 21:32:09
2,4.8,11098,3182,Fürstenberg Premium Pilsener,...,biegaman,3.5,Haystack yellow with an energetic group of bu...,2009-10-05 21:32:13
...,...,...,...,...,...,...,...,...,...
49997,8.1,21950,2372,Terrapin Coffee Oatmeal Imperial Stout,...,ugaterrapin,4.5,Poured a light sucking crude oil beckoning bl...,2009-12-25 17:23:52
49998,4.6,5453,1306,Badger Original Ale,...,MrHurmateeowish,3.5,"500ml brown bottle, 4.0% ABV. Pours a crystal...",2009-12-25 17:25:06
49999,9.4,47695,14879,Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout,...,strictly4DK,4.5,"22 oz bottle poured into a flute glass, share...",2009-12-25 17:26:06


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 13 columns):
abv                  48389 non-null float64
beer_id              50000 non-null int64
brewer_id            50000 non-null int64
beer_name            50000 non-null object
beer_style           50000 non-null object
review_appearance    50000 non-null float64
review_aroma         50000 non-null float64
review_overall       50000 non-null float64
review_palate        50000 non-null float64
profile_name         50000 non-null object
review_taste         50000 non-null float64
text                 49991 non-null object
time                 50000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 5.3+ MB


In [7]:
# we have some unicode
df.loc[2,'beer_name']

'Fürstenberg Premium Pilsener'

In [16]:
df.loc[2]

abv                                                                4.8
beer_id                                                          11098
brewer_id                                                         3182
beer_name                                 Fürstenberg Premium Pilsener
beer_style                                             German Pilsener
review_appearance                                                    4
review_aroma                                                         3
review_overall                                                       3
review_palate                                                        3
profile_name                                                  biegaman
review_taste                                                       3.5
text                  Haystack yellow with an energetic group of bu...
time                                               2009-10-05 21:32:13
Name: 2, dtype: object

In [17]:
row = df.loc[2]

In [18]:
type(row)

pandas.core.series.Series

In [19]:
row['abv']

4.7999999999999998

In [20]:
row.abv

4.7999999999999998

In [21]:
row.abv

4.7999999999999998

In [22]:
row.abv=4.2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [24]:
row.abv

4.2

In [25]:
df.loc[2]

abv                                                                4.8
beer_id                                                          11098
brewer_id                                                         3182
beer_name                                 Fürstenberg Premium Pilsener
beer_style                                             German Pilsener
review_appearance                                                    4
review_aroma                                                         3
review_overall                                                       3
review_palate                                                        3
profile_name                                                  biegaman
review_taste                                                       3.5
text                  Haystack yellow with an energetic group of bu...
time                                               2009-10-05 21:32:13
Name: 2, dtype: object

In [29]:
df.to_csv?

In [30]:
df.to_csv('data/beer.csv', index=False, encoding='utf-8')

In [32]:
df.size

650000

In [33]:
50e3 * 13

650000.0

In [36]:
!ls -Fla data/*.csv

-rw-r--r--  1 ijstokes  staff  42070683 Jan 13 11:44 data/beer.csv
-rw-r--r--@ 1 ijstokes  staff  42359574 Jan 13 08:06 data/beer2.csv


In [37]:
ls data/*.csv

data/beer.csv   data/beer2.csv


# Excel

http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files

In [38]:
df.to_excel('data/beer.xls', index=False, encoding='utf-8')

In [39]:
xldata = pd.read_excel('data/beer.xls', sheetnames=[0], encoding='utf-8')

In [40]:
xldata.loc[2]

abv                                                                4.8
beer_id                                                          11098
brewer_id                                                         3182
beer_name                                 Fürstenberg Premium Pilsener
beer_style                                             German Pilsener
review_appearance                                                    4
review_aroma                                                         3
review_overall                                                       3
review_palate                                                        3
profile_name                                                  biegaman
review_taste                                                       3.5
text                  Haystack yellow with an energetic group of bu...
time                                               2009-10-05 21:32:13
Name: 2, dtype: object

In [41]:
df.loc[2] == xldata.loc[2]

abv                  True
beer_id              True
brewer_id            True
beer_name            True
beer_style           True
review_appearance    True
review_aroma         True
review_overall       True
review_palate        True
profile_name         True
review_taste         True
text                 True
time                 True
Name: 2, dtype: bool

In [42]:
row == xldata.loc[2]

abv                  False
beer_id               True
brewer_id             True
beer_name             True
beer_style            True
review_appearance     True
review_aroma          True
review_overall        True
review_palate         True
profile_name          True
review_taste          True
text                  True
time                  True
Name: 2, dtype: bool

# SQL

http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries

In [43]:
!rm -f data/beer.sqlite

In [44]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///data/beer.sqlite')

In [45]:
# take our dataframe and write it to a database table
# called "beer" using SQL, and to the database accessed
# via "engine"
df.to_sql('beer', engine)

In [46]:
dbdata = pd.read_sql('beer', engine)

In [47]:
dbdata.loc[2]

index                                                                2
abv                                                                4.8
beer_id                                                          11098
brewer_id                                                         3182
beer_name                                 Fürstenberg Premium Pilsener
beer_style                                             German Pilsener
review_appearance                                                    4
review_aroma                                                         3
review_overall                                                       3
review_palate                                                        3
profile_name                                                  biegaman
review_taste                                                       3.5
text                  Haystack yellow with an energetic group of bu...
time                                               2009-10-05 21:32:13
Name: 

# JSON

http://pandas.pydata.org/pandas-docs/stable/io.html#json

In [49]:
df.to_json?

In [48]:
df.to_json('data/beer.json')

In [50]:
jsdata = pd.read_json('data/beer.json')

# HDF

http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables

In [51]:
# fixed format
df.to_hdf('data/beer_mixed.hdf',
          'df',
           mode='w',
           format='fixed',
           encoding='utf-8')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['beer_name', 'beer_style', 'profile_name', 'text']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)


In [57]:
# find empty fields
df.abv.isnull()

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
         ...  
49990    False
49991    False
49992    False
49993    False
49994    False
49995    False
49996    False
49997    False
49998    False
49999    False
Name: abv, dtype: bool

In [59]:
df[df.abv.isnull()]

Unnamed: 0,abv,beer_id,brewer_id,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,profile_name,review_taste,text,time
12,,4765,1500,Brown Hound Brown Ale,English Brown Ale,4.0,4.0,4.0,4.0,CheckMate,4.0,1/2 gallon growler obtained at Whole Foods in...,2009-10-05 21:38:21
52,,37149,5408,Gose,Gose,4.0,4.5,4.5,4.5,Mistofminn,4.5,Had this on tap at the Herkimer.\t\tServed in...,2009-10-05 22:27:57
58,,53133,5408,Oktoberfest,Märzen / Oktoberfest,4.0,4.0,4.5,4.0,Mistofminn,4.0,Had this on tap during happy hour at the Herk...,2009-10-05 22:33:38
62,,37920,5408,Sky Pilot,Keller Bier / Zwickel Bier,3.5,3.5,4.0,4.5,Mistofminn,3.5,Had this one as the third beer I ordered for ...,2009-10-05 22:40:29
63,,40841,2674,Stout,American Stout,4.0,3.0,3.0,3.0,duffextracold,3.5,"A- Poured deep dark brown, not quite black, p...",2009-10-05 22:43:39
65,,42363,898,Oxford Class Organic Amber Ale,American Amber / Red Ale,3.0,4.0,3.0,3.0,mountdew1,3.5,Poured into an imperial pint glass. Color was...,2009-10-05 22:45:20
117,,53138,6421,Blueberry Lager,Fruit / Vegetable Beer,3.0,4.0,4.5,4.0,t0rin0,3.5,On tap Friday (10/2/09)\t\tThey put a scoop o...,2009-10-05 23:42:11
126,,627,183,Sterkens Dubbel Ale,Dubbel,3.5,3.5,4.0,3.5,mdwalsh,4.0,A: Pours a dark ruby or cola color with an al...,2009-10-05 23:49:34
127,,53139,6421,Amber Ale,American Amber / Red Ale,4.0,4.0,4.0,4.0,t0rin0,3.5,"Sampled last Friday (10/2/09)\t\tPours clear,...",2009-10-05 23:53:29
134,,53141,6421,Hefeweizen,American Pale Wheat Ale,3.0,3.0,3.0,3.0,t0rin0,3.0,Sampled on Friday (10/2/09)\t\tEverything abo...,2009-10-06 00:05:45


In [60]:
data = pd.read_hdf('data/beer_mixed.hdf','df',encoding='utf-8')

In [61]:
# wildly varying strings
df.text.str.len().describe()

count    49991.000000
mean       733.792003
std        392.219226
min         16.000000
25%        458.000000
50%        642.000000
75%        900.000000
max       4902.000000
Name: text, dtype: float64

In [62]:
len(None)

TypeError: object of type 'NoneType' has no len()

# Timings

In [63]:
%timeit pd.read_excel('data/beer.xls', sheetnames=[0])

1 loops, best of 3: 2.62 s per loop


In [66]:
%%timeit

pd.read_sql('beer', engine)

1 loops, best of 3: 603 ms per loop


In [67]:
%timeit pd.read_json('data/beer.json')

1 loops, best of 3: 990 ms per loop


In [68]:
%timeit pd.read_csv('data/beer.csv', parse_dates=['time'])

1 loops, best of 3: 497 ms per loop


In [69]:
%timeit pd.read_hdf('data/beer_mixed.hdf','df')

10 loops, best of 3: 127 ms per loop


In [70]:
%timeit df.to_pickle('data/beer.pkl')
%timeit df.to_msgpack('data/beer.msgpack',encoding='utf-8')

10 loops, best of 3: 89.1 ms per loop
10 loops, best of 3: 132 ms per loop


In [71]:
%timeit pd.read_pickle('data/beer.pkl')

10 loops, best of 3: 41.2 ms per loop


In [72]:
%timeit pd.read_msgpack('data/beer.msgpack', encoding='utf-8')

10 loops, best of 3: 51.4 ms per loop


# Storing Text vs Data
http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/

# Operating on Large Data

In [73]:
chunks = pd.read_csv('data/beer2.csv.gz', 
                      index_col=0,
                      parse_dates=['time'],
                      chunksize=10000)
for i, chunk in enumerate(chunks):
    print("%d -> %d" % (i, len(chunk)))

0 -> 10000
1 -> 10000
2 -> 10000
3 -> 10000
4 -> 10000


In [76]:
%%timeit
chunks = pd.read_csv('data/beer2.csv.gz', 
                      index_col=0,
                      parse_dates=['time'],
                      chunksize=10000)

100 loops, best of 3: 3.68 ms per loop


In [75]:
chunks

<pandas.io.parsers.TextFileReader at 0x1130eed68>

In [78]:
chunks_it = iter(chunks)

In [80]:
mychnuck = next(chunks_it)

In [84]:
mychunck

Unnamed: 0,abv,beer_id,brewer_id,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,profile_name,review_taste,text,time
0,7.00,2511,287,Bell's Cherry Stout,American Stout,4.5,4.0,4.5,4.0,blaheath,4.5,Batch 8144\tPitch black in color with a 1/2 f...,2009-10-05 21:31:48
1,5.70,19736,9790,Duck-Rabbit Porter,American Porter,4.5,4.0,4.5,4.0,GJ40,4.0,Sampled from a 12oz bottle in a standard pint...,2009-10-05 21:32:09
2,4.80,11098,3182,Fürstenberg Premium Pilsener,German Pilsener,4.0,3.0,3.0,3.0,biegaman,3.5,Haystack yellow with an energetic group of bu...,2009-10-05 21:32:13
3,9.50,28577,3818,Unearthly (Imperial India Pale Ale),American Double / Imperial IPA,4.0,4.0,4.0,4.0,nick76,4.0,"The aroma has pine, wood, citrus, caramel, an...",2009-10-05 21:32:37
4,5.80,398,119,Wolaver's Pale Ale,American Pale Ale (APA),4.0,3.0,4.0,3.5,champ103,3.0,A: Pours a slightly hazy golden/orange color....,2009-10-05 21:33:14
5,7.00,966,365,Pike Street XXXXX Stout,American Stout,4.0,4.0,3.5,4.0,sprucetip,4.5,"From notes. Pours black, thin mocha head fade...",2009-10-05 21:33:48
6,6.20,53128,1114,Smokin' Amber Kegs Gone Wild,American Amber / Red Ale,3.5,4.0,4.5,4.0,Deuane,4.5,An American amber with the addition of smoked...,2009-10-05 21:34:24
7,4.80,1669,256,Great White,Witbier,4.5,4.5,4.5,4.5,n0rc41,4.5,"Ok, for starters great white I believe will b...",2009-10-05 21:34:29
8,6.70,6549,140,Northern Hemisphere Harvest Wet Hop Ale,American IPA,4.0,4.0,4.0,4.0,david18,4.0,I like all of Sierra Nevada's beers but felt ...,2009-10-05 21:34:31
9,6.50,13824,743,Oktoberfest,Vienna Lager,3.0,2.5,2.5,2.5,Seanibus,2.5,This actually winds up coming out like a ligh...,2009-10-05 21:35:09


In [85]:
mychunk = next(chunks_it)

In [86]:
mychunk

Unnamed: 0,abv,beer_id,brewer_id,beer_name,beer_style,review_appearance,review_aroma,review_overall,review_palate,profile_name,review_taste,text,time
10000,6.0,33236,26,Munsterfest,Märzen / Oktoberfest,4.0,3.0,3.5,4.0,BedetheVenerable,3.0,Presentation: 22oz brown pop-top with cool la...,2009-10-21 03:35:55
10001,7.0,19686,9629,Short's Black Cherry Porter,American Porter,5.0,4.0,3.5,5.0,adamette,4.5,This was the bartender's recommendation at Ho...,2009-10-21 03:36:47
10002,,47185,16533,Amber Ale,American Amber / Red Ale,3.0,3.0,3.5,3.5,Tone,3.5,"Pours a clear, copper color. 1/5 inch head of...",2009-10-21 03:37:13
10003,6.7,52371,140,Sierra Nevada Estate Brewers Harvest Ale,American IPA,4.5,4.0,4.0,3.5,mothman,4.0,Poured into Bud American Ale glass\t\tPours a...,2009-10-21 03:37:39
10004,9.0,18021,9629,Short's The Soft Parade,Fruit / Vegetable Beer,4.0,3.5,4.0,4.0,adamette,3.5,"On tap at HopCat in Grand Rapids, MI two days...",2009-10-21 03:38:07
10005,6.4,50,139,Tremont Freedom Trail IPA,English India Pale Ale (IPA),4.0,3.5,4.0,4.0,jwinship83,4.0,12oz bottle given to me by my sister who got ...,2009-10-21 03:38:42
10006,7.3,22505,2743,Green Flash West Coast I.P.A.,American IPA,4.0,4.5,4.5,4.0,augustgarage,4.5,On cask during Green Flash night at Blue Palm...,2009-10-21 03:40:02
10007,5.7,33159,199,Ballast Point Rocktoberfest Lager,Märzen / Oktoberfest,4.0,4.0,4.0,4.0,HopHead84,4.0,Consumed at the Old Grove location on 10/17/2...,2009-10-21 03:43:26
10008,6.2,51631,158,Hoss,Märzen / Oktoberfest,4.0,4.0,4.0,4.0,cswhitehorse,4.5,The Hoss poured a light amber like color with...,2009-10-21 03:47:19
10009,4.8,31282,132,Morimoto Black Obi Soba Ale,Fruit / Vegetable Beer,4.5,4.0,4.0,4.0,BigDank,4.0,Beer poured a pretty dark brown but not too d...,2009-10-21 03:50:02


# Using Odo
http://odo.readthedocs.org/

# Questions

- which formats provide good fidelity
  - hdf5, pickle, msgpack
  
- which formats can you query
  - hdf5, sql
  
- which formats can you iterate
  - csv, hdf5, sql
  
- which formats provide better interoprability
  - csv, json, excel
  
- which formats can you transmit over the wire
  - json, msgpack
  
- which formats have better compression
  - hdf5, pickle, msgpack
  
- which formats allow multiple datasets in the same file
  - hdf5, msgpack