# Blaze

There's a very interesting set of libraries under [The Blaze Ecosystem](http://blaze.pydata.org/) project.

[Blaze](http://blaze.readthedocs.io/en/latest/index.html) is a library to do _computation_ on large data sets, providing a syntax very similar to Pandas or NumPy, but allowing us to access to different storages: database, in-memory, distributed systems, etc.

It's interesting to read the documentation on [what Blaze doesn't do](http://blaze.readthedocs.io/en/latest/what-blaze-isnt.html):

> Blaze is a query system that looks like NumPy/Pandas. You write Blaze queries, Blaze translates those queries to something else (like SQL), and ships those queries to various database to run on other people’s fast code. It smoothes out this process to make interacting with foreign data as accessible as using Pandas. This is actually quite difficult.


In [1]:
from blaze import *

You can access NaTType as type(pandas.NaT)
  @convert.register((pd.Timestamp, pd.Timedelta), (pd.tslib.NaTType, type(None)))
  from flask.ext.cors import cross_origin


In [2]:
t = data([
    (1, "Alice"), 
    (2, "Bob"),
    (3, "Charlie")], 
    fields=['id', 'name'])
t.peek()

Unnamed: 0,id,name
0,1,Alice
1,2,Bob
2,3,Charlie


## Loading Data

Our data set has _Date_ columns not properly formated. We have to create a dshape with `date` instead of `datetime` to make sure python handles the column properly.

In [3]:
goog = data('goog.csv.gz')
goog[:5]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2016-07-01,692.200012,700.650024,692.130005,699.210022,1342700,699.210022
1,2016-06-30,685.469971,692.320007,683.650024,692.099976,1590500,692.099976
2,2016-06-29,683.0,687.429016,681.409973,684.109985,1928500,684.109985
3,2016-06-28,678.969971,680.330017,673.0,680.039978,2116600,680.039978
4,2016-06-27,671.0,672.299988,663.283997,668.26001,2629000,668.26001


## Queries

In [4]:
goog[goog.Open > 1220]

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
585,2014-03-07,1226.802152,1226.992071,1211.442033,1214.792073,3041500,606.789247
586,2014-03-06,1222.282091,1226.152141,1218.602083,1219.612082,2545600,609.196844
590,2014-02-28,1220.342104,1224.192059,1206.222092,1215.652097,4644500,607.21883
592,2014-02-26,1224.002141,1228.882066,1213.762102,1220.172036,3979100,609.476541


In [5]:
goog[["Open", "Close"]][:2]

Unnamed: 0,Open,Close
0,692.200012,699.210022
1,685.469971,692.099976


In [6]:
goog.Close.mean()

## Grouping

In [7]:
by(merge(goog.Date.year, goog.Date.month), mean=goog.Close.mean())

Unnamed: 0,Date_year,Date_month,mean
0,2004,8,105.262402
1,2004,9,113.227337
2,2004,10,153.231214
3,2004,11,177.495544
4,2004,12,181.770309
5,2005,1,192.846331
6,2005,2,195.014017
7,2005,3,181.158493
8,2005,4,199.215105
9,2005,5,239.710411


In [8]:
by(goog.Date.year, min=goog.Close.min(), max=goog.Close.max())

Unnamed: 0,Date_year,max,min
0,2004,197.600333,100.010169
1,2005,432.040752,174.990304
2,2006,509.65086,337.060574
3,2007,741.791259,438.680763
4,2008,685.331181,257.440455
5,2009,622.871087,282.750497
6,2010,626.771094,436.070761
7,2011,645.901096,474.880824
8,2012,768.051291,559.050931
9,2013,1120.711956,702.871197


## CSV to SQLite

Conversion is done using [Odo package](http://odo.readthedocs.io/en/latest/).

More examples can be found [here](http://blaze.readthedocs.io/en/latest/csv.html#migrate-to-binary-storage-formats).

In [9]:
from odo import odo
%time table = odo("goog.csv", "sqlite:///goog.sqlite::prices")

CPU times: user 35.3 ms, sys: 15.6 ms, total: 50.9 ms
Wall time: 109 ms


In [10]:
from sqlalchemy import Table, Column, MetaData, Integer, String, create_engine
engine = create_engine("sqlite:///goog.sqlite")
goog = data(engine)
goog.dshape

dshape("""{
  prices: var * {
    Date: ?datetime,
    Open: float64,
    High: float64,
    Low: float64,
    Close: float64,
    Volume: int64,
    'Adj Close': float64
    }
  }""")

In [11]:
goog.prices[["Open", "Close"]]

Unnamed: 0,Open,Close
0,692.200012,699.210022
1,685.469971,692.099976
2,683.0,684.109985
3,678.969971,680.039978
4,671.0,668.26001
5,675.169983,675.219971
6,697.450012,701.869995
7,699.059998,697.460022
8,698.400024,695.940002
9,698.77002,693.710022


### Manually defining input format

In [12]:
schema = dshape("""{
  prices: var * {
    Date: ?datetime,
    Open: float64,
    High: float64,
    Low: float64,
    Close: float64,
    Volume: int64,
    'Adj Close': float64
    }
  }""")
goog = data('sqlite:///goog.sqlite::prices', dshape=schema)
goog.Open

Unnamed: 0,Open
0,692.200012
1,685.469971
2,683.0
3,678.969971
4,671.0
5,675.169983
6,697.450012
7,699.059998
8,698.400024
9,698.77002


And kids, this is how one gets bitten by the `datetime` monster:

In [13]:
# DO NOT TRY THIS AT HOME
# goog.Date