# h5 Data Exploration

In this notebook, I will be looking at some data provided to me in the .h5 format, which is traditionally used to store large amounts of data. 

In [2]:
import numpy as np
import pandas as pd
import h5py
import os

In [3]:
os.listdir('../../')

['.DS_Store',
 'universe_sp500.h5',
 'parsed.pickle',
 'HKML Quant Exercise.pdf',
 'doublecheck.py',
 'project_overview.pdf',
 '10K-10Q-NLP-Project',
 'returns_sp500.h5']

In [4]:
f = h5py.File('../../returns_sp500.h5', 'r')

In [5]:
f.keys()

<KeysViewHDF5 ['returns']>

In [6]:
h = f['returns']

In [7]:
h.keys()

<KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>

In [8]:
dset = h['axis0']
dset.shape

(1110,)

In [9]:
dset[:20]

array([b'A', b'AABA', b'AAL', b'AAMRQ', b'AAP', b'AAPL', b'ABBV', b'ABC',
       b'ABI', b'ABKFQ', b'ABMD', b'ABS', b'ABT', b'ABX', b'ACAS',
       b'ACKH', b'ACN', b'ACS', b'ACV', b'ADBE'], dtype='|S5')

Those appear to be the tickers for the stocks in the S&P 500. Let us continue exploring:

In [10]:
dset2 = h['axis1']
dset2.shape

(6451,)

In [11]:
dset2[:20]

array([820540800000000000, 820627200000000000, 820713600000000000,
       820800000000000000, 821059200000000000, 821145600000000000,
       821232000000000000, 821318400000000000, 821404800000000000,
       821664000000000000, 821750400000000000, 821836800000000000,
       821923200000000000, 822009600000000000, 822268800000000000,
       822355200000000000, 822441600000000000, 822528000000000000,
       822614400000000000, 822873600000000000])

I believe that these numbers represent timestamps in Unix time. 

In [12]:
dset3 = h['block0_items']
dset3.shape

(1110,)

In [13]:
dset3[:20]

array([b'A', b'AABA', b'AAL', b'AAMRQ', b'AAP', b'AAPL', b'ABBV', b'ABC',
       b'ABI', b'ABKFQ', b'ABMD', b'ABS', b'ABT', b'ABX', b'ACAS',
       b'ACKH', b'ACN', b'ACS', b'ACV', b'ADBE'], dtype='|S5')

In [14]:
dset4 = h['block0_values']
dset4.shape

(6451, 1110)

In [15]:
dset4[0]

array([nan, nan, nan, ..., nan, nan, nan])

# Other File

Let's take a look at the other file as well.

In [16]:
f = h5py.File('../../universe_sp500.h5', 'r')

In [17]:
f.keys()

<KeysViewHDF5 ['sp500']>

In [19]:
h = f['sp500']
h.keys()

<KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']>

In [20]:
dset = h['axis0']
dset.shape

(1110,)

In [21]:
dset

<HDF5 dataset "axis0": shape (1110,), type "|S5">

In [22]:
dset[:20]

array([b'A', b'AABA', b'AAL', b'AAMRQ', b'AAP', b'AAPL', b'ABBV', b'ABC',
       b'ABI', b'ABKFQ', b'ABMD', b'ABS', b'ABT', b'ABX', b'ACAS',
       b'ACKH', b'ACN', b'ACS', b'ACV', b'ADBE'], dtype='|S5')

In [23]:
dset2 = h['axis1']
dset2.shape

(9352,)

In [24]:
dset2[:20]

array([820540800000000000, 820627200000000000, 820713600000000000,
       820800000000000000, 820886400000000000, 820972800000000000,
       821059200000000000, 821145600000000000, 821232000000000000,
       821318400000000000, 821404800000000000, 821491200000000000,
       821577600000000000, 821664000000000000, 821750400000000000,
       821836800000000000, 821923200000000000, 822009600000000000,
       822096000000000000, 822182400000000000])

In [25]:
dset3 = h['block0_items']
dset3.shape

(1110,)

In [26]:
dset3[:20]

array([b'A', b'AABA', b'AAL', b'AAMRQ', b'AAP', b'AAPL', b'ABBV', b'ABC',
       b'ABI', b'ABKFQ', b'ABMD', b'ABS', b'ABT', b'ABX', b'ACAS',
       b'ACKH', b'ACN', b'ACS', b'ACV', b'ADBE'], dtype='|S5')

In [30]:
dset4 = h['block0_values']
dset4.shape

(1,)

In [32]:
dset4

<HDF5 dataset "block0_values": shape (1,), type "|O">