# SPS Python Tutorial, Day 1
---
*January 29/30, 2020*

Luc Le Pottier, University of Michigan

## **2: data storage and loading**
This section will introduce tensor libraries and compressed file format libraries, such as
- the `numpy` module:
    - `ndarray` data storage object and associated functions
    - basic numerical constants and mathematical functions
    - loading and saving datasets in `npz` format

- the `pandas` module:
    - `DataFrame` object and associated functions
    - data loading and processing functionality

- the `h5py` module:
    - what an HDF5 file is/why it is useful
    - loading and saving datasets to HDF5 format with `h5py`
    

### **2.1: numpy**

We'll start with (probably) the most fundamental python library, `numpy`. I won't pretend that anyone is unfamiliar with numpy. but I will go over a few things. 

**what is a python tensor?**

Tensors in Python are objects with shape $n_1\times n_2 \times ... \times n_m$, containing $\prod_{i=0}^m n_i$ elements total. They are important to us because they are

- fixed in size (very different from the C++ vector, for instance)
- extremely fast to do math with
- *very* useful for machine learning

There are plenty of python libraries full of functions and objects which provide easy creation, manipulation, and interpretation of tensors. Numpy, pandas, and many of the machine learning libraries we will see later all fall into this classification. 

numpy itself is useful because it is 
- implemented mostly in C and fortran, meaning it is a *lot* faster than python alone (10-100 times)
- full of useful but esoteric data processing functions
- the proud owner of its own compressed data type

lets look some numpy features:

In [None]:
# the classic pseudonym
import numpy as np

##### new arrays
this can be from zeros, ones, raw memory (empty), random distributions, etc.

In [None]:
# zeros
z0 = np.zeros((5,10,2))

# ones
z1 = np.ones((3,2))

# empty
z_empty = np.empty((10,7))

# uniform data
uniform = np.random.uniform(3, 4, 10)

# gaussian data in any shape
gaussian = np.random.normal(loc=1, scale=0.5, size=(3, 5))

# ones tensor in the shape of the gaussian tensor
gaussian0 = np.ones_like(gaussian)

# z_empty might have data in it already
z_empty

##### conversion from other types

this can be done from many types; you basically just want to plug into the `np.asarray` function and see. 

Generators are a special case. this is useful when you want to inspect the output of a generator, i.e. for debugging. Generators are usually used for massive amounts of data; i.e. when you don't want to/can not load all of your data into memory at once, but you still need to analyze / train models on that data.

This example is not that case, but is still interesting. Note how the count must be specified, as numpy has fixed-size arrays: 

In [None]:
# from lists
a = [1, 5, 2, 3, -1]
np_arr = np.asarray(a)

# convert from generators
def data_generator(N):
    for i in range(N):
        # some resource intensive calculation
        yield i**2 - np.log(i + 1)
        
n = 50
gen = data_generator(n)
np_gen = np.fromiter(gen, dtype=float, count=n)
np_gen

##### vectorizing operations
There are many parts of numpy which can provide *huge* time savings (in exchange for memory) by vectorizing for-loops. `np.tile` is specifically useful for this goal.

An example of when this might be necessary is if you wanted to calculate all combinations of sums of two vectors, which would return a matrix. This can be done in a for-loop, and vectorized. Both are shown here:

In [None]:
from time import time

n1, n2 = 1000, 600

v1 = np.random.uniform(0, 10, n1)
v2 = np.random.uniform(-5, -1, n2)

# for-loop method:
t0 = time()
for_result = []

for i in range(len(v1)):
    for_result.append([])
    for j in range(len(v2)):
        for_result[i].append(v1[i] + v2[j])
for_result = np.asarray(for_result)

print('{:>15}: {:.8f} s'.format('for-loop method', time() - t0))
    
# vectorized method:
t0 = time()
result = np.tile(v1, [len(v2), 1]).T + np.tile(v2, [len(v1), 1])
print('{:>15}: {:.8f} s'.format('vector method', time() - t0))

print(np.isclose(result, for_result).all())

We can see that the vectorized method is *generally* 1-2 orders of magnitude faster, especially as the vector sizes $n_1$ and $n_2$ grow large

##### data types

we can also save huge datasets using numpys custom data format, `.npy`. 

We can save one array:

In [None]:
big_data = np.random.normal(loc=5.0, scale=2., size=(10000,200))
np.save('test', big_data)

reloaded = np.load('test.npy')

np.isclose(reloaded, big_data).all()

Or, we can save multiple to a .zip style file, which can then be reloaded as an archive:

In [None]:
other_data = np.empty((100,20))

np.savez('test', big_data=big_data, other_data=other_data)

archive = np.load('test.npz')
np.isclose(archive['other_data'], other_data).all()

##### using .npy and .npz files
pros:
- significantly faster than using .csv files
- provides a file archiving system for in-time access of data
- lazy-loaded; i.e. loading an archive does not load all files into memory

cons:
- specific to python & numpy
- partial loading must be done at save-time

### 2.2: pandas

Pandas is an awesome module build on top of numpy. It provides
- category visualization
- plotting
- fast numpy-based math
- builtin file-reading functions

... among other things. We will briefly show how useful it can be by constructing a pandas DataFrame, the main matrix-like object. 

In [None]:
import pandas as pd


# make random data of size N
N = 150
year = np.random.randint(1950, 1990, N)

income = 1e3*np.round(np.random.normal(loc=62, scale=20, size=N))
income[income < 1000] = 1000

# some fake relation between income and # cars
n_cars = (income**3./(income**3.).mean() + np.random.uniform(-.5, 1.0, N)).astype(int)

data = pd.DataFrame({'birth_year': year, 'income': income, 'n_cars': n_cars})

we can then do tons of things with the dataset. 

In [None]:
# first 10 entries
data.head(10)

In [None]:
# specific variable access
data.birth_year.head(10)

In [None]:
# value counting
data.n_cars.value_counts()

##### composite columns
It is easy to add columns to a dataframe, like a dictionary:

In [None]:
data['income_per_car'] = data.income/data.n_cars

data.income_per_car.head(10)

we can also replace bad entries, i.e. division by zero. We can also fill NaN entries. 

In [None]:
data.income_per_car.replace(np.inf, np.nan, inplace=True)
data.income_per_car.head()

In [None]:
data.fillna(0, inplace=True)
data.head(10)

##### pandas plotting
pandas dataframes have some built-in plotting functionality, though it is slightly limited. Some of the more interesting ones are boxplots:

In [None]:
_ = data.boxplot('income', 'n_cars')

histograms:

In [None]:
_ = data.income_per_car[data.income_per_car > 0].hist(bins=20)

... and regular plotting 

In [None]:
_ = data.plot('income', 'birth_year', kind='scatter')

We will get more into the plotting later.

##### loading/saving data

pandas provides tons of resources for loading and saving data, listed below:

In [None]:
for elt in dir(pd):
    if elt.startswith('read_'):
        print(elt)

this can be super nice when dealing with strange file formats, i.e. excel, clipboard, sql, etc.

Lets load up a .csv file (provided by google colab) 

In [None]:
housing = pd.read_csv('sample_data/california_housing_train.csv')
housing.head(10)

we will leave the statistical analysis of this file for the machine learning portion (tomorrow)

### 2.2: HDF5 files

HDF5 is a file format which essentially fixes the aforementioned numpy file issues. It is

- cross-language
- able to load any subset of data
- very very very very fast
- awesome heirarchical structure

Lets look at these things in demonstration. 

##### hierarchal structure

We can use the test dataset we had before (housing) as an example of this. Say we want to save the median_house_value data only, along with a matrix of latitude and longitude pair coordinates.

H5 files have a group/dataset structure. You can create
- groups: names directories for datasets
- datasets: named tensors of values

H5 files are edited in real time and are constantly open, as long as the File object exists. You can also not remake datasets or groups, so this cell can only be run once without deleting the h5 file. 

In [None]:
# API module for h5 files
import h5py

f = h5py.File('test.h5', 'w')

pos_group = f.create_group('position')
pos_group.create_dataset('latitude', data=housing.latitude.values)
pos_group.create_dataset('longitude', data=housing.longitude.values)

f.create_dataset('value', data=housing.median_house_value.values)

print(f)

once we have an h5 file, we can inspect its contents:

In [None]:
print(f.keys())
print(f['position'])
print(f['position'].keys())
print(f['position']['latitude'])

##### subset loading

Datasets will NOT be loaded unless called for - super useful for big files. To give an example of this, we can load slices of the `latitude` dataset without actually loading the whole thing into memory:

In [None]:
f['position']['latitude'][0:100]

in fact, this makes it trival to write a function for loading data in batches for processing. As an example, the following function loads all values in batches of 100:

In [None]:
def get_next_housing_batch(n_per_batch, batch_n, file):
    min_ = n_per_batch*batch_n
    max_ = min_ + n_per_batch
    idx = slice(min_, max_)
    ret = pd.DataFrame(index=range(min_, max_, 1))
    
    for key in file['position'].keys():
        ret[key] = file['position'][key][idx]
        
    ret['value'] = file['value'][idx]

    return ret
    
# 52nd batch of 100 elements
batch = get_next_housing_batch(n_per_batch=100, batch_n=52, file=f)
batch.head(10)

##### speed
we won't show that directly here, but generally h5 files are both faster and more configurable than the `npz` format.

### section 2 summary

That wraps up the (boring thing) information about data loading and storage. While this isn't the most interesting, it gets **super** important for machine learning things. 

NEXT, the fun part - plotting!!