In [None]:
!pip install git+https://github.com/pswpswpsw/nif.git

Collecting git+https://github.com/pswpswpsw/nif.git
  Cloning https://github.com/pswpswpsw/nif.git to /tmp/pip-req-build-fcp_ean3
  Running command git clone -q https://github.com/pswpswpsw/nif.git /tmp/pip-req-build-fcp_ean3


# How to training on large dataset (>5GB) where data cannot fit in memory?

### Note 

- GPU memory is precious, don't waste it on dataset
- It can be super interesting to train NIF on HUGE dataset!
- In this Google Colab, I will show a demonstration using [tensorflow record](https://www.tensorflow.org/tutorials/load_data/tfrecord),
which is a simple format for storing a sequence of binary records.
- First you will learn how to obtain `tfrecord` file from `npz`, then you will
learn how to perform training across different `tfrecord` files.

## 0. Prepare a "large" numpy `npz` to play with

- this "ficitious" dataset has 7 dimensions, 
$$(t,x,y,z,u,v,w)$$
where $u,v,w$ are three components of velocity field. 
- Thus, we have **4 features, and 3 targets, no weight**. We will need this information to create the dataset 

- let's call it `Big_npz_file.npz`, it has 54 MB with $10^6$ points, each point contains 7 real number values
- it has a key argument `data`, which stores the data


In [None]:
import numpy as np

data = np.random.uniform(0,1,(1000000,7))
np.savez('Big_npz_file.npz', data=data)

## 1. Generate `tfrecord` files from a big a `npz` 

First, you need to bring the `TFRDataset` class under `nif.data.tfr_dataset`

In [None]:
from nif.data.tfr_dataset import TFRDataset

1 Physical GPUs, 1 Logical GPUs


Second, create an instance, a "file handler", call it `fh`, given the `n_features = 3` and `n_target = 3`

In [None]:
fh = TFRDataset(n_feature=4, n_target=3)

### Then, you need to decide 
1. How many points to store in one single **tfrecord** file ----> `num_pts_per_file`
    - it doesn't have to be perfectly dividing the total number of points, we will make the last one smaller if it doesn't have a perfect divide.
    - Tensorflow official document [suggest](https://www.tensorflow.org/tutorials/load_data/tfrecord) around 100 MB.
    
2. What's the `npz` filepath? ----> `npz_path`
3. What's the key to access the 2-D matrix pointwise data? ----> `npz_key`
4. Where do you want to put the generated tfrecord file? ----> `tfr_path`
5. What's the prefix you want to put in the filename for the tfrecord file? 
    - they will be something like `prefix_0.npz`, ..., `prefix_10.npz`, ...

###  If you have answers to the above in your mind, now you can just call 
### `fh.create_from_npz(num_pts_per_file, npz_path, npz_key, write_tfr_path, prefix)`

In [None]:
fh.create_from_npz(num_pts_per_file=1e4, npz_path='Big_npz_file.npz',
                   npz_key='data', tfr_path='TFR_dir',
                   prefix="case1")

total number of TFR files =  100
working in 1-th file... total 100
working in 2-th file... total 100
working in 3-th file... total 100
working in 4-th file... total 100
working in 5-th file... total 100
working in 6-th file... total 100
working in 7-th file... total 100
working in 8-th file... total 100
working in 9-th file... total 100
working in 10-th file... total 100
working in 11-th file... total 100
working in 12-th file... total 100
working in 13-th file... total 100
working in 14-th file... total 100
working in 15-th file... total 100
working in 16-th file... total 100
working in 17-th file... total 100
working in 18-th file... total 100
working in 19-th file... total 100
working in 20-th file... total 100
working in 21-th file... total 100
working in 22-th file... total 100
working in 23-th file... total 100
working in 24-th file... total 100
working in 25-th file... total 100
working in 26-th file... total 100
working in 27-th file... total 100
working in 28-th file... total 

### Now, you can see the **tfrecord** files that you generated! 

In [None]:
%ls TFR_dir -lrth

total 27M
-rw-r--r-- 1 root root 274K May  3 22:21 case1_0.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_1.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_2.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_3.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_4.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_5.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_6.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_7.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_8.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_9.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_10.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_11.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_12.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_13.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_14.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_15.tfrecord
-rw-r--r-- 1 root root 274K May  3 22:21 case1_16.tfreco

# 2. How to train a Keras Model on the tfrecord files that generated above?

### Let's create a toy Model first

In [None]:
import tensorflow as tf 

x = tf.keras.Input(4,)
l1 = tf.keras.layers.Dense(3, activation='tanh')
model = tf.keras.Model([x], [l1(x)])
model.compile(optimizer='adam', loss='mse')

Next, we generate a **meta dataset** or maybe you can call it **hyper dataset**. We will do something called **sub-dataset-batching** (I called it). 

- First you need to determine how many epoch you want to run ----> `epoch`
- Then we call the method `.get_tfr_meta_dataset(path, epoch)`

In [None]:
epoch = 2
meta_dataset = fh.get_tfr_meta_dataset(tfr_path='TFR_dir', epoch=epoch)

### Finally, we do sub-dataset batching with `meta_dataset`

- we obtain `batch_file`, which are Tensors basically, from meta_dataset
- we create a `sub` dataset from `batch_file`

- you can add your favourite callbacks if you want
- the default `epoch` in `model.fit` is 1, which makes sense in this context

In [None]:
callbacks = []
batch_size = 128

for batch_file in meta_dataset:
    batch_dataset = fh.gen_dataset_from_batch_file(batch_file, batch_size)
    model.fit(batch_dataset, verbose='auto', callbacks=callbacks)

 1/79 [..............................] - ETA: 1s - loss: 0.0830

### Cautious! 
- it is not perfect since we are `.fit` for each `tfrecord` and this will make epoch information lost after each recall of `.fit`
- we can hack it by using a `callbacks` that explicitly write loss to the disk.. in an append way.
- **again what is presented here is not perfect, I am open for your solutions!**

### Question:

- **Why are you choosing to chop tfrecord like this?** It seems to me it is not what the official document has suggested, the syntax here looks very not software-optimized. 

### **Answer:** 

- Because the standard way of using `tfrecord` will take each data points, i.e., $(t,x,y,z,u,v,w)$ as an `tf.train.Example`. When you have 10 of millions, hundreds of millsion data points, **generating such dataset (not using this dataset for training)** has been reported to have performance issues if your algorithm contains any native Python Loop, which is not avoidable at the current stage. 

- In `nif.data.tfr_dataset.TFRDataset`, we subsampling npz file in a sequential order and take data along each dimension (e.g., (t_0,...,t_M) ) as an `tf.train.Example`, instead of treating each single point `(t,x,y,z,u,v,w)` as an example. So we have 7 `Example` in this demo. 

- In this way, performance issue is greatly reduced. I believe the reason is that we leave the inner looping over all datapoints to the low-level

- So far, I can generate tfrecord from tens of Gigabyte data within 30 mins to 1 hour. While if I tried to make each training example as data point, it may take 

### Let me know <shawnpan@uw.edu>  if you have better solution for this, i.e., without using the meta-dataset to do sub-dataset-batching