## Data Loaders

This notebook demonstrates the use of different data loaders in `tgml`. The job of a data loader is to pull data from the TigerGraph database. Currently, the following data loaders are provided:
* EdgeLoader, which returns either the whole edgelist or batches of edges. Edge attributes are not supported currently.
* VertexLoader, which returns either all the vertices or batches of vertices. Vertex attributes are supported.
* GraphLoader, which returns the whole graph in `PyG` format.
* NeighborLoader, which returns subgraphs using neighbor sampling.

Every data loader above can either stream data directly from the server to user or cache data on the cloud. For the latter, data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server. However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster than hitting the server from multiple consumers at the same time. 

Note: For the data loaders to work, the [Graph Data Processing Service](https://github.com/TigerGraph-DevLabs/GDPS) has to be running on the TigerGraph server.

### Define Graph

Conceptually, the `TigerGraph` class represents the graph stored in the database. Under the hood, it stores the necessary information to communicate with the TigerGraph database. It can read `username` and `password` from environment variables `TGUSERNAME` and `TGPASSWORD`. Hence, we recommend storing those credentials in the environment variables or in a `.env` file instead of hardcoding them in code. However, if you do provide `username` and `password` to this class constructor, the environment variables will be ignored.

In [None]:
from tgml.data import TigerGraph

Args to the `TigerGraph` class:
*    host (str, ): Address of the server. Defaults to "http://localhost".
*    graph (str, ): Name of the graph. Defaults to None.
*    username (str, optional): Username. Defaults to None.
*    password (str, optional): Password for the user. Defaults to None.
*    rest_port (str, optional): Port for the REST endpoint. Defaults to "9000".
*    gs_port (str, optional): Port for GraphStudio. Defaults to "14240".

In [None]:
tgraph = TigerGraph(host = "http://35.230.92.92",
                    graph = "Cora",
                    username = "tigergraph",
                    password = "tigergraphml")

In [None]:
tgraph.info()

In [None]:
tgraph.number_of_vertices()

In [None]:
tgraph.number_of_vertices("Paper")

In [None]:
tgraph.number_of_vertices(filter_by = "train_mask")

In [None]:
tgraph.number_of_vertices(vertex_type = "Paper", filter_by = "train_mask")

In [None]:
tgraph.number_of_edges()

In [None]:
tgraph.number_of_edges("Cite")

### Edge Loader

In [None]:
from tgml.dataloaders import EdgeLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. If you load all edges at once, there will be only one batch (of all the edges) in the iterator. 
* Second, you can access the `data` property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the iterator again. 

Args to `EdgeLoader` class:
* graph (TigerGraph): Connection to the TigerGraph database.
* batch_size (int, optional): Size of each batch. If given, `num_batches` will be recalculated based on batch size. Defaults to None.
* num_batches (int, optional): Number of batches to split the whole dataset. Defaults to 1.
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Defaults to "dataframe".
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

If using cloud caching, cloud storage access keys need to be provided. For AWS s3, `aws_access_key_id` and `aws_secret_access_key` are required. However, the class can read from environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, and again it is recommended to store those credentials in the `.env` file instead of hardcoding them.

#### Load all edges at once directly to local. Default.

In [None]:
%%time
edge_loader = EdgeLoader(tgraph)

In [None]:
%%time
# Use case 1: iterator
data = []
for batch in edge_loader:
    data.append(batch)

In [None]:
data

In [None]:
%%time
# Use case 2: `data` property
data = edge_loader.data

In [None]:
data

#### Stream batches of edges directly to local.

In [None]:
%%time
edge_loader = EdgeLoader(tgraph, batch_size = 256)

In [None]:
%%time
# Use case 1: as an iterator
data = []
for batch in edge_loader:
    data.append(batch)

In [None]:
print("Number of batches: ", len(data))
data

In [None]:
# Use case 2: `data` property
# Since there are multiple batches of data. 
# The `data` property will return the loader itsel
data = edge_loader.data

In [None]:
%%time
print("Number of batches: ", sum(1 for batch in data))

### Vertex Loader

In [None]:
from tgml.dataloaders import VertexLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. If you load all vertices at once, there will be only one batch of data (of all the vertices) in the iterator. 
* Second, you can access the `data` property of the class directly. If there is only one batch of data, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader again.

Args to class:
* graph (TigerGraph): Connection to the TigerGraph database.
* batch_size (int, optional): Size of each batch. If given, `num_batches` will be recalculated based on batch size. Defaults to None.
* num_batches (int, optional): Number of batches to split the whole dataset. Defaults to 1.
* attributes (str, optional): Vertex attributes to get, separated by comma. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only pandas dataframe is supported. Defaults to "dataframe".
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

#### Load all vertices at once directly to local. Default.

In [None]:
%%time
vertex_loader = VertexLoader(tgraph, attributes="x,y")
# Note: vertex primary ID will be extracted automatically. 
# No need to specify it as an attribute.

In [None]:
%%time
# Use case 1: as an iterator
data = []
for batch in vertex_loader:
    data.append(batch)

In [None]:
data

In [None]:
%%time
# Use case 2: `data` property
data = vertex_loader.data

In [None]:
data

#### Stream batches of vertices directly to local.

In [None]:
%%time
vertex_loader = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y")

In [None]:
%%time
# Use case 1: as an iterator
data = []
for batch in vertex_loader:
    data.append(batch)

In [None]:
print("Number of batches: ", len(data))
data

In [None]:
# Use case 2: `data` property
# Since there are multiple batches of data. 
# The `data` property will return the loader itsel
data = vertex_loader.data

In [None]:
%%time
print("Number of batches: ", sum(1 for batch in data))

### Graph Loader

#### Load the whole graph directly to local

In [None]:
from tgml.dataloaders import GraphLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. Since this loader loads the whole graph at once, there will be only one batch of data (of the whole graph) in the iterator. 
* Second, you can access the `data` property of the class directly. Since there is only one batch of data (the whole graph), it will give you the batch directly instead of an iterator.

Args to the class:
* graph (TigerGraph): Connection to the TigerGraph database.
* v_in_feats (str, optional): Attributes to be used as input features and their types. Attributes should be seperated by ',' and an attribute and its type should be separated by ':'. The type of an attrbiute can be omitted together with the separator ':', and the attribute will be default to type "float32". and Defaults to "".
* v_out_labels (str, optional): Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".
* v_extra_feats (str, optional): Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only "PyG" is supported. Defaults to "PyG".
* reindex (bool, optional): Whether to reindex the vertices. Defaults to False.
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

In [None]:
%%time
graph_loader = GraphLoader(
                 graph = tgraph,
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 reindex=False)

In [None]:
%%time
# Use case 1: as an iterator.
data = []
for batch in graph_loader:
    data.append(batch)

In [None]:
data

In [None]:
%%time
# Use case 2: `.data` property
data = graph_loader.data

In [None]:
data

#### Stream subgraphs with neighbor sampling

In [None]:
from tgml.dataloaders import NeighborLoader

A data loader that performs neighbor sampling as introduced in the [Inductive Representation Learning on Large Graphs](https://arxiv.org/abs/1706.02216) paper. 

Specifically, it first chooses `batch_size` number of vertices as seeds, then picks `num_neighbors` number of neighbors of each seed at random, then `num_neighbors` neighbors of each neighbor, and repeat for `num_hops`. This generates one subgraph. As you loop through this data loader, all vertices will be chosen as seeds and you will get all subgraphs expanded from those seeds.

If you want to limit seeds to certain vertices, the boolean attribute provided to `filter_by` will be used to indicate which vertices can be included as seeds.

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

Args to this class:
* graph (TigerGraph): Connection to the TigerGraph database.
* tmp_id (str, optional): Attribute name that holds the temporary ID of 
                vertices. Defaults to "tmp_id".
* v_in_feats (str, optional): Attributes to be used as input features and their types. Attributes should be seperated by ',' and an attribute and its type should be separated by ':'. The type of an attrbiute can be omitted together with the separator ':', and the attribute will be default to type "float32". and Defaults to "".
* v_out_labels (str, optional): Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".
* v_extra_feats (str, optional): Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. 
                Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 or GCP path used for cloud caching. 
                Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store 
                in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only
                "PyG" is supported. Defaults to "PyG".
* batch_size (int, optional): Number of vertices as seeds in each batch. 
                Defaults to None.
* num_batches (int, optional): Number of batches to split the vertices. 
                Defaults to 1.
* num_neighbors (int, optional): Number of neighbors to sample for each vertex. 
                Defaults to 10.
* num_hops (int, optional): Number of hops to traverse when sampling neighbors. 
                Defaults to 2.
* cache_id (str, optional): A tag attached to data generated. 
                Defaults to None.
* shuffle (bool, optional): Whether to shuffle the vertices after every epoch. 
                Defaults to False.
* filter_by (str, optional): A boolean attribute used to indicate which vertices 
                can be included as seeds. Defaults to None.
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

In [None]:
%%time
graph_loader = NeighborLoader(
                 graph = tgraph,
                 tmp_id = "tmp_id",
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 batch_size = 64,
                 num_neighbors = 10,
                 num_hops =2)

In [None]:
%%time
data = []
for batch in graph_loader:
    data.append(batch)
print("Number of batches: ", len(data))

In [None]:
data

In [None]:
%%time
graph_loader = NeighborLoader(
                 graph = tgraph,
                 tmp_id = "tmp_id",
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 batch_size = 16,
                 num_neighbors = 10,
                 num_hops =2,
                 filter_by = "train_mask")

In [None]:
%%time
data = []
for batch in graph_loader:
    data.append(batch)
print("Number of batches: ", len(data))

In [None]:
data

### Smart Cloud Caching

When you provide `cloud_storage_path` when creating a loader (including all vertex, edge, graph loaders), data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server. However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster than hitting the server from multiple consumers at the same time.

To share the cloud cache between different consumers, provide the same `cache_id` when creating the loaders. Below we create two loaders in this same python session to demo the use of cloud caching; in practice, you would run parallel python sessions with each having its own loader. 

In [None]:
VertexLoader(cloud_storage_path="s3://ohai", aw

In [None]:
%%time
vertex_loader = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y",
                             cache_id="test_smart_cache",
                             cloud_storage_path="s3://graph-export-dev/cora_vertices")

In [None]:
%%time
data = []
for batch in vertex_loader:
    data.append(batch)

In [None]:
data

In [None]:
%%time
vertex_loader2 = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y",
                             cache_id="test_smart_cache",
                             cloud_storage_path="s3://graph-export-dev/cora_vertices")

In [None]:
%%time
data2 = []
for batch in vertex_loader2:
    data2.append(batch)

In [None]:
for d1,d2 in zip(data,data2):
    assert all(d1==d2)