# Data Processing

This notebook demonstrates how to use `tgml` for common data processing tasks on graphs stored in `TigerGraph`.

### Define Graph

Conceptually, the `TigerGraph` class represents the graph stored in the database. Under the hood, it stores the necessary information to communicate with the TigerGraph database. It can read `username` and `password` from environment variables `TGUSERNAME` and `TGPASSWORD`. Hence, we recommend storing those credentials in the environment variables or in a `.env` file instead of hardcoding them in code. However, if you do provide `username` and `password` to this class constructor, the environment variables will be ignored.

In [1]:
from tgml.data import TigerGraph

Args to the `TigerGraph` class:
*    host (str, ): Address of the server. Defaults to "http://localhost".
*    graph (str, ): Name of the graph. Defaults to None.
*    username (str, optional): Username. Defaults to None.
*    password (str, optional): Password for the user. Defaults to None.
*    rest_port (str, optional): Port for the REST endpoint. Defaults to "9000".
*    gs_port (str, optional): Port for GraphStudio. Defaults to "14240".
*    token_auth (bool, optional): Whether to use token authentication. If True, token authentication must be turned on in the TigerGraph database server. Defaults to True.

In [2]:
tgraph = TigerGraph(host = "http://35.230.92.92",
                    graph = "Cora",
                    username = "tigergraph",
                    password = "tigergraphml")

In [3]:
tgraph.info()

Using graph 'Cora'
---- Graph Cora
Vertex Types: 
  - VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types: 
  - DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs: 
  - Graph Cora(Paper:v, Cite:e)
Jobs: 
  - CREATE LOADING JOB load_cora_data {
      DEFINE FILENAME edge_csv = "/home/tigergraph/data/Cora/edges.csv";
      DEFINE FILENAME node_csv = "/home/tigergraph/data/Cora/nodes.csv";
      LOAD node_csv TO VERTEX Paper VALUES($"id", SPLIT($"x", " "), $"y", $"train", $"valid", $"test") USING SEPARATOR=",", HEADER="true", EOL="\n";
      LOAD edge_csv TO EDGE Cite VALUES($"source", $"target") USING SEPARATOR=",", HEADER="true", EOL="\n";
    }

Queries: 
  - export_vertex_train_mask_val_mask_test_mask(string output_path) (installed v2)
  - get_vertex_number(string v_type, string filter_by) (installed v2)
  - train_test_vertex_split(string train_attr, string test_

In [4]:
tgraph.number_of_vertices()

2708

In [5]:
tgraph.number_of_vertices("Paper")

2708

In [6]:
tgraph.number_of_edges()

10556

In [7]:
tgraph.number_of_edges("Cite")

10556

### Train/Validation/Test Split

In [13]:
from tgml.utils import split_vertices

`tgml` provide a utility function `split_vertices` to split vertices into a training, a validation, and a test set. More precisely, it creates 3 boolean attributes with each attribute indicating whether the vertex is in the corresponding set. For example, if you want to split the vertices into 80% train, 10% validation and 10% test, you can provide as arguments to the function `train_mask=0.8, val_mask=0.1, test_mask=0.1`. This will create 3 attributes `train_mask`, `val_mask`, `test_mask` in the graph, if they don't already exist. 80% of vertices will be set to `train_mask=1`, 10% to `val_mask=1`, and 10% to `test_mask=1` at random. There will be no overlap between the partitions. You can name the attributes however you like as long as you follow the format, such as `yesterday=0.8, today=0.1, tomorrow=0.1`, but we recommend something  meaningful. 

In [14]:
split_vertices(tgraph, train_mask=0.8, val_mask=0.1, test_mask=0.1)

Now the split is done. Load all vertices and check if the split is correct. See the next tutorial for details on `VertexLoader` and other data loaders.

In [15]:
from tgml.dataloaders import VertexLoader

In [16]:
%%time
vertex_loader = VertexLoader(tgraph, attributes="train_mask,val_mask,test_mask")

CPU times: user 6.39 ms, sys: 2.45 ms, total: 8.85 ms
Wall time: 24 s


In [17]:
%%time
data = vertex_loader.data

CPU times: user 79.2 ms, sys: 23.6 ms, total: 103 ms
Wall time: 16.8 s


In [18]:
data.train_mask.sum()/len(data), data.val_mask.sum()/len(data), data.test_mask.sum()/len(data)

(0.7996034615755472, 0.1000024105679298, 0.10039111464661074)