# Schema Transform Example
In this example we will:

1. ~~persist a source graph's metadata in Grakn~~
2. perform a motif query to 'transform' the graph, and 
3. document the updated schema (versioning)

Ideally, generated code snippets will apply the transforms to the source graph in Spark.

In [114]:
# ! pip install findspark

In [115]:
# !pip install graphframes
# https://towardsdatascience.com/graphframes-in-jupyter-a-practical-guide-9b3b346cebc5

In [117]:
# imports and libraries
import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext, DataFrame
from pyspark.conf import SparkConf

In [118]:
# Spark runtime boilerplate

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sqlC = SQLContext(sc)
sc.addPyFile("/Users/josephhaaga/.ivy2/jars/graphframes_graphframes-0.6.0-spark2.3-s_2.11.jar")

## Source Graph
First we create a GraphFrame with the data generated by GeneratePeopleAndCompanies.ipynb

In [119]:
edges = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/peopleAndCompanies_edges.csv") 
    
vertices = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/peopleAndCompanies_vertices.csv") 

Some example vertices

In [135]:
vertices.toPandas()[17:23]

Unnamed: 0,id,type,name,address
17,17,company,"Diaz, Sanchez and Williams",00807 Meadows Prairie Apt. 382 East Melaniebur...
18,18,company,Jones Inc,Unit 5674 Box 1556 DPO AA 42502
19,19,company,Holmes Ltd,"90746 Beasley Shoal Suite 136 New Joseph, WA 4..."
20,20,person,Margaret Khan,"1533 Frederick Alley Jamesmouth, SD 62854"
21,21,person,Terry Herring,"7340 Simmons Square Apt. 770 Isabellaville, MS..."
22,22,person,Mariah Peterson,"829 Combs Expressway Jamesstad, VT 68867"


In [121]:
from graphframes import *
# https://stackoverflow.com/a/50404308

In [122]:
g = GraphFrame(vertices, edges)

An example GraphFrame query:

In [123]:
# Dependents claiming dependents
g.find("(a)-[r]->(b); (b)-[r2]->(c)") \
    .filter("r.relationship == 'claims_dependent'") \
    .filter("r2.relationship == 'claims_dependent'").show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                   a|                   r|                   b|                  r2|                   c|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[22, person, Mari...|[22, 66, claims_d...|[66, person, Andr...|[66, 69, claims_d...|[69, person, Javi...|
|[22, person, Mari...|[22, 88, claims_d...|[88, person, Caro...|[88, 34, claims_d...|[34, person, Tyle...|
|[22, person, Mari...|[22, 88, claims_d...|[88, person, Caro...|[88, 92, claims_d...|[92, person, Anit...|
|[22, person, Mari...|[22, 25, claims_d...|[25, person, Mary...|[25, 58, claims_d...|[58, person, Dona...|
|[26, person, Thom...|[26, 45, claims_d...|[45, person, Terr...|[45, 47, claims_d...|[47, person, Geor...|
|[29, person, Lisa...|[29, 71, claims_d...|[71, person, Evan...|[71, 58, claims_d...|[58, person, Dona...|
|[35, person, Devi...|[35, 36, claims

## Describe Source Graph in Grakn Metamodel
The source graph, a GraphFrame named `g`, depicts a network of people claiming eachother as dependents. We need methods to extract the relevant features of this graph so that it can be depicted in the Grakn metamodel. Some examples of things we need to describe include:

### graphVertex
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-vertexid
    - has_property
    - has-attribute

In [124]:
import uuid

A simple list comprehension over the unique `types` of GraphFrame vertices can be translated into Graql `insert` statements.

In [125]:
vertexTypes = g.vertices.select("type").distinct().rdd.flatMap(lambda x: x).collect()

createVertices = ['insert $'+str(uuid.uuid4())+' isa graphVertex has name "'+v+'";' \
 for v in vertexTypes]

# insert statements may need to become match-insert statements if we want to update existing metamodel graphs

createVertices

['insert $031d2573-1b62-4e35-81c8-5566e0972437 isa graphVertex has name "person";',
 'insert $89eb77c4-ae0c-4967-8f7a-ea3732ee9c18 isa graphVertex has name "company";']

### graphEdge
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-edgeids
    - has_property
    - has-attribute


A simple list comprehension over the unique `relationship` values of GraphFrame edges can be translated into Graql `insert` statements.

In [126]:
edgeTypes = g.edges.select("relationship").distinct().rdd.flatMap(lambda x: x).collect()
createEdges = ['$'+str(uuid.uuid4())+' isa graphEdge has name "'+e+'";' \
 for e in edgeTypes]

createEdges

['$3fc6cf2e-448e-4255-bc4b-878c8f6d4f11 isa graphEdge has name "owned_by";',
 '$9c70d463-8478-46b3-9c79-a24fa9f712d8 isa graphEdge has name "employed_by";',
 '$e57e66cf-87d3-4b20-9b26-76eb0eba26e2 isa graphEdge has name "claims_dependent";']

### graphTriplet
We can query the GraphFrame for an explicit list of all triples in the graph, and `insert` relationships between the nodes and edges we've created.
___
We should try making `owned_by` an ambigious relationship. This can be changed in the GeneratePeopleAndCompanies.ipynb notebook.

e.g. 
1. `(Company)-[owned_by]->(Company)`
2. `(Company)-[owned_by]->(Person)`

In [127]:
s = g.edges

In [136]:
tripleTypes = s.join(vertices, s.src == vertices.id) \
    .select(["src","type","dst","relationship"]) \
    .withColumnRenamed('type','src_type') \
    .join(vertices, s.dst == vertices.id) \
    .select(["src","src_type","dst","type","relationship"]) \
    .withColumnRenamed('type', 'dst_type') \
    .select(['src_type', 'relationship', 'dst_type']) \
    .distinct() \
    .collect()

In [137]:
tripleTypes

[Row(src_type='company', relationship='owned_by', dst_type='company'),
 Row(src_type='person', relationship='employed_by', dst_type='company'),
 Row(src_type='person', relationship='claims_dependent', dst_type='person')]

In [129]:
tripleTypes = [a.asDict() for a in tripleTypes]

Mixing `match` and `insert` statements leads to syntax errors; let's try persisting the UUID for all the nodes/edges we create so we don't have to do any `match`ing during the triplet relationship insertion.

In [138]:
createTriplets = [ \
    """
    ${0} isa graphTriplet has name "{1}";
    match 
        $src isa graphVertex has name "{2}";
        $dst isa graphVertex has name "{3}";
        $rel isa graphEdge has name "{4}";
    insert (src-vertex-owned: $src, dst-vertex-owned: $dst, edge-owned: $rel) isa has-graphobjects;
    """.format( \
            str(uuid.uuid4()), \
            a['src_type']+" "+a['relationship']+" "+a['dst_type'], \
            a['src_type'], \
            a['dst_type'], \
            a['relationship'] \
        ).replace("\n","").replace("\t", ' ')
    for e in tripleTypes \
]


In [139]:
createTriplets[0]

'    $d2072927-4eeb-454d-a794-cfa6debc5d09 isa graphTriplet has name "company owned_by company";    match         $src isa graphVertex has name "company";        $dst isa graphVertex has name "company";        $rel isa graphEdge has name "owned_by";    insert (src-vertex-owned: $src, dst-vertex-owned: $dst, edge-owned: $rel) isa has-graphobjects;    '

## Review of generated Graql statements

In [132]:
" ".join(createVertices)

'insert $031d2573-1b62-4e35-81c8-5566e0972437 isa graphVertex has name "person"; insert $89eb77c4-ae0c-4967-8f7a-ea3732ee9c18 isa graphVertex has name "company";'

In [133]:
"insert " + " ".join(createEdges)

'insert $3fc6cf2e-448e-4255-bc4b-878c8f6d4f11 isa graphEdge has name "owned_by"; $9c70d463-8478-46b3-9c79-a24fa9f712d8 isa graphEdge has name "employed_by"; $e57e66cf-87d3-4b20-9b26-76eb0eba26e2 isa graphEdge has name "claims_dependent";'

In [134]:
"insert " + " ".join(createTriplets)

'insert     $59f626fa-6d7d-454a-9e70-4e38da77da3c isa graphTriplet has name "company owned_by company";    match         $src isa graphVertex has name "company";        $dst isa graphVertex has name "company";        $rel isa graphEdge has name "owned_by";    insert (src-vertex-owned: $src, dst-vertex-owned: $dst, edge-owned: $rel) isa has-graphobjects;         $74361372-4de2-4c9e-9a6e-58929056fe5e isa graphTriplet has name "company owned_by company";    match         $src isa graphVertex has name "company";        $dst isa graphVertex has name "company";        $rel isa graphEdge has name "owned_by";    insert (src-vertex-owned: $src, dst-vertex-owned: $dst, edge-owned: $rel) isa has-graphobjects;         $eb1fe3c1-2ecd-4e57-80d5-25a86e48e28f isa graphTriplet has name "company owned_by company";    match         $src isa graphVertex has name "company";        $dst isa graphVertex has name "company";        $rel isa graphEdge has name "owned_by";    insert (src-vertex-owned: $src, dst-