# Schema Transform Example
In this example we will:

1. ~~persist a source graph's metadata in Grakn~~
2. perform a motif query to 'transform' the graph, and 
3. document the updated schema (versioning)

Ideally, generated code snippets will apply the transforms to the source graph in Spark.

In [2]:
# ! pip install findspark

In [3]:
# !pip install graphframes
# https://towardsdatascience.com/graphframes-in-jupyter-a-practical-guide-9b3b346cebc5

In [1]:
# imports and libraries
import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext, DataFrame
from pyspark.conf import SparkConf

In [2]:
# Spark runtime boilerplate

sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sqlC = SQLContext(sc)
sc.addPyFile("/Users/josephhaaga/.ivy2/jars/graphframes_graphframes-0.6.0-spark2.3-s_2.11.jar")

## Source Graph
First we create a GraphFrame with the data generated by GeneratePeopleAndCompanies.ipynb

In [3]:
edges = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/peopleAndCompanies_edges.csv") 
    
vertices = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/peopleAndCompanies_vertices.csv") 

Some example vertices

In [4]:
vertices.toPandas()[17:23]

Unnamed: 0,id,type,name,address
17,17,company,"Diaz, Sanchez and Williams",00807 Meadows Prairie Apt. 382 East Melaniebur...
18,18,company,Jones Inc,Unit 5674 Box 1556 DPO AA 42502
19,19,company,Holmes Ltd,"90746 Beasley Shoal Suite 136 New Joseph, WA 4..."
20,20,person,Margaret Khan,"1533 Frederick Alley Jamesmouth, SD 62854"
21,21,person,Terry Herring,"7340 Simmons Square Apt. 770 Isabellaville, MS..."
22,22,person,Mariah Peterson,"829 Combs Expressway Jamesstad, VT 68867"


In [4]:
from graphframes import *
# https://stackoverflow.com/a/50404308

In [5]:
g = GraphFrame(vertices, edges)

An example GraphFrame query:

In [6]:
# Dependents claiming dependents
g.find("(a)-[r]->(b); (b)-[r2]->(c)") \
    .filter("r.relationship == 'claims_dependent'") \
    .filter("r2.relationship == 'claims_dependent'").show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                   a|                   r|                   b|                  r2|                   c|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[22, person, Mari...|[22, 66, claims_d...|[66, person, Andr...|[66, 69, claims_d...|[69, person, Javi...|
|[22, person, Mari...|[22, 88, claims_d...|[88, person, Caro...|[88, 34, claims_d...|[34, person, Tyle...|
|[22, person, Mari...|[22, 88, claims_d...|[88, person, Caro...|[88, 92, claims_d...|[92, person, Anit...|
|[22, person, Mari...|[22, 25, claims_d...|[25, person, Mary...|[25, 58, claims_d...|[58, person, Dona...|
|[26, person, Thom...|[26, 45, claims_d...|[45, person, Terr...|[45, 47, claims_d...|[47, person, Geor...|
|[29, person, Lisa...|[29, 71, claims_d...|[71, person, Evan...|[71, 58, claims_d...|[58, person, Dona...|
|[35, person, Devi...|[35, 36, claims

In [9]:
motifs = g.find("(a)-[e]->(b); (c)-[e2]->(b)")

results = motifs.filter("a.type == 'person'") \
    .filter("c.type == 'person'") \
    .filter("b.type == 'company'") \
    .filter("e.relationship == 'employed_by'") \
    .filter("e2.relationship == 'employed_by'")
    


AnalysisException: 'No such struct field type in src, dst, relationship; line 1 pos 0'

## Describe Source Graph in Grakn Metamodel
The source graph, a GraphFrame named `g`, depicts a network of people claiming eachother as dependents. We need methods to extract the relevant features of this graph so that it can be depicted in the Grakn metamodel. Some examples of things we need to describe include:

### graphVertex
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-vertexid
    - has_property
    - has-attribute

A simple list comprehension over the unique `types` of GraphFrame vertices can be translated into Graql `insert` statements.

In [33]:
vertexTypes = g.vertices.select("type").distinct().rdd.flatMap(lambda x: x).collect()

createVertices = ['$'+v+' isa graphVertex has name "'+v+'";' \
 for v in vertexTypes]

# insert statements may need to become match-insert statements if we want to update existing metamodel graphs

createVertices

['$person isa graphVertex has name "person";',
 '$company isa graphVertex has name "company";']

### graphEdge
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-edgeids
    - has_property
    - has-attribute


A simple list comprehension over the unique `relationship` values of GraphFrame edges can be translated into Graql `insert` statements.

In [9]:
edgeTypes = g.edges.select("relationship").distinct().rdd.flatMap(lambda x: x).collect()
createEdges = ['$'+e+' isa graphEdge has name "'+e+'";' \
 for e in edgeTypes]

createEdges

['$owned_by isa graphEdge has name "owned_by";',
 '$employed_by isa graphEdge has name "employed_by";',
 '$claims_dependent isa graphEdge has name "claims_dependent";']

### graphTriplet
We can query the GraphFrame for an explicit list of all triples in the graph, and `insert` relationships between the nodes and edges we've created.
___
We should try making `owned_by` an ambigious relationship. This can be changed in the GeneratePeopleAndCompanies.ipynb notebook.

e.g. 
1. `(Company)-[owned_by]->(Company)`
2. `(Company)-[owned_by]->(Person)`

In [10]:
s = g.edges

In [11]:
tripleTypes = s.join(vertices, s.src == vertices.id) \
    .select(["src","type","dst","relationship"]) \
    .withColumnRenamed('type','src_type') \
    .join(vertices, s.dst == vertices.id) \
    .select(["src","src_type","dst","type","relationship"]) \
    .withColumnRenamed('type', 'dst_type') \
    .select(['src_type', 'relationship', 'dst_type']) \
    .distinct() \
    .collect()

In [12]:
tripleTypes

[Row(src_type='company', relationship='owned_by', dst_type='company'),
 Row(src_type='person', relationship='employed_by', dst_type='company'),
 Row(src_type='person', relationship='claims_dependent', dst_type='person')]

In [129]:
tripleTypes = [a.asDict() for a in tripleTypes]

Mixing `match` and `insert` statements leads to syntax errors; let's try persisting the UUID for all the nodes/edges we create so we don't have to do any `match`ing during the triplet relationship insertion.

In [49]:
createTriplets = [ \
    """${0} isa graphTriplet has name "{1}";
    (src-vertex-owned: ${2}, dst-vertex-owned: ${3}, edge-owned: ${4}, object-owner: ${0}) isa has-graphobjects;
    """.format( \
            a['src_type']+"_"+a['relationship']+"_"+a['dst_type'], \
            a['src_type']+" "+a['relationship']+" "+a['dst_type'], \
            a['src_type'], \
            a['dst_type'], \
            a['relationship'] \
        ).replace("\n","").replace("\t", ' ')
    for a in tripleTypes \
]


In [50]:
createTriplets[0]

'$company_owned_by_company isa graphTriplet has name "company owned_by company";    (src-vertex-owned: $company, dst-vertex-owned: $company, edge-owned: $owned_by, object-owner: $company_owned_by_company) isa has-graphobjects;    '

In [51]:
createTriplets[1]

'$person_employed_by_company isa graphTriplet has name "person employed_by company";    (src-vertex-owned: $person, dst-vertex-owned: $company, edge-owned: $employed_by, object-owner: $person_employed_by_company) isa has-graphobjects;    '

## Review of generated Graql statements

In [52]:
verts = " ".join(createVertices)

In [53]:
edges = " ".join(createEdges)

In [54]:
trips = " ".join(createTriplets)

In [55]:
print(verts + '\n' + edges + "\n" + trips)

$person isa graphVertex has name "person"; $company isa graphVertex has name "company";
$owned_by isa graphEdge has name "owned_by"; $employed_by isa graphEdge has name "employed_by"; $claims_dependent isa graphEdge has name "claims_dependent";
$company_owned_by_company isa graphTriplet has name "company owned_by company";    (src-vertex-owned: $company, dst-vertex-owned: $company, edge-owned: $owned_by, object-owner: $company_owned_by_company) isa has-graphobjects;     $person_employed_by_company isa graphTriplet has name "person employed_by company";    (src-vertex-owned: $person, dst-vertex-owned: $company, edge-owned: $employed_by, object-owner: $person_employed_by_company) isa has-graphobjects;     $person_claims_dependent_person isa graphTriplet has name "person claims_dependent person";    (src-vertex-owned: $person, dst-vertex-owned: $person, edge-owned: $claims_dependent, object-owner: $person_claims_dependent_person) isa has-graphobjects;    


## Insert Company-[owned_by]->Person
To see graph evolution in action, we will add a new triple using existing vertices and edges.


```
match
    $person isa graphVertex has name "person"; $company isa graphVertex has name "company";
    $owned_by isa graphEdge has name "owned_by";
insert
    $company_owned_by_person isa graphTriplet has name "company owned_by person";
    (src-vertex-owned: $company, dst-vertex-owned: $person, edge-owned: $owned_by, object-owner: $company_owned_by_person) isa has-graphobjects;
```