# Schema Transform Example
In this example, we persist a source graph's metadata in Grakn, perform a motif query to 'transform' the graph, and document the updated schema. Ideally, generated code snippets will apply the transforms to the source graph in Spark.

## Source Graph


In [1]:
# ! pip install findspark

[31mtensorflowjs 0.6.2 has requirement numpy==1.15.1, but you'll have numpy 1.14.5 which is incompatible.[0m
[31mtensorflowjs 0.6.2 has requirement tensorflow==1.11.0, but you'll have tensorflow 1.10.0 which is incompatible.[0m
[31mkeras 2.2.2 has requirement keras-applications==1.0.4, but you'll have keras-applications 1.0.6 which is incompatible.[0m
[31mkeras 2.2.2 has requirement keras-preprocessing==1.0.2, but you'll have keras-preprocessing 1.0.5 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [27]:
# !pip install graphframes
# https://towardsdatascience.com/graphframes-in-jupyter-a-practical-guide-9b3b346cebc5

Collecting graphframes
[31m  Could not find a version that satisfies the requirement graphframes (from versions: )[0m
[31mNo matching distribution found for graphframes[0m
[33mYou are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
import findspark
findspark.init()


In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext, DataFrame
from pyspark.conf import SparkConf
# SparkSession.builder.config(conf=SparkConf())

In [3]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
sqlC = SQLContext(sc)
sc.addPyFile("/Users/josephhaaga/.ivy2/jars/graphframes_graphframes-0.6.0-spark2.3-s_2.11.jar")

In [4]:
edges = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/thousand_nodes-edges.csv") 
    
vertices = sqlC.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("./data/thousand_nodes-nodes.csv") 

In [5]:
vertices.toPandas()

Unnamed: 0,id,name,itin,ein,street_address,city,state
0,0,Robert Lynch,934-78-2061,42-9154549,158 Jones Locks,East Amyville,Connecticut
1,1,Jacob Lawson,978-94-2682,98-0774013,25747 Katelyn Circles,Davisbury,Texas
2,2,James Jordan,982-92-7526,88-1312358,936 Anna Crest,Rodriguezport,Montana
3,3,Kevin Berg,985-75-9439,60-4259842,6630 Gregory Turnpike Suite 907,West Connieshire,Wisconsin
4,4,Deborah Peterson,948-79-5038,10-5618461,04830 Sheila Glen Suite 271,Port Annside,Florida
5,5,Steve Simpson,998-82-9153,90-5215367,93162 Nicholas Point,Vasquezhaven,Louisiana
6,6,Brittany Keith,947-87-6379,50-5623049,396 Adams Orchard,Lake Nicholas,Pennsylvania
7,7,Thomas Guerrero,989-79-6916,25-7863586,77787 Michael Center,North Christopher,Arkansas
8,8,Erica Robbins,967-97-4442,76-8332209,2211 Waters Mission Suite 878,New Crystal,Maryland
9,9,Brandon Smith DDS,929-77-7914,54-0148597,5916 Michael Green,Lindsaybury,Missouri


In [6]:
from graphframes import *
# https://stackoverflow.com/a/50404308

In [17]:
g = GraphFrame(vertices, edges)

In [20]:
# Dependents claiming dependents
g.find("(a)-[]->(b); (b)-[]->(c)").show()

+--------------------+--------------------+--------------------+
|                   a|                   b|                   c|
+--------------------+--------------------+--------------------+
|[1, Jacob Lawson,...|[148, Stephanie E...|[1, Jacob Lawson,...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[2, James Jordan,...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[3, Kevin Berg, 9...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[5, Steve Simpson...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[8, Erica Robbins...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[9, Brandon Smith...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[10, Kirsten Coll...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[11, Dylan Edward...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[12, Ashley Brown...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[13, Katherine Fl...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[16, Susan Roth, ...|
|[1, Jacob Lawson,...|[148, Stephanie E...|[17, Samantha Lam...|
|[1, Jacob Lawson,...|[14

## Describe Source Graph in Grakn Metamodel
The source graph, a GraphFrame named `g`, depicts a network of people claiming eachother as dependents. We need methods to extract the relevant features of this graph so that it can be depicted in the Grakn metamodel. Some examples of things we need to describe include:

### graphVertex
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-vertexid
    - has_property
    - has-attribute

In [63]:
import uuid

#### Can GraphFrames give a list of the different vertex types?
e.g: `g.vertices.types`

It looks like GraphFrame has weak support for multiple vertex types in the same graph. This will require manual intervention in the metamodel creation.
https://forums.databricks.com/questions/7792/with-graphframes-are-there-ways-of-dealing-with-mu.html

In [65]:
types = ["Person"]
# types = ["Person", "Return", "Company"]

createVertices = ['insert $'+str(uuid.uuid4())+' isa graphVertex has name "'+v+'";' \
 for v in types]

createVertices

['insert $10faabf8-660e-4bd9-b970-2ffba0838bd2 isa graphVertex has name "Person";']

### graphEdge
- Attributes
    - Name
- Relationships
    - has-type
    - has-graphobjects
    - has-concept
    - has-edgeids
    - has_property
    - has-attribute


In [None]:
edgeTypes = g.edges.select("relationship").distinct().rdd.flatMap(lambda x: x).collect()
createEdges = ['insert $'+str(uuid.uuid4())+' isa graphEdge has name "'+e+'";' \
 for e in edgeTypes]

createEdges

### graphAttribute

### graphTriplet