## Analyzing Click Stream Data with the GraphX Library

In [1]:
val maxEdgesFirstShell = 25
val maxEdgesNthShell = 20

maxEdgesFirstShell = 25
maxEdgesNthShell = 20


20

We will take a look at the Wikipedia ClickStream Dataset (TODO:  Link to original dataset).  This dataset reports thee number of times a user goes from one Wikipedia site to another.  The first two columns are the page numbers of the "from" site and "to" site, respectively (as indexed numbers).  The third column is the the number of clicks (the edge values).  The remaining columns are the "from" and "to" sites (as names). If a "from" site is outside of the Wikipedia corpus, it is listed as "other".  We will remove these sites, because it is not interesting to count those entries, and the numbers are much larger than the clickstream traffic within Wikipedia.

In [35]:
val lines = sc.textFile("headPlusWatsonPlusTeslaPlusApple.tsv")
lines.take(20).foreach(println)

	3632887	93	other-wikipedia	!!	other
	3632887	46	other-empty	!!	other
	3632887	10	other-other	!!	other
64486	3632887	11	!_(disambiguation)	!!	other
2061699	2556962	19	Louden_Up_Now	!!!_(album)	link
	2556962	25	other-empty	!!!_(album)	other
	2556962	16	other-google	!!!_(album)	other
	2556962	44	other-wikipedia	!!!_(album)	other
64486	2556962	15	!_(disambiguation)	!!!_(album)	link
600744	2556962	297	!!!	!!!_(album)	link
	6893310	11	other-empty	!Hero_(album)	other
1921683	6893310	26	!Hero	!Hero_(album)	link
	6893310	16	other-wikipedia	!Hero_(album)	other
	6893310	23	other-google	!Hero_(album)	other
8127304	22602473	16	Jericho_Rosales	!Oka_Tokat	link
35978874	22602473	20	List_of_telenovelas_of_ABS-CBN	!Oka_Tokat	link
	22602473	57	other-google	!Oka_Tokat	other
	22602473	12	other-wikipedia	!Oka_Tokat	other
7360687	22602473	10	Rica_Peralejo	!Oka_Tokat	link
37104582	22602473	11	Jeepney_TV	!Oka_Tokat	link


lines = headPlusWatsonPlusTeslaPlusApple.tsv MapPartitionsRDD[3940] at textFile at <console>:58


headPlusWatsonPlusTeslaPlusApple.tsv MapPartitionsRDD[3940] at textFile at <console>:58

## Defining Some ETL Functions

There is some ETL that we will need to carry out in order to get a graph model that is worth using.  One such function is a simple indexing of the "other" sites so that we can filter them out later on.

In [36]:
// simple mapping of 'other names to a unique index'
def getIndex(x: String): Int = x match {
    case "other-wikipedia" => 1
    case "other-empty"     => 2
    case "other-internal"  => 3    
    case "other-google"    => 4
    case "other-yahoo"     => 5
    case "other-bing"      => 6
    case "other-facebook"  => 7
    case "other-twitter"   => 8 
    case "other-other"     => 9 
    case _ => 0
}

getIndex: (x: String)Int


Some more ETL Functions here.  We need to populate empty fields with null values that have the same type.  The API here is a bit less robust than the DataFrames API, but nothing we can't handle!

In [37]:
def assignOtherSourceIndex(partsLine:Array[String]): String = {
    println( "assigning other source to " + partsLine(0))
    if (partsLine(0) == "") {
//      println("empty")
      return getIndex(partsLine(3)).toString
    }
    else {
//       println("in nonblank")
       return partsLine(0)
    }
}
// filling empty number fields with a large number
def fillWithNum(element: String): String = {
    if (element == ""){ 999999999.toString }
    else {element}
}

// filling empty column fields with a filler string
def fillWithString(element: String, colNum: Int): String = {
    if (element == ""){"column"+colNum+"Fill"}
    else{element}

}

assignOtherSourceIndex: (partsLine: Array[String])String
fillWithNum: (element: String)String
fillWithString: (element: String, colNum: Int)String


Now we're going to take our raw RDD and add all of our ETL functions.  We're filtering out redlinks, as well as all clicktstreams from outside Wikipedia, and also handling empty fields.

In [38]:
val parts = lines.map(l=>l.split("\t")).filter(l => !(l.contains("redlink"))). //splitting on tabs and filtering out redlinks
                                                    map(l => Array(assignOtherSourceIndex(l), 
                                                    fillWithNum(l(1)),fillWithNum(l(2)),
                                                    fillWithString(l(3),4),fillWithString(l(4),5))).
                                                    filter(y=>y(0).toInt>9)

parts.take(20).foreach(x => println(x(0)+"\t" + x(1) + "\t" + x(2) + "\t" + x(3) + "\t" + x(4)))

Name: org.apache.spark.SparkException
Message: Task not serializable
StackTrace:   at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.map(RDD.scala:370)
  ... 60 elided
Caused by: java.io.NotSerializableException: java.io.PrintWriter
Serialization stack:
	- object not serializable (class: java.io.PrintWriter, value: java.io.

We can convert this into a DataFrame.  For this notebook, the DataFrame is only used briefly to show the column labels.  The `parts` RDD is used to construct the graph.

In [39]:
case class Fields(prev_id: Int, curr_id: Int, n: Int, 
           prev_title: String, curr_title: String)

defined class Fields


lastException: Throwable = null


In [40]:
val clicksDataFrame = parts.map(
     p => Fields(p(0).toInt, p(1).toInt, p(2).toInt, p(3), p(4))).toDF

// registering dataframe
clicksDataFrame.registerTempTable("clicks")
clicksDataFrame.show(20)

+--------+--------+---+--------------------+----------------+
| prev_id| curr_id|  n|          prev_title|      curr_title|
+--------+--------+---+--------------------+----------------+
|   64486| 3632887| 11|  !_(disambiguation)|              !!|
| 2061699| 2556962| 19|       Louden_Up_Now|     !!!_(album)|
|   64486| 2556962| 15|  !_(disambiguation)|     !!!_(album)|
|  600744| 2556962|297|                 !!!|     !!!_(album)|
| 1921683| 6893310| 26|               !Hero|   !Hero_(album)|
| 8127304|22602473| 16|     Jericho_Rosales|      !Oka_Tokat|
|35978874|22602473| 20|List_of_telenovel...|      !Oka_Tokat|
| 7360687|22602473| 10|       Rica_Peralejo|      !Oka_Tokat|
|37104582|22602473| 11|          Jeepney_TV|      !Oka_Tokat|
|34376590|22602473| 22|Oka_Tokat_(2012_T...|      !Oka_Tokat|
|31976181| 6810768| 51|List_of_death_met...|      !T.O.O.H.!|
| 1337475| 3243047|208|The_Dismemberment...|       !_(album)|
| 3284285| 3243047| 78|The_Dismemberment...|       !_(album)|
| 209829

clicksDataFrame = [prev_id: int, curr_id: int ... 3 more fields]




[prev_id: int, curr_id: int ... 3 more fields]

We're also going to explicitly deduplicate our vertices before building the graph.  The graphX API will handle this for us, but It saves us some time later on.

In [41]:
//nodes should be deduplicated
val nodes1 = (parts.map{p => Array(p(0).toString, p(3)) })
nodes1.cache()
val uniqueNodes1 = nodes1.map(x => x(0)+"-0-"+x(1)).distinct.  //trick for accessing distinct
    map(x => Array(x.split("-0-")(0), x.split("-0-")(1)))  //resplitting to original structure

val nodes2 = (parts.map{p => Array(p(1).toString, p(4)) })
nodes2.cache()             
val uniqueNodes2 = nodes2.map(x => x(0)+"-0-"+x(1)).distinct.  //trick for accessing distinct
    map(x => Array(x.split("-0-")(0), x.split("-0-")(1)))  //resplitting to original structure
                         //converting to vertex RDD


uniqueNodes1.cache
uniqueNodes2.cache
val uniqueNodesBoth = (uniqueNodes1 ++ uniqueNodes2).map(x => x(0)+"-0-"+x(1)).distinct.  //trick for accessing distinct
    map(x=>Array(x.split("-0-")(0), x.split("-0-")(1))).  //resplitting to original structure
    map{ x=> (x(0).toInt.toLong, (x(1), x(1))) }  
uniqueNodesBoth.count

nodes1 = MapPartitionsRDD[3948] at map at <console>:71
uniqueNodes1 = MapPartitionsRDD[3953] at map at <console>:74
nodes2 = MapPartitionsRDD[3954] at map at <console>:76
uniqueNodes2 = MapPartitionsRDD[3959] at map at <console>:79
uniqueNodesBoth = MapPartitionsRDD[3966] at map at <console>:87


8204

Now, we construct the graph.  A graph consists of an RDD of vertices, and an RDD of edges, along with some error handling for dangling edges.  A printout of the first 10 edges and vertices is listed below.  Note that the vertices have some extra label annotation, while the edges have only the integer indexing and the edge value.  This is because the edges RDD is much larger than the vertices RDD.

In [42]:
// importing the graphx library
import org.apache.spark.graphx._
case class nodeFields(nodeID: Int, nodeName: String)
 
val edges = parts.map(x => Edge(x(0).toInt.toLong,x(1).toInt.toLong, x(2)) )
val defaultNode = ("default Node", "Missing")

//Graph is a wrapper to a vertex list and a edge list, with the defaultNode as well.
//It does some internal bookkeepping to maintain consistency before packaging.
val graph = Graph(uniqueNodesBoth,edges,defaultNode)
//main graph will be searched very frequently
graph.cache()

println("\nfirst 10 vertices")
graph.vertices.take(10).foreach(println)

println("\nfirst 10 edges")
graph.edges.take(10).foreach(println)


first 10 vertices
(150824,(Maiden,_North_Carolina,Maiden,_North_Carolina))
(148720,(Quiet_Riot,Quiet_Riot))
(612052,(Spider-Man_2,Spider-Man_2))
(24319476,(Billy_Horschel,Billy_Horschel))
(56892,(Classic_Environment,Classic_Environment))
(32300828,(Ultimate_Fallout,Ultimate_Fallout))
(1406584,(CCR_and_CAR_algebras,CCR_and_CAR_algebras))
(24752844,(Français_Pour_une_Nuit,Français_Pour_une_Nuit))
(1647936,(Luria–Delbrück_experiment,Luria–Delbrück_experiment))
(1140076,(1982_Formula_One_season,1982_Formula_One_season))

first 10 edges
Edge(878,1424517,20)
Edge(1130,1766908,14)
Edge(1162,18938265,33)
Edge(1348,821939,28)
Edge(1495,155375,79)
Edge(1869,175149,308)
Edge(2678,2676,11)
Edge(2724,4864529,148)
Edge(2824,2001051,25)
Edge(3382,3411,4720)


defined class nodeFields
edges = MapPartitionsRDD[3967] at map at <console>:39
defaultNode = (default Node,Missing)
graph = org.apache.spark.graphx.impl.GraphImpl@6cf3ae72


org.apache.spark.graphx.impl.GraphImpl@6cf3ae72

## Graph Processing Functions

We want to find the top N edges connected to any particular node and discard the remaining edges.  This is known as graph pruning.

In [43]:
def pruneGraphByMaxEdges(maxEdges: Int,  bigGraph:  Graph[(String, String),String]): 
                                                    Graph[(String, String),String] = {
    val minCount = bigGraph.triplets.sortBy(_.attr.toInt, ascending=false).
                                    map(x=>x.attr.toInt).take(maxEdges).reverse(0)

    return bigGraph.subgraph(epred = x => x.attr.toInt >= minCount)
}

pruneGraphByMaxEdges: (maxEdges: Int, bigGraph: org.apache.spark.graphx.Graph[(String, String),String])org.apache.spark.graphx.Graph[(String, String),String]


With our pruning function in hand, we want to start with a particular node and build a *shell* of N nodes around that central node.   ONce that graph is built, we want to add a shell to each node of that graph.  We can do this as many times as we like, but we're only creating 3 shells in this notebook.  Ultimately, this will give us a list of all important wikipedia sites that are 4 clicks or less away from the source site. 

In [44]:
def addShellToGraph(thisGraph: Graph[(String, String),String], shellNum:Int ): Graph[(String, String),String]  = {

    val searchStringList = thisGraph.triplets.map(x => x.srcAttr._1).collect
    def recursiveGraphBuild(i:Int,prevGraph: Graph[(String, String),String] ):Graph[(String, String),String] = {
        
        val currentGraph =  pruneGraphByMaxEdges(maxEdgesNthShell,
                            graph.subgraph(epred = x => x.srcAttr._1 == searchStringList(i))
                            )

        val nextGraph = Graph( graph.vertices, 
                                (prevGraph.edges++currentGraph.edges).distinct     )

        if (i == searchStringList.length - 1 ) {   
            return nextGraph
        }
        else {return recursiveGraphBuild(i+1, nextGraph)}
    }
    var i = 0
    return recursiveGraphBuild(0,thisGraph)
}


addShellToGraph: (thisGraph: org.apache.spark.graphx.Graph[(String, String),String], shellNum: Int)org.apache.spark.graphx.Graph[(String, String),String]


Now let's build our clickstream graph!  We'll start with the Watson site.  Some other sites are listed there as well.  Feel free to return to this and generate graphs with those after runnig through the first example. 

In [46]:
// MAIN QUERY  ============================================
//Here are a list of sites that work well for the prototype dataset

val centerVertex = "Watson_(computer)"
//val centerVertex = "Heroes"
//val centerVertex = "Tesla_Motors"
//val centerVertex = "Apple_Inc."

centerVertex = Watson_(computer)


Watson_(computer)

First, we generate the first cell of sites around the center vertex (Watson)

In [47]:
val smallGraph = pruneGraphByMaxEdges(maxEdgesFirstShell, 
                 graph.subgraph(epred = x => x.dstAttr._1.contains(centerVertex) && 
                               !(x.srcAttr._1.contains("other-")) && 
                               !(x.srcAttr._1 == "Main_Page"))
                )
smallGraph.cache
//smallGraph.triplets.count

smallGraph = org.apache.spark.graphx.impl.GraphImpl@6357d812


org.apache.spark.graphx.impl.GraphImpl@6357d812

Here is a list of *all* of the vertices connected to Watson.  We'll build our larger graph from this one.  We're calling the `triplets` method, which returns the edge between two vertices (one is always Watson).

In [48]:
smallGraph.triplets.collect.foreach(println)

((1164,(Artificial_intelligence,Artificial_intelligence)),(22584291,(Watson_(computer),Watson_(computer))),117)
((2142,(List_of_artificial_intelligence_projects,List_of_artificial_intelligence_projects)),(22584291,(Watson_(computer),Watson_(computer))),119)
((23485,(Prolog,Prolog)),(22584291,(Watson_(computer),Watson_(computer))),99)
((30657,(Terabyte,Terabyte)),(22584291,(Watson_(computer),Watson_(computer))),178)
((49387,(Deep_Blue_(chess_computer),Deep_Blue_(chess_computer))),(22584291,(Watson_(computer),Watson_(computer))),259)
((136764,(Blue_Gene,Blue_Gene)),(22584291,(Watson_(computer),Watson_(computer))),113)
((753973,(Ken_Jennings,Ken_Jennings)),(22584291,(Watson_(computer),Watson_(computer))),897)
((886996,(Watson,Watson)),(22584291,(Watson_(computer),Watson_(computer))),412)
((1400125,(Thomas_J._Watson_Research_Center,Thomas_J._Watson_Research_Center)),(22584291,(Watson_(computer),Watson_(computer))),53)
((1813537,(POWER7,POWER7)),(22584291,(Watson_(computer),Watson_(computer

Now we'll build out the additional shells from this original graph.  We're using a very small dataset here, so it only takes about a minute.

In [49]:
//smallGraph.triplets.count
val maxEdgesPerFirstShell = 50
val maxEdgesPerNthShell = 20

val t0 = System.nanoTime
// called 3 times (manually)
val graph1p1 = addShellToGraph(smallGraph,2)
val graph2p1 = addShellToGraph(graph1p1,3)
// fourth click takes longer and is not very informative (so far)
val graph3p1 = addShellToGraph(graph2p1,4)
//val graph2 = graph3p1
graph3p1.edges.count 
val dt = ((System.nanoTime-t0)/1.0e6.round/1.0e3).toString

maxEdgesPerFirstShell = 50
maxEdgesPerNthShell = 20
t0 = 135585393432676
graph1p1 = org.apache.spark.graphx.impl.GraphImpl@7013683b
graph2p1 = org.apache.spark.graphx.impl.GraphImpl@42f6ccbc
graph3p1 = org.apache.spark.graphx.impl.GraphImpl@46a6d121
dt = 32.389


32.389

## Visualizing the Graph

Since we're using scala here, we're just going to output the graph as a webpage that we will visualize in a separate page.  The rest of the code here is just reformatting and writing an html file that uses the d3 library.  After this has been run, go to the `site` directory and type:

`(py35) python -m http.server`

and go to `localhost:8000` (or `remotehost:8000`) to see your graph!


In [26]:
val outputVertexList = (graph3p1.triplets.map(x=>x.srcAttr._1) ++ graph3p1.triplets.map(x=>x.dstAttr._1)).distinct

// getting unique indices for vertices that will be referenced in the 'links' section of the json output.
val outputVertexListZipped=outputVertexList.collect.zipWithIndex

def getLinkIndex(name: String): Int = { outputVertexListZipped.filter(x=>x._1==name)(0)._2}
centerVertex


outputVertexList = MapPartitionsRDD[3919] at distinct at <console>:77
outputVertexListZipped = Array((Terabyte,0), (Google_DeepMind,1), (Watson,2), (List_of_artificial_intelligence_projects,3), (IBM,4), (Brad_Rutter,5), (TOP500,6), (Thomas_Watson,_Jr.,7), (Ginni_Rometty,8), (Artificial_intelligence,9), (Thomas_J._Watson,10), (Timeline_of_electrical_and_electronic_engineering,11), (Computer_performance_by_orders_of_magnitude,12), (Hawthorne,_New_York,13), (Actavis,14), (Wii_U,15), (Watson_(computer),16), (Apple_Inc.,17), (Outline_of_natural_language_processing,18), (Deep_learning,19), (Deep_Blue_(chess_computer),20), (List_of_Jeopardy!_tournaments_and_events,21), (Molecular_Structure_of_Nucleic_Acids:_A_Structure_for_Deoxyribose_Nuc...


Array((Terabyte,0), (Google_DeepMind,1), (Watson,2), (List_of_artificial_intelligence_projects,3), (IBM,4), (Brad_Rutter,5), (TOP500,6), (Thomas_Watson,_Jr.,7), (Ginni_Rometty,8), (Artificial_intelligence,9), (Thomas_J._Watson,10), (Timeline_of_electrical_and_electronic_engineering,11), (Computer_performance_by_orders_of_magnitude,12), (Hawthorne,_New_York,13), (Actavis,14), (Wii_U,15), (Watson_(computer),16), (Apple_Inc.,17), (Outline_of_natural_language_processing,18), (Deep_learning,19), (Deep_Blue_(chess_computer),20), (List_of_Jeopardy!_tournaments_and_events,21), (Molecular_Structure_of_Nucleic_Acids:_A_Structure_for_Deoxyribose_Nuc...

In [28]:
// JSON WRITING ===========================================
import sys.process._
import java.io._

val dirname = "site"
val pw = new PrintWriter(new File(dirname + "/" + centerVertex + ".json"))

// a quick way to normalize edges (not rigorous):
val maxEdgeVal = graph3p1.triplets.sortBy(_.attr.toInt, ascending=false).
                                        map(x=>x.attr.toInt).take(1)(0)

// recall that only *edges* are filtered, the full node list is kept at all times. 
// (it saves time when rebuilding Graphs)
       
//formatting vertices (lots of delimiter issues)                        
val jsonVertices = outputVertexList.map(x=>x.replace("""\""" ,"""\\""")). // backslash
                                    map(x=>x.replace("\"","\\\"")). // quote delimiters
                                    map(x=>"    {\"name\":\"" + x + "\",\"group\":1}").
                                    collect

dirname = site
pw = java.io.PrintWriter@26ac3a2c
maxEdgeVal = 1173
jsonVertices = Array("    {"name":"Terabyte","group":1}", "    {"name":"Google_DeepMind","group":1}", "    {"name":"Watson","group":1}", "    {"name":"List_of_artificial_intelligence_projects","group":1}", "    {"name":"IBM","group":1}", "    {"name":"Brad_Rutter","group":1}", "    {"name":"TOP500","group":1}", "    {"name":"Thomas_Watson,_Jr.","group":1}", "    {"name":"Ginni_Rometty","group":1}", "    {"name":"Artificial_intelligence","group":1}", "    {"name":"Thomas_J._Watson","group":1}", "    {"name":"Timeline_of_electrical_and_electronic_engineering","group":1}", "    {"name":"Computer_performance_by_orders_of_magnitude","group"...


Array("    {"name":"Terabyte","group":1}", "    {"name":"Google_DeepMind","group":1}", "    {"name":"Watson","group":1}", "    {"name":"List_of_artificial_intelligence_projects","group":1}", "    {"name":"IBM","group":1}", "    {"name":"Brad_Rutter","group":1}", "    {"name":"TOP500","group":1}", "    {"name":"Thomas_Watson,_Jr.","group":1}", "    {"name":"Ginni_Rometty","group":1}", "    {"name":"Artificial_intelligence","group":1}", "    {"name":"Thomas_J._Watson","group":1}", "    {"name":"Timeline_of_electrical_and_electronic_engineering","group":1}", "    {"name":"Computer_performance_by_orders_of_magnitude","group"...

In [29]:
val jsonEdges = graph3p1.triplets.map(x => (x.srcAttr._1,x.dstAttr._1,x.attr)).collect.map(y =>  
                                    "    {\"source\":" + getLinkIndex(y._1).toString +
                                    ",\"target\":" + getLinkIndex(y._2).toString +
                                    ",\"value\":" + (y._3.toFloat/maxEdgeVal*100).ceil.toInt.
                                    toString  + "}"  
)

jsonEdges = Array("    {"source":3,"target":16,"value":11}", "    {"source":40,"target":16,"value":9}", "    {"source":28,"target":23,"value":2}", "    {"source":2,"target":31,"value":6}", "    {"source":2,"target":43,"value":5}", "    {"source":2,"target":23,"value":2}", "    {"source":2,"target":22,"value":1}", "    {"source":2,"target":42,"value":9}", "    {"source":2,"target":36,"value":3}", "    {"source":23,"target":10,"value":4}", "    {"source":23,"target":39,"value":4}", "    {"source":23,"target":28,"value":3}", "    {"source":23,"target":25,"value":6}", "    {"source":23,"target":29,"value":2}", "    {"source":23,"target":6,"value":2}", "    {"source":23,"target":16,"value":5}", "    {"source":41,"target":16,"value":6}", "    {"source":33,"target":16,"value":12...


Array("    {"source":3,"target":16,"value":11}", "    {"source":40,"target":16,"value":9}", "    {"source":28,"target":23,"value":2}", "    {"source":2,"target":31,"value":6}", "    {"source":2,"target":43,"value":5}", "    {"source":2,"target":23,"value":2}", "    {"source":2,"target":22,"value":1}", "    {"source":2,"target":42,"value":9}", "    {"source":2,"target":36,"value":3}", "    {"source":23,"target":10,"value":4}", "    {"source":23,"target":39,"value":4}", "    {"source":23,"target":28,"value":3}", "    {"source":23,"target":25,"value":6}", "    {"source":23,"target":29,"value":2}", "    {"source":23,"target":6,"value":2}", "    {"source":23,"target":16,"value":5}", "    {"source":41,"target":16,"value":6}", "    {"source":33,"target":16,"value":12...

In [32]:
// writing json

pw.write("{\n")                             // main header
pw.write("  \"nodes\":[\n")                 // nodes header

for (i<-0 until jsonVertices.length){       // nodes
    if(i == jsonVertices.length-1){pw.write(jsonVertices(i))}
    else {pw.write(jsonVertices(i)+",")}
    pw.write("\n")}
pw.write("  ],\n")                          // end nodes header
pw.write("  \"links\":[\n")      

In [33]:
for (i<-0 until jsonEdges.length){          // links
    if(i == jsonEdges.length-1){pw.write(jsonEdges(i))}
    else {pw.write(jsonEdges(i)+",")}
    pw.write("\n")
}
pw.write("  ]\n") 
pw.write("}")                               // main footer
pw.close

In [34]:

// writing new html file from template
import scala.io.Source._
val lines = fromFile(dirname +"/template.html").getLines.toArray
val pw = new java.io.PrintWriter(new File(dirname+"/"+centerVertex+".html"))

//lines.map(x=>x.replace("NAME_OF_SITE", centerVertex)).foreach(y=>pw.write(y+"\n"))
lines.map(x => x.replace("NAME_OF_SITE", centerVertex)). 
      map(x => x.replace("TIME_FOR_QUERY", " (took " + dt + "s)")).
      foreach(y=>pw.write(y+"\n"))

pw.close

lines = Array(<!DOCTYPE html>, <meta charset="utf-8">, <style>, "", .node {, "  stroke: #fff;", "  stroke-width: 1.5px;", }, "", .link {, "  stroke: #999;", "  stroke-opacity: .6;", }, "", </style>, <body>, <title>NAME_OF_SITE</title>, "", <h1>NAME_OF_SITE </h1>, "<h3>TIME_FOR_QUERY </h3> ", <script type="text/javascript" src="http://mbostock.github.com/d3/d3.js?2.6.0"></script>, <script type="text/javascript" src="http://mbostock.github.com/d3/d3.layout.js?2.6.0"></script>, <script type="text/javascript" src="http://mbostock.github.com/d3/d3.geom.js?2.6.0"></script>, <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>, <script type="text/javascript" charset="utf-8">, "", "    ", var width = 1920,, "    height = 1000;...


Array(<!DOCTYPE html>, <meta charset="utf-8">, <style>, "", .node {, "  stroke: #fff;", "  stroke-width: 1.5px;", }, "", .link {, "  stroke: #999;", "  stroke-opacity: .6;", }, "", </style>, <body>, <title>NAME_OF_SITE</title>, "", <h1>NAME_OF_SITE </h1>, "<h3>TIME_FOR_QUERY </h3> ", <script type="text/javascript" src="http://mbostock.github.com/d3/d3.js?2.6.0"></script>, <script type="text/javascript" src="http://mbostock.github.com/d3/d3.layout.js?2.6.0"></script>, <script type="text/javascript" src="http://mbostock.github.com/d3/d3.geom.js?2.6.0"></script>, <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>, <script type="text/javascript" charset="utf-8">, "", "    ", var width = 1920,, "    height = 1000;...