[ScaDaMaLe, Scalable Data Science and Distributed Machine Learning](https://lamastex.github.io/scalable-data-science/sds/3/x/)
==============================================================================================================================

Old Bailey Online Data Analysis in Apache Spark
===============================================

2016, by Raaz Sainudiin and James Smithies is licensed under [Creative
Commons Attribution-NonCommercial 4.0 International
License](http://creativecommons.org/licenses/by-nc/4.0/).

#### Old Bailey, London's Central Criminal Court, 1674 to 1913

-   with Full XML Data for another great project. This is a starting
    point for ETL of Old Bailey Online Data from
    <http://lamastex.org/datasets/public/OldBailey/index.html>.

This work merely builds on [Old Bailey Online by Clive Emsley, Tim
Hitchcock and Robert Shoemaker](https://www.oldbaileyonline.org/) that
is licensed under a Creative Commons Attribution-NonCommercial 4.0
International License. Permissions beyond the scope of this license may
be available at https://www.oldbaileyonline.org/static/Legal-info.jsp.

In [None]:
//This allows easy embedding of publicly available information into any other notebook
//when viewing in git-book just ignore this block - you may have to manually chase the URL in frameIt("URL").
//Example usage:
// displayHTML(frameIt("https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Topics_in_LDA",250))
def frameIt( u:String, h:Int ) : String = {
      """<iframe 
 src=""""+ u+""""
 width="95%" height="""" + h + """"
 sandbox>
  <p>
    <a href="http://spark.apache.org/docs/latest/index.html">
      Fallback link for browsers that, unlikely, don't support frames
    </a>
  </p>
</iframe>"""
   }
displayHTML(frameIt("https://www.oldbaileyonline.org/", 450))

  

### This exciting dataset is here for a course project in digital humanities

#### To understand the extraction job we are about to do here:

-   see [Jasper Mackenzie, Raazesh Sainudiin, James Smithies and Heather
    Wolffram, A nonparametric view of the civilizing process in London's
    Old Bailey, Research Report UCDMS2015/1, 32 pages,
    2015](http://lamastex.org/preprints/20150828_civilizingProcOBO.pdf).

The data is already loaded in dbfs (see dowloading and loading section
below for these details).

Analysing the Full Old Bailey Online Sessions Papers Dataset
============================================================

First **Step 0: Dowloading and Loading Data (The Full Dataset)** below
should have been done on the shard.  
This currently cannot be done in Community Edition as the dataset is not
loaded into the dbfs available in CE yet. But the datset is in the
academic shard and this is a walkthorugh of the Old Bailey Online data
in the academic shard.

Let's first check that the datasets are there in the distributed file
system.

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/obo/tei/")) // full data if you have it - not in CE!!

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/obo/tei/ordinarysAccounts"))

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/obo/tei/sessionsPapers"))

In [None]:
displayHTML(frameIt("https://en.wikipedia.org/wiki/XML", 450))

  

Step 1: Exploring data first: xml parsing in scala
--------------------------------------------------

But, first let's understand the data and its structure.

**Step 0: Dowloading and Loading Data (The Full Dataset)** should have
been done already with data in dbfs alread.

In [None]:
val raw = sc.wholeTextFiles("dbfs:/datasets/obo/tei/ordinarysAccounts/OA17070912.xml")

In [None]:
val raw = sc.wholeTextFiles("dbfs:/datasets/obo/tei/sessionsPapers/17280717.xml") // has data on crimes and punishments

In [None]:
//val oboTest = sc.wholeTextFiles("dbfs:/datasets/obo/tei/ordinaryAccounts/OA1693072*.xml")
val xml = raw.map( x => x._2 )
val x = xml.take(1)(0) // getting content of xml file as a string

In [None]:
val elem = scala.xml.XML.loadString(x)

In [None]:
elem

  

Quick Preparation
-----------------

#### Some examples to learn xml and scala in a hurry

In [None]:
val p = new scala.xml.PrettyPrinter(80, 2)

p.format(elem)

  

### Better examples:

http://alvinalexander.com/scala/how-to-extract-data-from-xml-nodes-in-scala

http://alvinalexander.com/scala/scala-xml-xpath-example

#### More advanced topics:

https://alvinalexander.com/scala/serializing-deserializing-xml-scala-classes

#### XML to JSON, if you want to go this route:

https://stackoverflow.com/questions/9516973/xml-to-json-with-scala

Our Parsing Problem
-------------------

Let's dive deep on this data right away. See links above to learn xml
more systematically to be able to parse other subsets of the data for
your own project.

For now, we will jump in to parse the input data of counts used in
[Jasper Mackenzie, Raazesh Sainudiin, James Smithies and Heather
Wolffram, A nonparametric view of the civilizing process in London's Old
Bailey, Research Report UCDMS2015/1, 32 pages,
2015](http://lamastex.org/preprints/20150828_civilizingProcOBO.pdf).

In [None]:
(elem \\ "div0").map(Node => (Node \ "@type").text) // types of div0 node, the singleton root node for the file

In [None]:
(elem \\ "div1").map(Node => (Node \ "@type").text) // types of div1 node

In [None]:
(elem \\ "div1")

In [None]:
(elem \\ "div1").filter(Node => ((Node \ "@type").text == "trialAccount"))
                 .map(Node => (Node \ "@type", Node \ "@id" ))

In [None]:
val trials = (elem \\ "div1").filter(Node => ((Node \ "@type").text == "trialAccount"))
                 .map(Node => (Node \ "@type", Node \ "@id", (Node \\ "rs" \\ "interp").map( n => ((n \\ "@type").text, (n \\ "@value").text ))))

In [None]:
val wantedFields = Seq("verdictCategory","punishmentCategory","offenceCategory").toSet

In [None]:
val trials = (elem \\ "div1").filter(Node => ((Node \ "@type").text == "trialAccount"))
                 .map(Node => ((Node \ "@type").text, (Node \ "@id").text, (Node \\ "rs" \\ "interp")
                                                               .filter(n => wantedFields.contains( (n \\ "@type").text))
                                                               .map( n => ((n \\ "@type").text, (n \\ "@value").text ))))

  

Since there can be more than one defendant in a trial, we need to reduce
by key as follows.

In [None]:
def reduceByKey(collection: Traversable[Tuple2[String, Int]]) = {    
    collection
      .groupBy(_._1)
      .map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }
  }

  

Let's process the coarsest data on the trial as json strings.

In [None]:
val trials = (elem \\ "div1").filter(Node => ((Node \ "@type").text == "trialAccount"))
                 .map(Node => {val trialId = (Node \ "@id").text;
                               val trialInterps = (Node \\ "rs" \\ "interp")
                                                                 .filter(n => wantedFields.contains( (n \\ "@type").text))
                                                                 //.map( n => ((n \\ "@type").text, (n \\ "@value").text ));
                                                                 .map( n => ((n \\ "@value").text , 1 ));
                               val trialCounts = reduceByKey(trialInterps).toMap;
                               //(trialId, trialInterps, trialCounts)
                               scala.util.parsing.json.JSONObject(trialCounts updated ("id", trialId))
                              })

In [None]:
trials.foreach(println)

  

Step 2: Extract, Transform and Load XML files to get DataFrame of counts
------------------------------------------------------------------------

We have played enough (see **Step 1: Exploring data first: xml parsing
in scala** above first) to understand what to do now with our xml data
in order to get it converted to counts of crimes, verdicts and
punishments.

Let's parse the xml files and turn into Dataframe in one block.

In [None]:
val rawWTF = sc.wholeTextFiles("dbfs:/datasets/obo/tei/sessionsPapers/*.xml") // has all data on crimes and punishments
val raw = rawWTF.map( x => x._2 )
val trials = raw.flatMap( x => { 
                       val elem = scala.xml.XML.loadString(x);
                       val outJson = (elem \\ "div1").filter(Node => ((Node \ "@type").text == "trialAccount"))
                           .map(Node => {val trialId = (Node \ "@id").text;
                               val trialInterps = (Node \\ "rs" \\ "interp")
                                                                 .filter(n => wantedFields.contains( (n \\ "@type").text))
                                                                 //.map( n => ((n \\ "@type").text, (n \\ "@value").text ));
                                                                 .map( n => ((n \\ "@value").text , 1 ));
                               val trialCounts = reduceByKey(trialInterps).toMap;
                               //(trialId, trialInterps, trialCounts)
                               scala.util.parsing.json.JSONObject(trialCounts updated ("id", trialId)).toString()
                              })
  outJson
})

In [None]:
dbutils.fs.rm("dbfs:/datasets/obo/processed/trialCounts",recurse=true) // let's remove the files from the previous analysis
trials.saveAsTextFile("dbfs:/datasets/obo/processed/trialCounts") // now let's save the trial counts - aboout 220 seconds to pars all data and get counts

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/obo/processed/trialCounts"))

In [None]:
val trialCountsDF = sqlContext.read.json("dbfs:/datasets/obo/processed/trialCounts")

In [None]:
trialCountsDF.printSchema

In [None]:
trialCountsDF.count // total number of trials = 197751

In [None]:
display(trialCountsDF)

In [None]:
val trDF = trialCountsDF.na.fill(0) // filling nulls with 0

In [None]:
display(trDF)

  

This is already available as the following csv file:

-   <http://lamastex.org/datasets/public/OldBailey/oboOffencePunnishmentCountsFrom-sds-2-2-ApacheSparkScalaProcessingOfOBOXMLDoneByRaazOn20180405.csv>

Please cite this URL if you use this data or the Apache licensed codes
in the databricks notebook above for your own non-commerical analysis:

-   <http://lamastex.org/datasets/public/OldBailey/>

Raazesh Sainudiin generated this header **Old bailey Processing in
Apache Spark** on Thu Apr 5 18:22:43 CEST 2018 in Uppsala, Sweden.

Step 0: Dowloading and Loading Data (The Full Dataset)
------------------------------------------------------

First we will be downloading data from
<http://lamastex.org/datasets/public/OldBailey/index.html>.

The steps below need to be done once for a give shard!

You can download the tiny dataset
`obo-tiny/OB-tiny_tei_7-2_CC-BY-NC.zip` **to save time and space in db
CE**

**Optional TODOs:**

-   one could just read the zip files directly (see week 10 on Beijing
    taxi trajectories example from the scalable-data-science course in
    2016 or read 'importing zip files' in the Guide).
-   one could just download from s3 directly

In [None]:
# if you want to download the tiny dataset
wget https://raw.githubusercontent.com/raazesh-sainudiin/scalable-data-science/master/datasets/obo-tiny/OB-tiny_tei_7-2_CC-BY-NC.zip

In [None]:
# this is the full dataset - necessary for a project on this dataset
wget http://lamastex.org/datasets/public/OldBailey/OB_tei_7-2_CC-BY-NC.zip

In [None]:
pwd && ls -al

  

Make sure you comment/uncomment the right files depending on wheter you
have downloaded the tiny dataset or the big one.

In [None]:
unzip OB-tiny_tei_7-2_CC-BY-NC.zip
#unzip OB_tei_7-2_CC-BY-NC.zip

  

Let's put the files in dbfs.

In [None]:
dbutils.fs.mkdirs("dbfs:/datasets/obo/tei") //need not be done again!

In [None]:
//dbutils.fs.rm("dbfs:/datasets/obo/tei",true)

In [None]:
ls 
#ls obo-tiny/tei

In [None]:
 dbutils.fs.cp("file:/databricks/driver/obo-tiny/tei", "dbfs:/datasets/obo/tei/",recurse=true) // already done and it takes 1500 seconds - a while!
 //dbutils.fs.cp("file:/databricks/driver/tei", "dbfs:/datasets/obo/tei/",recurse=true) // already done and it takes 19 minutes - a while!

In [None]:
//dbutils.fs.rm("dbfs:/datasets/tweets",true) // remove files to make room for the OBO dataset

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/"))

In [None]:
display(dbutils.fs.ls("dbfs:/datasets/obo/tei/"))

In [None]:
util.Properties.versionString // check scala version