# Archaeological Data Analysis: lab module 1

### Author:  KENDALL SWANSON

# Exploring a data set

In this notebook, you'll download a data set derived from the openly licensed content of the [Online Coins of the Roman Empire](http://numismatics.org/ocre/) (OCRE). The original data set is available from <http://nomisma.org/> RDF XML format.  We'l work with a version formatted as a delimited-text file, using `#` as the column delimiter, with a header line labelling each column.

As with any data set, our first task is to figure out what kinds of data it contains, and what the range of values are for each category of data. We'll examine the contents of several columns of data.




## Download delimited-text data

We'll make the standard Scala `Source` object available by `import`ing it, then use it to retrieve the content of a URL.

In [1]:
import scala.io.Source
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"

[32mimport [39m[36mscala.io.Source
[39m
[36mocreCex[39m: [32mString[39m = [32m"https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"[39m

We'll extract a sequence of lines from the URL source, and convert them to our favorite type of Scala collection, a `Vector`.

(The following cell downloads the data:  depending on your internet connection, this might take a moment.)

In [2]:
val lines = Source.fromURL(ocreCex).getLines.toVector

[36mlines[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"ID#Label#Denomination#Metal#Authority#Mint#Region#ObvType#ObvLegend#ObvPortraitId#RevType#RevLegend#RevPortraitId#StartDate#EndDate"[39m,
  [32m"3.com.43#RIC III Commodus 43#denarius#ar#commodus#rome#italy#Head of Commodus, laureate, right#M COMMODVS ANTONINVS AVG#http://nomisma.org/id/commodus#Roma, helmeted, draped, standing left, holding Victory in extended right hand and vertical spear in left hand#TR P VII IMP V COS III P P#http://collection.britishmuseum.org/id/person-institution/60208#182#182"[39m,
  [32m"9.thes.27B.iii#RIC IX Thessalonica 27B: Subtype iii#ae3#ae#valentinian_i#thessalonica#macedonia#Bust of Valens, pearl-diademed, draped and cuirassed, right#D N VALEN-S P F AVG#http://nomisma.org/id/valens#Victory advancing left, holding wreath and palm#SECVRITAS-REIPVBLICAE#http://collection.britishmuseum.org/id/person-institution/60915#367#375"[39m,
  [32m"9.thes.27B.ii#RIC IX Thessalonica 27

## Examine header line

To start with, let's see what the first line looks like, and compare it with the first data line.

In [4]:
lines.head // same as lines(0)

[36mres3[39m: [32mString[39m = [32m"ID#Label#Denomination#Metal#Authority#Mint#Region#ObvType#ObvLegend#ObvPortraitId#RevType#RevLegend#RevPortraitId#StartDate#EndDate"[39m

In [3]:
lines(1)

[36mres2[39m: [32mString[39m = [32m"3.com.43#RIC III Commodus 43#denarius#ar#commodus#rome#italy#Head of Commodus, laureate, right#M COMMODVS ANTONINVS AVG#http://nomisma.org/id/commodus#Roma, helmeted, draped, standing left, holding Victory in extended right hand and vertical spear in left hand#TR P VII IMP V COS III P P#http://collection.britishmuseum.org/id/person-institution/60208#182#182"[39m

## Split data strings into columns

Every line is a `String`.  If we break it up using the `split` method, we get an `Array` of `String`s, which we'll convert to a `Vector` of `String`s.  The end result will be that from a Vector of Strings, we create a Vector of Vectors of Strings.  Notice that Scala identifies the class of the new `data` expression as  `Vector[Vector[String]]`.
 

In [5]:
val data = lines.tail.map(ln => ln.split("#").toVector)

[36mdata[39m: [32mVector[39m[[32mVector[39m[[32mString[39m]] = [33mVector[39m(
  [33mVector[39m(
    [32m"3.com.43"[39m,
    [32m"RIC III Commodus 43"[39m,
    [32m"denarius"[39m,
    [32m"ar"[39m,
    [32m"commodus"[39m,
    [32m"rome"[39m,
    [32m"italy"[39m,
    [32m"Head of Commodus, laureate, right"[39m,
    [32m"M COMMODVS ANTONINVS AVG"[39m,
    [32m"http://nomisma.org/id/commodus"[39m,
    [32m"Roma, helmeted, draped, standing left, holding Victory in extended right hand and vertical spear in left hand"[39m,
    [32m"TR P VII IMP V COS III P P"[39m,
    [32m"http://collection.britishmuseum.org/id/person-institution/60208"[39m,
    [32m"182"[39m,
    [32m"182"[39m
  ),
  [33mVector[39m(
    [32m"9.thes.27B.iii"[39m,
    [32m"RIC IX Thessalonica 27B: Subtype iii"[39m,
    [32m"ae3"[39m,
    [32m"ae"[39m,
    [32m"valentinian_i"[39m,
    [32m"thessalonica"[39m,
    [32m"macedonia"[39m,
    [32m"Bust of Valens, pearl-diade

Mapping each Vector to the first item in the Vector is equivalent to extracting the first column from each Vector.  The header line told us that the first column should contain ID values.

In [7]:
val ids = data.map(columns => columns(10))

[36mids[39m: [32mVector[39m[[32mString[39m] = [33mVector[39m(
  [32m"Roma, helmeted, draped, standing left, holding Victory in extended right hand and vertical spear in left hand"[39m,
  [32m"Victory advancing left, holding wreath and palm"[39m,
  [32m"Victory advancing left, holding wreath and palm"[39m,
  [32m"Pax, draped, standing left, holding olive branch in right hand and sceptre in left hand"[39m,
  [32m"Victory advancing left, holding wreath and palm"[39m,
  [32m"Sol, chlamys hanging behind, standing left, raising right hand and holding globe close to body in left hand"[39m,
  [32m"Sol, chlamys hanging behind, standing left, raising right hand and holding globe close to body in left hand"[39m,
  [32m"Sol, chlamys hanging behind, standing left, raising right hand and holding up globe in left hand"[39m,
  [32m"Sol, chlamys hanging behind, standing left, raising right hand and holding up globe in left hand"[39m,
  [32m"Sol, chlamys hanging behind, standi