# Downloading a web archive dataset as WARC/CDX from the Wayback Machine

In [1]:
import org.archive.webservices.archivespark._
import org.archive.webservices.archivespark.functions._
import org.archive.webservices.archivespark.specific.warc._

## Loading the dataset from the Wayback Machine

ArchiveSpark provides a [Data Specification (DataSpec)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) to load records remotely from the Wayback Machine with metadata fetched from [CDX Server](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server). For more details about this and other [DataSpecs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) please read the [docs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/README.md).

The following example loads all archived resources under the domain `helgeholzmann.de` (`matchPrefix = true`) between May and June 2019, with 5 blocks per page for max. 50 pages (please read [the CDX server documentation](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) for more information on these parameters):

In [2]:
val records = ArchiveSpark.load(WarcSpec.fromWayback("helgeholzmann.de", matchPrefix = true, from = 201905, to = 201906, blocksPerPage = 5, pages = 50))

### Peeking at the first record as JSON

As usual the records in this dataset can be printed as JSON and all common [operations](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md) as well as [Enrichment Functions](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md) can be applied as shown in [other recipes](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Recipes.md).

In [3]:
records.peekJson

{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "warc/revisit",
        "compressedSize" : 771,
        "meta" : "-",
        "status" : -1
    }
}

This revisit record that marks a duplicate in the Wayback Machine will be stored as the original `text/html` resource when downloaded locally.

## Saving as WARC / CDX

Save the records as local *.warc.gz and *.cdx.gz files (by adding the `.gz` extension to the path, the output will be compressed using GZip):

In [4]:
records.saveAsWarc("/data/helgeholzmann-de.warc.gz", WarcFileMeta(publisher = "Helge Holzmann"), generateCdx = true)

48

## Loading from WARC / CDX

Now, with the dataset available in local WARC / CDX files, we can load it from there:

In [5]:
val records = ArchiveSpark.load(WarcSpec.fromFilesWithCdx("/data/helgeholzmann-de.warc.gz"))

In [6]:
records.peekJson

{
    "record" : {
        "redirectUrl" : "-",
        "timestamp" : "20190528152652",
        "digest" : "sha1:HCHVDRUSN7WDGNZFJES2Y4KZADQ6KINN",
        "originalUrl" : "https://www.helgeholzmann.de/",
        "surtUrl" : "de,helgeholzmann)/",
        "mime" : "text/html",
        "compressedSize" : 2087,
        "meta" : "-",
        "status" : 200
    }
}