# Extracting embedded resources from webpages

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._
import org.apache.hadoop.io.compress.GzipCodec

## Loading the dataset

In the example, the Web archive dataset will be loaded from local WARC / CDX files, using `WarcCdxHdfsSpec`, but any other [Data Specification (DataSpec)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) can be used here as well in order to load your records from different local or remote sources.

In [2]:
val path = "/data/archiveit/ArchiveIt-Collection-2950"
val cdxPath = path + "/cdx/*.cdx.gz"
val warcPath = path + "/warc"

In [3]:
val records = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))

### Filtering irrelevant records

Embeds are specific to webpages, so we can filter out videos, images, stylesheets and any other files except for webpages ([mime type](https://en.wikipedia.org/wiki/Media_type) *text/html*), as well as webpages that were unavailable when they were crawled either ([status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) == 200). In addition to that, we also filter URLs here to only keep pages under *occupylowell.org*, to reduce the size of the dataset for this example.

*It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual Web archive records, which is the core efficiency feature of ArchiveSpark.*

In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200 && r.surtUrl.startsWith("org,occupylowell)"))

By looking at the first record in our remaining dataset, we can see that this indeed is of type *text/html* and was *online* (status 200) at the time of crawl:

In [5]:
pages.peekJson

{
  "record" : {
    "surtUrl" : "org,occupylowell)/",
    "timestamp" : "20120107063734",
    "originalUrl" : "http://occupylowell.org/",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "NG6JBJ5VEZRU6FRYULHYBRHVNDFCPFR7",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 8020
  }
}

### Remove duplicates

In order to save processing time, we remove duplicate websites (based on the digest in the CDX records) and only keep the earliest snapshot for each content. This will be cached, so that we do not need to compute it every time we want to access that collection.

In [6]:
val earliest = pages.distinctValue(_.digest) {(a, b) => if (a.time < b.time) a else b}.cache

## Extract embedded resources

In this example we want to extract stylesheets, hence we are interested in `link` tags with attribute `rel="stylesheet"`. Similarly, we could also extract images or other resources.

We first need to define the required [Enrich Function](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md) to enrich our metadata with the URLs (in SURT format) of the embedded stylesheets.

In [7]:
val Stylesheets = Html.all("link").mapMulti("stylesheets") { linkTags: Seq[String] => linkTags.filter(_.contains("rel=\"stylesheet\""))}
val StylesheetUrls = SURT.of(HtmlAttribute("href").ofEach(Stylesheets))

In [8]:
earliest.enrich(StylesheetUrls).peekJson

{
  "record" : {
    "surtUrl" : "org,occupylowell)/forums/showthread.php?action=nextnewest&tid=15",
    "timestamp" : "20120501044456",
    "originalUrl" : "http://occupylowell.org/forums/showthread.php?tid=15&action=nextnewest",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "TBED6JFUUP4MM3I2PTUZW3ZOZGM4RQ4L",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 3009
  },
  "payload" : {
    "string" : {
      "html" : {
        "link" : {
          "stylesheets" : [ {
            "attributes" : {
              "href" : {
                "SURT" : "org,occupylowell)/forums/cache/themes/theme3/css3.css"
              }
            }
          }, {
            "attributes" : {
              "href" : {
                "SURT" : "org,occupylowell)/f...

## Identify the relevant embeds / stylesheets in the dataset

At this point, we have to access the original dataset `records` again, as the stylesheets are not among the filtered `pages`.
A `join` operation is used to filter the records in the dataset and keep only the previously extracted stylesheet files. As a `join` is performed on the keys in the dataset, we introduce a dummy value (`true`) here to make the URL the key of the records. For more information please read the [Spark Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html).

In [9]:
val stylesheetUrls = earliest.flatMapValues(StylesheetUrls).collect.toSet

In [10]:
val stylesheetUrls = earliest.flatMapValues(StylesheetUrls).distinct.map(url => (url, true))

In [11]:
val stylesheets = records.map(r => (r.surtUrl, r)).join(stylesheetUrls).map{case (url, (record, dummy)) => record}

Similar to above, we again remove duplicates in the stylesheet dataset: 

In [12]:
val distinctStylesheets = stylesheets.distinctValue(_.digest) {(a, b) => if (a.time < b.time) a else b}.cache

In [13]:
distinctStylesheets.peekJson

{
  "record" : {
    "surtUrl" : "org,occupylowell)/wp-content/themes/coraline/style.css",
    "timestamp" : "20120424033905",
    "originalUrl" : "http://occupylowell.org/wp-content/themes/coraline/style.css",
    "mime" : "text/css",
    "status" : 200,
    "digest" : "D6BMRWYNLTNVOMT32BHJ362K3YXNO5VO",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 5765
  }
}

## Save the relevant embeds

There are different options to save the embeds datasets. One way would be to save the embeds as WARC records as follows:

In [14]:
distinctStylesheets.saveAsWarc("stylesheets.warc.gz", WarcMeta(publisher = "Internet Archive"))

Another option is to enrich the metadata of the stylesheets with their actual content and save it as JSON:

In [15]:
val enriched = distinctStylesheets.enrich(StringContent)

In [16]:
enriched.peekJson

{
  "record" : {
    "surtUrl" : "org,occupylowell)/wp-content/themes/coraline/style.css",
    "timestamp" : "20120424033905",
    "originalUrl" : "http://occupylowell.org/wp-content/themes/coraline/style.css",
    "mime" : "text/css",
    "status" : 200,
    "digest" : "D6BMRWYNLTNVOMT32BHJ362K3YXNO5VO",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 5765
  },
  "payload" : {
    "string" : "/*\nTheme Name: Coraline\nTheme URI: http://wordpress.org/extend/themes/coraline/\nDescription: A squeaky-clean theme featuring a custom menu, header, background, and layout. Coraline supports 7 widget areas (up to 3 in the sidebar, four in the footer) and featured images (thumbnails for gallery posts and custom header images for posts and pages). It includes style...

*By adding a .gz extension to the output path, the data will be automatically compressed with GZip.*

In [17]:
enriched.saveAsJson("stylesheets.json.gz")

To learn how to convert and save the dataset to some custom format, please see the recipe on [Extracting title + text from a selected set of URLs](Selected_Title-and-Text.ipynb).

For more recipes, please check the [ArchiveSpark documentation](https://github.com/helgeho/ArchiveSpark/blob/master/docs/README.md).