preliminary new docs

helgeho · Feb 20, 2018 · acd0590 · acd0590
1 parent 4e05af4
commit acd0590
Show file tree

Hide file tree

Showing 130 changed files with 1,659 additions and 1,066 deletions.
diff --git a/.gitignore b/.gitignore
@@ -41,4 +41,3 @@ project/boot/
 ## Put local stuff here that you don't want to commit
 ignore/
 *.ignore.*
-docs/
diff --git a/docs/Build.md b/docs/Build.md
@@ -0,0 +1,19 @@
+[< Table of Contents](README.md) | [Use ArchiveSpark as a Library (advanced) >](Use_Library.md)
+:---|---:
+
+# Build ArchiveSpark
+
+To build the ArchiveSpark JAR files from source you need to have Scala 2.11 as well as SBT installed.
+Then simply run the following build commands from within the ArchiveSpark folder:
+
+1. `sbt assembly`
+2. `sbt assemblyPackageDependency`
+
+These commands will create two JAR files under `target/scala-2.11`, one for ArchiveSpark and one for the required dependencies.
+Please include these files in your project that depends on ArchiveSpark or add them to your JVM classpath.
+
+There are also pre-built versions available that you can add as dependencies to your projects.
+Fore more information, please read [Use ArchiveSpark as a library](Use_Library.md).
+
+[< Table of Contents](README.md) | [Use ArchiveSpark as a Library (advanced) >](Use_Library.md)
+:---|---:
diff --git a/docs/Config.md b/docs/Config.md
@@ -0,0 +1,28 @@
+[< Table of Contents](README.md) | [ArchiveSpark Operations >](Operations.md)
+:---|---:
+
+# Configuration
+
+ArchiveSpark provides two ways to configure certain options:
+
+1. Parameters that are relevant for the driver, which is in charge of running a job and distributing your code to the executors, can be set directly on the global `ArchiveSpark` object.
+2. Parameters that are relevant for each executor can be set through a distributed config, which can be accessed with `ArchiveSpark.conf`.
+
+## Driver Options
+
+Option| Description
+:--------|:---
+**parallelism** | This sets the extent of parallelism / number of partitions that are used by most distributed ArchiveSpark operations.  If this is not set, Spark's `defaultParallelism` will be used by ArchiveSpark instead, which can be set through the `spark.default.parallelism` property.
+&nbsp; | *Example:* `ArchiveSpark.parallelism = 1000`
+
+## Executor Options
+
+Option| Description
+:--------|:---
+**catchExceptions** | Defines whether or not exceptions that occur when running an Enrich Function should be catched by ArchiveSpark. This is `true` by default. In this case, exceptions are sliently catched and accessible through `rdd.lastException` or filtered by `rdd.filterNoException()` (s. [Dataset Operations](Operations.md)). Otherwise, exceptions would cause the job to fail.
+&nbsp; | *Example:* `ArchiveSpark.conf.catchExceptions = false`
+**maxWarcDecompressionSize** | Limits the number of bytes to be read when extracting a (W)ARC.gz record. This can help to prevent failures due to out-of-memory exceptions. By defaults it is set to 0, which means there is no limit.  
+&nbsp; | *Example:* `ArchiveSpark.conf.maxWarcDecompressionSize = 100 * 1024 // 100KB`
+
+[< Table of Contents](README.md) | [ArchiveSpark Operations >](Operations.md)
+:---|---:
diff --git a/docs/Contribute.md b/docs/Contribute.md
@@ -0,0 +1,17 @@
+[< Table of Contents](README.md) | [How to Implement DataSpecs >](Dev_DataSpecs.md)
+:---|---:
+
+# Contribute
+
+Everyone is very welcome to contribute to ArchiveSpark. If you encounter any bugs, which you would like to report to us, please use [GitHub Issues](https://github.com/helgeho/ArchiveSpark/issues). Of course, as everything is open source, you can also fix bugs yourself and file a pull request. We will then review your changes and potentially integrate your fix.
+
+In order to extend the functionality of ArchiveSpark with additional [Enrich Functions](EnrichFuncs.md) and [Data Specifications](DataSpecs.md), we strongly encourage you to share these with others in separate projects. ArchiveSpark's flexible architecture allows to easily integrate those modules from other projects, while we would like to keep the core repository clean with a focus on the basic features.
+
+As an example, we have created a project to demonstrate how to extend ArchiveSpark, which can be used as a template. It includes some very simple DataSpecs as well as Enrich Functions to analyze digitized books from the Internet Archive remotely with ArchiveSpark using local XML meta data: [IABooksOnArchiveSpark](https://github.com/helgeho/IABooksOnArchiveSpark)
+
+For more information, please read:
+* [How to Implement DataSpecs](Dev_DataSpecs.md)
+* [How to Implement Enrich Functions](Dev_EnrichFuncs.md)
+
+[< Table of Contents](README.md) | [How to Implement DataSpecs >](Dev_DataSpecs.md)
+:---|---:
diff --git a/docs/DataSpecs.md b/docs/DataSpecs.md
@@ -0,0 +1,52 @@
+[< Table of Contents](README.md) | [Enrich Functions >](EnrichFuncs.md)
+:---|---:
+
+# Data Specifications (DataSpecs)
+
+Data Specifications (DataSpecs) are abstractions of the load and read logics for metadata as well as data records.
+Depending on your data source and type you need to select an appropriate one.
+As part of core ArchiveSpark we provide DataSpecs for Web Archives (CDX/(W)ARC format) as well as some raw data types, such as text.
+More DataSpecs for different data types and sources may be found in different projects, contributed by independent developers or ourselves (see below).
+
+For more information on the usage of DataSpecs, please read [General Usage](General_Usage.md).
+
+## Web Archive DataSpecs
+
+The following DataSpecs are specific to Web archive datasets. These become available by this import: `import de.l3s.archivespark.specific.warc.specs._`
+
+DataSpec| Description
+:-------|:---
+**[WarcCdxHdfsSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WarcCdxHdfsSpec.scala)**(*cdxPaths*, *warcPath*) | Loads a Web archive collection that is available in CDX and (W)ARC format from a (distributed) filesystem, like HDFS.
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(WarcCdxHdfsSpec("/path/to/*.cdx.gz", "/path/to/warc_dir"))`
+**[CdxHdfsSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/CdxHdfsSpec.scala)**(*paths*) | Loads a collection of CDX records (meta data only) from a (distributed) filesystem, like HDFS. This is helpful to resolve *revisit records*, before loading the corresponding (W)ARC records, using `WarcHdfsCdxRddSpec`.
+&nbsp; | *Example:* `val cdxRdd = ArchiveSpark.load(CdxHdfsSpec("/path/to/*.cdx.gz"))` 
+**[CdxHdfsWaybackSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/CdxHdfsWaybackSpec.scala)**(*cdxPath*) | Loads a Web archive collection from local CDX records with the corresponding data being fetched from the [Internet Archive's Wayback Machine](http://web.archive.org) remotely. 
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(CdxHdfsWaybackSpec("/path/to/*.cdx.gz"))`
+**[WarcHdfsSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WarcHdfsSpec.scala)**(*paths*) | Loads a Web archive dataset from (W)ARC files without corresponding CDX records. Please note that this may be much slower for most operations except for batch processing that involve the whole collection. So it is highly recommended to use this DataSpec only to generate corresponding CDX records and reload it using `WarcCdxHdfsSpec` in order to make use of ArchiveSpark's optimized two-step loading approach.
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(WarcHdfsSpec("/path/to/*.*arc"))`
+**[WarcGzHdfsSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WarcGzHdfsSpec.scala)**(*cdxPath*, *warcPath*) | An optimized version of `WarcHdfsSpec` for dataset stored in WARC.gz with each record compressed individually, making use of the [*HadoopConcatGz*](https://github.com/helgeho/HadoopConcatGz) input format.
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(WarcGzHdfsSpec("/path/to/warc.gz"))`
+**[WarcHdfsCdxPathRddSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WarcHdfsCdxPathRddSpec.scala)**(*cdxWithPathsRdd*) | Loads a Web archive dataset from a (distributed) filesystem, like HDFS, given an RDD with tuples of the form `(CdxRecord, WarcPath)`. After loading the CDX records using `CdxHdfsSpec`, an RDD of this form can be created using [`rdd.mapInfo(...)`](../src/main/scala/de/l3s/archivespark/specific/warc/implicits/ResolvableRDD.scala), given another RDD that maps metadata to corresponding (W)ARC paths.
+&nbsp; | *Example:* `val cdxRdd = ArchiveSpark.load(CdxHdfsSpec("/path/to/*.cdx.gz")`<br>`val cdxWithPathsRdd = cdxRdd.mapInfo(_.digest, digestWarcPathRdd)`<br>`val rdd = ArchiveSpark.load(WarcHdfsCdxPathRddSpec(cdxWithPathsRdd))`
+**[WarcHdfsCdxRddSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WarcHdfsCdxRddSpec.scala)**(*cdxRdd*, *warcPath*) | Loads a Web archive dataset from a (distributed filesystem, like HDFS, given an RDD of corresponding CDX records (e.g., loaded using `CdxHdfsSpec`).
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(WarcHdfsCdxRddSpec(cdxRdd, "/path/to/warc"))`
+**[WaybackSpec](../src/main/scala/de/l3s/archivespark/specific/warc/specs/WaybackSpec.scala)**(*url*, [*matchPrefix*], [*from*], [*to*], [*blocksPerPage*], [*pages*]) | Loads a Web archive dataset completely remotely from the [Internet Archive's Wayback Machine](http://web.archive.org) with the CDX metadata being fetched from their CDX server. More details on the parameters for this DataSpec can be found on the [CDX server documentation](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server).
+&nbsp; | *Example:* `val rdd = ArchiveSpark.load(WaybackSpec("l3s.de", matchPrefix = true, from = 2010, to = 2012, pages = 100))`
+
+## Additional DataSpecs for more data types 
+
+In addition to the Web archive specs we also provide some additional specs for raw files, availble through `import de.l3s.archivespark.specific.raw._`
+
+DataSpec| Description
+:-------|:---
+**[HdfsFileSpec](../src/main/scala/de/l3s/archivespark/specific/raw/HdfsFileSpec.scala)**(*path*, [*filePatterns*]) | Loads raw data records ([`FileStreamRecord`](../src/main/scala/de/l3s/archivespark/specific/raw/HdfsFileSpec.scala)) from the given path and matching the specified file patterns. This is an alternative to Spark's native `sc.textFile(...)`, but offers more flexibility as files can be filtered by name before they are loaded and also provides raw data stream access.
+&nbsp; | *Example:* `val textLines = ArchiveSpark.load(HdfsFileSpec("/path/to/data", Seq("*.txt.gz")).flatMap(_.lineIterator)`
+
+More DataSpecs for additional data types can be found in the following projects:
+
+* Also for Web archives, but starting from the temporal Web archive search engine [Tempas](http://tempas.L3S.de/v2) to fetch metadata by keywords with data records loaded remotely from the Internet Archive's Wayback Machine: [Tempas2ArchiveSpark](https://github.com/helgeho/Tempas2ArchiveSpark)
+* DataSpecs to analyze digitized books from the Internet Archive remotely with ArchiveSpark using local XML meta data. The main purpose of this project is to demonstrate how easily ArchiveSpark can be extended: [IABooksOnArchiveSpark](https://github.com/helgeho/IABooksOnArchiveSpark)
+* The [Medical Heritage Library (MHL)](http://www.medicalheritage.org/) on ArchiveSpark project contains the required components for ArchiveSpark to work with MHL collections. It includes three DataSpecs to load data remotely through MHL's full-text search as well as from local files: [MHLonArchiveSpark](https://github.com/helgeho/MHLonArchiveSpark)
+
+[< Table of Contents](README.md) | [Enrich Functions >](EnrichFuncs.md)
+:---|---:
diff --git a/docs/Dev_DataSpecs.md b/docs/Dev_DataSpecs.md
@@ -0,0 +1,11 @@
+[< Table of Contents](README.md) | [How to Implement Enrich Functions >](Dev_EnrichFuncs.md)
+:---|---:
+
+# How to Implement DataSpecs
+
+ArchiveSpark comes with a base class for Data Specifications, called [`DataSpec`](../src/main/scala/de/l3s/archivespark/dataspecs/DataSpec.scala). It accepts two types to be defined: The first is `Raw`, which is the type of metadata as loaded from disk or a remote source by the `load` method, e.g., `String` for raw text. Each loaded metadata record is then passed to the `parse` method, which has to be implemented with the logic to transform the raw metadata into a record of your dataset type `Record`. This can be any custom class derived from [`EnrichRoot`](../src/main/scala/de/l3s/archivespark/enrich/EnrichRoot.scala). These records store and provide access to the metadata as well as include the logics to access the actual data records.
+
+For examples, please have a look at the included DataSpecs, such as [`HdfsFileSpec`](../src/main/scala/de/l3s/archivespark/specific/raw/HdfsFileSpec.scala) or the external [IABooksOnArchiveSpark](https://github.com/helgeho/IABooksOnArchiveSpark) project. For more information on how to deploy and share your DataSpecs, please read [Contribute](Contribute.md).
+
+[< Table of Contents](README.md) | [How to Implement Enrich Functions >](Dev_EnrichFuncs.md)
+:---|---:
diff --git a/docs/Dev_EnrichFuncs.md b/docs/Dev_EnrichFuncs.md
@@ -0,0 +1,13 @@
+[< Table of Contents](README.md) | [Contribute >](Contribute.md)
+:---|---:
+
+# How to Implement Enrich Functions
+
+ArchiveSpark comes with multiple base classes to implement your custom Enrich Functions. These can provide completely new derivation / extraction logics or expose an interface to your own libraries. In order to deploy and share your Enrich Functions, please create a separate project or include an additional class, which serves as an API to your library with your library's project. 
+
+All Enrich Functions need to be of type `EnrichFunc`, which is a generic class that accepts two types: The `Root` type, which defines what type of records it is applicable to by default, as well as the `Source` type, which defines its input type. To define a default result field with a default output type, we provide the [`DefaultField`](../src/main/scala/de/l3s/archivespark/enrich/DefaultField.scala) trait as well as the [`SingleField`](../src/main/scala/de/l3s/archivespark/enrich/SingleField.scala) for Enrich Functions that produce only one result field. For the most common types of Enrich Functions, we provide simplified base classes, which are usually sufficient for the most common use cases: [BasicEnrichFunc](../src/main/scala/de/l3s/archivespark/enrich/BasicEnrichFunc.scala), [BasicDependentEnrichFunc](../src/main/scala/de/l3s/archivespark/enrich/BasicDependentEnrichFunc.scala),[BasicMultiValEnrichFunc](../src/main/scala/de/l3s/archivespark/enrich/BasicMultiValEnrichFunc.scala),[BasicMultiValDependentEnrichFunc](../src/main/scala/de/l3s/archivespark/enrich/BasicMultiValDependentEnrichFunc.scala).
+
+For examples, please have a look at the included Enrich Functions, such as [`LowerCase`](../src/main/scala/de/l3s/archivespark/enrich/functions/LowerCase.scala) or the external [IABooksOnArchiveSpark](https://github.com/helgeho/IABooksOnArchiveSpark) project. For more information on how to deploy and share your DataSpecs, please read [Contribute](Contribute.md).
+
+[< Table of Contents](README.md) | [Contribute >](Contribute.md)
+:---|---: