We currently support metadata in the form of XML files with contents in txt files, both stored locally or contents being fetched remotely from the Internet Archive. In addition to that, we support completely remote queries by utilizing MHL's advanced full-text search to retrieve the required metadata.
The following ArchiveSpark DataSpecs are included:
- MhlHdfsSpec (with local data): Loads data from local files (from HDFS). It takes two parameters:
- MhlHdfsSpec (with remote data): Loads metadata from local files (from HDFS) and fetches contents remotely:
- MhlSearchSpec: Queries the MHL search system and fetches contents remotely. A query can be specified by passing in MhlSearchOptions:
MhlSearchOptions allow to specify a query, whether it is a conjunctive or disjunctive query, which fields should be queried, as well as the desired languages, mediatypes and collections. The available options are provided by specialized objects under edu.harvard.countway.mhl.archivespark.search.
Please make sure that you have both ArchiveSpark as well as this MHLonArchiveSpark library in your classpath.
To build a JAR file of this project, we recommend to use SBT. As the project depends on the Internet Archive Books on ArchiveSpark project, you will first need to publish that one locally, before you can build this:
git clone https://github.com/helgeho/IABooksOnArchiveSpark.git cd IABooksOnArchiveSpark sbt publishLocal cd .. git clone https://github.com/helgeho/MHLonArchiveSpark.git cd MHLonArchiveSpark sbt assembly
Now the resulting JAR file should be located under