Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
SPARQL Crawl Server
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
SparqlCrawlServer ================= Version: 1.6 Author: netEstate GmbH, http://www.netestate.de/, email@example.com License: Apache License version 2 <http://www.apache.org/licenses/> Sparql Crawl Server is a Java Web application that crawls a list of URLs containing RDF and executes a SPARQL query on the resulting graphs. Every document gets its own named graph with the final document URL as graph name. Already crawled documents are cached in memory and on disk. The crawler supports RDF/XML, RDFa, N3 and Turtle and will open a maximum of one HTTP connection per server IP. It respects the Robot Exclusion Standard. HTTP POST request parameters (UTF-8 encoded): query: The SPARQL query. graphs: A comma separated list of URLs in <>: <url1>,<url2>,... timeout: number of seconds to wait for the RDF documents. HTTP Response: application/sparql-results+json If the timeout is reached, the SPARQL query will be executed with the current results. RDF from unfinished HTTP requests is added to the cache when finished. Build instructions: Configure the path to your settings file in SparqlCrawlServer/WebContent/WEB-INF/web.xml. Build by invoking ant. Minimum Java version is 7.0, minimum Servlet version is 3.0. Deploy the file SparqlCrawlServer.war in your Servlet container (e.g. Tomcat 7). The context-path of the servlet is /sparqlcrawl (so, normally the endpoint URL is http://<your-server>/SparqlCrawlServer/sparqlcrawl). Settings file: The settings file is a groovy file, see example-settings.groovy. cacheLifetimeOk: number of seconds to cache crawled RDF documents. cacheLifetimeContentTooLong: number of seconds to cache 'content-too-long' crawl results. (see maxRdfXmlSize) cacheLifetimeHttpError: number of seconds to cache HTTP error crawl results. cacheLifetimeNetworkError: number of seconds to cache network error crawl results. cacheLifetimeNotAllowedByRobots: number of seconds to cache 'disallowed' crawl results. cacheLifetimeParseError: number of seconds to cache RDF parse error crawl results. cacheLifetimeWrongContentType: number of seconds to cache wrong content type error crawl results. datasetCacheLifetimeComplete: number of seconds to cache completely crawled document lists in memory. datasetCacheLifetimeIncomplete: number of seconds to cache incompletely crawled document lists in memory. logFilePath: path to the log file crawlThreads: number of threads used for crawling. crawlThreadStackSize: stack size for crawl threads dnsThreads: number of threads for dns lookup dnsThreadStackSize: stack size for dns lookup threads dnsTimeout: DNS timeout maxRdfXmlSize: maximum size of RDF documents maxRobotsAge: maximum cache time of robots.txt files. maxRobotsSize: maximum allowed size of a robots.txt file. modelCachePath: path to disk-cache-folder socketTimeout: socket timeout unixOs: is the underlying os unix like? (true/false) userAgent: user-agent header value for crawl requests