Permalink
Browse files

Added Parallel Loading and Retrieval

  • Loading branch information...
1 parent 28e3a54 commit 317f53d001fd490e0264bd3631a8a17055c5b115 @ldodds committed Aug 18, 2011
View
2 src/application-patterns.xml
@@ -12,6 +12,8 @@
&composite-descriptions;
&follow-your-nose;
&missing-isnt-broken;
+&parallel-loading;
+&parallel-retrieval;
&resource-caching;
&schema-annotation;
&smushing;
View
1 src/blackboard.xml
@@ -33,6 +33,7 @@
<sect2><title>Related</title>
<itemizedlist>
<listitem><link linkend="progressive-enrichment">Progressive Enrichment</link></listitem>
+ <listitem><link linkend="parallel-loading">Parallel Loading</link></listitem>
</itemizedlist>
</sect2>
</sect1>
View
4 src/follow-your-nose.xml
@@ -37,7 +37,8 @@
limit the data retrieved, e.g. by only following certain types of relationship or restricting the domains from which data will be retrieved.
The former allows a more directed "crawl" to find related information, while the latter allows simple white/black-listing to only obtain data
from trusted sources.</para>
- <para>An application might also want to limit network traffic by performing <link linkend="resource-caching">Resource Caching</link>.</para>
+ <para>An application might also want to limit network traffic by performing <link linkend="resource-caching">Resource Caching</link>.
+ <link linkend="parallel-retrieval">Parallel Retrieval</link> can also improve performance</para>
<para>The retrieved data will often be parsed into one RDF graph that can then be queried or manipulated within the application. This "working set"
might be cached as well as the original source descriptions, to allow for the fact that the same data may be repeatedly referenced.</para>
<para>Some additional processing may also be carried out on the retrieved data, e.g. to apply <link linkend="smushing">Smushing</link>
@@ -50,6 +51,7 @@
<listitem><link linkend="see-also">See Also</link></listitem>
<listitem><link linkend="smushing">Smushing</link></listitem>
<listitem><link linkend="resource-caching">Resource Caching</link></listitem>
+ <listitem><link linkend="parallel-loading">Parallel Retrieval</link></listitem>
</itemizedlist>
</sect2>
View
2 src/linked-data-patterns.xml
@@ -52,6 +52,8 @@
<!ENTITY missing-isnt-broken SYSTEM "missing-isnt-broken.xml">
<!ENTITY follow-your-nose SYSTEM "follow-your-nose.xml">
<!ENTITY smushing SYSTEM "smushing.xml">
+ <!ENTITY parallel-retrieval SYSTEM "parallel-retrieval.xml">
+ <!ENTITY parallel-loading SYSTEM "parallel-loading.xml">
<!ENTITY date SYSTEM "date.txt">
View
41 src/parallel-loading.xml
@@ -0,0 +1,41 @@
+<sect1 id="parallel-loading">
+ <title>Parallel Loading</title>
+ <para>
+ <emphasis>How can we reduce loading times for a web-accessible triple store?</emphasis>
+ </para>
+
+ <sect2><title>Context</title>
+ <para>It is quite common for triple stores to expose an HTTP based API to support data loading. E.g. via
+ SPARQL 1.1 Update or the SPARQL 1.1. Uniform Protocol. It can be inefficient or difficult to POST very large datasets over HTTP,
+ e.g. due to protocol time-outs, network errors, etc</para>
+ </sect2>
+
+ <sect2><title>Solution</title>
+ <para>Chunk the data to be loaded into smaller files and use a number of worker processes to submit data via parallel
+ HTTP requests</para>
+ </sect2>
+
+ <sect2><title>Example(s)</title>
+ <para>Most good HTTP client libraries will support parallelisation of HTTP requests. E.g. PHP's
+ <ulink linkend="http://us.php.net/manual/en/function.curl-multi-init.php">curl_multi</ulink> or Ruby's
+ <ulink linkend="https://github.com/dbalatero/typhoeus">typhoeus</ulink> library.</para>
+ </sect2>
+
+ <sect2><title>Discussion</title>
+ <para>Parallelization can improve any process. Because an RDF graph is a set of triples there is no ordering critera for adding
+ statements to a store. This means that it is usually possible to divide up an RDF data dump into a number of smaller files or
+ chunks for loading via parallel POST requests.</para>
+ <para>This approach works best when the RDF data is made available as N-Triples, because the chunking can be done by simply splitting
+ the file on line numbers. This isn't possible with RDF/XML or Turtle files that use prefixes or other syntax short-cuts.</para>
+ <para>The one caveat to this approach is if the data contains blank nodes. It is important that all statements about a single blank
+ node are submitted in the same batch. Either avoid using bnodes, or split the file based on a
+ <link linkend="bounded-description">Bounded Description</link> of each resource.</para>
+ </sect2>
+
+ <sect2><title>Related</title>
+ <itemizedlist>
+ <listitem><link linkend="parallel-retrieval">Parallel Retrieval</link></listitem>
+ </itemizedlist>
+ </sect2>
+
+</sect1>
View
40 src/parallel-retrieval.xml
@@ -0,0 +1,40 @@
+<sect1 id="parallel-retrieval">
+ <title>Parallel Retrieval</title>
+ <para>
+ <emphasis>How can we improve performance of an application dynamically retrieving Linked Data?</emphasis>
+ </para>
+
+ <sect2><title>Context</title>
+ <para>An application that draws on data from the web may typically be retrieving a number of
+ different resources. This is especially true if using the <link linkend="follow-your-nose">Follow Your Nose</link>
+ pattern to discover data</para>
+ </sect2>
+
+ <sect2><title>Solution</title>
+ <para>Use several workers to make parallel GET requests, with each work writing into a shared RDF graph</para>
+ </sect2>
+
+ <sect2><title>Example(s)</title>
+ <para>Most good HTTP client libraries will support parallelisation of HTTP requests. E.g. PHP's
+ <ulink linkend="http://us.php.net/manual/en/function.curl-multi-init.php">curl_multi</ulink> or Ruby's
+ <ulink linkend="https://github.com/dbalatero/typhoeus">typhoeus</ulink> library.</para>
+ </sect2>
+
+ <sect2><title>Discussion</title>
+ <para>Parallelisation of HTTP requests can greatly reduce retrieval times, e.g. to time of the single longest
+ GET request.</para>
+ <para>By combining this approach with <link linkend="resource-caching">Resource Caching</link> of the individual
+ responses, an application can maintain a local cache of the most requested data, which are then combined
+ and parsed into a single RDF graph for driving application behaviour.</para>
+ <para>Parallelisation is particularly useful for AJAX based applications as browsers are particularly well optimized
+ for making a large number of parallel HTTP requests.</para>
+ </sect2>
+
+ <sect2><title>Related</title>
+ <itemizedlist>
+ <listitem><link linkend="follow-your-nose">Follow Your Nose</link></listitem>
+ <listitem><link linkend="parallel-loading">Parallel Loading</link></listitem>
+ </itemizedlist>
+ </sect2>
+
+</sect1>
View
1 src/resource-caching.xml
@@ -21,6 +21,7 @@
<sect2><title>Related</title>
<itemizedlist>
<listitem><link linkend="follow-your-nose">Follow Your Nose</link></listitem>
+ <listitem><link linkend="parallel-retrieval">Parallel Retrieval</link></listitem>
</itemizedlist>
</sect2>

0 comments on commit 317f53d

Please sign in to comment.