This solution was ranked 1st place for the original hidden test. This solution was ranked 2nd place for the final, readjusted test.
(a) Team name Campers
(b) For each member in the team: name, email, institution, department, and degree Henrik Mühe, email@example.com, PhD student, Technische Universität München (Germany), Institute of Computer Science, Master of Science with honors Florian Funke, firstname.lastname@example.org, PhD student, Technische Universität München (Germany), Institute of Computer Science, Diplom Informatiker (former German equivalent to Master of Science in Computer Science)
(c) Supervisor name (if your team has one) Our general PhD Supervisors are Professor Alfons Kemper, Ph.D. and Professor Dr. Thomas Neumann
(d) Brief description of your approach Our approach is based on massive parallelism, architecture-aware optimizations, efficient computation of the distance metrics and clever indexing/filtering of query words/document words to minimize the number of distance computations. We briefly sketch out approach here, details can be found in the comments.
(0) Terminology and Overview - Match Type: Exact Matching, Hamming-Distance Matching or Edit-/Levenshtein-Distance Matching - Metric: Used to compute the distance of two words using a given Match Type. - Matcher: Object of a given match type that allows the (de-)registration of query words (StartQuery/EndQuery) and that can determine the set of active queries that match a given document. - Query Word: A query word is a word occurring in one or more queries. For each match type, the query words are unique and contain the set of queries they originate from. Each query word has a minimum and a maximum distance (in case of the Hamming/Edit match types). - core.cpp: Accept API calls dispatch the associated operations to a "processor" (see processor.hpp and tbb_processor.hpp). - Tokenizer: Represents a (tokenized) document. The document's words are deduplicated and are indexed using a hash table. Words are stored ordered by the first two Haar Wavelet coefficients of their frequency histogram (see (1.2)). The first coefficient corresponds to the word length, the second coefficient corresponds to the distance between the sum of the first part of the histogram and the sum of the second part. (1) Indexing/Filtering/Pruning We apply various techniques to reduce the number of times each metric (Exact, Hamming, Edit) is computed as they are computationally expensive. This is particularly beneficial for the more involved Hamming-/Edit-distance computation. (1.1) Cover Pruning We construct death lists which can be used to prune query words. If a certain query word QW1 has not matched any word in the document, there is a set of query words QWs that can be skipped as they will never add a query to the resulting list of matching queries because there are no queries that contain any query word from QWs and not QW1. (1.2) Frequency Filter We use two character frequency histograms to efficiently determine whether two words (a query word and a document word) can be within a certain Edit-/Hamming-distance. The two histograms are Frequency and FrequencyFast (which is faster to evaluate but less accurate than Frequency). The filter itself is described in detail in frequency.hpp. The frequency filters are leveraged in two ways when searching for matches for a given query word in a document: First, we have ordered document words according to the first and second Haar Wavelet coefficient (see Tokenizer::parse) and can skip a large part of this sequence when testing the query word against document words (see edit_distance.cpp). Second, in order for two words to be within edit distance t, their frequency delta must be <= 2*t-length_difference. This can be checked efficiently before computing the actual metric. (1.3) Caching In the Edit-distance matcher, we maintain a cache of document words for each query word, that have previously matched this query word with distance <= 3. We probe each document's hash table with the caches to see if a previous match is contained in the document. This may either directly yield to a good enough match of the query word or might help to tighten the bounds of the edit distance matcher's inner loop that computes the distances of a query word and the potentially matching document words. (2) Parallel Processing We expose a very high degree of parallelism in our approach to leverage the full potential of multi-core architectures. The most computationally-intensive task is MatchDocument(). Thus calls to this routine launch asynchronous tasks which in turn spawn separate sub-tasks to determine the matching queries for each match type (Exact, Hamming, Edit). The computationally-intensive Hamming- and Edit-distance sub-tasks are in turn parallelized: The set of query words that potentially have matches in the document is tested against various filters (see section (1)) and eventually against potentially matching document words. This set is split into subsets that are processed in parallel. While we heavily parallelize the computation, we require only a minium number of synchronization points. (3) Architecture-Aware Optimizations Besides leveraging the available hardware threads, we carefully designed our solution to fit modern architectures. SIMD (data-level) parallelism is exploited when computing Hamming- and Edit-distances, hash values, frequency filter deltas (see (1)) and when parsing the document. We optimize for high data- and instruction-cache hit rates in all performance critical sections of our code: Objects are either stored inside the containing data structure (e.g. Matcher::CompactQW) or reside in consecutive memory chunks when pointers are used (e.g. Pool) and are carefully aligned. Hot loops are kept compact to avoid instruction-cache misses.
(e) Clear note about any third party code used in your submission We use the following static third party libraries as per the information given at https://groups.google.com/forum/#!msg/sigmod2013contest/KNu4jbrng7M/q_mv4V1zx6UJ 1. Intel Thread Building Blocks (http://threadingbuildingblocks.org/): We use this popular parallel programming template library to parallelize loops, control the execution of asynchronous tasks, replace STL containers with concurrent implementations, etc. It is licensed under GPLv2 with the (libstdc++) runtime exception (see http://threadingbuildingblocks.org/licensing for details). 2. Boost (www.boost.org): We use Boost's string_ref class (www.boost.org/libs/utility/doc/html/string_ref.html), an non-owning reference to a string. The Boost libraries are liccensed under Boost Software License (see http://www.boost.org/LICENSE_1_0.txt). 3. Lockless Allocator (http://locklessinc.com/): We statically link against this allocator to efficiently handle memory (de-)allocations by concurrent threads. The Lockless Allocator is licensed under GPL 3 (see http://locklessinc.com/gpl3.shtml).