Skip to content

meylism/sparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparser

Lines of Code Coverage

Security Rating Maintainability Rating Reliability Rating

Bugs Vulnerabilities Duplicated Lines (%) Code Smells Technical Debt

Exploratory big data applications that run un- or semi-structured data spend considerable amount of their execution time parsing the data. Academia has shown that most queries have low selectivity. However, parsing is expensive and therefore CPU cycles are wasted on parsing unnecessary data. Sparser strives to address this issue/bottleneck by introducing the concept of filtering data before parsing. As a result only records that satisfy the predicate in query will be parsed by a parser.

Given a query predicate, Sparser, a system that filters "unnecessary" records, either builds an efficient filtering strategy by using its optimizer or "turns off" filtering at all if it deems filtering is of no help.

Therefore, Sparser is not a replacement but a complement to existing parsers.

Refer to the wiki of the project for more insights.

Project Structure

The multimodule project includes the followings:

  • benchmark - a separate utility project to facilitate benchmarking Sparser
  • core - Sparser's core functionalities
  • jacoco-report - contains no code and solely exist for the purpose of aggregating test coverage metrics from multiple modules

Learnings

Link to the Sparser's paper: Filter Before You Parse: Faster Analytics on Raw Data with Sparser.

Link to the video presentation by Sparser's paper's authors: Faster Parsing of Unstructured Data Formats in Apache Spark .

To-Do

TODO:

  • implement isDNF
  • change the implementation of subssearch
  • re-optimization strategy
  • raw filtering