Sparser

Sparser

Exploratory big data applications that run un- or semi-structured data spend considerable amount of their execution time parsing the data. Academia has shown that most queries have low selectivity. However, parsing is expensive and therefore CPU cycles are wasted on parsing unnecessary data. Sparser strives to address this issue/bottleneck by introducing the concept of filtering data before parsing. As a result only records that satisfy the predicate in query will be parsed by a parser.

Given a query predicate, Sparser, a system that filters "unnecessary" records, either builds an efficient filtering strategy by using its optimizer or "turns off" filtering at all if it deems filtering is of no help.

Therefore, Sparser is not a replacement but a complement to existing parsers.

Refer to the wiki of the project for more insights.

Project Structure

The multimodule project includes the followings:

benchmark - a separate utility project to facilitate benchmarking Sparser
core - Sparser's core functionalities
jacoco-report - contains no code and solely exist for the purpose of aggregating test coverage metrics from multiple modules

Learnings

Link to the Sparser's paper: Filter Before You Parse: Faster Analytics on Raw Data with Sparser.

Link to the video presentation by Sparser's paper's authors: Faster Parsing of Unstructured Data Formats in Apache Spark .

To-Do

TODO:

implement isDNF
change the implementation of subssearch
re-optimization strategy
raw filtering

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
benchmark		benchmark
core		core
jacoco-report		jacoco-report
scripts/dataset		scripts/dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparser

Project Structure

Learnings

To-Do

About

Releases

Packages

Languages

License

meylism/sparser

Folders and files

Latest commit

History

Repository files navigation

Sparser

Project Structure

Learnings

To-Do

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages