About

Cascalog is a tool for processing data on Hadoop with Clojure in a concise, expressive, and highly readable manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

Most query languages, like SQL, Pig, and Hive, are custom languages -- and this leads to huge amounts of accidental complexity. Constructing queries dynamically by doing string manipulation is haphazard and leads to further complexity such as SQL injection attacks. The nature of Cascalog being a domain specific language in Clojure avoids these accidental complexities and allows a programmer to manipulate queries as first-class entities within the language. The Datalog syntax of Cascalog is simpler and more expressive than SQL-based languages.

Follow the getting started steps, check out the tutorial, and you'll be running Cascalog queries on your local computer within 5 minutes.

Getting started

Make sure you have java 1.6
export JAVA_OPTS=-Xmx768m
install leiningen
git clone git://github.com/nathanmarz/cascalog.git
cd cascalog && lein deps && lein compile
optionally run "lein test" to make sure tests pass

The entire Cascalog API is defined within src/clj/cascalog/api.clj . Helpers for testing queries can be found in src/clj/cascalog/testing.clj .

Tutorials

Running Cascalog queries on a Hadoop cluster

Cascalog includes hadoop as a dependency so that you can experiment with it easily. Don't include Hadoop jars within your jar that has Cascalog.
Cascalog requires Cascading 1.1
Any custom operations must be compiled into the jar you give to Hadoop for running jobs

Questions?

Google group: cascalog-user

IM: Come chat in the #cascading room on freenode

Priorities for Cascalog development

Replicated and bloom joins
Cross query optimization: push constants and filters down into subqueries when possible
Negations, i.e. "people who like dogs and don't like cats" (<- [?p] (likes ?p "dogs") (likes ?p "cats" :> false)) [implement with multigroupby of some sort]
Disjunction, i.e. "all people over 30 years old and all males" (<- [?p] [(age ?p ?a) (> ?a 30)] [(gender ?p "m")])])
Recursion, i.e. "all ancestry relations" (<- [?a ?p] [(parent ?a ?p)] [(parent ?a ?p2) (recur ?p2 ?p))])

Acknowledgements

YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.

Cascalog is based off of a very early branch of cascading-clojure project (http://github.com/clj-sys/cascading-clojure). Special thanks to Bradford Cross and Mark McGranaghan for their work on that project. Much of that code appears within Cascalog in either its original form or a modified form.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
src		src
test/cascalog		test/cascalog
.gitignore		.gitignore
LICENSE		LICENSE
README.markdown		README.markdown
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

test/cascalog

test/cascalog

.gitignore

.gitignore

LICENSE

LICENSE

README.markdown

README.markdown

project.clj

project.clj

Repository files navigation

About

Getting started

Tutorials

Running Cascalog queries on a Hadoop cluster

Questions?

Priorities for Cascalog development

Acknowledgements

About

Releases

Packages

License

isterin/cascalog

Folders and files

Latest commit

History

Repository files navigation

About

Getting started

Tutorials

Running Cascalog queries on a Hadoop cluster

Questions?

Priorities for Cascalog development

Acknowledgements

About

Resources

License

Stars

Watchers

Forks