CHANGES.txt

Cascading Change Log

0.6.0

  Changed default argument selector for c.p.Every to be Fields.ALL, to be consistent with the default value of c.p.Each.

  Added support for assembly traps. If an exception is thrown from inside an c.o.Operation, the offending Tuple
  can be saved to a file for later processing, allowing the job to complete.

  Added support for stream assertions. STRICT and VALID assertions can be built into a pipe assembly, and optionally
  planned out during runtime. Assertions will throw exceptions if they fail.

  Changed c.o.a.First, Last, Min, and Max to optionally ignore specified values. Useful if you do not wish
  for a 'default' value to be considered first, or last in a set.

  Changed c.o.a.Sum to take a Class for coercion of the result value.

  Changes c.o.Max and Min to use infinity as initial values so zero is bigger than a really small number
  for Max, and zero is smaller than a really big number for Min.

  Changed order of JobConf initialization. c.f.FlowStep now is added to the JobConf last in order to catch
  all lazily configured values.

  Changed compile to include debug info by default.

  Fixed bug in c.t.MultiTap where super scheme was not returned if available.

0.5.0

  Added skipIfSinkExists property to c.f.Flow. Set to true if the c.c.Cascade should skip the Flow instance even
  if the sink is stale and not set to be deleted on initialization.

  Fixed bug in c.t.h.HttpFileSystem that URL escaped the ? prefixing the query string.

  Fixed bug where a join with duplicate taps was not recognized during job planning. Now an appropriate error
  message is displayed, instead of jobs completing with only one instance of the resource stream.

  Fixed c.t.h.HttpFileSystem to remember authority information in the url and prefix it when missing.

  Changed c.s.TextLine to accept either on or two source fields. If one, only the 'line' value
  is sourced from the value, discarding the 'offset' value.

  Added c.o.r.RegexSplitGenerator to support splitting single tuple values into multiple tuples based on a regex
  delimiter. Includes new tests.

  Added c.s.CascadeStats and c.s.FlowStats to provide access to current state and statistics of particular
  Cascade, Flow, or the child Flows of a Cascade.

  Added ability to sort grouping values with sort argument on c.p.GroupBy. Sorts can be reversed.

  Added c.o.e.ExpressionFilter, the c.o.Filter analog to c.o.e.ExpressionFunction.

0.4.1

  Fixed path normalization regex in c.u.Util where it munged any path starting with file:///.

0.4.0

  Changed c.p.GroupBy default grouping fields to c.t.Fields.ALL from Fields.FIRST. This change provides a simple
  way to sort a tuple stream based on the order of the tuple fields.

  Changed c.f.FlowConnector to create c.f.Flow instances that will bypass the reducer if no c.p.Group is participating
  in the assembly. Previoiusly Group instances were inserted if missing. This allows a chain of c.p.Every instances
  to be used to process/filter a tuple stream without the invoking the reducer needlessly (if a sort isn't required).
  This change also supports bypassing the default Hadoop OutputCollector in the mapper via the sink c.t.Tap instance.

  Changed c.f.FlowStep behavior to run in 'local' mode if either the sink or source tap is a c.t.Lfs instance. This
  allows for c.f.Flow instances to run mixed if configured to execute on a particular cluster by default. This behavior
  supports complex import/export processes against the HDFS or other supported remote filesystem.

  Changed behavior of c.t.Dfs to force use of HDFS. Previously Dfs would default to the local FileSystem
  if the job was run in 'local'mode. Now a Dfs instance will cause failures if it cannot connect to a HDFS cluster.
  Using c.t.Hfs will provide previous Dfs behavior. Hfs will use the 'default' filesystem if a scheme is not present
  in the 'stringPath' (i.e. hdfs://host:port/some/path).

  Added c.stats package to allow for collecting statics of Cascades, Flows, and FlowSteps.

  Updated c.f.Flow and c.c.Cascade log messages to be easier to follow when executing many flow instances
  simultaneously.

  Added compression flag to c.s.TextLine. Can now toggle compression (Hadoop style compression) per Tap instance.
  This prevents clusters with compression enabled by default to export text files with a .deflate extension.

  Added support for bypassing Hadoop OutputCollector via Tap.setUseTapCollector() method. Setting to true will force
  Cascading to use the c.t.TapCollector instead. This bypasses bugs in Hadoop with custom FileSystem types. This will
  always be true for http(s) and s3tp filesystems when using a c.t.Hfs Tap type (atleast until HADOOP-3021 is resolved).

  Added c.t.TupleCollector, complementing c.t.TupleIterator, for directly writing Tuple instances out via a c.t.Tap
  instance.

  Added c.f.FlowListener so that c.f.Flow instances can fire events on starting, completed, and throwable.

  Changed c.t.h.S3HttpFileSystem so it can now create files remotely.

  Renamed cascading.spill.threshold to cascading.cogroup.spill.threshold, so there is less a chance of collision.

  Made numerous optimizations to improve overall performance. Namely split and merge of key/value tuples to remove
  redundancy in the stream between the mapper and reducer.

  Changed c.p.Operators to push c.o.Operation results directly through to next operation without intermediate
  collection. This should improve pipelining of large result streams and lower runtime memory footprint.

  Changed c.c.Cascade so it now runs Flows in parallel if Hadoop is clustered, and there are no dependencies between the
  Flows.

  Moved c.Cascade and related classed to c.cascade package. Wanted to preempt any future ugliness.

  Added support in c.t.h.S3HttpFileSystem for these properties: fs.s3tp.awsAccessKeyId and fs.s3tp.awsSecretAccessKey

0.3.0

  Added ability to push Log4j logger properties to mapper/reducer via JobConf.
  Use jobConf.set("log4j.logger","logger1=LEVEL,logger2=LEVEL")

  Added missing equals() and hashCode() in c.t.MultiTap.

  Added c.t.h.ZipInputFormat (and ZipSplit) to support zip files. c.s.TextLine supports transparent
  reading of zip files if the filename ends with .zip, but cannot write to them. This code is
  loosely based on HADOOP-1824. If the underlying filesystem is hdfs or file, splits will be created
  for each ZipEntry. Otherwise ZipEntries are iterated over to be more stream friendly. Progress status is
  supported.

  Added http, https, and s3tp read-only file systems to Hadoop. Use these URLs, respectively:
  http://, https://, and s3tp://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/key

  Added c.o.t.DateFormatter supporting text formatting of time stamps created by c.o.t.DateParser.

  Fixed bug where in complex assemblies, some Scopes were not resolved.

  Fixed bug where tap instances were not being inserted before some CoGroup joins if there was a previous Group in the
  assembly.

  Upgraded JGraphT to 0.7.3

  Changed c.t.SpillableTupleList allows for iteration across entries.

  Changed c.f.FlowException to optionally allow for printing of underlying pipe graph for debugging.

  Added c.o.t.FieldFormatter function to format Tuples into complex strings using j.u.Formatter formatting.

  Added c.o.a.Last aggregator to find the last value encountered in a group.

  Changed c.o.a.Max and c.o.a.Min to maintain original value type. Will return null if no values are encountered.

  Changed c.o.a.First to use Fields.ARG by default. Removed Fields constructor.

  Added c.t.Fields.join(Fields...) method to allow for joining multiple Fields instances into a new instance.

  Can retrieve Tuple values by field name through the TupleEntry class via the get(String) method.

  Added c.t.TupleCollector interface to simplify the operation interfaces.

  Added a Debug filter that will print to either stderr or stdout. Useful for debugging stream transformations.

  Added CascadingTestCase base test class

  Added Insert Function that allows for literal values to be inserted into the Tuple stream.

0.2.0

  CoGroup will now spill to disk on extremely large co-groupings. Configurable via "cascading.spill.threshold".
  Defaults to 10k elements.

  java.util.Properties instances can be used to set defauls for FlowConnectors.

  Fix for InnerJoin, the default join for CoGroup.

  Introduced MultiTap to support concatenation of files into a pipe assembly.

  RegexParser now fails on a failed match. Prevents it being used or behaving as a filter.

  Fixed bug with PipeAssembly instances not properly being assimiliated into the pipeGraph.

  Fixed assertion error thrown by JGraphT.

  Renamed Tap method deleteOnInit to deleteOnSinkInit.


0.1.0

  First release.