forked from cwensel/cascading
/
CHANGES.txt
181 lines (108 loc) · 8.61 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
Cascading Change Log
0.6.0
Changed default argument selector for c.p.Every to be Fields.ALL, to be consistent with the default value of c.p.Each.
Added support for assembly traps. If an exception is thrown from inside an c.o.Operation, the offending Tuple
can be saved to a file for later processing, allowing the job to complete.
Added support for stream assertions. STRICT and VALID assertions can be built into a pipe assembly, and optionally
planned out during runtime. Assertions will throw exceptions if they fail.
Changed c.o.a.First, Last, Min, and Max to optionally ignore specified values. Useful if you do not wish
for a 'default' value to be considered first, or last in a set.
Changed c.o.a.Sum to take a Class for coercion of the result value.
Changes c.o.Max and Min to use infinity as initial values so zero is bigger than a really small number
for Max, and zero is smaller than a really big number for Min.
Changed order of JobConf initialization. c.f.FlowStep now is added to the JobConf last in order to catch
all lazily configured values.
Changed compile to include debug info by default.
Fixed bug in c.t.MultiTap where super scheme was not returned if available.
0.5.0
Added skipIfSinkExists property to c.f.Flow. Set to true if the c.c.Cascade should skip the Flow instance even
if the sink is stale and not set to be deleted on initialization.
Fixed bug in c.t.h.HttpFileSystem that URL escaped the ? prefixing the query string.
Fixed bug where a join with duplicate taps was not recognized during job planning. Now an appropriate error
message is displayed, instead of jobs completing with only one instance of the resource stream.
Fixed c.t.h.HttpFileSystem to remember authority information in the url and prefix it when missing.
Changed c.s.TextLine to accept either on or two source fields. If one, only the 'line' value
is sourced from the value, discarding the 'offset' value.
Added c.o.r.RegexSplitGenerator to support splitting single tuple values into multiple tuples based on a regex
delimiter. Includes new tests.
Added c.s.CascadeStats and c.s.FlowStats to provide access to current state and statistics of particular
Cascade, Flow, or the child Flows of a Cascade.
Added ability to sort grouping values with sort argument on c.p.GroupBy. Sorts can be reversed.
Added c.o.e.ExpressionFilter, the c.o.Filter analog to c.o.e.ExpressionFunction.
0.4.1
Fixed path normalization regex in c.u.Util where it munged any path starting with file:///.
0.4.0
Changed c.p.GroupBy default grouping fields to c.t.Fields.ALL from Fields.FIRST. This change provides a simple
way to sort a tuple stream based on the order of the tuple fields.
Changed c.f.FlowConnector to create c.f.Flow instances that will bypass the reducer if no c.p.Group is participating
in the assembly. Previoiusly Group instances were inserted if missing. This allows a chain of c.p.Every instances
to be used to process/filter a tuple stream without the invoking the reducer needlessly (if a sort isn't required).
This change also supports bypassing the default Hadoop OutputCollector in the mapper via the sink c.t.Tap instance.
Changed c.f.FlowStep behavior to run in 'local' mode if either the sink or source tap is a c.t.Lfs instance. This
allows for c.f.Flow instances to run mixed if configured to execute on a particular cluster by default. This behavior
supports complex import/export processes against the HDFS or other supported remote filesystem.
Changed behavior of c.t.Dfs to force use of HDFS. Previously Dfs would default to the local FileSystem
if the job was run in 'local'mode. Now a Dfs instance will cause failures if it cannot connect to a HDFS cluster.
Using c.t.Hfs will provide previous Dfs behavior. Hfs will use the 'default' filesystem if a scheme is not present
in the 'stringPath' (i.e. hdfs://host:port/some/path).
Added c.stats package to allow for collecting statics of Cascades, Flows, and FlowSteps.
Updated c.f.Flow and c.c.Cascade log messages to be easier to follow when executing many flow instances
simultaneously.
Added compression flag to c.s.TextLine. Can now toggle compression (Hadoop style compression) per Tap instance.
This prevents clusters with compression enabled by default to export text files with a .deflate extension.
Added support for bypassing Hadoop OutputCollector via Tap.setUseTapCollector() method. Setting to true will force
Cascading to use the c.t.TapCollector instead. This bypasses bugs in Hadoop with custom FileSystem types. This will
always be true for http(s) and s3tp filesystems when using a c.t.Hfs Tap type (atleast until HADOOP-3021 is resolved).
Added c.t.TupleCollector, complementing c.t.TupleIterator, for directly writing Tuple instances out via a c.t.Tap
instance.
Added c.f.FlowListener so that c.f.Flow instances can fire events on starting, completed, and throwable.
Changed c.t.h.S3HttpFileSystem so it can now create files remotely.
Renamed cascading.spill.threshold to cascading.cogroup.spill.threshold, so there is less a chance of collision.
Made numerous optimizations to improve overall performance. Namely split and merge of key/value tuples to remove
redundancy in the stream between the mapper and reducer.
Changed c.p.Operators to push c.o.Operation results directly through to next operation without intermediate
collection. This should improve pipelining of large result streams and lower runtime memory footprint.
Changed c.c.Cascade so it now runs Flows in parallel if Hadoop is clustered, and there are no dependencies between the
Flows.
Moved c.Cascade and related classed to c.cascade package. Wanted to preempt any future ugliness.
Added support in c.t.h.S3HttpFileSystem for these properties: fs.s3tp.awsAccessKeyId and fs.s3tp.awsSecretAccessKey
0.3.0
Added ability to push Log4j logger properties to mapper/reducer via JobConf.
Use jobConf.set("log4j.logger","logger1=LEVEL,logger2=LEVEL")
Added missing equals() and hashCode() in c.t.MultiTap.
Added c.t.h.ZipInputFormat (and ZipSplit) to support zip files. c.s.TextLine supports transparent
reading of zip files if the filename ends with .zip, but cannot write to them. This code is
loosely based on HADOOP-1824. If the underlying filesystem is hdfs or file, splits will be created
for each ZipEntry. Otherwise ZipEntries are iterated over to be more stream friendly. Progress status is
supported.
Added http, https, and s3tp read-only file systems to Hadoop. Use these URLs, respectively:
http://, https://, and s3tp://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/key
Added c.o.t.DateFormatter supporting text formatting of time stamps created by c.o.t.DateParser.
Fixed bug where in complex assemblies, some Scopes were not resolved.
Fixed bug where tap instances were not being inserted before some CoGroup joins if there was a previous Group in the
assembly.
Upgraded JGraphT to 0.7.3
Changed c.t.SpillableTupleList allows for iteration across entries.
Changed c.f.FlowException to optionally allow for printing of underlying pipe graph for debugging.
Added c.o.t.FieldFormatter function to format Tuples into complex strings using j.u.Formatter formatting.
Added c.o.a.Last aggregator to find the last value encountered in a group.
Changed c.o.a.Max and c.o.a.Min to maintain original value type. Will return null if no values are encountered.
Changed c.o.a.First to use Fields.ARG by default. Removed Fields constructor.
Added c.t.Fields.join(Fields...) method to allow for joining multiple Fields instances into a new instance.
Can retrieve Tuple values by field name through the TupleEntry class via the get(String) method.
Added c.t.TupleCollector interface to simplify the operation interfaces.
Added a Debug filter that will print to either stderr or stdout. Useful for debugging stream transformations.
Added CascadingTestCase base test class
Added Insert Function that allows for literal values to be inserted into the Tuple stream.
0.2.0
CoGroup will now spill to disk on extremely large co-groupings. Configurable via "cascading.spill.threshold".
Defaults to 10k elements.
java.util.Properties instances can be used to set defauls for FlowConnectors.
Fix for InnerJoin, the default join for CoGroup.
Introduced MultiTap to support concatenation of files into a pipe assembly.
RegexParser now fails on a failed match. Prevents it being used or behaving as a filter.
Fixed bug with PipeAssembly instances not properly being assimiliated into the pipeGraph.
Fixed assertion error thrown by JGraphT.
Renamed Tap method deleteOnInit to deleteOnSinkInit.
0.1.0
First release.