Skip to content
Browse files

Fundamental data operations

  • Loading branch information...
1 parent e24b866 commit da761640e9068314a14f357c04e4974717422fc9 Philip (flip) Kromer committed Dec 20, 2013
Showing with 604 additions and 218 deletions.
  1. +230 −37 00a-about.asciidoc
  2. +3 −3 00a-intro-more_outlines.asciidoc
  3. +168 −27 05d-intro_to_pig.asciidoc
  4. +203 −151 19a-advanced_pig.asciidoc
View
267 00a-about.asciidoc
@@ -20,56 +20,218 @@ Hello and thanks, courageous and farsighted early released-to'er! We want to mak
This is the plan. We'll roll material out over the next few months. Should we find we need to cut things (I hope not to), I've flagged a few chapters as _(bubble)_.
-1. *First Exploration*: A walkthrough of problem you'd use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we'll call _transforms_ (process records independently) and _pivots_ (restructure data).
-2. *Simple Transform*: Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you'll take over the task of translating text to Pig Latin. This is an "embarrasingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
+1. *First Exploration*:
+Objective: Show you a thing you couldn’t do without hadoop, you couldn’t do it any other way. Your mind should be blown and when you’re slogging through the data munging chapter you should think back to this and remember why you started this mess in the first place.
+
+A walkthrough of problem you'd use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we'll call _transforms_ (process records independently) and _pivots_ (restructure data).
+2. *Simple Transform*:
+Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you'll take over the task of translating text to Pig Latin. This is an "embarrassingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
- Chimpanzee and Elephant start a business
- Pig Latin translation
- Your first job: test at commandline
- Run it on cluster
- Input Splits
- Why Hadoop I: Simple Parallelism
+3. *Transform-Pivot Job*:
-3. *Transform-Pivot Job*: C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We'll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop).
+C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We'll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop).
- Locality: the central challenge of distributed computing
- The Hadoop Haiku
-4. *First Exploration: Geographic Flavor* pt II
-
-5. *The Hadoop Toolset*
+Note: we will almost certainly re-arrange the content in chapters 1-3.
+They will be centered around the following 3 explorations
+(in pig and wukong)
+- Pig Latin exploration (map only job)
+- Count UFO visits by month
+ - might visit the jobtracker
+- Counting Wikipedia pageviews by hour (or whatever)
+ - should be same as UFO exploration, but:
+ - will actually require Hadoop
+ - also do a total sort at the end
+ - will visit the jobtracker
+
+By this point in the book you should:
+- Have your mind blown
+- See some compelling enough data and a compelling enough question, and a wukong job that answers that job by using only a mapper
+- see some compelling enough data and a compelling enough question, which requires a map and reduce job, written in both pig and wukong
+- believe the mapreduce story, i.e. you know, in general, the high-level conceptual mechanics of a mapreduce job
+
+You should have:
+- whimsical & concrete explanations of mapreduce, what’s happening as a job is born and run, and HDFS
+
+You shouldn’t have:
+a lot of words in pig
+a good feel for how to deftly use wukong yet
+5. *The Hadoop Toolset and Other Practical Matters*
- toolset overview
+- It’s a necessarily polyglot sport
+- Pig is a language that excels at describing
+- we think you are doing it wrong if you are not using :
+- a declarative orchestration language, a high-level scripting language for the dirty stuff (e.g. parsing, contacting external apis, etc..)
+- udfs (without saying udfs) are for accessing a java-native library, e.g. geospacial libraries, when you really care about performance, to gift pig with a new ability, custom loaders, etc…
+- there are a lot of tools, they all have merits: Hive, Pig, Cascading, Scalding, Wukong, MrJob, R, Julia (with your eyes open), Crunch. There aren’t others that we would recommend for production use, although we see enough momentum from impala and spark that you can adopt them with confidence that they will mature.
- launching and debugging jobs
- - overview of wukong
- - overview of pig
+ - overview of Wukong
+ - overview of Pig
+
+6. Fundamental Data Operations in Hadoop*
+here’s the stuff you’d like to be able to do with data, in wukong and in pig
+ - Foreach/filter operations (messing around inside a record)
+ - reading data (brief practical directions on the level of “this is what you type in”)
+ - pig schema
+ - wukong model
+ - loading TSV
+- loading generic JSON
+- storing JSON
+- loading schematized JSON
+- loading parquet or Trevni
+- (Reference the section on working with compressed files; call back to the points about splitability and performance/size tradeoffs)
+- TSV, JSON, not XML; Protobufs, Thrift, Avro; Trevni, Parquet; Sequence Files; HAR
+- compression: gz, bz2, snappy, LZO
+ - subsetting your data
+- limit
+- filter
+- sample
+- using a hash digest function to take a signature
+- top k and reservoir sampling
+- refer to subuniverse which is probably elsewhere
+- group
+- join
+- ??cogroup?? (does this go with group? Does it go anywhere?)
+- sort, etc.. : cross cube
+- total sort
+- partitioner
+- basic UDFs
+- ?using ruby or python within a pig dataflow?
+7. *Filesystem Mojo and `cat` herding*
-6. *Filesystem Mojo and `cat` herding*
- dumping, listing, moving and manipulating files on the HDFS and local filesystems
- total sort
- transformations from the commandline (grep, cut, wc, etc)
- pivots from the commandline (head, sort, etc)
- commandline workflow tips
- advanced hadoop filesystem (chmod, setrep, fsck)
-7. *Intro to Storm+Trident*
+8. *Intro to Storm+Trident*
-8. *Statistics*:
+9. *Statistics*:
- (this is first deep experience with Storm+Trident)
- Summarizing: Averages, Percentiles, and Normalization
+ - running / windowed stream summaries
+ - make a "SummarizingTap" trident operation that collects {Sum Count Min Max Avg Stddev SomeExampleValuesReservoirSampled} (fill in the details of what exactly this means)
+ - also, maybe: Median+Deciles, Histogram
+ - understand the flow of data going on in preparing such an aggregate, by either making sure the mechanics of working with Trident don't overwhelm that or by retracing the story of records in an aggregation
+ - you need a group operation -> means everything in group goes to exactly one executor, exactly one machine, aggregator hits everything in a group
+- combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation
+ - by default, always use persistent aggregate until we find out why you wouldn’t
+
+ - (BUBBLE) highlight the corresponding map/reduce dataflow and illuminate the connection
+ - (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard
+ - (in next chapter we can do histogram)
+ - Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements
+ - Rolling timeseries averages
- Sampling responsibly: it's harder and more important than you think
+ - consistent sampling using hashing
+ - don’t use an RNG
+ - appreciate that external data sources may have changed
+ - reservoir sampling
+ - connectivity sampling (BUBBLE)
+ - subuniverse sampling (LOC?)
- Statistical aggregates and the danger of large numbers
+ - numerical stability
+ - overflow/underflow
+ - working with distributions at scale
+ - your intuition is often incomplete
+ - with trillions of things, 1 in billion chance things happen thousands of times
-9. *Time Series and Event Log Processing*:
- - Parsing logs and using regular expressions with Hadoop
- - Histograms and time series of pageviews
- - Geolocate visitors based on IP (with Storm+Trident)
- - (Ab)Using Hadoop to stress-test your web server
+ - weather temperature histogram in streaming fashion
-10. *Geographic Data*:
- - Spatial join (find all UFO sightings near Airports)
+approximate distinct counts (using HyperLogLog)
+approximate percentiles (based on quantile digest)
-11. *Conceptual Model for Data Analysis*
+10. *Time Series and Event Log Processing*:
+ - Parsing logs and using regular expressions with Hadoop
+ - logline model
+ - regexp to match lines, highlighting this as a parser pattern
+ - reinforce the source blob -> source model -> domain model practice
+ - Histograms and time series of pageviews using Hadoop
+ -
+ - sessionizing
+ - flow chart throughout site?
+ - "n-views": pages viewed in sequence
+ - ?? Audience metrics:
+ - make sure that this serves the later chapter with the live recommender engine (lambda architecture)
+ - Geolocate visitors based on IP with Hadoop
+ - use World Cup data?
+ - demonstrate using lookup table,
+ - explain it as a range query
+ - use a mapper-only (replicated) join -- explain why using that (small with big) but don't explain what it's doing (will be covered later)
+ - (Ab)Using Hadoop to stress-test your web server
-12. *Data Munging (Semi-Structured Data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
+Exercise: what predicts the team a country will root for next? In particular: if say Mexico knocks out Greece, do Greeks root for, or against, Mexico in general?
+
+11. *Geographic Data*:
+Spatial join (find all UFO sightings near Airports) of points with points
+map points to grid cell in the mapper
+truncate at a certain zoom level (explain how to choose zoom level)
+must send points to reducers for own grid key and also neighbors (9 total squares).
+Perhaps, be clever about not having to use all 9 quad grid neighbors by partitioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers
+Perhaps generate the four points that are x away from you and use their quad cells.
+In the reducer, do point-by-point comparisons
+*Maybe* a secondary sort???
+Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON
+We choose X, we want the focus to be on data science not on GIS
+Still have to explain ‘feature’, ‘region’, ‘latitude’, ‘longitude’, etc…
+Decomposing a map into quad-cell mapping at constant zoom level
+mapper input:
+[name of region, GeoJSON region boundary]
+Goal 1: have a mapping from region -> quad cells it covers
+Goal 2: have a mapping from quad key to partial GeoJSON objects on it
+mapper output:
+[thing, quadkey]
+[quadkey, list of region ids, hash of region ids to GeoJSON region boundaries]
+Spacial join of points with regions, e.g. what congressional district are you in?
+in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration)
+a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on
+Nearness query
+suppose the set of items you want to find nearness to is not huge
+produce the voronoi diagrams
+Decomposing a map into quad-cell mapping at multiple zoom levels
+in particular, use voronoi regions to make show multi-scale decomposition
+Re-do spacial join with Voronoi cells in multi-scale fashion (fill in details later)
+Framing the problem (NYC vs Pacific Ocean)
+Discuss how, given a global set of features, to decompose into a multi-scale grid representation
+Other mechanics of working with geo data
+
+12. *Conceptual Model for Data Analysis*
+
+See bottom
+13. *Data Munging (Semi-Structured Data)*:
+
+Welcome to chapter to 13. This is some f'real shit, yo.
+
+Wiki pageviews - String encoding and other bullshit
+Airport data -Reconciling to *mostly* agreeing datasets
+Something that has errors (SW Kid) - dealing with bad records
+Weather Data - Parsing a flat pack file
+ - bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields.
+- when your schema is so complicated, it needs to be automated, too.
+- join hell, when your keys change over time
+
+Data formats
+ - airing of grievances on XML
+ - airing of grievances on CSV
+ - don’t quote, escape
+ - the only 3 formats you should use, and when to use them
+
+- Just do a data munging project from beginning to end that wasn’t too horrible
+- Talk about the specific strategies and tactics
+ - source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code). Also, will end up building the wrong model for the business object, i.e. it will look like the source domain.
+
+Airport data - chief challenge is reconciling data sets, dealing with conflicting errors
+
+The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
- Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
- Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
- US Commercial Airline Flights: every commercial airline flight since 1987
@@ -81,45 +243,76 @@ This is the plan. We'll roll material out over the next few months. Should we fi
- Logistic Regression
- Random Forest (using Mahout)
We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
+14. *Full Application: Regional Flavor*
-14. *Hadoop Native Java API*
+15. *Hadoop Native Java API*
- don't
-15. *Advanced Pig*
+16. *Advanced Pig*
- Specialized joins that can dramatically speed up (or make feasible) your data transformations
- - Basic UDF
- why algebraic UDFs are awesome and how to be algebraic
- Custom Loaders
- Performance efficiency and tunables
+ - using a filter after a cogroup will get pushed up by Pig, sez Jacob
+
-16. *Data Modeling for HBase-style Database*
+17. *Data Modeling for HBase-style Database*
17. *Hadoop Internals*
+
- What happens when a job is launched
- A shallow dive into the HDFS
+===== HDFS
+
+Lifecycle of a File:
+
+* What happens as the Namenode and Datanode collaborate to create a new file.
+* How that file is replicated to acknowledged by other Datanodes.
+* What happens when a Datanode goes down or the cluster is rebalanced.
+* Briefly, the S3 DFS facade // (TODO: check if HFS?).
+
+===== Hadoop Job Execution
+
+* Lifecycle of a job at the client level including figuring out where all the source data is; figuring out how to split it; sending the code to the JobTracker, then tracking it to completion.
+* How the JobTracker and TaskTracker cooperate to run your job, including: The distinction between Job, Task and Attempt., how each TaskTracker obtains its Attempts, and dispatches progress and metrics back to the JobTracker, how Attempts are scheduled, including what happens when an Attempt fails and speculative execution, ________, Split.
+* How TaskTracker child and Datanode cooperate to execute an Attempt, including; what a child process is, making clear the distinction between TaskTracker and child process.
+* Briefly, how the Hadoop Streaming child process works.
+
+==== Skeleton: Map-Reduce Internals
+
+* How the mapper and Datanode handle record splitting and how and when the partial records are dispatched.
+* The mapper sort buffer and spilling to disk (maybe here or maybe later, the I/O.record.percent).
+* Briefly note that data is not sent from mapper-to-reducer using HDFS and so you should pay attention to where you put the Map-Reduce scratch space and how stupid it is about handling an overflow volume.
+* Briefly that combiners are a thing.
+* Briefly how records are partitioned to reducers and that custom partitioners are a thing.
+* How the Reducer accepts and tracks its mapper outputs.
+* Details of the merge/sort (shuffle and sort), including the relevant buffers and flush policies and why it can skip the last merge phase.
+* (NOTE: Secondary sort and so forth will have been described earlier.)
+* Delivery of output data to the HDFS and commit whether from mapper or reducer.
+* Highlight the fragmentation problem with map-only jobs.
+* Where memory is used, in particular, mapper-sort buffers, both kinds of reducer-merge buffers, application internal buffers.
+
18. *Hadoop Tuning*
- Tuning for the Wise and Lazy
- Tuning for the Brave and Foolish
- The USE Method for understanding performance and diagnosing problems
19. *Storm+Trident Internals*
+* Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
+* (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
+* Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist.
+* Understand Trident’s transactional mechanism, in the case of a PartitionPersist.
+* Understand how Aggregators, Statemap and the Persistence methods combine to give you _exactly once_ processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why.
+* Understand how the master batch coordinator and spout coordinator for the Kafka spout in particular work together to uniquely and efficiently process all records in a Kafka topic.
+* One specific: how Kafka partitions relate to Trident partitions.
+
+
20. *Storm+Trident Tuning*
-21. *Appendix: Overview of Datasets and Scripts*
- - Datasets
- - Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)
- - Airline Flights
- - UFO Sightings
- - Global Hourly Weather
- - Waxy.org "Star Wars Kid" Weblogs
- - Scripts
-
-22. *Appendix: Cheatsheets*:
- - Regular Expressions
- - Sizes of the Universe
- - Hadoop Tuning & Configuration Variables
+
+
Chopping block
View
6 00a-intro-more_outlines.asciidoc
@@ -26,8 +26,8 @@ What should you take away from this chapter:
You should:
-* Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
-* (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
+* Understand the lifecycle of a Storm tuple, including spout, tupletree and acking.
+* (Optional but not essential) Understand the details of its reliability mechanism and how tuples are acked.
* Understand the lifecycle of partitions within a Trident batch and thus, the context behind partition operations such as Apply or PartitionPersist.
* Understand Trident’s transactional mechanism, in the case of a PartitionPersist.
* Understand how Aggregators, Statemap and the Persistence methods combine to give you _exactly once_ processing with transactional guarantees. Specifically, what an OpaqueValue record will look like in the database and why.
@@ -74,7 +74,7 @@ Lifecycle of a File:
* Explain using the example of weblogs. highlighting strict order and partial order.
* In the frequent case, the sequence order only somewhat corresponds to one of the horizon keys. There are several types of somewhat ordered streams: block disorder, bounded band disorder, band disorder. When those conditions hold, you can use windows to recover the power you have with ordered streams -- often, without having to order the stream.
* Unbounded band disorder only allows "convergent truth" aggregators. If you have no idea when or whether that some additional record from a horizon group might show up, then you can’t treat your aggregation as anything but a best possible guess at the truth.
-* However, what the limited disorder does get you, is the ability to efficiently cache aggregations from a practically infinite backing data store.
+* However, what the limited disorder does get you, is the ability to efficiently cache aggregations from a practically infinite backing data store.
* With bounded band or block disorder, you can perform accumulator-style aggregations.
* How to, with the abstraction of an infinite sorting buffer or an infinite binning buffer, efficiently re-present the stream as one where sequence order and horizon key directly correspond.
* Re-explain the Hadoop Map-Reduce algorithm in this window+horizon model.
View
195 05d-intro_to_pig.asciidoc
@@ -1,28 +1,169 @@
-------
-stream do |article|
- words = Wukong::TextUtils.tokenize(article.text, remove_stopwords: true)
- words.group_by(&:to_s).map{|word, occurs|
- yield [article.id, word, occurs.count]
- end
-end
-------
-
-Reading it as prose the script says "for each article: break it into a list of words; group all occurrences of each word and count them; then output the article id, word and count."
-
-.Snippet from the Wikipedia article on "Barbecue"
-[quote, wikipedia, http://en.wikipedia.org/wiki/Barbeque]
-____
-Each Southern locale has its own particular variety of barbecue, particularly concerning the sauce. North Carolina sauces vary by region; eastern North Carolina uses a vinegar-based sauce, the center of the state enjoys Lexington-style barbecue which uses a combination of ketchup and vinegar as their base, and western North Carolina uses a heavier ketchup base. Lexington boasts of being "The Barbecue Capital of the World" and it has more than one BBQ restaurant per 1,000 residents. In much of the world outside of the American South, barbecue has a close association with Texas. Many barbecue restaurants outside the United States claim to serve "Texas barbecue", regardless of the style they actually serve. Texas barbecue is often assumed to be primarily beef. This assumption, along with the inclusive term "Texas barbecue", is an oversimplification. Texas has four main styles, all with different flavors, different cooking methods, different ingredients, and different cultural origins. In the June 2008 issue of Texas Monthly Magazine Snow's BBQ in Lexington was rated as the best BBQ in the state of Texas. This ranking was reinforced when New Yorker Magazine also claimed that Snow's BBQ was "The Best Texas BBQ in the World".
-____
-
-The output of the first stage
-
-----
-37135 texas 8
-37135 barbecue 8
-37135 bbq 5
-37135 different 4
-37135 lexington 3
-37135 north 3
-37135 carolina 3
+TODO: Intro
+* What we want the reader to have learned
+* Call forward to the section on advanced Pig
+
+=== Overview of Pig
+
+Pig is an open-source, high-level language that enables you to create efficient Map/Reduce jobs using clear, maintainable scripts. Its interface is similar to SQL, which makes it a great choice for folks with significant experience there. (It’s not identical, though, and things that are efficient in SQL may not be so in Pig; we will try to highlight those traps.)
+
+Let’s dive in with an example using the UFO dataset to estimate whether aliens tend to visit in month months over others:
+
+----
+PARALLEL 1; (USE 1 REDUCER) (DISABLE COMBINERS)
+LOAD UFO DATASET
+EXTRACT MONTH FROM EACH LINE
+GROUP ON MONTHS
+COUNT WITHIN GROUPS
+STORE INTO OUTPUT FILE
+----
+
+In a Wukong script or traditional Hadoop job, the focus is on the record and you’re best off thinking in terms of message passing or grouping. In Pig, the focus is much more on the structure and you should think in terms of relational and set operations. In the example above, each line described an operation on the full dataset; we declared what change to make and Pig, as you’ll see, executes those changes by dynamically assembling and running a set of Map/Reduce jobs.
+
+Here’s what you might write in Wukong to answer the same question:
+
+----
+DEFINE MODEL FOR INPUT RECORDS
+MAPPER EXTRACTS MONTHS, EMITS MONTH AS KEY WITH NO VALUE
+COUNTING REDUCER INCREMENTS ON EACH ENTRY IN GROUP AND EMITS TOTAL IN FINALIZED METHOD
+----
+
+Did you notice, by the way, that in both cases, the output was sorted? that is no coincidence -- as you saw in Chapter (TODO: REF), Hadoop sorted the results in order to group them.
+
+To run the Pig job, go into the ‘EXAMPLES/UFO’ directory and run
+
+----
+pig monthly_visit_counts.pig /data_UFO_sightings.tsv /dataresults monthly_visit_counts-pig.tsv
+----
+
+To run the Wukong job, go into the (TODO: REF) directory and run
+
+----
+wu-run monthly_visit_counts.rb --reducers_count=1 /data_UFO_sightings.tsv /dataresults monthly_visit_counts-wu.tsv
+----
+
+If you consult the output, you’ll see (TODO:CODE: INSERT CONCLUSIONS).
+
+If you consult the Job Tracker Console, you should see a single Map/Reduce for each with effectively similar statistics; the dataflow Pig instructed Hadoop to run is essentially similar to the Wukong script you ran. What Pig ran was, in all respects, a Hadoop job. It calls on some of Hadoop’s advanced features to help it operate but nothing you could not access through the standard Java API.
+
+==== Wikipedia Visitor Counts
+
+Let’s put Pig to a sterner test. Here’s the script above, modified to run on the much-larger Wikipedia dataset and to assemble counts by hour, not month:
+
+----
+LOAD SOURCE FILE
+PARALLEL 3
+TURN RECORD INTO HOUR PART OF TIMESTAMP AND COUNT
+GROUP BY HOUR
+SUM THE COUNTS BY HOUR
+ORDER THE RESULTS BY HOUR
+STORE INTO FILE
+----
+
+(TODO: If you do an order and then group, is Pig smart enough to not add an extra REDUCE stage?)
+
+Run the script just as you did above:
+
+----
+(TODO: command to run the script)
+----
+
+Up until now, we have described Pig as authoring the same Map/Reduce job you would. In fact, Pig has automatically introduced the same optimizations an advanced practitioner would have introduced, but with no effort on your part. If you compare the Job Tracker Console output for this Pig job with the earlier ones, you’ll see that, although x bytes were read by the Mapper, only y bytes were output. Pig instructed Hadoop to use a Combiner. In the naive Wukong job, every Mapper output record was sent across the network to the Reducer but in Hadoop, as you will recall from (TODO: REF), the Mapper output files have already been partitioned and sorted. Hadoop offers you the opportunity to do pre-Aggregation on those groups. Rather than send every record for, say, August 8, 2008 8 pm, the Combiner outputs the hour and sum of visits emitted by the Mapper.
+
+----
+SIDEBAR: You can write Combiners in Wukong, too. (TODO:CODE: Insert example with Combiners)
----
+
+You’ll notice that, in the second script, we introduced the additional operation of instructing Pig to explicitly sort the output by minute. We did not do that in the first example because its data was so small that we had instructed Hadoop to use a single Reducer. As you will recall from (TODO: REF), Hadoop uses a Sort to prepare the Reducer groups, so its output was naturally ordered. If there are multiple Reducers, however, that would not be enough to give you a Result file you can treat as ordered. By default, Hadoop assigns partitions to Reducers using the ‘RandomPartitioner’, designed to give each Reducer a uniform chance of claiming any given partition. This defends against the problem of one Reducer becoming overwhelmed with an unfair share of records but means the keys are distributed willy-nilly across machines. Although each Reducer’s output is sorted, you will see records from 2008 at the top of each result file and records from 2012 at the bottom of each result file.
+
+What we want instead is a total sort, the earliest records in the first numbered file in order, the following records in the next file in order, and so on until the last numbered file. Pig’s ‘ORDER’ Operator does just that. In fact, it does better than that. If you look at the Job Tracker Console, you will see Pig actually ran three Map/Reduce jobs. As you would expect, the first job is the one that did the grouping and summing and the last job is the one that sorted the output records. In the last job, all the earliest records were sent to Reducer 0, the middle range of records were sent to Reducer 1 and the latest records were sent to Reducer 2.
+
+Hadoop, however, has no intrinsic way to make that mapping happen. Even if it figured out, say, that the earliest buckets were in 2008 and the latest buckets were in 2012, if we fed it a dataset with skyrocketing traffic in 2013, we would end up sending an overwhelming portion of results to that Reducer. In the second job, Pig sampled the set of output keys, brought them to the same Reducer, and figured out the set of partition breakpoints to distribute records fairly.
+
+In general, Pig offers many more optimizations beyond these and we will talk more about them in the chapter on Advanced Pig (TODO: REF). In our experience, the only times Pig will author a significantly less-performant dataflow than would an expert comes when Pig is overly aggressive about introducing an optimization. The chief example you’ll hit is that often, the intermediate stage in the total sort to calculate partitions has a larger time penalty than doing a bad job of partitioning would; you can disable that by (TODO:CODE: Describe how).
+
+=== Group and Flatten
+
+The fundamental Map/Reduce operation is to group a set of records and operate on that group. In fact, it’s a one-liner in Pig:
+
+----
+BINS = Group WP_pageviews by (date, hour)
+DESCRIBE BINS
+(TODO:CODE: SHOW OUTPUT)
+----
+
+The result is always a tuple whose first field is named “Group” -- holding the individual group keys in order. The next field has the full input record with all its keys, even the group key. Here’s a Wukong script that illustrates what is going on:
+
+----
+(TODO:CODE: Wukong script)
+----
+
+You can group more than one dataset at the same time. In weather data, there is one table listing the location and other essentials of each weather station and a set of tables listing, for each hour, the weather at each station. Here’s one way to combine them into a new table, giving the explicit latitude and longitude of every observation:
+
+----
+G1=GROUP WSTNS BY (ID1, ID2), WOBS BY (ID1, ID2);
+G2=FLATTEN G1…
+G3=FOR EACH G2 …
+----
+
+This is equivalent to the following Wukong job:
+
+----
+(TODO:CODE: Wukong job)
+----
+
+(TODO: replace with an example where you would use a pure COGROUP).
+
+==== Join Practicalities
+
+The output of the Join job has one line for each discrete combination of A and B. As you will notice in our Wukong version of the Join, the job receives all the A records for a given key in order, strictly followed by all the B records for that key in order. We have to accumulate all the A records in memory so we know what rows to emit for each B record. All the A records have to be held in memory at the same time, while all the B records simply flutter by; this means that if you have two datasets of wildly different sizes or distribution, it is worth ensuring the Reducer receives the smaller group first. In Wukong, you do this by giving it an earlier-occurring field group label; in Pig, always put the table with the largest number of records per key last in the statement.
+
+==== Ready Reckoner: How fast should your Pig fly?
+
+TODO:CODE: describe for each Pig command what jobs should result.
+
+==== More
+
+There are a few more Operators we will use later in the book:
+Cube, which produces aggregations at multiple levels within a Group;
+Rank, which is sugar on top of Order to produce a number, total-ordered set of records;
+Split, to separate a dataset into multiple pieces; and
+Union, to produce a new dataset to have all the records from its input datasets.
+
+That’s really about it. Pig is an extremely sparse language. By having very few Operators and very uniform syntax (FOOTNOTE: Something SQL users but non-enthusiasts like your authors appreciate), the language makes it easy for the robots to optimize the dataflow and for humans to predict and reason about its performance.
+
+We won’t spend any more time introducing Pig, the language, as its usage will be fairly clear in context as you meet it later in the book. The online Pig manual at (TODO: REF) is quite good and for a deeper exploration, consult (TODO: Add name of best Pig book here).
+
+==== Pig Gotchas
+
+TODO:CODE: That one error where you use the dot or the colon when you should use the other.
+TODO:CODE: Where to look to see that Pig is telling you have either nulls, bad fields, numbers larger than your type will hold or a misaligned schema.
+
+
+// ------
+// stream do |article|
+// words = Wukong::TextUtils.tokenize(article.text, remove_stopwords: true)
+// words.group_by(&:to_s).map{|word, occurs|
+// yield [article.id, word, occurs.count]
+// end
+// end
+// ------
+//
+// Reading it as prose the script says "for each article: break it into a list of words; group all occurrences of each word and count them; then output the article id, word and count."
+//
+// .Snippet from the Wikipedia article on "Barbecue"
+// [quote, wikipedia, http://en.wikipedia.org/wiki/Barbeque]
+// ____
+// Each Southern locale has its own particular variety of barbecue, particularly concerning the sauce. North Carolina sauces vary by region; eastern North Carolina uses a vinegar-based sauce, the center of the state enjoys Lexington-style barbecue which uses a combination of ketchup and vinegar as their base, and western North Carolina uses a heavier ketchup base. Lexington boasts of being "The Barbecue Capital of the World" and it has more than one BBQ restaurant per 1,000 residents. In much of the world outside of the American South, barbecue has a close association with Texas. Many barbecue restaurants outside the United States claim to serve "Texas barbecue", regardless of the style they actually serve. Texas barbecue is often assumed to be primarily beef. This assumption, along with the inclusive term "Texas barbecue", is an oversimplification. Texas has four main styles, all with different flavors, different cooking methods, different ingredients, and different cultural origins. In the June 2008 issue of Texas Monthly Magazine Snow's BBQ in Lexington was rated as the best BBQ in the state of Texas. This ranking was reinforced when New Yorker Magazine also claimed that Snow's BBQ was "The Best Texas BBQ in the World".
+// ____
+//
+// The output of the first stage
+//
+// ----
+// 37135 texas 8
+// 37135 barbecue 8
+// 37135 bbq 5
+// 37135 different 4
+// 37135 lexington 3
+// 37135 north 3
+// 37135 carolina 3
+// ----
View
354 19a-advanced_pig.asciidoc
@@ -1,174 +1,226 @@
-=== Advanced Join Fu ===
+=== Optimizing Hadoop Dataflows
-Pig has three special-purpose join strategies: the "map-side" (aka 'fragment replicate') join
+The guidelines in this section start by repeating things your good common sense surely already knew -- you didn’t buy this book to have people tell you it’s inefficient to process more data than you need to. It’s not always obvious, however, how to get Hadoop to respect your good common sense and, if anything that follows seems obvious to you, well, it’s included here because your authors have most likely learned it the hard way.
-The map-side join have strong restrictions on the properties
+So, don’t process more data than you have to. In Pig, put filters before joins or other structural operations. If you’re doing a filter, you probably have a good sense of how much data should be allowed through. Check the Job Tracker Console and if the ratio of data read to Mapper output data size does not line up with what you expect, dig deeper. Also, project out fields you don’t need for downstream operations; surprisingly, Pig does not do this for you. Keep in mind that when you do a Group, the key fields appear in full in both the new Group field and in every one of the grouped tuples. It is a good habit to follow every Group with a FOREACH.
-A dataflow designed to take advantage of them
-can produce order-of-magnitude scalability improvements.
+If you only need a small fraction of records from a dataset, extract them as early as possible. If it is all the same to you, a LIMIT operation (taking the first N rows) is more efficient than a SAMPLE or FILTER; if you can get everything you need from one or a few input files, that’s even more efficient. As we harped on throughout the book, when you’re developing a job, it’s almost always worth it to start by extracting a faithful, smaller sub-universe to work with.
-They're also a great illustration of three key scalability patterns.
-Once you have a clear picture of how these joins work,
-you can be confident you understand the map/reduce paradigm deeply.
+Process more data if you have to if it makes your code more readable. There is some benefit to having uniform schema, especially if you’re working with TSV files, where the Mapping from slot order to name is not intrinsic. If leaving in a couple extraneous fields would add five minutes to a job’s production runtime but subtract five lines from the source, we prefer the inefficient script.
-[[advanced_pig_map_side_join]]
-=== Map-side Join ===
-
-A map-side (aka 'fragment replicate') join
-
-In a normal `JOIN`, the largest dataset goes on the right. In a fragement-replicate join, the largest dataset goes on the *left*, and everything to the right must be tiny.
-
-The Pig manual calls this a "fragment replicate" join, because that is how Pig thinks about it: the tiny datasets are duplicated to each machine.
-Throughout the book, I'll refer to it as a map-side join, because that's how you should think about it when you're using it.
-The other common name for it is a Hash join -- and if you want to think about what's going on inside it, that's the name you should use.
-
-==== How a Map-side (Hash) join works =====
-
-If you've been to enough large conferences you've seen at least one registration-day debacle. Everyone leaves their hotel to wait in a long line at the convention center, where they have set up different queues for some fine-grained partition of attendees by last name and conference track. Registration is a direct join of the set of attendees on the set of badges; those check-in debacles are basically the stuck reducer problem come to life.
-
-If it's a really large conference, the organizers will instead set up registration desks at each hotel. Now you don't have to move very far, and you can wait with your friends. As attendees stream past the registration desk, the 'A-E' volunteer decorates the Arazolos and Eliotts with badges, the 'F-K' volunteer decorates the Gaspers and Kellys, and so forth. Note these important differences: a) the registration center was duplicated in full to each site b) you didn't have to partition the attendees; Arazolos and Kellys and Zarebas can all use the same registration line.
-
-To do a map-side join, Pig holds the tiny table in a Hash (aka Hashmap or dictionary), indexed by the full join key.
-
-----
-
- .-------------. |
- | tiny table | | ... huge table ...
- +--+----------+ |
- |A | ...a... | | Q | ...
- | | ...a... | | B | ...
- |Q | ...q... | | B | ...
- |F | ...f... | | B | ...
- ... | A | ...
- |Z | ...z... | | B | ...
- | | ...z... | | B | ...
- |P | ...p... | | C | ...
- |_____________| | Z | ...
- | A | ...
-
-----
-
-As each row in the huge table flys by, it is decorated with the matching rows from the tiny table and emitted.
-Holding the data fully in-memory in a hash table gives you constant-time lookup speed for each key, and lets you access rows at the speed of RAM.
-
-One map-side only pass through the data is enough to do the join.
-
-See ((distribution of weather measurements)) for an example.
-
-
-==== Example: map-side join of wikipedia page metadata with wikipedia pageview stats =====
-
-
-
-
-[[merge_join]]
-=== Merge Join ===
-
-==== How a merge join works =====
-
-(explanation)
-
-Quoting Pig docs:
-
-
-____________________________________________________________________
-You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data).
-____________________________________________________________________
-
-
-==== Example: merge join of user graph with page rank iteration ====
-
-=== Skew Join ===
-
-(explanation of when needed)
-
-==== How a skew join works ====
-
-(explanation how)
-
-==== Example: ? counting triangles in wikipedia page graph ? OR ? Pageview counts ? ====
-
-TBD
-
-=== Efficiency and Scalability ===
-
-
-==== Do's and Don'ts ====
-
-The Pig Documentation has a comprehensive section on http://pig.apache.org/docs/r0.9.2/perf.html[Performance and Efficiency in Pig]. We won't try to improve on it, but here are some highlights:
-
-* As early as possible, reduce the size of your data:
- - LIMIT
- - Use a FOREACH to reject unnecessary columns
- - FILTER
-
-* Filter out `Null`s before a join
- in a join, all the records rendezvous at the reducer
- if you reject nulls at the map side, you will reduce network load
-
-==== Join Optimizations ====
-
-__________________________________________________________________________
-"Make sure the table with the largest number of tuples per key is the last table in your query.
- In some of our tests we saw 10x performance improvement as the result of this optimization.
-
- small = load 'small_file' as (t, u, v);
- large = load 'large_file' as (x, y, z);
- C = join small by t, large by x;
-__________________________________________________________________________
-
-(explain why)
-
-(come up with a clever mnemonic that doesn't involve sex, or get permission to use the mnemonic that does.)
-
-==== Magic Combiners ====
-
-TBD
-
-==== Turn off Optimizations ====
-
-After you've been using Pig for a while, you might enjoy learning about all those wonderful optimizations, but it's rarely necessary to think about them.
-
-In rare cases,
-you may suspect that the optimizer is working against you
-or affecting results.
-
-To turn off an optimization
-
- TODO: instructions
+In general, wasting CPU to save network or disk bandwidth is a good idea. If you grew up watching a 386 try to produce a ZIP file, it probably seems counterintuitive that storing your data in compressed files not only saves disk space but also speeds up processing time. However, most Hadoop jobs are overwhelmingly throughput bound, so spending the extra CPU cycles to compress and decompress data is more than justified by the overall efficiency of streaming less data off the disk or network. The section on Hadoop Internals (TODO: REF) explains how to enable compression of data between Mapper and Reducer (which you should always do) and how to read or write compressed data (which has tradeoffs you need to understand). In the case of intermediate checkpoints within a multi-stage workflow, it almost always makes sense to use a light compression format such as LZO or Snappy. In Pig, if you set the `pig.tmpfilecompression` and `pig.tmpfilecompression.codec` configuration variables appropriately, it will do that for you.
+
+There are a few other cases where you should invoke this rule. If you have a large or variable text string that you only need to compare for uniqueness -- e.g., using URLs as keys -- it is worth using a string digest function to shorten its length, as described in the chapter on Sketch Functions (TODO: REF).
+
+Regular expressions are much faster than you’d think, especially in Ruby. If you only need a small part of a string and it does not cost you in readability, it might be worth slicing out only the interesting part.
+
+Use types efficiently. Always use a schema in Pig; no exceptions. In case you’re wondering, it makes your script more efficient and catches errors early, but most of all, it shows respect to your colleagues or future self.
+
+There are a couple Pig-specific tradeoffs to highlight. Wherever possible, make your UDFs algebraic or at least Accumulator-like -- as described in the section on UDFs (TODO: REF). If you use two FOREACH statements consecutively, Pig is often able to merge them, but not always. If you see Pig introduce an extra Mapside-only job where you didn’t think one was necessary, it has probably failed to combine them. Always start with the more readable code, then decide if the problem is worth solving. Most importantly, be aware of Pig’s specialized JOINs; these are important enough that they get their whole section below.
+
+As you’ve seen, Pig is extremely sugar-free; more or less every structural operation corresponds to a unique Map/Reduce plan. In principle, a JOIN is simply a Cogroup with a FLATTEN and a DISTINCT is a Cogroup with a projection of just the GROUP key. Pig offers those more specific Operators because it is able to do them more efficiently. Watch for cases where you have unwittingly spelled these out explicitly.
+
+Always remove records with a NULL Group key _before_ the JOIN; those records will never appear in the output of a JOIN statement, yet they are not eliminated until after they have been sent across the network. Even worse, since all these records share all the same (worthless) key, they are all sent to the same Reducer, almost certainly causing a hotspot.
+
+If you are processing a lot of small files, Pig offers the ability to process many at once per Mapper, an extremely important optimization. Set the `pig.splitCombination` and `pig.maxCombinedSplitSize` options; if you're writing a custom loader, spend the extra effort to make it compatible with this optimization.
+
+Do not use less or more parallelism than reasonable. We have talked repeatedly throughout the book about the dangers of hotspots -- a few Reducers laboring to swallow the bulk of the dataset while its many comrades clock out early. Sometimes, however, your job’s configuration might unwittingly recommend to Hadoop that it only use one or a too-few number of Reducers. In this case, the Job Tracker would show only a few heavyweight Reduce tasks running, the other Reduce slots are sitting idle because nothing has been asked of them. Set the number of Reducers in Pig using the ‘PARALLEL’ directive, and in Wukong, using the ‘--REDUCE_TASKS=N’ (TODO: check spelling).
+
+It can also be wasteful to have too many Reducers. If your job has many Reducers uniformly processing only a few kb of records, the large fixed costs of launching and accounting for those attempts justify using the parallelism settings to limit the number of Reducers.
+
+==== Efficient JOINs in Pig
+
+Pig has a number of specialized JOINs that, used in their appropriate circumstances, bring enough performance improvements to organize a cult around. (TODO: make funny).
+
+===== Specialized Pig Join #1: The `REPLICATED JOIN`
+
+If you are joining a large dataset with a small-enough one, Pig can often execute the operation using a Mapper-Only job, eliminating the costly startup time and network transfer of a Reduce step. This is commonplace enough and the performance impact large enough that it is always worth considering whether this type of JOIN is applicable.
+
+Imagine visiting the opera while the United Nations is in town. The smart theater owner will prepare librettos in, say, a dozen languages, enough to delight the many thousands of attendees. A bad way to distribute them would be to arrange kiosks, one per language, throughout the lobby. Every aficionado would first have to fight their way through the crowd to find the appropriate kiosk, then navigate across the theater to find their seats. Our theater owner, being smart, instead handles the distribution as follows: Ushers stand at every entrance, armed with stacks of librettos; at every entrance, all the languages are represented. This means that, as each attendee files in, they simply select the appropriate one from what is on hand, then head to their seat without delay.
+
+A Mapper-Only JOIN works analogously. Every Mapper reads the small dataset into a lookup table -- a hash map keyed by the JOIN key (this is why you will also see it referred to as a HashMap JOIN). Every Mapper loads the contents of the smaller dataset in full into its own local lookup table (which is why it is also known as a Replicated JOIN). The minor cost of replicating that dataset to every single Mapper is often a huge improvement in processing speed by eliminating the entire Reduce stage. The constraint, however, is that the smaller dataset must fit entirely in RAM. The usher’s task is manageable when there is one type of libretto for each of a dozen languages but would be unmanageable if there were one type of libretto for each of several thousand home towns.
+
+How much is too much? Watch for excessive GC activity. (TODO: Pig probably has a warning too - find out what it is). Within the limits of available RAM, you can use fewer Mappers with more available RAM; the Hadoop tuning chapter (TODO: REF) shows you how. Don’t be too aggressive, though; datasets have a habit of growing over time and you would hate to spend Thanksgiving day reworking the jobs that process retail sales data because you realized they would not stand up to the Black Friday rush.
+
+There is a general principle here: It is obvious there is a class of problems which only crop up past a certain threshold of data. What may not be obvious, until you’ve learned it the hard way, is that the external circumstances most likely to produce that flood of extra data are also the circumstances that leave you least able to address the problem.
+
+===== Specialized Pig Join #2: The `MERGE JOIN`
+
+A JOIN of two datasets, each in strict total order by the JOIN key, can also be done using Mapper-Only by simply doing a modified Merge sort. You must ensure not only that the files are in sort order but that the lexicographic order of the file names match the order in which its parts should be read. If you do so, Pig can proceed as follows: It does a first pass to sample each file from the right-hand dataset to learn the distribution of keys throughout the files. The second stage performs the actual JOIN. Each Mapper reads from two streams: its assigned split within the left-hand dataset and the appropriate sections of the right-hand dataset. The Mapper’s job is then very simple; it grabs a group of records from the right-hand stream and a group of records from the left-hand stream and compares their keys. If they match, they are joined. If they do not match, it reads from the stream with the too-low key until it either produces the matching group or sails past it, in which case it similarly reads from the other stream.
+
+As we’ve discussed a few times, reading data in straight streams like this lets the underlying system supply data at the fastest possible rate. What’s more, the first pass indexing scheme means most tasks will be “Map-local” -- run on a machine whose data node hosts a copy of that block. In all, you require a short Mapper-Only task to sample the right-hand dataset and the network throughput cost that is ‘O(N)’ in the size of the second dataset. The constraint is, of course, that this only works with total-ordered data on the same key. For a “Gold” dataset -- one that you expect to use as source data for a number of future jobs -- we typically spend the time to do a last pass total sort of the dataset against the most likely JOIN key. It is a nice convenience for future users of the dataset, helps in sanity checking and improves the odds that you will be able to use the more efficient MERGE/JOIN.
+
+
+// === Advanced Join Fu ===
+//
+// Pig has three special-purpose join strategies: the "map-side" (aka 'fragment replicate') join
+//
+// The map-side join have strong restrictions on the properties
+//
+// A dataflow designed to take advantage of them
+// can produce order-of-magnitude scalability improvements.
+//
+// They're also a great illustration of three key scalability patterns.
+// Once you have a clear picture of how these joins work,
+// you can be confident you understand the map/reduce paradigm deeply.
+//
+// [[advanced_pig_map_side_join]]
+// === Map-side Join ===
+//
+// A map-side (aka 'fragment replicate') join
+//
+// In a normal `JOIN`, the largest dataset goes on the right. In a fragement-replicate join, the largest dataset goes on the *left*, and everything to the right must be tiny.
+//
+// The Pig manual calls this a "fragment replicate" join, because that is how Pig thinks about it: the tiny datasets are duplicated to each machine.
+// Throughout the book, I'll refer to it as a map-side join, because that's how you should think about it when you're using it.
+// The other common name for it is a Hash join -- and if you want to think about what's going on inside it, that's the name you should use.
+//
+// ==== How a Map-side (Hash) join works =====
+//
+// If you've been to enough large conferences you've seen at least one registration-day debacle. Everyone leaves their hotel to wait in a long line at the convention center, where they have set up different queues for some fine-grained partition of attendees by last name and conference track. Registration is a direct join of the set of attendees on the set of badges; those check-in debacles are basically the stuck reducer problem come to life.
+//
+// If it's a really large conference, the organizers will instead set up registration desks at each hotel. Now you don't have to move very far, and you can wait with your friends. As attendees stream past the registration desk, the 'A-E' volunteer decorates the Arazolos and Eliotts with badges, the 'F-K' volunteer decorates the Gaspers and Kellys, and so forth. Note these important differences: a) the registration center was duplicated in full to each site b) you didn't have to partition the attendees; Arazolos and Kellys and Zarebas can all use the same registration line.
+//
+// To do a map-side join, Pig holds the tiny table in a Hash (aka Hashmap or dictionary), indexed by the full join key.
+//
+// ----
+//
+// .-------------. |
+// | tiny table | | ... huge table ...
+// +--+----------+ |
+// |A | ...a... | | Q | ...
+// | | ...a... | | B | ...
+// |Q | ...q... | | B | ...
+// |F | ...f... | | B | ...
+// ... | A | ...
+// |Z | ...z... | | B | ...
+// | | ...z... | | B | ...
+// |P | ...p... | | C | ...
+// |_____________| | Z | ...
+// | A | ...
+//
+// ----
+//
+// As each row in the huge table flys by, it is decorated with the matching rows from the tiny table and emitted.
+// Holding the data fully in-memory in a hash table gives you constant-time lookup speed for each key, and lets you access rows at the speed of RAM.
+//
+// One map-side only pass through the data is enough to do the join.
+//
+// See ((distribution of weather measurements)) for an example.
+//
+//
+// ==== Example: map-side join of wikipedia page metadata with wikipedia pageview stats =====
+//
+//
+//
+//
+// [[merge_join]]
+// === Merge Join ===
+//
+// ==== How a merge join works =====
+//
+// (explanation)
+//
+// Quoting Pig docs:
+//
+//
+// ____________________________________________________________________
+// You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data).
+// ____________________________________________________________________
+//
+//
+// ==== Example: merge join of user graph with page rank iteration ====
+//
+// === Skew Join ===
+//
+// (explanation of when needed)
+//
+// ==== How a skew join works ====
+//
+// (explanation how)
+//
+// ==== Example: ? counting triangles in wikipedia page graph ? OR ? Pageview counts ? ====
+//
+// TBD
+//
+// === Efficiency and Scalability ===
+//
+//
+// ==== Do's and Don'ts ====
+//
+// The Pig Documentation has a comprehensive section on http://pig.apache.org/docs/r0.9.2/perf.html[Performance and Efficiency in Pig]. We won't try to improve on it, but here are some highlights:
+//
+// * As early as possible, reduce the size of your data:
+// - LIMIT
+// - Use a FOREACH to reject unnecessary columns
+// - FILTER
+//
+// * Filter out `Null`s before a join
+// in a join, all the records rendezvous at the reducer
+// if you reject nulls at the map side, you will reduce network load
+//
+// ==== Join Optimizations ====
+//
+// __________________________________________________________________________
+// "Make sure the table with the largest number of tuples per key is the last table in your query.
+// In some of our tests we saw 10x performance improvement as the result of this optimization.
+//
+// small = load 'small_file' as (t, u, v);
+// large = load 'large_file' as (x, y, z);
+// C = join small by t, large by x;
+// __________________________________________________________________________
+//
+// (explain why)
+//
+// (come up with a clever mnemonic that doesn't involve sex, or get permission to use the mnemonic that does.)
+//
+// ==== Magic Combiners ====
+//
+// TBD
+//
+// ==== Turn off Optimizations ====
+//
+// After you've been using Pig for a while, you might enjoy learning about all those wonderful optimizations, but it's rarely necessary to think about them.
+//
+// In rare cases,
+// you may suspect that the optimizer is working against you
+// or affecting results.
+//
+// To turn off an optimization
+//
+// TODO: instructions
==== Exercises ====
1. Quoting Pig docs:
> "You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data)."
Why is this?
-
-2. Each of the following snippets goes against the Pig documentation's recommendations in one clear way.
+
+2. Each of the following snippets goes against the Pig documentation's recommendations in one clear way.
- Rewrite it according to best practices
- compare the run time of your improved script against the bad version shown here.
-
+
things like this from http://pig.apache.org/docs/r0.9.2/perf.html --
a. (fails to use a map-side join)
-
+
b. (join large on small, when it should join small on large)
-
+
c. (many `FOREACH`es instead of one expanded-form `FOREACH`)
-
+
d. (expensive operation before `LIMIT`)
For each use weather data on weather stations.
-
-=== Pig and HBase ===
-
-TBD
-
-=== Pig and JSON ===
-
-TBD
-
-=== Refs ===
-
-* http://pig.apache.org/docs/r0.10.0/perf.html#replicated-joins:[map-side join]
+// === Pig and HBase ===
+//
+// TBD
+//
+// === Pig and JSON ===
+//
+// TBD
+//
+// === Refs ===
+//
+// * http://pig.apache.org/docs/r0.10.0/perf.html#replicated-joins:[map-side join]

0 comments on commit da76164

Please sign in to comment.
Something went wrong with that request. Please try again.