Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
fixing preface TOC
  • Loading branch information
Philip (flip) Kromer committed Jan 10, 2014
1 parent c3659e0 commit a9e1813
Show file tree
Hide file tree
Showing 5 changed files with 89 additions and 152 deletions.
181 changes: 86 additions & 95 deletions 00-preface.asciidoc
Expand Up @@ -140,123 +140,94 @@ Connect the structural operations you've seen pig do with what is happeining und
8. *Intro to Storm+Trident*

* Meet Nim Seadragon
* What and Why Storm and Trident
* First Storm Job

9. *Statistics*:

- (this is first deep experience with Storm+Trident)
- Summarizing: Averages, Percentiles, and Normalization
- running / windowed stream summaries
* (this is first deep experience with Storm+Trident)
* Summarizing: Averages, Percentiles, and Normalization
* running / windowed stream summaries
- make a "SummarizingTap" trident operation that collects {Sum Count Min Max Avg Stddev SomeExampleValuesReservoirSampled} (fill in the details of what exactly this means)
- also, maybe: Median+Deciles, Histogram
- understand the flow of data going on in preparing such an aggregate, by either making sure the mechanics of working with Trident don't overwhelm that or by retracing the story of records in an aggregation
- you need a group operation -> means everything in group goes to exactly one executor, exactly one machine, aggregator hits everything in a group
- combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation
* combiner-aggregators (in particular), do some aggregation beforehand, and send an intermediate aggregation to the executor that hosts the group operation
- by default, always use persistent aggregate until we find out why you wouldn’t

- (BUBBLE) highlight the corresponding map/reduce dataflow and illuminate the connection
- (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard
- (in next chapter we can do histogram)
- Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements
- Rolling timeseries averages
- Sampling responsibly: it's harder and more important than you think
* (BUBBLE) Median / calculation of quantiles at large enough scale that doing so is hard
* (in next chapter we can do histogram)
* Use a sketching algorithm to get an approximate but distributed answer to a holistic aggregation problem eg most frequent elements
* Rolling timeseries averages
* Sampling responsibly: it's harder and more important than you think
- consistent sampling using hashing
- don’t use an RNG
- appreciate that external data sources may have changed
- reservoir sampling
- connectivity sampling (BUBBLE)
- subuniverse sampling (LOC?)
- Statistical aggregates and the danger of large numbers
* Statistical aggregates and the danger of large numbers
- numerical stability
- overflow/underflow
- working with distributions at scale
- your intuition is often incomplete
- with trillions of things, 1 in billion chance things happen thousands of times

- weather temperature histogram in streaming fashion
* weather temperature histogram in streaming fashion

approximate distinct counts (using HyperLogLog)
approximate percentiles (based on quantile digest)

10. *Time Series and Event Log Processing*:
- Parsing logs and using regular expressions with Hadoop
- logline model
- regexp to match lines, highlighting this as a parser pattern
- reinforce the source blob -> source model -> domain model practice
- Histograms and time series of pageviews using Hadoop
-
- sessionizing
- flow chart throughout site?
- "n-views": pages viewed in sequence
- ?? Audience metrics:
- make sure that this serves the later chapter with the live recommender engine (lambda architecture)
- Geolocate visitors based on IP with Hadoop
- use World Cup data?
- demonstrate using lookup table,
- explain it as a range query
- use a mapper-only (replicated) join -- explain why using that (small with big) but don't explain what it's doing (will be covered later)
- (Ab)Using Hadoop to stress-test your web server
* Parsing logs and using regular expressions with Hadoop
- logline model
- regexp to match lines, highlighting this as a parser pattern
- reinforce the source blob -> source model -> domain model practice
* Histograms and time series of pageviews using Hadoop
-
* sessionizing
- flow chart throughout site?
- "n-views": pages viewed in sequence
- ?? Audience metrics:
- make sure that this serves the later chapter with the live recommender engine (lambda architecture)
* Geolocate visitors based on IP with Hadoop
- use World Cup data?
- demonstrate using lookup table,
- explain it as a range query
- use a mapper-only (replicated) join -- explain why using that (small with big) but don't explain what it's doing (will be covered later)
* (Ab)Using Hadoop to stress-test your web server

Exercise: what predicts the team a country will root for next? In particular: if say Mexico knocks out Greece, do Greeks root for, or against, Mexico in general?

11. *Geographic Data*:
Spatial join (find all UFO sightings near Airports) of points with points
map points to grid cell in the mapper
truncate at a certain zoom level (explain how to choose zoom level)
must send points to reducers for own grid key and also neighbors (9 total squares).
Perhaps, be clever about not having to use all 9 quad grid neighbors by partitioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers
Perhaps generate the four points that are x away from you and use their quad cells.
In the reducer, do point-by-point comparisons
*Maybe* a secondary sort???
Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON
We choose X, we want the focus to be on data science not on GIS
Still have to explain ‘feature’, ‘region’, ‘latitude’, ‘longitude’, etc…
Decomposing a map into quad-cell mapping at constant zoom level
mapper input: `<name of region, GeoJSON region boundary>`
Goal 1: have a mapping from region -> quad cells it covers
Goal 2: have a mapping from quad key to partial GeoJSON objects on it
mapper output:
[thing, quadkey]
[quadkey, list of region ids, hash of region ids to GeoJSON region boundaries]
Spacial join of points with regions, e.g. what congressional district are you in?
in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration)
a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on
Nearness query
suppose the set of items you want to find nearness to is not huge
produce the voronoi diagrams
Decomposing a map into quad-cell mapping at multiple zoom levels
in particular, use voronoi regions to make show multi-scale decomposition
Re-do spacial join with Voronoi cells in multi-scale fashion (fill in details later)
Framing the problem (NYC vs Pacific Ocean)
Discuss how, given a global set of features, to decompose into a multi-scale grid representation
Other mechanics of working with geo data

12. *Conceptual Model for Data Analysis*

See bottom
13. *Data Munging (Semi-Structured Data)*:
* Spatial join (find all UFO sightings near Airports) of points with points
- map points to grid cell in the mapper; truncate at a certain zoom level (explain how to choose zoom level). must send points to reducers for own grid key and also neighbors (9 total squares).
- Perhaps, be clever about not having to use all 9 quad grid neighbors by partitioning on a grid size more fine-grained than your original one and then use that to send points only the pertinent grid cell reducers
- Perhaps generate the four points that are x away from you and use their quad cells.
* In the reducer, do point-by-point comparisons
- *Maybe* a secondary sort???
* Geospacial data model, i.e. the terms and fields that you use in, e.g. GeoJSON
- We choose X, we want the focus to be on data science not on GIS
- Still have to explain ‘feature’, ‘region’, ‘latitude’, ‘longitude’, etc…
* Decomposing a map into quad-cell mapping at constant zoom level
- mapper input: `<name of region, GeoJSON region boundary>`; Goal 1: have a mapping from region -> quad cells it covers; Goal 2: have a mapping from quad key to partial GeoJSON objects on it. mapper output: [thing, quadkey] ; [quadkey, list of region ids, hash of region ids to GeoJSON region boundaries]
* Spatial join of points with regions, e.g. what congressional district are you in?
- in mapper for points emit truncated quad key, the rest of the quad key, just stream the regions through (with result from prior exploration); a reducer has quadcell, all points that lie within that quadcell, and all regions (truncated) that lie on that quadcell. Do a brute force search for the regions that the points lie on
* Nearness query
- suppose the set of items you want to find nearness to is not huge; produce the voronoi diagrams
* Decomposing a map into quad-cell mapping at multiple zoom levels;in particular, use voronoi regions to make show multi-scale decomposition
* Re-do spatial join with Voronoi cells in multi-scale fashion (fill in details later)
- Framing the problem (NYC vs Pacific Ocean)
- Discuss how, given a global set of features, to decompose into a multi-scale grid representation
- Other mechanics of working with geo data

Welcome to chapter to 13. This is some f'real shit, yo.

Wiki pageviews - String encoding and other bullshit
Airport data -Reconciling to *mostly* agreeing datasets
Something that has errors (SW Kid) - dealing with bad records
Weather Data - Parsing a flat pack file
- bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields.
- when your schema is so complicated, it needs to be automated, too.
- join hell, when your keys change over time

Data formats
- airing of grievances on XML
- airing of grievances on CSV
- don’t quote, escape
- the only 3 formats you should use, and when to use them
12. *Conceptual Model for Data Analysis*

- Just do a data munging project from beginning to end that wasn’t too horrible
- Talk about the specific strategies and tactics
- source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code). Also, will end up building the wrong model for the business object, i.e. it will look like the source domain.
* There's just one framework

Airport data - chief challenge is reconciling data sets, dealing with conflicting errors
13. *Data Munging (Semi-Structured Data)*:

The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book:
- Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
Expand All @@ -265,27 +236,47 @@ The dirty art of data munging. It's a sad fact, but too often the bulk of time s
- Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
- "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).

13. *Machine Learning without Grad School*: We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
- Naive Bayes
- Logistic Regression
- Random Forest (using Mahout)
We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
* Wiki pageviews - String encoding and other bullshit
* Airport data -Reconciling to *mostly* agreeing datasets
* Something that has errors (SW Kid) - dealing with bad records
* Weather Data - Parsing a flat pack file
- bear witness, explain that you DID have to temporarily become an ameteur meteorologist, and had to write code to work with that many fields.
- when your schema is so complicated, it needs to be automated, too.
- join hell, when your keys change over time
* Data formats
- airing of grievances on XML
- airing of grievances on CSV
- don’t quote, escape
- the only 3 formats you should use, and when to use them
* Just do a data munging project from beginning to end that wasn’t too horrible
- Talk about the specific strategies and tactics
- source blob to source domain object, source domain object to business object. e.g. you want your initial extraction into a model mirrors closely the source domain data format. Mainly because you do not want mix your extraction logic and business logic (extraction logic will pollute business objects code). Also, will end up building the wrong model for the business object, i.e. it will look like the source domain.
* Airport data - chief challenge is reconciling data sets, dealing with conflicting errors

13. *Machine Learning without Grad School*: We'll equip you with a picture of how they work, but won't go into the math of how or why. We will show you how to choose a method, and how to cheat to win. We'll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using

- Naive Bayes
- Logistic Regression
- Random Forest (using Mahout)

14. *Full Application: Regional Flavor*

15. *Hadoop Native Java API*
- don't

16. *Advanced Pig*
- Specialized joins that can dramatically speed up (or make feasible) your data transformations
- why algebraic UDFs are awesome and how to be algebraic
- Custom Loaders
- Performance efficiency and tunables
- using a filter after a cogroup will get pushed up by Pig, sez Jacob
- don't

19. *Advanced Pig*

- Specialized joins that can dramatically speed up (or make feasible) your data transformations
- why algebraic UDFs are awesome and how to be algebraic
- Custom Loaders
- Performance efficiency and tunables
- using a filter after a cogroup will get pushed up by Pig, sez Jacob


17. *Data Modeling for HBase-style Database*
20. *Data Modeling for HBase-style Database*

17. *Hadoop Internals*
21. *Hadoop Internals*

- What happens when a job is launched
- A shallow dive into the HDFS
Expand Down
File renamed without changes.
File renamed without changes.
56 changes: 0 additions & 56 deletions book-all.asciidoc

This file was deleted.

4 changes: 3 additions & 1 deletion book.asciidoc
Expand Up @@ -30,7 +30,9 @@ include::13-data_munging.asciidoc[]

include::14-organizing_data.asciidoc[]

include::16-machine_learning.asciidoc[]
include::16-conceptual_model.asciidoc[]

include::17-machine_learning.asciidoc[]

include::18-java_api.asciidoc[]

Expand Down

0 comments on commit a9e1813

Please sign in to comment.