Permalink
Browse files

Organizing files

  • Loading branch information...
1 parent 038e734 commit eeb84e1b10b691b79f3b30835f8eb56c76c5f045 Philip (flip) Kromer committed Feb 17, 2013
View
@@ -86,12 +86,7 @@ INTERMEDIATE
- (?maybe?) SVD as Principal Component Analysis
- (?maybe?) Topic extraction using (to be determined)
-9. Interlude I: *Data Models, Data Formats, Data Management*:
- - How to design your data models
- - How to serialize their contents (orig, scratch, prod)
- - How to organize your scripts and your data
-
-10. *Statistics*
+9. *Statistics*
- Averages, Percentiles, and Normalization
- sum, average, standard deviation, etc (airline_flights)
- Percentiles / Median
@@ -107,14 +102,14 @@ INTERMEDIATE
- consistent sampling
- distributions
-11. *Time Series*
+10. *Time Series*
- Anomaly detection
- Wikipedia Pageviews
- windowing and rolling statistics
- (?maybe?) correlation of joint timeseries
- (?even mayber?) similar wikipedia pages based on pageview time series
-12. *Geographic*
+11. *Geographic*
- Spatial join (find all UFO sightings near Airports)
- mechanics of handling geo data
- Statistics on grid cells
@@ -130,7 +125,7 @@ INTERMEDIATE
- Clustering
- Pointwise mutual information
-13. *`cat` herding*
+12. *`cat` herding*
- total sort
- transformations
- `ruby -ne`
@@ -147,21 +142,19 @@ INTERMEDIATE
- `time`
- advanced hadoop filesystem (chmod, setrep, fsck)
-14. *Data munging (Semi-structured data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book.
+13. *Data munging (Semi-structured data)*: The dirty art of data munging. It's a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We'll show you street-fighting tactics that lessen the time and pain. Along the way, we'll prepare the datasets to be used throughout the book.
- Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
- Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
- US Commercial Airline Flights: every commercial airline flight since 1987
- Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
- "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio's waxy.org).
-15. Interlude II: *Best Practices and Pedantic Points of style*
- - Pedantic Points of Style
- - Best Practices
- - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
- - Why Hadoop
- - robots are cheap, people are important
+14. Interlude I: *Data Models, Data Formats, Data Management*:
+ - How to design your data models
+ - How to serialize their contents (orig, scratch, prod)
+ - How to organize your scripts and your data
-16. *Graph* -- some better-motivated subset of:
+15. *Graph* -- some better-motivated subset of:
- Adjacency List / Edge List conversion
- Undirecting a graph, Min-degree undirected graph
- Breadth-First Search
@@ -174,7 +167,7 @@ INTERMEDIATE
- Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
- _(bubble)_
-17. *Machine Learning without Grad School*
+16. *Machine Learning without Grad School*
- weather & flight delays for prediction
- Naive Bayes
- Logistic Regression ("SGD")
@@ -187,6 +180,13 @@ INTERMEDIATE
- unreasonable effectiveness
- partition the data, recombine the models
- pointers for the person who is going to get fancy anyway
+
+17. Interlude II: *Best Practices and Pedantic Points of style*
+ - Pedantic Points of Style
+ - Best Practices
+ - How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they're equivalent; with some experience under your belt it's worth learning how to fluidly shift among these different models.
+ - Why Hadoop
+ - robots are cheap, people are important
PRACTICAL
@@ -1,2 +1,3 @@
-[[first_exploration]]== First Exploration
+[[first_exploration]]
+== First Exploration
@@ -1,2 +0,0 @@
-[[data_management]]== Data Management
-
@@ -1,2 +1,3 @@
-[[simple_transform]]== Simple Transform
+[[simple_transform]]
+== Simple Transform
@@ -1,2 +1,3 @@
-[[transform_pivot]]== Transform Pivot
+[[transform_pivot]]
+== Transform Pivot
@@ -1,3 +1,3 @@
-[[regional_flavor]]== Regional Flavor
-
+[[regional_flavor]]
+== Regional Flavor
View
@@ -1,2 +1,3 @@
-[[toolset]]== Toolset
+[[toolset]]
+== Toolset
@@ -1,2 +1,3 @@
-[[filesystem_mojo]]== Filesystem Mojo
+[[filesystem_mojo]]
+== Filesystem Mojo
View
@@ -1,2 +1,3 @@
-[[server_logs]]== Server Logs
+[[server_logs]]
+== Server Logs
@@ -1,2 +1,3 @@
-[[text_processing]]== Text Processing
+[[text_processing]]
+== Text Processing
View
@@ -1,2 +1,3 @@
-[[statistics]]== Statistics
+[[statistics]]
+== Statistics
View
@@ -1,2 +1,3 @@
-[[time_series]]== Time Series
+[[time_series]]
+== Time Series
View
@@ -1,2 +1,3 @@
-[[geographic]]== Geographic
+[[geographic]]
+== Geographic
File renamed without changes.
View
@@ -1,2 +1,3 @@
-[[cat_herding]]== Cat Herding
+[[cat_herding]]
+== Cat Herding
View
@@ -1,2 +1,3 @@
-[[data_munging]]== Data Munging
+[[data_munging]]
+== Data Munging
@@ -0,0 +1,3 @@
+[[organizing_data]]
+== Organizing Data
+
View
@@ -1,2 +1,3 @@
-[[graphs]]== Graphs
+[[graphs]]
+== Graphs
@@ -1,2 +1,3 @@
-[[machine_learning]]== Machine Learning
+[[machine_learning]]
+== Machine Learning
@@ -1,2 +1,3 @@
-[[best_practices]]== Best Practices
+[[best_practices]]
+== Best Practices
View
@@ -1,2 +1,3 @@
-[[java_api]]== Java Api
+[[java_api]]
+== Java Api
View
@@ -1,2 +1,3 @@
-[[advanced_pig]]== Advanced Pig
+[[advanced_pig]]
+== Advanced Pig
@@ -1,2 +1,3 @@
-[[hbase_data_modeling]]== Hbase Data Modeling
+[[hbase_data_modeling]]
+== Hbase Data Modeling
@@ -1,2 +1,3 @@
-[[hadoop_internals]]== Hadoop Internals
+[[hadoop_internals]]
+== Hadoop Internals
@@ -1,2 +1,3 @@
-[[hadoop_tuning]]== Hadoop Tuning
+[[hadoop_tuning]]
+== Hadoop Tuning
@@ -1,2 +1,3 @@
-[[datasets_and_scripts]]== Datasets And Scripts
+[[datasets_and_scripts]]
+== Datasets And Scripts
View
@@ -1,2 +1,3 @@
-[[cheatsheets]]== Cheatsheets
+[[cheatsheets]]
+== Cheatsheets
View
@@ -1,2 +1,3 @@
-[[appendix]]== Appendix
+[[appendix]]
+== Appendix
View
@@ -1,37 +1,108 @@
= Big Data for Chimps
+include::00-preface.asciidoc[]
include::00a-about.asciidoc[]
+include::00c-hello_reviewers.asciidoc[]
+include::01-first_exploration.asciidoc[]
include::01a-first_exploration.asciidoc[]
-
-include::02a-simple_stream.asciidoc[]
-
-include::03a-locality-pivot.asciidoc[]
-
-include::03b-locality-saving_christmas.asciidoc[]
-
-include::03d-locality-efficient_santa.asciidoc[]
-
+include::02-simple_transform.asciidoc[]
+include::02a-simple_transform.asciidoc[]
+include::03-transform_pivot.asciidoc[]
+include::03a-locality.asciidoc[]
+include::03b-saving_christmas.asciidoc[]
+include::03c-simple_reshape.asciidoc[]
+include::03d-efficient_santa.asciidoc[]
include::03e-jt_and_nanette_at_work_for_you.asciidoc[]
-
-include::03f-locality-partition_and_sort_keys.asciidoc[]
-
-include::05a-server_logs.asciidoc[]
-
-include::09a-geographic_data.asciidoc[]
-
-include::17d-use_method_checklist.asciidoc[]
-
+include::03f-partition_and_sort_keys.asciidoc[]
+include::04-regional_flavor.asciidoc[]
+include::04a-defining_problem.asciidoc[]
+include::04b-smoothing_counts.asciidoc[]
+include::05-toolset.asciidoc[]
+include::05a-tools.asciidoc[]
+include::05b-launching_and_debugging.asciidoc[]
+include::05c-intro_to_wukong.asciidoc[]
+include::05d-intro_to_pig.asciidoc[]
+include::06-filesystem_mojo.asciidoc[]
+include::07-server_logs.asciidoc[]
+include::07a-server_logs.asciidoc[]
+include::08-text_processing.asciidoc[]
+include::08a-processing_text.asciidoc[]
+include::08f-gotchas.asciidoc[]
+include::09-statistics.asciidoc[]
+include::09a-summarizing.asciidoc[]
+include::09b-sampling.asciidoc[]
+include::09c-distribution_of_weather_measurements.asciidoc[]
+include::09e-exercises.asciidoc[]
+include::10-time_series.asciidoc[]
+include::10a-time_series_data.asciidoc[]
+include::11-geographic.asciidoc[]
+include::11a-spatial_join.asciidoc[]
+include::11b-multiscale_join.asciidoc[]
+include::11f-an_elephants_eye_view_of_the_world.asciidoc[]
+include::12-cat_herding.asciidoc[]
+include::12a-herding_cats.asciidoc[]
+include::13-data_munging.asciidoc[]
+include::13a-wikipedia_other.asciidoc[]
+include::13c-wikipedia_corpus.asciidoc[]
+include::13d-patterns.asciidoc[]
+include::13e-airline_flights.asciidoc[]
+include::13f-daily_weather.asciidoc[]
+include::13g-truth_and_error.asciidoc[]
+include::13h-other_strategies.asciidoc[]
+include::14a-data_formats.asciidoc[]
+include::14c-data_modeling.asciidoc[]
+include::15-graphs.asciidoc[]
+include::15a-representing_graphs.asciidoc[]
+include::15b-community_extractions.asciidoc[]
+include::15c-pagerank.asciidoc[]
+include::16-machine_learning.asciidoc[]
+include::16a-simple_machine_learning.asciidoc[]
+include::16d-misc.asciidoc[]
+include::17-best_practices.asciidoc[]
+include::17a-why_hadoop.asciidoc[]
+include::17b-how_to_think.asciidoc[]
+include::17d-cloud-vs-static.asciidoc[]
+include::17e-rules_of_scaling.asciidoc[]
+include::17f-best_practices_and_pedantic_points_of_style.asciidoc[]
+include::17g-tao_te_chimp.asciidoc[]
+include::18-java_api.asciidoc[]
+include::18a-hadoop_api.asciidoc[]
+include::19-advanced_pig.asciidoc[]
+include::19a-advanced_pig.asciidoc[]
+include::19b-pig_udfs.asciidoc[]
+include::20-hbase_data_modeling.asciidoc[]
+include::20b-hbase_and_databases.asciidoc[]
+include::21-hadoop_internals.asciidoc[]
+include::21a-hadoop_internals.asciidoc[]
include::21a-hbase_schema.asciidoc[]
-
-include::23a-cheatsheets.asciidoc[]
-
-include::30a-authors.asciidoc[]
-
-include::30b-colophon.asciidoc[]
-
-include::30c-references.asciidoc[]
-
-include::30f-glossary.asciidoc[]
-
-include::30g-back_cover.asciidoc[]
+include::21b-hadoop_internals-logs.asciidoc[]
+include::22-hadoop_tuning.asciidoc[]
+include::22a-tuning-wise_and_lazy.asciidoc[]
+include::22b-scripts.asciidoc[]
+include::22b-tuning-pathology.asciidoc[]
+include::22c-tuning-brave_and_foolish.asciidoc[]
+include::22d-use_method_checklist.asciidoc[]
+include::23-datasets_and_scripts.asciidoc[]
+include::23a-overview_of_datasets.asciidoc[]
+include::23c-datasets.asciidoc[]
+include::23c-wikipedia_dbpedia.asciidoc[]
+include::23d-airline_flights.asciidoc[]
+include::23e-access_logs.asciidoc[]
+include::23f-data_formats-arc.asciidoc[]
+include::23g-other_datasets_on_the_web.asciidoc[]
+include::23h-notes_for_chimpmark.asciidoc[]
+include::24-cheatsheets.asciidoc[]
+include::24a-unix_cheatsheet.asciidoc[]
+include::24b-regular_expression_cheatsheet.asciidoc[]
+include::24c-pig_cheatsheet.asciidoc[]
+include::24d-hadoop_tunables_cheatsheet.asciidoc[]
+include::25-appendix.asciidoc[]
+include::25a-authors.asciidoc[]
+include::25b-colophon.asciidoc[]
+include::25c-acquiring_a_hadoop_cluster.asciidoc[]
+include::25c-references.asciidoc[]
+include::25f-glossary.asciidoc[]
+include::25g-back_cover.asciidoc[]
+include::25h-TODO.asciidoc[]
+include::25i-asciidoc_cheatsheet_and_style_guide.asciidoc[]
View
9 foo.rb
@@ -1,24 +1,23 @@
require 'gorillib/string/human'
-
chapters = %w[
preface
first_exploration
simple_transform
transform_pivot
- geographic_flavor
+ regional_flavor
toolset
filesystem_mojo
server_logs
text_processing
- data_management
statistics
time_series
geographic
cat_herding
data_munging
- best_practices
+ organizing_data
graphs
machine_learning
+ best_practices
java_api
advanced_pig
hbase_data_modeling
@@ -31,7 +30,7 @@
chapters.each_with_index do |name, idx|
File.open("#{"%02d" % (idx)}-#{name}.asciidoc", "w") do |file|
- file << "[[#{name}]]"
+ file << "[[#{name}]]\n"
file << "== #{name.titleize}\n" << "\n"
end
end

0 comments on commit eeb84e1

Please sign in to comment.