diff --git a/lesson1.html b/lesson1.html index cc39644..58bc57d 100644 --- a/lesson1.html +++ b/lesson1.html @@ -32,84 +32,63 @@
-

What is Big Data?

+

Lesson 1: Building a Big Data Infrastructure Part 1

-
- Big data is the combination of infrastructure, algorithms, and visualizations around making sense of user and machine generated data. -
+

Unstructured Storage & Hadoop

-
- Big data does not necessarily mean: more data than you can effectively work with on a single computer. -
-
- -
-
- Big data is about gaining insight from data regardless of the size of the data set. -
-
- -
-

Questions Big Data can Answer

+

Unstructured Data

  1. -

    What are my users doing on my site?

    -
  2. -
  3. -

    Is something spam?

    +

    Log Files

  4. -

    What items or users are like each other?

    +

    Text

  5. -

    What items might a user like?

    +

    Unknown Formats

-

Types of Data

+

Hadoop

+
    +
  1. Open source
  2. +
  3. HDFS: Distributed file system modeled after GFS
  4. +
  5. MapReduce: Distributed batch processing modeled after Google's MapReduce
  6. +
+
+ +
+

Hadoop's Wider Ecosystem

+
    +
  1. HBase
  2. +
  3. ZooKeeper
  4. +
  5. Hive
  6. +
  7. Cascading
  8. +
  9. Pig
  10. +
  11. Flume
  12. +
+
+ +
+

Batch Processing

  1. -

    User Generated

    +

    Like cron

  2. -

    Machine Generated

    -
  3. -
  4. -

    Structured

    +

    Run once or frequently

  5. -

    Unstructured

    +

    Ship code to data

- -
-

Goals of a Big Data Infrastructure

-
    -
  1. -

    Scalability

    -
  2. -
  3. -

    Experimentation

    -
  4. -
  5. -

    Mining business intelligence

    -
  6. -
  7. -

    Making recommendations

    -
  8. -
  9. -

    Monitoring performance

    -
  10. -
-
-