---
Hadoop Ecosystem
---


<img src="http://datadotz.com/wp-content/uploads/2015/02/hadoop-stodgy-1345x1345.jpg" width="250">

Our little elephant is all grown-up and ready to go to work

---
By the end of this session, you should know:
----

- Advanved MapReduce design patterns
- MapReduce anti-patterns
- The difference between Hadoop 1.* and 2.*
- The "animals" in the current Hadoop "Zoo"
- What the hell YARN and ZooKeeper are

---
MapReduce Redux
---

__Think about the output first in terms of Key-Value. __

For example:

- Metrics (date, webpage, locale: #users, #visits, #abandonment)
- Membership:List of members
- Property:Value (userId: name, location, #transactions, purchase Categories with frequencies )

[Learn WAY more here](https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/)

---
Join in MapReduce
---

![](https://chamibuddhika.files.wordpress.com/2012/02/reducesidejoin.jpg)

--- 
MapReduce anti-patterns
---

![](images/anti-pattern.png)

1) __NEVER DO GRAPH SEARCH__

aka How many friends to my friends have?

2) Full Table Join

```sql
SELECT *
FROM huge_table
JOIN ON some_other_huge_table

Remeber "data locality" - Compute is pushed to storage

![](images/slow.png)

![](images/filter.png)

![](images/merge.png)

__Extra points for MERGE JOIN__:

__Optional Reading:__

[Apache Hadoop: Best Practices and Anti-Patterns](https://developer.yahoo.com/blogs/hadoop/apache-hadoop-best-practices-anti-patterns-465.html)

[The Top 7 Hadoop Patterns and Anti-patterns](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjhtuLD1dnKAhUIxGMKHa_VBpoQFggnMAE&url=https%3A%2F%2Foracleus.activeevents.com%2F2014%2Fconnect%2FfileDownload%2Fsession%2FFAB2683E5956C5B4E00139F262A6D055%2FCON3515_Holmes-javaone-2014-top-hadoop-patterns.pdf&usg=AFQjCNFrwyAtvlytw7m04kQLjID0AVrtrw&sig2=H7dtPgkDI2QvxrmtgSSl5w)

---
Limitations of Hadoop 1.x
---

- No horizontal scalability of NameNode
- Does not support NameNode High Availability
- Overburdened JobTracker
- Not possible to run Non
- MapReduce Big Data Applications on HDFS
- Does not support Multi-tenancy

Points to Ponder
----

<details><summary>
Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x? Why?
</summary>
 No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots
<br>

</details>

<details><summary>
Can you use Hadoop for Real-time processing? Why?
</summary>
 No. Hadoop is designed and developer for massively parallel batch processing.
<br>

</details>

</details>

<details><summary>
 Can you store 1 billion files in a Hadoop 1.x cluster? Why?
</summary>
 No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x.
<br>

</details>


![](images/hadoop_1_summary.png)

---
Hadoop 1.0 vs 2.0
---

![](images/1-to-2.png)

---
YARN
---

![](http://static01.nyt.com/images/2011/05/19/fashion/Z-YARN-P1-2/Z-YARN-P1-2-articleLarge.jpg)

- __Y__et
- __A__nother
- __R__esource
- __N__egotiator

A global ResourceManager (RM) and per-application ApplicationMaster (AM)

Not tied to MapReduce (Praise the Heavens 🙌)

![](http://image.slidesharecdn.com/yarn-china-131217110834-phpapp02/95/apache-hadoop-yarn-understanding-the-data-operating-system-of-hadoop-7-638.jpg?cb=1387278584)

 

---
Rearchitectured for modularity
---

![](images/YARN.png)

---
Other important Hadoop 2.0 features
----

- HDFS Snapshots
- Improved (i.e., NFSv3) access to data in HDFS
- Support for running Hadoop on MS Windows (if you like that kind of thing)
- Binary Compatibility for MapReduce applications built on Hadoop 1.0
- Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem


---
(YAAD) Yet Another Architecture Diagram
---

![](images/yaard.png)

---
Coordination in a distributed system
---

Coordination: An act that multiple nodes must perform together.

Examples:

- Group membership
- Locking
- Publisher/Subscriber
- Leader Election
- Synchronization


__Getting node coordination correct is very hard!__

---
ZooKeeper (ZK)
---

![](https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/7/22/1311338718286/Zookeeper-007.jpg)

![](https://www.mapr.com/sites/default/files/zookeeper-image.png)

__ZooKeeper (ZK)__ is an open-source solution enables highly reliable distributed coordination.

a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. 

Exposes common services in simple interface: 

- Naming
- Configuration management
- Locks & synchronization
- Group services

… developers don't have to write them from scratch

Build your own on it for specific needs. 

![](https://cwiki.apache.org/confluence/download/attachments/24193436/service.png?version=1&modificationDate=1295027310000&api=v2)

ZooKeeper Service is replicated over a set of machines

All machines store a copy of the data (in memory)‏

A leader is elected on service startup

Clients only connect to a single ZooKeeper server & maintains a TCP connection.

Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.


---
ZooKeeper Consistency Guarantees
---

- Sequential Consistency: Updates are applied in order
- Atomicity: Updates either succeed or fail
- Single System Image: A client sees the same view of the service regardless of the ZK server it connects to.
- Reliability: Updates persists once applied, till overwritten by some clients.
- Timeliness: The clients’ view of the system is guaranteed to be up-to-date within a certain time bound. (Eventual Consistency)


---
ZK References
---

- The Chubby lock service for loosely-coupled distributed systems
- Google Research (7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), {USENIX} (2006) )
- ZooKeeper: Wait-free coordination for Internet-scale systems
- Yahoo Research (USENIX Annual Technology Conference 2010)
- Apache ZooKeeper Home: http://zookeeper.apache.org/
- Presentations:
    - http://www.slideshare.net/mumrah/introduction-to-zookeeper-trihug-may-22-2012
    - http://www.slideshare.net/scottleber/apache-zookeeper
    - https://cwiki.apache.org/confluence/display/ZOOKEEPER/ZooKeeperPresentations


Check for understanding
----

<details><summary>
What is the relationship between Yarn and ZooKeeper?
</summary>
YARN is the new Map Reduce daemon (MRv1) and it's primary job is to take jobs and run them in the Hadoop cluster. So it primarily farms out and manages cluster work load. <br>
<br>
Zookeeper provides a distributed configuration service, a synchronization service and a naming registry for distributed systems. It is used by many daemons (including YARN) to manage their peers in a multiple node setup for high availability.
<br>
![](http://image.slidesharecdn.com/yarnprojectpreview-121122042407-phpapp01/95/high-availability-in-yarn-project-preview-3-638.jpg?cb=1353558314)
[Source](https://www.quora.com/What-is-the-relationship-between-YARN-and-ZooKeeper-which-both-manage-a-cluster-of-nodes)
</details>

---
Hadoop is written in Java therefore ...
---

it is __compiled!__

It matters which versions are on your cluster.

Welcome to __Hadoop Jar Hell__
![](images/hell.png)

---
Pick the right tool
---

![](http://image.slidesharecdn.com/sparkforthebusinessanalyst-150812021110-lva1-app6891/95/spark-for-the-business-analyst-25-638.jpg?cb=1439345583)

---
Summary
---

- We explored common advanced patterns
- You have been warned about anti-patterns
- Hadoop 2.0 is better. Enables Yarn (and Spark)
- Yarn maximizes resources use.
- I introduced ZooKeeper just to confuse you. ZK keeps clusters from being confused.