---
Hive: SQL on Hadoop
====

![](http://hortonworks.com/wp-content/uploads/2013/05/hive_logo.png)

By the end of this session, you should know:
---

- What Hive is and why it is important
- The architecture of Hive
- How to write Hive queries
- The limitations of Hive

----
Why Hive?
--------

Instead of writing MapReduce programs, what if we could write SQL?

![](https://media.giphy.com/media/B33saVGoouwNi/giphy.gif)


---
Hive Overvall
---

Step 1: Write SQL in Hive

Step 2: The SQL is then translated to MapReduce for you

Step 3:
![](http://www.reactiongifs.com/r/suitm.gif)

Hive Example
------------

```sql
SELECT user.*
FROM user
WHERE user.active = 1;
```

![](https://media.giphy.com/media/amZfgEVrf84Xm/giphy.gif)


----
Hive definition
----

> The Apache Hive™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

![](http://d287f0h5fel5hu.cloudfront.net/blog/wp-content/uploads/2014/09/pighive3.png)

---
Hive 101
---- 

- Hive was developed at Facebook.
- Data analysts (and lazy data scientists) can use SQL instead of MapReduce to process data.

---
Hive Architecture: Simple
---
![](http://image.slidesharecdn.com/hiveauthorization2-130628125713-phpapp01/95/apache-hive-authorization-models-3-638.jpg?cb=1372424367)

- Hive stores metadata about tables in the *MetaStore*

- When a SQL or HiveQL query is submitted to Hive it translates it to
  MapReduce, using the MetaStore.

- Then it submits it to the cluster as a MapReduce job.


---
Hive Architecture: Not Simple
---

![](http://image.slidesharecdn.com/hive-20110630junhocho-110701075800-phpapp01/95/replacing-telco-dbdw-to-hadoop-and-hive-40-728.jpg?cb=1309507873)

The MetaStore is RDMS
----

Derby by default.

Most people roll with MySQL.

Points to Ponder
--------

<details><summary>
Why does the MetaStore not use HDFS to store the metadata?
</summary>
1. The metadata changes constantly.<br>
2. The metadata changes every time you add a new table or alter a table.<br>
3. HDFS does not allow modifying data.<br>
</details>

<details><summary>
What is the benefit of having a shared MySQL-backed MetaStore instead
of each user having his/her own Derby-backed MetaStore?
</summary>
1. Sharing the MetaStore means everyone uses the same table and column names.<br>
2. This makes the SQL consistent and portable across the organization.<br>
</details>

---
Q: How can we analyze Big Data without writing MapReduce jobs? <br>
---
A: Dark Magic
---

![](http://randomwallpapers.net/dark-magic-fantasy-wizard-football-cool-1920x1200-wallpaper121552.jpg)

- Apache Hive runs SQL queries against large datasets.

- Datasets can be in HDFS, S3, and other Hadoop-compatible filesystems.

- Used for queries against data without writing Java/Python code.

- Used for ad hoc queries against data without schemas.

----
Hive vs RDBMS
-------------

What are the pros and cons of Hive vs traditional databases (RDBMS)?

Hive                                  |RDBMS
----                                  |-----
HiveQL (subset of SQL-92)             |SQL Interface
Schema On Read                        |Schema On Write
Write Once, Read Many Times           |Read Many Times, Write Many Times
Optimal for static data               |Optimal for dynamic data
Transactions not supported            |Transactions supported
Highly distributed processing via MR  |Limited distributed processing
Handles 100s of Petabytes             |Handles 10s of Terabytes
Uses commodity hardware               |Uses proprietary hardware
Slow response time                    |Fast response time
For batch processing                  |For realtime processing




----
Big Data SQL
-------------

What other Big Data SQL technologies are out there? 
Why should I useHive?

Project            |Elevator Pitch
-------            |--------------
Hive               |Mature, full-featured
Hive on Tez        |Hive optimized over MapReduce
Hive on Spark      |Hive optimized using Spark
Spark SQL          |SQL from scratch on Spark; not as full-featured
Shark              |Hive on Spark; abandoned
Impala             |SQL without MapReduce using C++; focus on speed
Phoenix            |SQL over HBase
Trafodion          |SQL engine by HP
Presto             |SQL engine by team at HP
Drill              |SQL engine by MapR, does well on some queries
Flink              |SQL engine with support for realtime data

---
Summary
---

- Fact: More people know SQL than MapReduce
- Fact: Declarative programming is easier than functional programming
- Solution: Hive, decalarative SQL translated tanslated to MapReduce
- Hive architecture sits on top of MapReduce
- Write in SQL-ish language (HiveQL)
- Still a (sloooooow) batch system