# Hive Command Note

**Outline**

* [Introduction](#intro)
* [Syntax](#syntax)
* [Reference](#refer)

---

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to 
summarize Big Data, and makes querying and analyzing easy.

* **Access Hive**: in cmd, type *`hive`*
* **Run hive script**: hive -f xxx.hql

> **Database in HIVE**

Each database is a collection of tables. 
[link](http://www.tutorialspoint.com/hive/hive_create_database.htm)

In [None]:
# create database
CREATE DATABASE [IF NOT EXISTS] userdb;

# show all the databases
show databases;

# use a certain database, every table we create afterwards will be within the database
use databaseName;

In [None]:
# drop database
DROP DATABASE IF EXISTS userdb;

> **Create Table**

1. employees.csv -> HDFS
2. create table & load employees.csv
3. drop employees table (Be careful that by dropping the table, HIVE will actually delete the original csv not just the table itself). Instead, we can create an external table. 
    * External tables: if you drop them, data in hdfs will NOT be deleted.

**Data Types**
* **Integers**
    * *TINYINT*—1 byte integer
    * *SMALLINT*—2 byte integer
    * *INT*—4 byte integer
    * *BIGINT*—8 byte integer
* **Boolean type**
    * *BOOLEAN*—TRUE/FALSE
* **Floating point numbers**
    * *FLOAT*—single precision
    * *DOUBLE*—Double precision
* **Fixed point numbers**
    * *DECIMAL*—a fixed point value of user defined scale and precision
* **String types**
    * *STRING*—sequence of characters in a specified character set
    * *VARCHAR*—sequence of characters in a specified character set with a maximum length
    * *CHAR*—sequence of characters in a specified character set with a defined length
* **Date and time types**
    * *TIMESTAMP*— a specific point in time, up to nanosecond precision
    * *DATE*—a date
* **Binary types**
    * *BINARY*—a sequence of bytes

**Complex Types**
* **Structs**: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
    * format: `<first, second>`
    * access: mystruct.first    
* **Maps (key-value tuples)**: The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
    * format: index based
    * access: myarray[0]
* **Arrays (indexable lists)**: The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
    * format: key based
    * access: myMap['KEY']

* **ROW FORMAT DELIMITED**: one row per line
* **FIELDS TERMINATED BY ','**: split column by comma

In [None]:
# use external table in this example
CREATE EXTERNAL TABLE movies(
    userid INT,
    movieid INT,
    rating INT,
    timestamp TIMESTAMP)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

In [None]:
CREATE TABLE myemployees( 
    name STRING, 
    salary FLOAT, 
    subordinates ARRAY<STRING>, 
    deductions MAP<STRING, FLOAT>, 
    address STRUCT<street:STRING, city:STRING, state:STRING,zip:INT>)
ROW FORMAT DELIMITED # This line is telling Hive to expect the file to contain one row per line. So basically, we are telling Hive that when it finds a new line character that means is a new records.
FIELDS TERMINATED BY ',' # split column by comma
COLLECTION ITEMS TERMINATED BY '#' # split the struct type item by `#`
MAP KEYS TERMINATED BY '-' # split the map type column by `-`
LINES TERMINATED BY '\N'; # separate line by `\N`

> **load file from hdfs into hive**

[StackOverFlow: Which is the difference between LOAD DATA INPATH and LOAD DATA LOCAL INPATH in HIVE](https://stackoverflow.com/questions/43204716/which-is-the-difference-between-load-data-inpath-and-load-data-local-inpath-in-h/43205970)

In [None]:
# load data into table movie. Noted that the path is hdfs path
# noted that the original file in hdfs://hw5/ will be move to ''hdfs://wolf.xxx.ooo.edu:8000/user/hive/warehouse/jchiu.db/movie/u.data'' after this command
LOAD DATA INPATH 'hw5/u.data' into table movie;

# load data into table movie. Noted that the path is local path
# LOCAL is identifier to specify the local path. It is optional.
# when using LOCAL, the file is copied to the hive directory
LOAD DATA LOCAL INPATH 'localpath' into table movie;
LOAD DATA LOCAL INPATH '/home/public/course/recommendationEngine/u.data' into table movies;

In [None]:
# create an external table
CREATE EXTERNAL TABLE myemployees

In [None]:
LOAD DATA INPATH '...' INTO TABLE employees

> **see column name; describe table**

In [None]:
# method 1
describe database.tablename;

# method 2
use database;
describe tablename;

> **Query**

In [None]:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[ORDER BY col_list]] 
[LIMIT number];

In [None]:
select address.city from employees

> **show tables**

In [None]:
# if already use database, it'll show tables in this database; if not, it'll show all the tables
show tables;

> **drop tables**

[] means optional. When used, we don't need these.

In [None]:
DROP TABLE [IF EXISTS] table_name;

> **create view in hive**

In [None]:
CREATE VIEW [IF NOT EXISTS] emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;

> **drop a view**

In [None]:
DROP VIEW view_name

> **join**

[tutorialspoint: hiveql join](https://www.tutorialspoint.com/hive/hiveql_joins.htm)

Syntax-wise is essentially the same as SQL

> **hive built in aggregation functions**

[treasuredata: hive-aggregate-functions](https://docs.treasuredata.com/articles/hive-aggregate-functions)

> **hive built in operators**

[tutorialspoint: built-in operators](https://www.tutorialspoint.com/hive/hive_built_in_operators.htm)

deal with NULL/NA, equal...etc

> **writing data into the filesystem from queries**

[hive doc](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries)

* If LOCAL keyword is used, Hive will write data to the directory on the local file system.
* Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

In [None]:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1  
  SELECT ... FROM ...

> **Create User Defined Fucntions (UDF)**

**Steps**
* write in java
* jar file
* import jar file
* use UDF as query

# Lab Material

In [None]:
### sample code from lab

CREATE EXTERNAL TABLE employees(
name STRING,
salary FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ’,’; 
LOAD DATA INPATH ‘employees.csv’ into table employees;

CREATE DATABASE msia;
SHOW DATABASES;
DROP DATABASE msia;
USE msia;
SHOW TABLES;

CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>, 
deductions MAP<STRING, FLOAT>, 
address STRUCT<street:STRING, city: STRING, state: STRING, zip: INT>); CREATE TABLE t (
s STRING,
f FLOAT,
a ARRAY<MAP<STRING, STRUCT<p1: INT, p2:INT> >);
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ’,’
COLLECTION ITEMS TERMINATED BY ’#’
MAP KEYS TERMINATED BY ’-’
LINES TERMINATED BY ’\n’;
LOAD DATA INPATH ’employees.csv’ into table employees;

---

# <a id='refer'>Reference</a>

* [Tutorialspoint Hive Tutorial](https://www.tutorialspoint.com/hive/index.htm)
* [Hive tutorial doc](https://cwiki.apache.org/confluence/display/Hive/Tutorial)