Skip to content

Notes on Hive and S3

mmalohlava edited this page Mar 29, 2013 · 10 revisions

Hive

Starting hive

There are two posibilities of using Hive:

  • local installation
  • use map-reduce functionality in AWS
    • AWS console -> map-reduce -> start new interactive hive job
    • connect to master server (master is tagged by MASTER role tag - see EC2 management console)

Using hive

Hive supports common SQL syntax: creata tables, show tables, describe <table>, ...

To start HQL script simply type: hive < script.sql

Loading Iris data into S3 bucket

  • Create local iris table
create table iris (sepal_len float, sepal_width float, petal_len float, petal_width float, response STRING) row format delimited fields terminated by ',';
  • Load data from local file iris.data
load data local inpath 'iris.data' into table iris;
  • Create a table in S3 bucket
create external table s3iristable (sepal_len float, sepal_width float, petal_len float, petal_width float, response STRING) ROW format DELIMITED location 's3n://hive-tests/iris';
  • Insert data from local table
insert overwrite table s3iristable select * from iris;

Loading airlines data into S3

The objective is to load airline data into S3 bucket and split them according year and month.

Creating local table

-- create table
create table airlines (Year int,Month int,DayofMonth int,DayOfWeek int,DepTime int,CRSDepTime int,ArrTime int,CRSArrTime int,UniqueCarrier string,FlightNum int,TailNum string,ActualElapsedTime int,CRSElapsedTime int,AirTime int,ArrDelay int,DepDelay int,Origin string,Dest string,Distance int,TaxiIn int,TaxiOut int,Cancelled int,CancellationCode string,Diverted int,CarrierDelay int,WeatherDelay int,NASDelay int,SecurityDelay int,LateAircraftDelay int,IsArrDelayed string,IsDepDelayed string) row format delimited fields terminated by ',';

-- load data
load data local inpath 'allyears1m.csv' into table airlines;

Creating partitioned S3 table and loading data

-- create S3 table:
-- create dynamic partition per years and months
create external table s3airlines (Year int,Month int,DayofMonth int,DayOfWeek int,DepTime int,CRSDepTime int,ArrTime int,CRSArrTime int,UniqueCarrier string,FlightNum int,TailNum string,ActualElapsedTime int,CRSElapsedTime int,AirTime int,ArrDelay int,DepDelay int,Origin string,Dest string,Distance int,TaxiIn int,TaxiOut int,Cancelled int,CancellationCode string,Diverted int,CarrierDelay int,WeatherDelay int,NASDelay int,SecurityDelay int,LateAircraftDelay int,IsArrDelayed string,IsDepDelayed string) partitioned by (theyear int, themonth int) row format delimited location 's3n://hive-tests/airlines/allyears1m/';

-- setup dynamic partitoning for hive queries
set hive.optimize.s3.query=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing = true;

-- insert data from local table
insert overwrite table s3airlines partition (theyear, themonth) select a.*,a.Year,a.Month from airlines a;

Creating clustered table and loading data

--
-- Create S3 table:
--  * cluster per year and month
CREATE EXTERNAL TABLE s3airlines (Year int,Month int,DayofMonth int,DayOfWeek int,DepTime int,CRSDepTime int,ArrTime int,CRSArrTime int,UniqueCarrier string,FlightNum int,TailNum string,ActualElapsedTime int,CRSElapsedTime int,AirTime int,ArrDelay int,DepDelay int,Origin string,Dest string,Distance int,TaxiIn int,TaxiOut int,Cancelled int,CancellationCode string,Diverted int,CarrierDelay int,WeatherDelay int,NASDelay int,SecurityDelay int,LateAircraftDelay int,IsArrDelayed string,IsDepDelayed string) CLUSTERED BY (Year, Month) SORTED BY (Month) INTO 144 BUCKETS ROW FORMAT DELIMITED LOCATION 's3n://hive-tests-airlines/allyears10k/';

--
-- setup dynamic partitoning for hive queries
--
set hive.optimize.s3.query=true;
set hive.enforce.bucketing = true;

-- insert data from local table
INSERT OVERWRITE TABLE s3airlines SELECT a.* FROM airlines a;

Dropping tables

drop table airlines;
drop table s3airlines;

Notes

  • Do not use only s3 protocol but use s3n
  • It seems that Hive does not like a plain s3n bucket s3n://tests/, but the directory is required s3n://tests/table
  • Hive does not support nested directories - do not use s3n://tests/table/subtable/

Resources

Clone this wiki locally