# Operaciones básicas en Hive

* *30 min* | Última modificación: Junio 22, 2019

Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/

En este tutorial se ejemplifica: 

* La carga de datos. 

* El uso básico de consultas.

* La exportación de resultados.

## Preparación

En este tutorial se usa el magic `bigdata` para usar interactivamente Hive desde un notebook de Jupyter. El parámetro `timeout` es el tiempo máximo de espera de procesamiento antes de que se reporte un error por procesamiento.

In [5]:
%load_ext bigdata
%timeout 300

The bigdata extension is already loaded. To reload it, use:
  %reload_ext bigdata


Los datos se encuentran almacenados en la carpeta `drivers` del directorio actual. A continución se procede a crear la carpeta `/tmp/drivers` en el sistema de archivos de Hadoop (HDFS). 

In [2]:
## Borra la carpeta si existe
!hdfs dfs -rm -r -f /tmp/drivers

##
## Crea la carpeta drivers en el HDFS
##
!hdfs dfs -mkdir /tmp/drivers

##
## Copia los archivos al HDFS
##
!hdfs dfs -copyFromLocal drivers/*  /tmp/drivers/

##
## Lista los archivos al HDFS para verificar
## que los archivos fueron copiados correctamente.
##
!hdfs dfs -ls /tmp/drivers/*

-rw-r--r--   1 vagrant supergroup       2043 2019-06-12 19:47 /tmp/drivers/drivers.csv
-rw-r--r--   1 vagrant supergroup       4308 2019-06-12 19:47 /tmp/drivers/drivers.json
-rw-r--r--   1 vagrant supergroup      26205 2019-06-12 19:47 /tmp/drivers/timesheet.csv
-rw-r--r--   1 vagrant supergroup    2272077 2019-06-12 19:47 /tmp/drivers/truck_event_text_partition.csv


## Carga de los datos de los eventos de los conductores

En el siguiente código se crea crea la tabla de eventos de los conductores en el sistema; la primera instrucción borra la tabla si ya existe. Note que se debe especificar que los campos en las filas están delimitados por comas para que Hive los importe correctamente.

In [4]:
%%hive
DROP TABLE IF EXISTS truck_events;

CREATE TABLE truck_events (driverId       INT, 
                           truckId        INT,
                           eventTime      STRING,
                           eventType      STRING, 
                           longitude      DOUBLE, 
                           latitude       DOUBLE,
                           eventKey       STRING, 
                           correlationId  STRING, 
                           driverName     STRING,
                           routeId        BIGINT,
                           routeName      STRING,
                           eventDate      STRING)

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1");

DROP TABLE IF EXISTS truck_events;
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
CREATE TABLE truck_events (driverId       INT, 
                           truckId        INT,
                           eventTime      STRING,
                           eventType      STRING, 
                           longitude      DOUBLE, 
                           latitude       DOUBLE,
                           eventKey       STRING, 
                           correlationId  STRING, 
                           driverName     STRING,
                           routeId        BIGINT,
                           routeName      STRING,
                           eventDate      STRING)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
TBLPROPERTIES ("skip.header.line.count"="1");
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang

Se verifican las tablas existentes en la base de datos.

In [4]:
%%hive
SHOW TABLES;

SHOW TABLES;
OK
docs
drivers
specific_columns
temp_drivers
temp_timesheet
timesheet
truck_events
truck_events_subset
word_counts
Time taken: 0.112 seconds, Fetched: 9 row(s)


A continuación se muestra la información detallada de creación de la tabla `truck_events`.

In [5]:
%%hive
SHOW CREATE TABLE truck_events;

SHOW CREATE TABLE truck_events;
OK
CREATE TABLE `truck_events`(
  `driverid` int, 
  `truckid` int, 
  `eventtime` string, 
  `eventtype` string, 
  `longitude` double, 
  `latitude` double, 
  `eventkey` string, 
  `correlationid` string, 
  `drivername` string, 
  `routeid` bigint, 
  `routename` string, 
  `eventdate` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'=',', 
  'serialization.format'=',') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://0.0.0.0:9000/user/hive/warehouse/truck_events'
TBLPROPERTIES (
  'skip.header.line.count'='1', 
  'transient_lastDdlTime'='1559010444')
Time taken: 0.074 seconds, Fetched: 27 row(s)


También es posible visualizar los campos y su contenido con el comando `DESCRIBE`.

In [6]:
%%hive
DESCRIBE truck_events;

DESCRIBE truck_events;
OK
driverid            	int                 	                    
truckid             	int                 	                    
eventtime           	string              	                    
eventtype           	string              	                    
longitude           	double              	                    
latitude            	double              	                    
eventkey            	string              	                    
correlationid       	string              	                    
drivername          	string              	                    
routeid             	bigint              	                    
routename           	string              	                    
eventdate           	string              	                    
Time taken: 0.033 seconds, Fetched: 12 row(s)


## Carga de datos

La carga de datos se realiza con la siguiente consulta.

In [7]:
%%hive
LOAD DATA INPATH '/tmp/drivers/truck_event_text_partition.csv' OVERWRITE 
INTO TABLE truck_events;

LOAD DATA INPATH '/tmp/drivers/truck_event_text_partition.csv' OVERWRITE 
INTO TABLE truck_events;
Loading data to table default.truck_events
OK
Time taken: 0.872 seconds


Se verifican las propieades de la tabla después de la carga de datos.

In [8]:
%%hive
SHOW TBLPROPERTIES truck_events;

SHOW TBLPROPERTIES truck_events;
OK
numFiles	1
numRows	0
rawDataSize	0
skip.header.line.count	1
totalSize	2272077
transient_lastDdlTime	1559010445
Time taken: 0.028 seconds, Fetched: 6 row(s)


## Visualización de datos

La visualización se realiza mediante consultas con
`SELECT`.

In [9]:
%%hive
SELECT * FROM truck_events LIMIT 10;

SELECT * FROM truck_events LIMIT 10;
OK
14	25	59:21.4	Normal	-94.58	37.03	14|25|9223370572464814373	3.66E+18	Adis Cesir	160405074	Joplin to Kansas City Route 2	2016-05-27-22
18	16	59:21.7	Normal	-89.66	39.78	18|16|9223370572464814089	3.66E+18	Grant Liu	1565885487	Springfield to KC Via Hanibal	2016-05-27-22
27	105	59:21.7	Normal	-90.21	38.65	27|105|9223370572464814070	3.66E+18	Mark Lochbihler	1325562373	Springfield to KC Via Columbia Route 2	2016-05-27-22
11	74	59:21.7	Normal	-90.2	38.65	11|74|9223370572464814123	3.66E+18	Jamie Engesser	1567254452	Saint Louis to Memphis Route2	2016-05-27-22
22	87	59:21.7	Normal	-90.04	35.19	22|87|9223370572464814101	3.66E+18	Nadeem Asghar	1198242881	 Saint Louis to Chicago Route2	2016-05-27-22
22	87	59:22.3	Normal	-90.37	35.21	22|87|9223370572464813486	3.66E+18	Nadeem Asghar	1198242881	 Saint Louis to Chicago Route2	2016-05-27-22
23	68	59:22.4	Normal	-89.91	40.86	23|68|9223370572464813450	3.66E+18	Adam Diaz	160405074	Joplin to Kansas City Route 2	2016-0

## Obtención de un subconjunto de datos 

En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla `truck_events_subset` con los primeros 100 registros de la tabla `truck_events`.

In [10]:
%%hive
DROP TABLE IF EXISTS truck_events_subset;

CREATE TABLE truck_events_subset 
AS
    SELECT *
    FROM truck_events
    LIMIT 100;

DROP TABLE IF EXISTS truck_events_subset;
OK
Time taken: 0.106 seconds
CREATE TABLE truck_events_subset 
AS
    SELECT *
    FROM truck_events
    LIMIT 100;
Query ID = vagrant_20190528022727_4214daba-470c-425a-9e92-59712ae22f81
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1558824450495_0083, Tracking URL = http://ubuntu-bionic:8088/proxy/application_1558824450495_0083/
Kill Command = /usr/local/hadoop-2.8.5/bin/hadoop job  -kill job_1558824450495_0083
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-28 02:27:34,788 Stage-1 map = 0%,  reduce = 0%
2019-05-28 02:27:40,297 Stage-1 map = 100%,  reduce = 0

El código anterior es equivalente al siguiente, donde se usa `LIKE` en `CREATE TABLE` para indicar que la nueva tabla `truck_events_subset` tiene la misma estructura de la tabla existente `truck_events`.

In [11]:
%%hive

DROP TABLE IF EXISTS truck_events_subset;

CREATE TABLE truck_events_subset LIKE truck_events;

INSERT OVERWRITE TABLE truck_events_subset
SELECT
    *
FROM
    truck_events
LIMIT
    100;

DROP TABLE IF EXISTS truck_events_subset;
OK
Time taken: 0.058 seconds
CREATE TABLE truck_events_subset LIKE truck_events;
OK
Time taken: 0.05 seconds
INSERT OVERWRITE TABLE truck_events_subset
SELECT
    *
FROM
    truck_events
LIMIT
    100;
Query ID = vagrant_20190528022747_08f7c087-a6e6-46e9-bf6a-6b6f685a943c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1558824450495_0084, Tracking URL = http://ubuntu-bionic:8088/proxy/application_1558824450495_0084/
Kill Command = /usr/local/hadoop-2.8.5/bin/hadoop job  -kill job_1558824450495_0084
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-28 02:27:57,147 

In [12]:
%%hive
SELECT * FROM truck_events_subset LIMIT 5;

SELECT * FROM truck_events_subset LIMIT 5;
OK
31	18	59:36.3	Normal	-94.58	37.03	31|18|9223370572464799462	3.66E+18	Rommel Garcia	1594289134	Memphis to Little Rock Route 2	2016-05-27-22
18	16	59:36.3	Normal	-92.42	39.76	18|16|9223370572464799486	3.66E+18	Grant Liu	1565885487	Springfield to KC Via Hanibal	2016-05-27-22
26	57	59:35.9	Normal	-92.74	37.6	26|57|9223370572464799895	3.66E+18	Michael Aube	1325712174	Saint Louis to Tulsa Route2	2016-05-27-22
14	25	59:35.8	Normal	-94.46	37.16	14|25|9223370572464800006	3.66E+18	Adis Cesir	160405074	Joplin to Kansas City Route 2	2016-05-27-22
27	105	59:35.6	Normal	-92.85	38.93	27|105|9223370572464800175	3.66E+18	Mark Lochbihler	1325562373	Springfield to KC Via Columbia Route 2	2016-05-27-22
Time taken: 0.105 seconds, Fetched: 5 row(s)


## Obtención de un subconjunto de datos

En el siguiente código se obtienen algunas columnas de la tabla `truck_events_subset` para ser almacenadas en una tabla diferente.

In [13]:
%%hive

DROP TABLE IF EXISTS specific_columns; 

CREATE TABLE specific_columns 
AS
    SELECT
        driverId, 
        eventTime, 
        eventType
    FROM
        truck_events_subset;

SELECT * FROM specific_columns LIMIT 5;

DROP TABLE IF EXISTS specific_columns; 
OK
Time taken: 0.052 seconds
CREATE TABLE specific_columns 
AS
    SELECT
        driverId, 
        eventTime, 
        eventType
    FROM
        truck_events_subset;
Query ID = vagrant_20190528022808_dd79fb1f-73a1-4fb2-8fe8-ea19a6a258d3
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1558824450495_0085, Tracking URL = http://ubuntu-bionic:8088/proxy/application_1558824450495_0085/
Kill Command = /usr/local/hadoop-2.8.5/bin/hadoop job  -kill job_1558824450495_0085
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-28 02:28:16,936 Stage-1 map = 0%,  reduce = 0%
2019-05-28 02:28:21,236 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1558824450495_0085
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is f

## Escritura de la tabla en el HDFS

Seguidamente, se procede a escribir el contenido de la tabla en el directorio `/tmp/drivers/specific-columns` del HDFS.

In [14]:
%%hive
INSERT OVERWRITE DIRECTORY '/tmp/drivers/specific-columns' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT 
    * 
FROM 
    specific_columns;

INSERT OVERWRITE DIRECTORY '/tmp/drivers/specific-columns' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT 
    * 
FROM 
    specific_columns;
Query ID = vagrant_20190528022823_a4bf0a87-a469-49ed-911b-4cac85617c15
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1558824450495_0086, Tracking URL = http://ubuntu-bionic:8088/proxy/application_1558824450495_0086/
Kill Command = /usr/local/hadoop-2.8.5/bin/hadoop job  -kill job_1558824450495_0086
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-05-28 02:28:33,570 Stage-1 map = 0%,  reduce = 0%
2019-05-28 02:28:37,911 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.19 sec
MapReduce Total cumulative CPU time: 1 seconds 190 msec
Ended Job = job_1558824450495_0086
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory

In [15]:
##
## Se visualiza el contenido del directorio
##
!hdfs dfs -ls /tmp/drivers/specific-columns/

Found 1 items
-rwxr-xr-x   1 vagrant supergroup       1800 2019-05-28 02:28 /tmp/drivers/specific-columns/000000_0


In [16]:
##
## Se visualiza la parte final del archivo
##
!hdfs dfs -tail /tmp/drivers/specific-columns/000000_0

,59:29.6,Normal
13,59:29.5,Normal
27,59:29.3,Normal
17,59:29.2,Normal
12,59:29.1,Normal
15,59:28.8,Normal
16,59:28.8,Normal
13,59:28.5,Normal
23,59:28.4,Normal
11,59:28.3,Normal
30,59:28.0,Normal
24,59:27.9,Normal
25,59:27.8,Normal
28,59:27.7,Normal
27,59:27.7,Normal
13,59:27.6,Normal
23,59:27.4,Normal
25,59:27.0,Normal
26,59:27.0,Normal
28,59:26.9,Normal
10,59:26.8,Normal
22,59:26.6,Normal
23,59:26.6,Normal
25,59:26.2,Normal
27,59:25.9,Normal
19,59:25.9,Normal
13,59:25.9,Normal
21,59:25.7,Normal
16,59:25.3,Normal
26,59:25.2,Normal
19,59:25.1,Normal
18,59:25.0,Normal
22,59:25.0,Normal
29,59:24.7,Normal
25,59:24.3,Normal
24,59:24.3,Normal
32,59:24.2,Normal
22,59:24.2,Normal
14,59:24.2,Normal
25,59:23.5,Normal
31,59:23.5,Normal
16,59:23.4,Normal
15,59:23.4,Normal
28,59:23.3,Normal
14,59:23.3,Normal
17,59:23.2,Normal
27,59:22.6,Normal
32,59:22.5,Normal
20,59:22.5,Normal
11,59:22.5,Normal
23,59:22.4,Normal
22,59:22.3,Normal
22,59:21.7,Normal
11,59:21.7,Normal
27,59:21.7,Normal
18,59:21.7,N

---

In [17]:
%%hive
DROP TABLE drivers;
DROP TABLE specific_columns;
DROP TABLE temp_drivers;
DROP TABLE temp_timesheet;
DROP TABLE timesheet;
DROP TABLE truck_events;
DROP TABLE truck_events_subset;

DROP TABLE drivers;
OK
Time taken: 0.154 seconds
DROP TABLE specific_columns;
OK
Time taken: 0.045 seconds
DROP TABLE temp_drivers;
OK
Time taken: 0.049 seconds
DROP TABLE temp_timesheet;
OK
Time taken: 0.057 seconds
DROP TABLE timesheet;
OK
Time taken: 0.056 seconds
DROP TABLE truck_events;
OK
Time taken: 0.048 seconds
DROP TABLE truck_events_subset;
OK
Time taken: 0.046 seconds
