Transacciones ACID (Insert/ Update / Delete) en Hive
===

* *30 min* | Última modificación: Junio 22, 2019

El lenguaje SQL estándar provee directivas para la insertar, actualizar y borrar registros en una tabla. En este tutorial se presentan ejemplos representativos de estas instrucciones en Hive. 

## Ejecución de Hive en un contenedor de Docker

* Usando el directorio de trabajo de la máquina local:

```
docker run --rm -it -v "$PWD":/datalake  --name hive -p 50070:50070 -p 8088:8088 -p 8888:8888 -p 5000:5000 jdvelasq/hive:2.3.6-pseudo
```

* Usando un volumen de Docker (llamado `datalake`):

```
docker run --rm -it -v datalake:/datalake --name hive  -p 50070:50070 -p 8088:8088 -p 8888:8888 -p 5000:5000 jdvelasq/hive:2.3.6-pseudo
```


* Consola conectada a un contendor que ya está corriendo:

```
docker exec -it hive bash
```


In [1]:
%load_ext bigdata
%timeout 300

## Creación de la tabla

In [2]:
%%hive
DROP DATABASE IF EXISTS demo CASCADE;
CREATE DATABASE demo;
USE demo;

CREATE TABLE persons (
    id        INT,
    firstname STRING,
    surname   STRING,
    birthday  TIMESTAMP,
    quantity  INT
)
PARTITIONED BY (color STRING)
CLUSTERED BY(id) INTO 3 BUCKETS
STORED AS ORC 
LOCATION '/tmp/hive-partitioned'
TBLPROPERTIES ('transactional'='true');

DROP DATABASE IF EXISTS demo CASCADE;
OK
Time taken: 3.867 seconds
CREATE DATABASE demo;
OK
Time taken: 0.302 seconds
USE demo;
OK
Time taken: 0.012 seconds
CREATE TABLE persons (
    id        INT,
    firstname STRING,
    surname   STRING,
    birthday  TIMESTAMP,
    quantity  INT
)
PARTITIONED BY (color STRING)
CLUSTERED BY(id) INTO 3 BUCKETS
STORED AS ORC 
LOCATION '/tmp/hive-partitioned'
TBLPROPERTIES ('transactional'='true');
OK
Time taken: 0.476 seconds


## Preparación

Se deben habilitar las características de Hive para manejo de transacciones ACID.

In [3]:
%%hive
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;

SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;


---

## INSERT

    INSERT INTO TABLE tablename VALUES PARTITION () values_row [, values_row ...]
    
    values_row: 
       (value [, value ...])
       
Note que a diferencia de SQL, aca no es posible indicar para que columnas se van a insertar los valores, de tal manera que siempre se deben dar valores para todas las columnas.       

In [4]:
%%hive
--
-- Inserta el registro en la tabla.
-- Los valores están en el mismo orden de los campos.
--
INSERT INTO persons PARTITION (color='green') VALUES
    (1,"Vivian","Hamilton","1971-07-08",1),
    (2,"Karen","Holcomb","1974-05-23",4),
    (12,"Hope","Coffey","1973-12-24",5),
    (17,"Chanda","Boyer","1973-04-01",4);
    
INSERT INTO persons PARTITION (color='black') VALUES    
    (4,"Roth","Fry","1975-01-29",1),
    (10,"Kylan","Sexton","1975-02-28",4);

INSERT INTO persons PARTITION (color='blue') VALUES
    (5,"Zoe","Conway","1974-07-03",2),
    (7,"Driscoll","Klein","1970-10-05",5),
    (15,"Hope","Silva","1970-07-01",5);

INSERT INTO persons PARTITION (color='orange') VALUES    
    (3,"Cody","Garrett","1973-04-22",1),
    (16,"Ayanna","Jarvis","1974-02-11",5);
    
INSERT INTO persons PARTITION (color='violet') VALUES    
    (6,"Gretchen","Kinney","1974-10-18",1);

INSERT INTO persons PARTITION (color='red') VALUES    
    (8,"Karyn","Diaz","1969-02-24",1),
    (14,"Clio","Noel","1972-12-12",5);
    
INSERT INTO persons PARTITION (color='indigo') VALUES    
    (9,"Merritt","Guy","1974-10-17",4),
    (11,"Jordan","Estes","1969-12-07",4);

INSERT INTO persons PARTITION (color='gray') VALUES    
    (13,"Vivian","Crane","1970-08-27",5);

INSERT INTO persons PARTITION (color='yellow') VALUES
    (18,"Chadwick","Knight","1973-04-29",1);    


--
-- Inserta el registro en la tabla.
-- Los valores est??n en el mismo orden de los campos.
--
INSERT INTO persons PARTITION (color='green') VALUES
    (1,"Vivian","Hamilton","1971-07-08",1),
    (2,"Karen","Holcomb","1974-05-23",4),
    (12,"Hope","Coffey","1973-12-24",5),
    (17,"Chanda","Boyer","1973-04-01",4);
Query ID = root_20191114234858_9367c926-6f03-4376-87f2-5a11f2ab9c15
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1573774297554_0017, Tracking URL = http://0982451e3758:8088/proxy/application_1573774297554_0017/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1573774297554_0017
Hadoop job information for Stage-1: num

In [5]:
%%hive
SELECT * FROM persons;

SELECT * FROM persons;
OK
10	Kylan	Sexton	1975-02-28 00:00:00	4	black
4	Roth	Fry	1975-01-29 00:00:00	1	black
15	Hope	Silva	1970-07-01 00:00:00	5	blue
7	Driscoll	Klein	1970-10-05 00:00:00	5	blue
5	Zoe	Conway	1974-07-03 00:00:00	2	blue
13	Vivian	Crane	1970-08-27 00:00:00	5	gray
12	Hope	Coffey	1973-12-24 00:00:00	5	green
1	Vivian	Hamilton	1971-07-08 00:00:00	1	green
17	Chanda	Boyer	1973-04-01 00:00:00	4	green
2	Karen	Holcomb	1974-05-23 00:00:00	4	green
9	Merritt	Guy	1974-10-17 00:00:00	4	indigo
11	Jordan	Estes	1969-12-07 00:00:00	4	indigo
3	Cody	Garrett	1973-04-22 00:00:00	1	orange
16	Ayanna	Jarvis	1974-02-11 00:00:00	5	orange
14	Clio	Noel	1972-12-12 00:00:00	5	red
8	Karyn	Diaz	1969-02-24 00:00:00	1	red
6	Gretchen	Kinney	1974-10-18 00:00:00	1	violet
18	Chadwick	Knight	1973-04-29 00:00:00	1	yellow
Time taken: 0.282 seconds, Fetched: 18 row(s)


In [6]:
!hdfs dfs -ls /tmp/hive-partitioned

Found 9 items
drwxrwxrwx   - root supergroup          0 2019-11-14 23:49 /tmp/hive-partitioned/color=black
drwxrwxrwx   - root supergroup          0 2019-11-14 23:50 /tmp/hive-partitioned/color=blue
drwxrwxrwx   - root supergroup          0 2019-11-14 23:52 /tmp/hive-partitioned/color=gray
drwxrwxrwx   - root supergroup          0 2019-11-14 23:49 /tmp/hive-partitioned/color=green
drwxrwxrwx   - root supergroup          0 2019-11-14 23:51 /tmp/hive-partitioned/color=indigo
drwxrwxrwx   - root supergroup          0 2019-11-14 23:50 /tmp/hive-partitioned/color=orange
drwxrwxrwx   - root supergroup          0 2019-11-14 23:51 /tmp/hive-partitioned/color=red
drwxrwxrwx   - root supergroup          0 2019-11-14 23:51 /tmp/hive-partitioned/color=violet
drwxrwxrwx   - root supergroup          0 2019-11-14 23:52 /tmp/hive-partitioned/color=yellow


In [7]:
!hdfs dfs -ls /tmp/hive-partitioned/color=black

Found 1 items
drwxr-xr-x   - root supergroup          0 2019-11-14 23:49 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000


In [8]:
!hdfs dfs -ls /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000

Found 3 items
-rw-r--r--   1 root supergroup        223 2019-11-14 23:49 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00000
-rw-r--r--   1 root supergroup        929 2019-11-14 23:49 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00001
-rw-r--r--   1 root supergroup        223 2019-11-14 23:49 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00002


## UPDATE

    UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

Véase https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

In [9]:
%%hive
UPDATE persons SET quantity = 100 WHERE color = 'red';

UPDATE persons SET quantity = 100 WHERE color = 'red';
Query ID = root_20191114235249_60126600-c670-4da0-a008-cca13d6b8d14
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1573774297554_0026, Tracking URL = http://0982451e3758:8088/proxy/application_1573774297554_0026/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1573774297554_0026
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2019-11-14 23:52:56,474 Stage-1 map = 0%,  reduce = 0%
2019-11-14 23:53:01,714 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 1.37 sec
2019-11-14 23:53:02,726 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 3.02 sec
2019-1

In [10]:
%%hive
SELECT * FROM persons;

SELECT * FROM persons;
OK
10	Kylan	Sexton	1975-02-28 00:00:00	4	black
4	Roth	Fry	1975-01-29 00:00:00	1	black
15	Hope	Silva	1970-07-01 00:00:00	5	blue
7	Driscoll	Klein	1970-10-05 00:00:00	5	blue
5	Zoe	Conway	1974-07-03 00:00:00	2	blue
13	Vivian	Crane	1970-08-27 00:00:00	5	gray
12	Hope	Coffey	1973-12-24 00:00:00	5	green
1	Vivian	Hamilton	1971-07-08 00:00:00	1	green
17	Chanda	Boyer	1973-04-01 00:00:00	4	green
2	Karen	Holcomb	1974-05-23 00:00:00	4	green
9	Merritt	Guy	1974-10-17 00:00:00	4	indigo
11	Jordan	Estes	1969-12-07 00:00:00	4	indigo
3	Cody	Garrett	1973-04-22 00:00:00	1	orange
16	Ayanna	Jarvis	1974-02-11 00:00:00	5	orange
14	Clio	Noel	1972-12-12 00:00:00	100	red
8	Karyn	Diaz	1969-02-24 00:00:00	100	red
6	Gretchen	Kinney	1974-10-18 00:00:00	1	violet
18	Chadwick	Knight	1973-04-29 00:00:00	1	yellow
Time taken: 0.151 seconds, Fetched: 18 row(s)


## DELETE

    DELETE FROM tablename [WHERE expression]
    
Véase https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions    

In [11]:
%%hive
DELETE FROM persons WHERE id = 10;

DELETE FROM persons WHERE id = 10;
Query ID = root_20191114235311_03b2dcc3-d721-4400-9113-07d02c42cb00
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1573774297554_0027, Tracking URL = http://0982451e3758:8088/proxy/application_1573774297554_0027/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1573774297554_0027
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 3
2019-11-14 23:53:20,532 Stage-1 map = 0%,  reduce = 0%
2019-11-14 23:53:26,844 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 2.23 sec
2019-11-14 23:53:28,981 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 4.48 sec
2019-11-14 23:53:32,231 Sta

In [12]:
%%hive
SELECT * FROM persons ORDER BY id;

SELECT * FROM persons ORDER BY id;
Query ID = root_20191114235419_402662f8-e827-4917-97fe-e736001b7943
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1573774297554_0028, Tracking URL = http://0982451e3758:8088/proxy/application_1573774297554_0028/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1573774297554_0028
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 1
2019-11-14 23:54:29,073 Stage-1 map = 0%,  reduce = 0%
2019-11-14 23:54:34,302 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 1.35 sec
2019-11-14 23:54:36,455 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 3.03 sec
2019-11-14 23:54:38,728 Sta

## MERGE

    MERGE INTO <target table> AS T USING <source expression/table> AS S
    ON <boolean expression1>
    WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET <set clause list>
    WHEN MATCHED [AND <boolean expression3>] THEN DELETE
    WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES<value list>
    
Véase https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html    

---

In [13]:
%%hive
-- limpia la base de datos
DROP DATABASE IF EXISTS demo CASCADE;

-- limpia la base de datos
DROP DATABASE IF EXISTS demo CASCADE;
OK
Time taken: 0.493 seconds


In [14]:
!rm *.log