Transacciones ACID (Insert/ Update / Delete) en Hive
===

* Última modificación: Mayo 17, 2022 | YouTube

El lenguaje SQL estándar provee directivas para la insertar, actualizar y borrar registros en una tabla. En este tutorial se presentan ejemplos representativos de estas instrucciones en Hive. 

Cell magic `%%hive`
---

In [1]:
from IPython.core.magic import Magics, cell_magic, line_magic, magics_class
from pexpect import spawn

TIMEOUT = 60
PROG = "hive"
PROMPT = ["\r\n    > ", "\r\nhive> "]
QUIT = "quit;"


@magics_class
class Magic(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.app = spawn(PROG, timeout=60)
        self.app.expect(PROMPT)

    @cell_magic
    def hive(self, line, cell):
        cell_lines = [cell_line.strip() for cell_line in cell.split("\n")]
        cell_lines = [cell_line for cell_line in cell_lines if cell_line != ""]
        for cell_line in cell_lines:
            self.app.sendline(cell_line)
            self.app.expect(PROMPT, timeout=TIMEOUT)
            output = self.app.before.decode()
            output = output.replace("\r\n", "\n")
            output = output.split("\n")
            output = [output_line.strip() for output_line in output]
            for output_line in output:
                if output_line not in cell_lines:
                    print(output_line)
        return None

    @line_magic
    def quit(self, line):
        self.app.sendline(QUIT)


def load_ipython_extension(ip):
    ip.register_magics(Magic(ip))


load_ipython_extension(ip=get_ipython())

Creación de la tabla
--

In [2]:
%%hive
DROP DATABASE IF EXISTS demo CASCADE;
CREATE DATABASE demo;
USE demo;

CREATE TABLE persons (
    id        INT,
    firstname STRING,
    surname   STRING,
    birthday  TIMESTAMP,
    quantity  INT
)
PARTITIONED BY (color STRING)
CLUSTERED BY(id) INTO 3 BUCKETS
STORED AS ORC 
LOCATION '/tmp/hive-partitioned'
TBLPROPERTIES ('transactional'='true');

OK
Time taken: 5.841 seconds
OK
Time taken: 0.334 seconds
OK
Time taken: 0.023 seconds
OK
Time taken: 0.25 seconds


Preparación
--

Se deben habilitar las características de Hive para manejo de transacciones ACID.

In [3]:
%%hive
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;

---

INSERT
--

    INSERT INTO TABLE tablename VALUES PARTITION () values_row [, values_row ...]
    
    values_row: 
       (value [, value ...])
       
Note que a diferencia de SQL, aca no es posible indicar para que columnas se van a insertar los valores, de tal manera que siempre se deben dar valores para todas las columnas.       

In [4]:
%%hive
--
-- Inserta el registro en la tabla.
-- Los valores están en el mismo orden de los campos.
--
INSERT INTO persons PARTITION (color='green') VALUES
    (1,"Vivian","Hamilton","1971-07-08",1),
    (2,"Karen","Holcomb","1974-05-23",4),
    (12,"Hope","Coffey","1973-12-24",5),
    (17,"Chanda","Boyer","1973-04-01",4);
    
INSERT INTO persons PARTITION (color='black') VALUES    
    (4,"Roth","Fry","1975-01-29",1),
    (10,"Kylan","Sexton","1975-02-28",4);

INSERT INTO persons PARTITION (color='blue') VALUES
    (5,"Zoe","Conway","1974-07-03",2),
    (7,"Driscoll","Klein","1970-10-05",5),
    (15,"Hope","Silva","1970-07-01",5);

INSERT INTO persons PARTITION (color='orange') VALUES    
    (3,"Cody","Garrett","1973-04-22",1),
    (16,"Ayanna","Jarvis","1974-02-11",5);
    
INSERT INTO persons PARTITION (color='violet') VALUES    
    (6,"Gretchen","Kinney","1974-10-18",1);

INSERT INTO persons PARTITION (color='red') VALUES    
    (8,"Karyn","Diaz","1969-02-24",1),
    (14,"Clio","Noel","1972-12-12",5);
    
INSERT INTO persons PARTITION (color='indigo') VALUES    
    (9,"Merritt","Guy","1974-10-17",4),
    (11,"Jordan","Estes","1969-12-07",4);

INSERT INTO persons PARTITION (color='gray') VALUES    
    (13,"Vivian","Crane","1970-08-27",5);

INSERT INTO persons PARTITION (color='yellow') VALUES
    (18,"Chadwick","Knight","1973-04-29",1);    


Query ID = root_20220517144650_1350f9e4-7c80-4a1f-b79b-94852ac5f89f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0014, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0014/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2022-05-17 14:46:58,056 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:47:01,178 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.77 sec
2022-05-17 14:47:07,402 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 4.71 sec
2022-05-17 14:47:08,427 Stage-1 map = 100%,  reduce = 67%, Cumulative 

In [5]:
%%hive
SELECT * FROM persons;

OK
10	Kylan	Sexton	1975-02-28 00:00:00	4	black
4	Roth	Fry	1975-01-29 00:00:00	1	black
15	Hope	Silva	1970-07-01 00:00:00	5	blue
7	Driscoll	Klein	1970-10-05 00:00:00	5	blue
5	Zoe	Conway	1974-07-03 00:00:00	2	blue
13	Vivian	Crane	1970-08-27 00:00:00	5	gray
12	Hope	Coffey	1973-12-24 00:00:00	5	green
1	Vivian	Hamilton	1971-07-08 00:00:00	1	green
17	Chanda	Boyer	1973-04-01 00:00:00	4	green
2	Karen	Holcomb	1974-05-23 00:00:00	4	green
9	Merritt	Guy	1974-10-17 00:00:00	4	indigo
11	Jordan	Estes	1969-12-07 00:00:00	4	indigo
3	Cody	Garrett	1973-04-22 00:00:00	1	orange
16	Ayanna	Jarvis	1974-02-11 00:00:00	5	orange
14	Clio	Noel	1972-12-12 00:00:00	5	red
8	Karyn	Diaz	1969-02-24 00:00:00	1	red
6	Gretchen	Kinney	1974-10-18 00:00:00	1	violet
18	Chadwick	Knight	1973-04-29 00:00:00	1	yellow
Time taken: 0.207 seconds, Fetched: 18 row(s)


In [6]:
!hdfs dfs -ls /tmp/hive-partitioned

Found 9 items
drwxrwxrwx   - root supergroup          0 2022-05-17 14:47 /tmp/hive-partitioned/color=black
drwxrwxrwx   - root supergroup          0 2022-05-17 14:47 /tmp/hive-partitioned/color=blue
drwxrwxrwx   - root supergroup          0 2022-05-17 14:49 /tmp/hive-partitioned/color=gray
drwxrwxrwx   - root supergroup          0 2022-05-17 14:47 /tmp/hive-partitioned/color=green
drwxrwxrwx   - root supergroup          0 2022-05-17 14:49 /tmp/hive-partitioned/color=indigo
drwxrwxrwx   - root supergroup          0 2022-05-17 14:48 /tmp/hive-partitioned/color=orange
drwxrwxrwx   - root supergroup          0 2022-05-17 14:49 /tmp/hive-partitioned/color=red
drwxrwxrwx   - root supergroup          0 2022-05-17 14:48 /tmp/hive-partitioned/color=violet
drwxrwxrwx   - root supergroup          0 2022-05-17 14:50 /tmp/hive-partitioned/color=yellow


In [7]:
!hdfs dfs -ls /tmp/hive-partitioned/color=black

Found 1 items
drwxr-xr-x   - root supergroup          0 2022-05-17 14:47 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000


In [8]:
!hdfs dfs -ls /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000

Found 3 items
-rw-r--r--   1 root supergroup        223 2022-05-17 14:47 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00000
-rw-r--r--   1 root supergroup        936 2022-05-17 14:47 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00001
-rw-r--r--   1 root supergroup        223 2022-05-17 14:47 /tmp/hive-partitioned/color=black/delta_0000002_0000002_0000/bucket_00002


UPDATE
--

    UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

Véase https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

In [9]:
%%hive
UPDATE persons SET quantity = 100 WHERE color = 'red';

Query ID = root_20220517145011_7cabef7b-52ff-4671-a383-b636bd0cb2a6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0023, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0023/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0023
Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 3
2022-05-17 14:50:15,598 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:50:19,686 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 1.61 sec
2022-05-17 14:50:20,730 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 3.48 sec
2022-05-17 14:50:22,782 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 

In [10]:
%%hive
SELECT * FROM persons;

OK
10	Kylan	Sexton	1975-02-28 00:00:00	4	black
4	Roth	Fry	1975-01-29 00:00:00	1	black
15	Hope	Silva	1970-07-01 00:00:00	5	blue
7	Driscoll	Klein	1970-10-05 00:00:00	5	blue
5	Zoe	Conway	1974-07-03 00:00:00	2	blue
13	Vivian	Crane	1970-08-27 00:00:00	5	gray
12	Hope	Coffey	1973-12-24 00:00:00	5	green
1	Vivian	Hamilton	1971-07-08 00:00:00	1	green
17	Chanda	Boyer	1973-04-01 00:00:00	4	green
2	Karen	Holcomb	1974-05-23 00:00:00	4	green
9	Merritt	Guy	1974-10-17 00:00:00	4	indigo
11	Jordan	Estes	1969-12-07 00:00:00	4	indigo
3	Cody	Garrett	1973-04-22 00:00:00	1	orange
16	Ayanna	Jarvis	1974-02-11 00:00:00	5	orange
14	Clio	Noel	1972-12-12 00:00:00	100	red
8	Karyn	Diaz	1969-02-24 00:00:00	100	red
6	Gretchen	Kinney	1974-10-18 00:00:00	1	violet
18	Chadwick	Knight	1973-04-29 00:00:00	1	yellow
Time taken: 0.127 seconds, Fetched: 18 row(s)


DELETE
--

    DELETE FROM tablename [WHERE expression]
    
Véase https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions    

In [11]:
%%hive
DELETE FROM persons WHERE id = 10;

Query ID = root_20220517145028_7b530327-329d-4010-a61a-86454a1a66be
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0024, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0024/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0024
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 3
2022-05-17 14:50:37,402 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:50:42,561 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 2.71 sec
2022-05-17 14:50:43,607 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 5.56 sec
2022-05-17 14:50:44,639 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 9.

In [12]:
%%hive
SELECT * FROM persons ORDER BY id;

Query ID = root_20220517145115_135ba0fe-fdea-45fc-960f-cbe253040dd8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1652793922537_0025, Tracking URL = http://4feb4ed7d52d:8088/proxy/application_1652793922537_0025/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1652793922537_0025
Hadoop job information for Stage-1: number of mappers: 27; number of reducers: 1
2022-05-17 14:51:24,736 Stage-1 map = 0%,  reduce = 0%
2022-05-17 14:51:28,845 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 1.81 sec
2022-05-17 14:51:29,883 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 4.03 sec
2022-05-17 14:51:31,967 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 6.

MERGE
--

    MERGE INTO <target table> AS T USING <source expression/table> AS S
    ON <boolean expression1>
    WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET <set clause list>
    WHEN MATCHED [AND <boolean expression3>] THEN DELETE
    WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES<value list>
    
Véase https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html    

---

In [13]:
%%hive
-- limpia la base de datos
DROP DATABASE IF EXISTS demo CASCADE;

OK
Time taken: 0.55 seconds


In [14]:
%quit

In [15]:
!rm *.log