Tipos de datos complejos
===

* 60 min | Última modificación: Noviembre 07, 2019

In [1]:
%load_ext bigdata

In [2]:
%pig_start

In [3]:
%timeout 300

## Datos simples

Los siguientes son los tipos de datos soportados por Pig:

     int      long      float       double      chararray  
     boolean  datetime  biginteger  bigdecimal  bytearray


## Datos complejos

Apache Pig trabaja con la siguiente jerarquía de relaciones (http://pig.apache.org/docs/r0.17.0/basic.html#relations):

* Una *tuple* es un ser de campos ordenados: (field1, field2, ....).
* Una *bag* es un conjunto de tuplas: {(...), (...), ...}
* Un *map* es un conjunto de parejas [key#value, ....]



### TUPLE

In [4]:
%%writefile data.tsv
A	10	(1, 2)
B	20	(3, 4)
C	30	(5, 6)
D	40	(7, 8)

Writing data.tsv


In [5]:
!hadoop fs -put data.tsv

In [6]:
%%pig
--
-- Los campos del archivo están separados por 
-- tabuladores.
--
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
DUMP u;

 --
 -- Los campos del archivo est??n separados por 
 -- tabuladores.
 --
 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
 DUMP u;
2019-11-07 21:09:05,968 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:09:06,260 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2019-11-07 21:09:06,265 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2019-11-07 21:09:06,279 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication
2019-11-07 21:09:06,723 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use map

In [7]:
%%pig
--
-- Los campos de la tupla pueden ser accesados
-- por nombre o por posición.
--
r = FOREACH u GENERATE f3.p, f3.$1 ;   
DUMP r;

 --
 -- Los campos de la tupla pueden ser accesados
 -- por nombre o por posici??n.
 --
 r = FOREACH u GENERATE f3.p, f3.$1 ;   
 DUMP r;
2019-11-07 21:09:28,874 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:09:29,840 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:09:29,855 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:09:29,869 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:09:29,907 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:09:29,966 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0085
2019-11-07 21:09:29,971 [JobControl] INFO  org.apache.hado

In [8]:
%%pig
--
-- Aqui se accesan los campos de la tupla por 
-- posicion ya que no tienen nombre.
--
u = LOAD 'data.tsv' AS (f1:CHARARRAY, f2:INT, f3:TUPLE(INT, INT));
r = FOREACH u GENERATE $2.$0, $2.$1;
DUMP r;

 --
 -- Aqu?? se accesan los campos de la tupla por 
 -- posici??n ya que no tienen nombre.
 --
 u = LOAD 'data.tsv' AS (f1:CHARARRAY, f2:INT, f3:TUPLE(INT, INT));
 r = FOREACH u GENERATE $2.$0, $2.$1;
 DUMP r;
2019-11-07 21:09:51,431 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:09:51,967 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:09:51,995 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:09:52,011 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:09:52,045 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:09:52,067 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573

In [9]:
%%writefile data.tsv
A	(1,  2)	(3,  4)
B	(5,  6)	(7,  8)
C	(9, 10)	(11, 12)

Overwriting data.tsv


In [10]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [11]:
%%pig
--
-- Se seleccionan los campos por nombre
--
u = LOAD 'data.tsv'
    AS (f1: CHARARRAY, 
        t1: TUPLE(t1a: INT, t1b: INT), 
        t2: TUPLE(t2a: INT, t2b: INT)); 
r = FOREACH u GENERATE f1, t2.t2b;

 --
 -- Se seleccionan los campos por nombre
 --
 u = LOAD 'data.tsv'
    AS (f1: CHARARRAY, 
        t1: TUPLE(t1a: INT, t1b: INT), 
        t2: TUPLE(t2a: INT, t2b: INT)); 
 r = FOREACH u GENERATE f1, t2.t2b;


### BAG

In [12]:
%%writefile data.tsv
A	10	{( 1,  2),( 3,  4)}
B	20	{( 5,  6),( 7,  8)}
C	30	{( 9, 10),(11, 12)}
D	40	{(13, 14),(15, 16)}

Overwriting data.tsv


In [13]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [14]:
%%pig
--
-- Se selecciona el `bag` por nombre
--
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t: TUPLE(p:INT, q:INT)});
r = FOREACH u GENERATE f3;
DUMP r;

 --
 -- Se selecciona el `bag` por nombre
 --
 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t: TUPLE(p:INT, q:INT)});
 r = FOREACH u GENERATE f3;
 DUMP r;
2019-11-07 21:10:29,847 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:10:29,965 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:10:29,989 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:10:30,000 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:10:30,040 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:10:30,059 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0087
2019-11-07 21:10:30,062 [JobC

In [15]:
%%pig
r = FOREACH u GENERATE f3.p;
DUMP r;

 r = FOREACH u GENERATE f3.p;
 DUMP r;
2019-11-07 21:10:50,612 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:10:50,737 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:10:50,754 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:10:50,768 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:10:50,799 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:10:50,831 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0088
2019-11-07 21:10:50,835 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2019-11

### MAP

In [16]:
%%writefile data.tsv
A	10	[a#1,b#2]
B	20	[a#3,c#4]
C	30	[b#5,c#6]
D	40	[b#7,c#8]

Overwriting data.tsv


In [17]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [18]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:MAP[]);
r = FOREACH u GENERATE f3#'a', f3#'c';
DUMP r

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:MAP[]);
 r = FOREACH u GENERATE f3#'a', f3#'c';
 DUMP r
2019-11-07 21:11:19,817 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:11:19,942 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:11:19,954 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:11:19,961 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:11:20,002 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:11:20,022 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0089
2019-11-07 21:11:20,024 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job j

## Manipulación de datos complejos: FLATTEN

In [19]:
%%writefile data.tsv
A	10	(1, 2)
B	20	(3, 4)
C	30	(5, 6)
D	40	(7, 8)

Overwriting data.tsv


In [20]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [21]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
DUMP u;

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
 DUMP u;
2019-11-07 21:11:48,516 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:11:48,649 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:11:48,660 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:11:48,669 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:11:49,114 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:11:49,540 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0090
2019-11-07 21:11:49,544 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not ad

In [22]:
%%pig
r = FOREACH u GENERATE f1, FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE f1, FLATTEN(f3);
 DUMP r;
2019-11-07 21:12:10,231 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:12:10,756 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:12:10,767 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:12:10,776 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:12:10,808 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:12:10,827 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0091
2019-11-07 21:12:10,829 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resourc

In [23]:
%%writefile data.tsv
A	10	{(1),(2)}
B	20	{(3),(4)}
C	30	{(5),(6)}
D	40	{(7),(8)}

Overwriting data.tsv


In [24]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [25]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t:(p:INT)});
DUMP u;

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t:(p:INT)});
 DUMP u;
2019-11-07 21:12:39,620 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:12:39,731 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:12:39,741 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:12:39,750 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:12:39,782 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:12:39,817 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0092
2019-11-07 21:12:39,820 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding 

In [26]:
%%pig
r = FOREACH u GENERATE f1, FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE f1, FLATTEN(f3);
 DUMP r;
2019-11-07 21:13:00,396 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:00,488 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:00,501 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:13:00,509 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:13:00,540 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:13:00,558 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0093
2019-11-07 21:13:00,561 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resourc

In [27]:
%%pig
r = FOREACH u GENERATE FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE FLATTEN(f3);
 DUMP r;
2019-11-07 21:13:21,257 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:21,386 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:21,397 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:13:21,408 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:13:21,444 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:13:21,462 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0094
2019-11-07 21:13:21,464 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.


In [28]:
%%pig
--
-- se pueden colocar varios comandos dentro de
-- un FOREACH
--
r1 = FOREACH u {
        GENERATE FLATTEN(f3);
};
DUMP r1;

 --
 -- se pueden colocar varios comandos dentro de
 -- un FOREACH
 --
 r1 = FOREACH u {
        GENERATE FLATTEN(f3);
};
 DUMP r1;
2019-11-07 21:13:42,554 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:42,674 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:13:42,686 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:13:42,703 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:13:42,734 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:13:42,753 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0095
2019-11-07 21:13:42,758 [JobControl] INFO  org.apache.hadoop.map

In [29]:
%%pig
r1 = FOREACH u GENERATE (DOUBLE) $1;
DUMP r1;

 r1 = FOREACH u GENERATE (DOUBLE) $1;
 DUMP r1;
2019-11-07 21:14:03,339 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:14:03,873 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-07 21:14:03,885 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-07 21:14:03,896 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-07 21:14:04,336 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-07 21:14:04,763 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573155851190_0096
2019-11-07 21:14:04,767 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources

## Limpieza del sistema

In [30]:
%pig_quit

In [31]:
!rm *.tsv
!hadoop fs -rm *.tsv

Deleted data.tsv
