Tipos de datos complejos
===

* 60 min | Última modificación: Noviembre 07, 2019

In [1]:
%load_ext bigdata

In [2]:
%pig_start

In [3]:
%timeout 300

## Datos simples

Los siguientes son los tipos de datos soportados por Pig:

     int      long      float       double      chararray  
     boolean  datetime  biginteger  bigdecimal  bytearray


## Datos complejos

Apache Pig trabaja con la siguiente jerarquía de relaciones (http://pig.apache.org/docs/r0.17.0/basic.html#relations):

* Una *tuple* es un ser de campos ordenados: (field1, field2, ....).
* Una *bag* es un conjunto de tuplas: {(...), (...), ...}
* Un *map* es un conjunto de parejas [key#value, ....]



### TUPLE

In [4]:
%%writefile data.tsv
A	10	(1, 2)
B	20	(3, 4)
C	30	(5, 6)
D	40	(7, 8)

Writing data.tsv


In [5]:
!hadoop fs -put data.tsv

In [6]:
%%pig
--
-- Los campos del archivo están separados por 
-- tabuladores.
--
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
DUMP u;

 --
 -- Los campos del archivo est??n separados por 
 -- tabuladores.
 --
 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
 DUMP u;
2019-11-14 22:56:53,440 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:56:53,579 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2019-11-14 22:56:53,582 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2019-11-14 22:56:53,591 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.submit.replication is deprecated. Instead, use mapreduce.client.submit.file.replication
2019-11-14 22:56:53,908 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use map

In [7]:
%%pig
--
-- Los campos de la tupla pueden ser accesados
-- por nombre o por posición.
--
r = FOREACH u GENERATE f3.p, f3.$1 ;   
DUMP r;

 --
 -- Los campos de la tupla pueden ser accesados
 -- por nombre o por posici??n.
 --
 r = FOREACH u GENERATE f3.p, f3.$1 ;   
 DUMP r;
2019-11-14 22:57:10,373 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:11,328 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:11,345 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:57:11,363 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:57:11,397 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:57:11,430 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0065
2019-11-14 22:57:11,433 [JobControl] INFO  org.apache.hado

In [8]:
%%pig
--
-- Aqui se accesan los campos de la tupla por 
-- posicion ya que no tienen nombre.
--
u = LOAD 'data.tsv' AS (f1:CHARARRAY, f2:INT, f3:TUPLE(INT, INT));
r = FOREACH u GENERATE $2.$0, $2.$1;
DUMP r;

 --
 -- Aqui se accesan los campos de la tupla por 
 -- posicion ya que no tienen nombre.
 --
 u = LOAD 'data.tsv' AS (f1:CHARARRAY, f2:INT, f3:TUPLE(INT, INT));
 r = FOREACH u GENERATE $2.$0, $2.$1;
 DUMP r;
2019-11-14 22:57:27,768 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:28,287 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:28,304 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:57:28,318 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:57:28,347 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:57:28,367 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_157377

In [9]:
%%writefile data.tsv
A	(1,  2)	(3,  4)
B	(5,  6)	(7,  8)
C	(9, 10)	(11, 12)

Overwriting data.tsv


In [10]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [11]:
%%pig
--
-- Se seleccionan los campos por nombre
--
u = LOAD 'data.tsv'
    AS (f1: CHARARRAY, 
        t1: TUPLE(t1a: INT, t1b: INT), 
        t2: TUPLE(t2a: INT, t2b: INT)); 
r = FOREACH u GENERATE f1, t2.t2b;

 --
 -- Se seleccionan los campos por nombre
 --
 u = LOAD 'data.tsv'
    AS (f1: CHARARRAY, 
        t1: TUPLE(t1a: INT, t1b: INT), 
        t2: TUPLE(t2a: INT, t2b: INT)); 
 r = FOREACH u GENERATE f1, t2.t2b;


### BAG

In [12]:
%%writefile data.tsv
A	10	{( 1,  2),( 3,  4)}
B	20	{( 5,  6),( 7,  8)}
C	30	{( 9, 10),(11, 12)}
D	40	{(13, 14),(15, 16)}

Overwriting data.tsv


In [13]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [14]:
%%pig
--
-- Se selecciona el `bag` por nombre
--
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t: TUPLE(p:INT, q:INT)});
r = FOREACH u GENERATE f3;
DUMP r;

 --
 -- Se selecciona el `bag` por nombre
 --
 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t: TUPLE(p:INT, q:INT)});
 r = FOREACH u GENERATE f3;
 DUMP r;
2019-11-14 22:57:55,244 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:55,745 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:57:55,771 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:57:55,781 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:57:55,811 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:57:55,826 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0067
2019-11-14 22:57:55,834 [JobC

In [15]:
%%pig
r = FOREACH u GENERATE f3.p;
DUMP r;

 r = FOREACH u GENERATE f3.p;
 DUMP r;
2019-11-14 22:58:06,740 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:06,821 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:06,833 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:58:06,843 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:58:06,871 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:58:06,903 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0068
2019-11-14 22:58:06,906 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2019-11

### MAP

In [16]:
%%writefile data.tsv
A	10	[a#1,b#2]
B	20	[a#3,c#4]
C	30	[b#5,c#6]
D	40	[b#7,c#8]

Overwriting data.tsv


In [17]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [18]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:MAP[]);
r = FOREACH u GENERATE f3#'a', f3#'c';
DUMP r

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:MAP[]);
 r = FOREACH u GENERATE f3#'a', f3#'c';
 DUMP r
2019-11-14 22:58:27,943 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:28,024 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:28,035 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:58:28,042 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:58:28,071 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:58:28,106 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0069
2019-11-14 22:58:28,108 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job j

## Manipulación de datos complejos: FLATTEN

In [19]:
%%writefile data.tsv
A	10	(1, 2)
B	20	(3, 4)
C	30	(5, 6)
D	40	(7, 8)

Overwriting data.tsv


In [20]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [21]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
DUMP u;

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:TUPLE(p:INT, q:INT));
 DUMP u;
2019-11-14 22:58:44,066 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:44,159 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:44,176 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:58:44,185 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:58:44,208 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:58:44,221 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0070
2019-11-14 22:58:44,223 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not ad

In [22]:
%%pig
r = FOREACH u GENERATE f1, FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE f1, FLATTEN(f3);
 DUMP r;
2019-11-14 22:58:55,243 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:55,318 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:58:55,329 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:58:55,339 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:58:55,367 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:58:55,398 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0071
2019-11-14 22:58:55,401 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resourc

In [23]:
%%writefile data.tsv
A	10	{(1),(2)}
B	20	{(3),(4)}
C	30	{(5),(6)}
D	40	{(7),(8)}

Overwriting data.tsv


In [24]:
!hadoop fs -rm data.tsv
!hadoop fs -put data.tsv

Deleted data.tsv


In [25]:
%%pig
u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t:(p:INT)});
DUMP u;

 u = LOAD 'data.tsv'
    AS (f1:CHARARRAY, f2:INT, f3:BAG{t:(p:INT)});
 DUMP u;
2019-11-14 22:59:16,280 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:16,378 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:16,390 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:59:16,401 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:59:16,425 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:59:16,454 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0072
2019-11-14 22:59:16,456 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding 

In [26]:
%%pig
r = FOREACH u GENERATE f1, FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE f1, FLATTEN(f3);
 DUMP r;
2019-11-14 22:59:32,478 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:32,984 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:32,994 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:59:33,001 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:59:33,025 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:59:33,039 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0073
2019-11-14 22:59:33,042 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resourc

In [27]:
%%pig
r = FOREACH u GENERATE FLATTEN(f3);
DUMP r;

 r = FOREACH u GENERATE FLATTEN(f3);
 DUMP r;
2019-11-14 22:59:48,558 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:48,651 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 22:59:48,661 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 22:59:48,671 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 22:59:48,697 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 22:59:49,147 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0074
2019-11-14 22:59:49,149 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.


In [28]:
%%pig
--
-- se pueden colocar varios comandos dentro de
-- un FOREACH
--
r1 = FOREACH u {
        GENERATE FLATTEN(f3);
};
DUMP r1;

 --
 -- se pueden colocar varios comandos dentro de
 -- un FOREACH
 --
 r1 = FOREACH u {
        GENERATE FLATTEN(f3);
};
 DUMP r1;
2019-11-14 23:00:05,078 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 23:00:05,176 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 23:00:05,185 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 23:00:05,197 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 23:00:05,222 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 23:00:05,238 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0075
2019-11-14 23:00:05,240 [JobControl] INFO  org.apache.hadoop.map

In [29]:
%%pig
r1 = FOREACH u GENERATE (DOUBLE) $1;
DUMP r1;

 r1 = FOREACH u GENERATE (DOUBLE) $1;
 DUMP r1;
2019-11-14 23:00:21,213 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 23:00:21,289 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2019-11-14 23:00:21,297 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2019-11-14 23:00:21,303 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2019-11-14 23:00:21,732 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2019-11-14 23:00:21,753 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1573770721210_0076
2019-11-14 23:00:21,756 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources

## Limpieza del sistema

In [30]:
%pig_quit

In [31]:
!rm *.tsv
!hadoop fs -rm *.tsv

Deleted data.tsv
