Análisis básico de información con Apache Pig
===

* 60 min | Última modificación: Noviembre 07, 2019

Descarga de datos
---

In [1]:
filenames = [
    "drivers.csv",
    "timesheet.csv",
    "truck_event_text_partition.csv",
]

url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/"

!mkdir -p /tmp/drivers/
for filename in filenames:
    !wget --quiet {url + filename} -P /tmp/drivers/

Movimiento de datos al HDFS
--

In [2]:
!hdfs dfs -rm -r drivers/ output/
!hdfs dfs -mkdir drivers/
!hdfs dfs -copyFromLocal /tmp/drivers/*.csv  drivers/
!hdfs dfs -ls drivers/*

Deleted drivers
Deleted output
-rw-r--r--   1 root supergroup       2043 2022-05-31 16:45 drivers/drivers.csv
-rw-r--r--   1 root supergroup      26205 2022-05-31 16:45 drivers/timesheet.csv
-rw-r--r--   1 root supergroup    2272077 2022-05-31 16:45 drivers/truck_event_text_partition.csv


Selección de un subconjunto de datos
---

In [3]:
%%writefile truck-events.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int, 
            truckId:int, 
            eventTime:chararray,
            eventType:chararray, 
            longitude:double, 
            latitude:double,
            eventKey:chararray, 
            correlationId:long, 
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );

truck_events_subset = LIMIT truck_events 10;
    
specific_columns = FOREACH truck_events_subset GENERATE driverId, eventTime, eventType;
    
STORE specific_columns INTO 'output/specific_columns' USING PigStorage(',');

Overwriting truck-events.pig


In [4]:
!pig -f truck-events.pig

2022-05-31 16:45:31,859 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:32,582 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:45:32,655 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:45:32,673 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:45:33,143 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:45:33,272 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0021
2022-05-31 16:45:33,404 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:45:33,448 [JobControl] INFO  org

In [5]:
!hdfs dfs -ls output/specific_columns/

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:46 output/specific_columns/_SUCCESS
-rw-r--r--   1 root supergroup        183 2022-05-31 16:46 output/specific_columns/part-r-00000


In [6]:
!hdfs dfs -text output/specific_columns/part-r-00000 | head

11,59:21.7,Normal
11,59:22.5,Normal
14,59:21.4,Normal
18,59:21.7,Normal
20,59:22.5,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
27,59:21.7,Normal
,eventTime,eventType


Ejecución de un join
--

In [7]:
%%writefile join.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int, 
            truckId:int, 
            eventTime:chararray,
            eventType:chararray, 
            longitude:double, 
            latitude:double,
            eventKey:chararray, 
            correlationId:long, 
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );

drivers =  LOAD 'drivers/drivers.csv' USING PigStorage(',')
    AS (
            driverId:int, 
            name:chararray, 
            ssn:chararray,
            location:chararray, 
            certified:chararray,
            wage_plan:chararray
    );
    
join_data = JOIN  truck_events BY (driverId), drivers BY (driverId);

STORE join_data INTO 'output/join_data' USING PigStorage(',');

Overwriting join.pig


In [8]:
!pig -f join.pig

2022-05-31 16:46:16,018 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:16,369 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:16,443 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:46:16,469 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:16,488 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:16,521 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:2
2022-05-31 16:46:16,659 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0023
2022-05-31 16:46:16,794 [JobControl] INFO  org.apache.hadoop.map

In [9]:
!hdfs dfs -ls output/join_data/

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:46 output/join_data/_SUCCESS
-rw-r--r--   1 root supergroup    3283088 2022-05-31 16:46 output/join_data/part-r-00000


In [10]:
!hdfs dfs -cat output/join_data/part-r-00000 | head

10,85,00:35.2,Normal,-92.99,37.34,10|85|9223370572464740606,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,58:48.7,Normal,-90.69,38.5,10|23|9223370572126447149,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,59:04.1,Normal,-93.69,37.16,10|23|9223370572126431719,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,43,37:06.0,Normal,-90.69,38.5,10|43|9223370572419349763,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-28-11,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,39,08:56.0,Normal,-91.44,38.09,10|39|9223370571956639801,1000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-06-02-20,10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
10,23,58:53.0,Normal,-91.44,38.09,10|23|922337057212

Ordenamiento de datos usando 'ORDER BY'
---

In [11]:
%%writefile sort.pig

drivers =  LOAD 'drivers/drivers.csv' USING PigStorage(',')
    AS (
            driverId:int, 
            name:chararray, 
            ssn:chararray,
            location:chararray, 
            certified:chararray,
            wage_plan:chararray
    );

ordered_data = ORDER drivers BY name asc;

STORE ordered_data INTO 'output/ordered_data' USING PigStorage(',');

Overwriting sort.pig


In [12]:
!pig -f sort.pig

2022-05-31 16:46:38,366 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:38,702 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:46:38,774 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:46:38,794 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:46:38,893 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:46:39,050 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0024
2022-05-31 16:46:39,194 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:46:39,242 [JobControl] INFO  org

In [13]:
!hdfs dfs -ls output/ordered_data/ 

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:47 output/ordered_data/_SUCCESS
-rw-r--r--   1 root supergroup       2002 2022-05-31 16:47 output/ordered_data/part-r-00000


In [14]:
!hdfs dfs -cat output/ordered_data/part-r-00000 | head 

23,Adam Diaz,928312208,P.O. Box 260- 6127 Vitae Road,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
19,Ajay Singh,160005158,592-9430 Nonummy Avenue,Y,hours
36,Andrew Grande,245303216,Ap #685-9598 Egestas Rd.,Y,hours
20,Chris Harris,921812303,883-2691 Proin Avenue,Y,hours
30,Dan Rice,282307061,Ap #881-9267 Mollis Avenue,Y,hours
43,Dave Patton,977706052,3028 A- St.,Y,hours
39,David Kaiser,967706052,9185 At Street,Y,hours
24,Don Hilborn,254412152,4361 Ac Road,Y,hours
35,Emil Siemes,971401151,321-2976 Felis Rd.,Y,hours


Filtrado y agrupamiento usando "GROUP BY"
--

In [15]:
%%writefile groupby.pig

truck_events = LOAD 'drivers/truck_event_text_partition.csv' USING PigStorage(',')
    AS (
            driverId:int, 
            truckId:int, 
            eventTime:chararray,
            eventType:chararray, 
            longitude:double, 
            latitude:double,
            eventKey:chararray, 
            correlationId:long, 
            driverName:chararray,
            routeId:long,
            routeName:chararray,
            eventDate:chararray
    );
    
filtered_events = FILTER truck_events BY NOT (eventType MATCHES 'Normal');

grouped_events = GROUP filtered_events BY driverId;

STORE grouped_events INTO 'output/grouped_events' USING PigStorage(',');

Overwriting groupby.pig


In [16]:
!pig -f groupby.pig

2022-05-31 16:47:32,676 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:33,054 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-05-31 16:47:33,125 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-05-31 16:47:33,145 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 1
2022-05-31 16:47:33,195 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-05-31 16:47:33,344 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654012563278_0027
2022-05-31 16:47:33,487 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-05-31 16:47:33,530 [JobControl] INFO  org

In [17]:
!hdfs dfs -ls output/grouped_events/

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-31 16:47 output/grouped_events/_SUCCESS
-rw-r--r--   1 root supergroup       5613 2022-05-31 16:47 output/grouped_events/part-r-00000


In [18]:
!hdfs dfs -cat output/grouped_events/part-r-00000 | head 

10,{(10,85,00:13.1,Unsafe tail distance,-91.18,38.22,10|85|9223370572464762694,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22),(10,85,00:39.7,Overspeed,-94.23,37.09,10|85|9223370572464736126,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22),(10,85,59:46.9,Overspeed,-95.5,36.37,10|85|9223370572464788896,3660000000000000000,George Vetticaden,1390372503,Saint Louis to Tulsa,2016-05-27-22)}
11,{(11,74,00:14.1,Lane Departure,-88.77,40.76,11|74|9223370572464761716,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:49.6,Lane Departure,-89.71,37.47,11|74|9223370572464726246,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:05.4,Unsafe following distance,-89.74,39.1,11|74|9223370572464770396,3660000000000000000,Jamie Engesser,1567254452,Saint Louis to Memphis Route2,2016-05-27-22),(11,74,00:41.0,Lane Departure,-90.07,35.68,1

---

In [19]:
!rm *log *.pig *.csv