Conteo de palabras en Apache Pig
===

* Última modificación: Mayo 16, 2021 | YouTube

Archivos de prueba
---

A continuación se generarán tres archivos de prueba para probar el sistema. Puede usar directamente comandos del sistema operativo en el Terminal y el editor de texto `pico` para crear los archivos.

In [1]:
!rm -rf /tmp/wordcount
!mkdir -p /tmp/wordcount/input/
%cd /tmp/wordcount

/tmp/wordcount


In [2]:
%%writefile input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns 
in data. Especially valuable in areas rich with recorded information, analytics relies 
on the simultaneous application of statistics, computer programming and operations research 
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business 
performance. Specifically, areas within analytics include predictive analytics, prescriptive 
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big 
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, 
marketing optimization and marketing mix modeling, web analytics, call analytics, speech 
analytics, sales force sizing and optimization, price and promotion modeling, predictive 
science, credit risk analysis, and fraud analytics. Since analytics can require extensive 
computation (see big data), the algorithms and software used for analytics harness the most 
current methods in computer science, statistics, and mathematics.

Writing input/text0.txt


In [3]:
%%writefile input/text1.txt
The field of data analysis. Analytics often involves studying past historical data to 
research potential trends, to analyze the effects of certain decisions or events, or to 
evaluate the performance of a given tool or scenario. The goal of analytics is to improve 
the business by gaining knowledge which can be used to make improvements or changes.

Writing input/text1.txt


In [4]:
%%writefile input/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions 
about the information they contain, increasingly with the aid of specialized systems 
and software. Data analytics technologies and techniques are widely used in commercial 
industries to enable organizations to make more-informed business decisions and by 
scientists and researchers to verify or disprove scientific models, theories and 
hypotheses.

Writing input/text2.txt


In [5]:
!ls -1 input/

text0.txt
text1.txt
text2.txt


Conteo de palabras en modo local (escritura y depuración del programa)
---

**Nota.** Se usan los dos guiones `--` para comentario de una línea y `/*` ... `*/` para comentarios de varias líneas.

In [6]:
%%writefile wordcount-local.pig

-- carga de datos desde la carpeta local
lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);

-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;

-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;

-- escribe el archivo de salida en el sistema local
STORE s INTO 'output';

Writing wordcount-local.pig


In [7]:
#
# Archivos en la carpeta local
#
!ls -l

total 8
drwxr-xr-x 2 root root 4096 Jun  3 14:53 input
-rw-r--r-- 1 root root  570 Jun  3 14:53 wordcount-local.pig


In [8]:
#
# Ejecución en modo local (no pseudo ni distribuido (cluster))
#
!pig -x local -execute 'run wordcount-local.pig'

2022-06-03 14:53:40,434 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2022-06-03 14:53:40,555 [JobControl] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2022-06-03 14:53:40,589 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-03 14:53:40,601 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2022-06-03 14:53:40,627 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-03 14:53:40,740 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local1500906503_0001
2022-06-03 14:53:40,816 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/

In [9]:
#
# Archivos en la carpeta local
#
!ls -l

total 12
drwxr-xr-x 2 root root 4096 Jun  3 14:53 input
drwxr-xr-x 2 root root 4096 Jun  3 14:53 output
-rw-r--r-- 1 root root  570 Jun  3 14:53 wordcount-local.pig


In [10]:
#
# Resultados obtenidos
#
!ls -l output/

total 4
-rw-r--r-- 1 root root  0 Jun  3 14:53 _SUCCESS
-rw-r--r-- 1 root root 81 Jun  3 14:53 part-r-00000


In [11]:
#
# Contenido de part-r-*
#
!cat output/part-r-*

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


Conteo de palabras en modo pseudo-distribuido (cluster)
---

In [12]:
%%writefile wordcount-pseudo.pig

-- borra las carpetas si existen
fs -rm -r input output

-- crea la carpeta input in el HDFS
fs -mkdir input

-- copia los archivos del sistema local al HDFS
fs -put input/ .

-- carga de datos
lines = LOAD 'input/text*.txt' AS (line:CHARARRAY);

-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;

-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;

-- escribe el archivo de salida en el HDFS
STORE s INTO 'output';

-- copia los archivos del HDFS al sistema local
fs -get output/ 

Writing wordcount-pseudo.pig


In [13]:
#
# Ejecución en modo local (no pseudo ni distribuido (cluster))
#
!rm -rf output/
!pig -execute 'run wordcount-pseudo.pig'

Deleted input
Deleted output
2022-06-03 14:53:45,437 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-06-03 14:53:45,696 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2022-06-03 14:53:45,764 [JobControl] WARN  org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2022-06-03 14:53:45,780 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input files to process : 3
2022-06-03 14:53:45,823 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2022-06-03 14:53:45,974 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1654265746122_0005
2022-06-03 14:53:46,066 [JobControl] INFO  org.apache.hadoop.mapred.YARNRunner - Job jar is not present. Not adding any jar to the list of resources.
2022-06-03 14:53:

In [14]:
#
# Contenido del HDFS
#
!hdfs dfs -ls output/*

-rw-r--r--   1 root supergroup          0 2022-06-03 14:54 output/_SUCCESS
-rw-r--r--   1 root supergroup         81 2022-06-03 14:54 output/part-r-00000


In [15]:
#
# Resultados ontenidos en el HDFS
#
!hdfs dfs -cat output/part-r-00000

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


In [16]:
#
# Resultados obtenidos en la máquina local
#
!ls -l 

total 16
drwxr-xr-x 2 root root 4096 Jun  3 14:53 input
drwxr-xr-x 2 root root 4096 Jun  3 14:54 output
-rw-r--r-- 1 root root  570 Jun  3 14:53 wordcount-local.pig
-rw-r--r-- 1 root root  780 Jun  3 14:53 wordcount-pseudo.pig


In [17]:
!ls -l output/

total 4
-rw-r--r-- 1 root root  0 Jun  3 14:54 _SUCCESS
-rw-r--r-- 1 root root 81 Jun  3 14:54 part-r-00000


In [18]:
#
# Contenido de part-r-*
#
!cat output/part-r-*

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


Ejecución de e scripts desde Grunt (consola de Apache Pig)
---

Se realiza con los comandos `exec` y `run`. 

    grunt> exec script
    
    grunt> run script
    
La diferencia entre estos comandos es que `exec` ejecuta el script sin importalo a `grunt`  mientras que `run` si lo hace.