WordCount en Apache Pig en modo standalone
===

* 30 min | Última modificación: Noviembre 15, 2019

## Definición del problema

Se desea contar la frecuencia de ocurrencia de palabras en un conjunto de documentos usando Apache Pig.

## Solución

### Preparación

#### Inicio de la máquina virtual

Si usa linux o macOS puede pasar directamente al siguiente paso. Inicie la VM con:

```bash
vagrant up
```

y luego vaya a la carpeta de trabajo:

```
cd /vagrant
```


#### Ejecución del contendor de Docker

Si va a iniciar el contendor de Hadoop en la carpeta compartida con su máquina local use:

```
docker run --rm -it -v "$PWD":/datalake  --name pig -p 8888:8888 jdvelasq/pig:0.17.0-standalone
```

Si desea iniciar la sesión en el `datalake` use:

```
docker run --rm -it -v datalake:/datalake --name pig  -p 8888:8888 jdvelasq/pig:0.17.0-standalone
```


Si un contenedor ya se está ejecutando puede abrir un nuevo terminal con:

```
docker exec -it pig bash
```

### Archivos de prueba

A continuación se generarán tres archivos de prueba para probar el sistema. Puede usar directamente comandos del sistema operativo en el Terminal y el editor de texto `pico` para crear los archivos.

In [1]:
## Se crea el directorio de entrada
!rm -rf input tmp
!mkdir input

In [2]:
%%writefile input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns 
in data. Especially valuable in areas rich with recorded information, analytics relies 
on the simultaneous application of statistics, computer programming and operations research 
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business 
performance. Specifically, areas within analytics include predictive analytics, prescriptive 
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big 
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, 
marketing optimization and marketing mix modeling, web analytics, call analytics, speech 
analytics, sales force sizing and optimization, price and promotion modeling, predictive 
science, credit risk analysis, and fraud analytics. Since analytics can require extensive 
computation (see big data), the algorithms and software used for analytics harness the most 
current methods in computer science, statistics, and mathematics.

Writing input/text0.txt


In [3]:
%%writefile input/text1.txt
The field of data analysis. Analytics often involves studying past historical data to 
research potential trends, to analyze the effects of certain decisions or events, or to 
evaluate the performance of a given tool or scenario. The goal of analytics is to improve 
the business by gaining knowledge which can be used to make improvements or changes.

Writing input/text1.txt


In [4]:
%%writefile input/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions 
about the information they contain, increasingly with the aid of specialized systems 
and software. Data analytics technologies and techniques are widely used in commercial 
industries to enable organizations to make more-informed business decisions and by 
scientists and researchers to verify or disprove scientific models, theories and 
hypotheses.

Writing input/text2.txt


### Código en Apache Pig

**Nota.** Se usan los dos guiones `--` para comentario de una línea y `/*` ... `*/` para comentarios de varias líneas.

In [5]:
%%writefile script.pig

-- crea la carpeta input in el HDFS
fs -mkdir tmp
fs -mkdir tmp/input

-- copia los archivos del sistema local al HDFS
fs -put input/ tmp/

-- carga de datos
lines = LOAD 'tmp/input/text*.txt' AS (line:CHARARRAY);

-- genera una tabla llamada words con una palabra por registro
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- agrupa los registros que tienen la misma palabra
grouped = GROUP words BY word;

-- genera una variable que cuenta las ocurrencias por cada grupo
wordcount = FOREACH grouped GENERATE group, COUNT(words);

-- selecciona las primeras 15 palabras
s = LIMIT wordcount 15;

-- escribe el archivo de salida 
STORE s INTO 'tmp/output';

-- copia los archivos del HDFS al sistema local (genera la carpeta output en el directorio actual)
fs -get tmp/output/ .


Overwriting script.pig


### Ejecución del script en modo standalone

In [6]:
!pig -execute 'run script.pig'

2019-11-15 02:31:43,492 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-11-15 02:31:44,405 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator
2019-11-15 02:31:44,678 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - session.id is deprecated. Instead, use dfs.metrics.session-id
2019-11-15 02:31:44,679 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
2019-11-15 02:31:44,701 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2019-11-15 02:31:44,704 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapredu

### Visualización de los resultados en el HDFS

In [7]:
!hadoop fs -ls tmp/output/*

-rw-r--r--   1 root root          0 2019-11-15 02:31 tmp/output/_SUCCESS
-rw-r--r--   1 root root         81 2019-11-15 02:31 tmp/output/part-r-00000


In [8]:
!hadoop fs -cat tmp/output/part-r-00000

a	1
DA	1
be	1
by	2
in	5
is	3
of	8
on	1
or	5
to	12
Big	1
The	2
aid	1
and	15
are	1


### Limpieza del HDFS y de la máquina local

In [9]:
!rm -rf input tmp output