Conteo de palabras en Apache Hive (script)
===

* Última modificación: Mayo 17, 2022

Datos
--

In [1]:
!mkdir -p /tmp/wordcount/

In [2]:
%%writefile /tmp/wordcount/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns 
in data. Especially valuable in areas rich with recorded information, analytics relies 
on the simultaneous application of statistics, computer programming and operations research 
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business 
performance. Specifically, areas within analytics include predictive analytics, prescriptive 
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big 
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, 
marketing optimization and marketing mix modeling, web analytics, call analytics, speech 
analytics, sales force sizing and optimization, price and promotion modeling, predictive 
science, credit risk analysis, and fraud analytics. Since analytics can require extensive 
computation (see big data), the algorithms and software used for analytics harness the most 
current methods in computer science, statistics, and mathematics.

Overwriting /tmp/wordcount/text0.txt


In [3]:
%%writefile /tmp/wordcount/text1.txt
The field of data analysis. Analytics often involves studying past historical data to 
research potential trends, to analyze the effects of certain decisions or events, or to 
evaluate the performance of a given tool or scenario. The goal of analytics is to improve 
the business by gaining knowledge which can be used to make improvements or changes.

Overwriting /tmp/wordcount/text1.txt


In [4]:
%%writefile /tmp/wordcount/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions 
about the information they contain, increasingly with the aid of specialized systems 
and software. Data analytics technologies and techniques are widely used in commercial 
industries to enable organizations to make more-informed business decisions and by 
scientists and researchers to verify or disprove scientific models, theories and 
hypotheses.

Overwriting /tmp/wordcount/text2.txt


Versión en productivo
--

En la segunda parte, se procede a llevar el aplicativo a productivo con los siguientes cambios:

* Los datos son leidos del sistema HDFS de Hadoop.

* Los resultdos son guardados en una carpeta del sistema Hadoop.

* El script se almacena en un archivo en el disco duro, para su uso posterior.

Copia de los datos al sistema HDFS
--

In [5]:
#
# Se usan un directorio temporal en el HDFS. La siguiente
# instrucción muestra el contenido del dicho directorio
#
!hdfs dfs -ls /tmp

Found 2 items
drwxrwx---   - root supergroup          0 2022-05-18 05:11 /tmp/hadoop-yarn
drwxrwxrwx   - root supergroup          0 2022-05-18 05:12 /tmp/hive


In [6]:
#
# Crea la carpeta wordcount en el hdfs
#
!hdfs dfs -mkdir /tmp/wordcount

In [7]:
#
# Verifica la creación de la carpeta
#
!hdfs dfs -ls /tmp/

Found 3 items
drwxrwx---   - root supergroup          0 2022-05-18 05:11 /tmp/hadoop-yarn
drwxrwxrwx   - root supergroup          0 2022-05-18 05:12 /tmp/hive
drwxr-xr-x   - root supergroup          0 2022-05-18 05:13 /tmp/wordcount


In [8]:
#
# Copia los archvios del directorio local /tmp/wordcount/
# al directorio /tmp/wordcount/ en el hdfs
#
!hdfs dfs -copyFromLocal /tmp/wordcount/*  /tmp/wordcount/

In [9]:
#
# Verifica que los archivos esten copiados
# en el hdfs
#
!hdfs dfs -ls /tmp/wordcount

Found 3 items
-rw-r--r--   1 root supergroup       1093 2022-05-18 05:14 /tmp/wordcount/text0.txt
-rw-r--r--   1 root supergroup        352 2022-05-18 05:14 /tmp/wordcount/text1.txt
-rw-r--r--   1 root supergroup        440 2022-05-18 05:14 /tmp/wordcount/text2.txt


Generación del script y ajuste del código
--

Se realizan dos cambios. En primer lugar, se sustituye la línea 

    LOAD DATA LOCAL INPATH "wordcount/" OVERWRITE INTO TABLE docs;
    
por:

    LOAD DATA INPATH "/tmp/wordcount/" OVERWRITE INTO TABLE docs;

para que Hive lea los datos del directorio `/tmp/wordcount/` en el HDFS. En segundo lugar, se agrega

    INSERT OVERWRITE DIRECTORY '/tmp/output' 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
    SELECT * FROM word_counts;
    
para que los resultados sean almacenados en la carpeta `/tmp/output` como un archivo en formato CSV. El programa es guadado como `wordcount.hql` en el computador local. 

In [10]:
%%writefile /tmp/wordcount.hql

DROP TABLE IF EXISTS docs;
DROP TABLE IF EXISTS word_counts;

CREATE TABLE docs (line STRING);

LOAD DATA INPATH "/tmp/wordcount/" OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts 
AS
    SELECT word, count(1) AS count 
    FROM
        (SELECT explode(split(line, '\\s')) AS word FROM docs) w
GROUP BY 
    word
ORDER BY 
    word;
    
INSERT OVERWRITE DIRECTORY '/tmp/output' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
SELECT * FROM word_counts;


Writing /tmp/wordcount.hql


In [11]:
!hdfs dfs -cat wordcount.hql

cat: `wordcount.hql': No such file or directory


Ejecución
---

In [12]:
!hive -f /tmp/wordcount.hql

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.9.jar!/hive-log4j2.properties Async: true
OK
Time taken: 6.625 seconds
OK
Time taken: 0.1 seconds
OK
Time taken: 0.473 seconds
Loading data to table default.docs
OK
Time taken: 0.518 seconds
Query ID = root_20220518051423_eca70e0a-6fb2-4bd3-aa04-df507d760320
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.re

Visualización de resultados
--

Los resultados quedan almacenados en la carpeta `/tmp/output` del sistema HDFS

In [13]:
!hdfs dfs -ls /tmp/output

Found 1 items
-rwxrwxrwx   1 root supergroup       1653 2022-05-18 05:15 /tmp/output/000000_0


In [14]:
!hdfs dfs -cat /tmp/output/000000_0 | head

,20
(DA),1
(see,1
Analytics,2
Analytics,,1
Big,1
Data,3
Especially,1
Organizations,1
Since,1


Copia de los resultados a la máquina local
--

In [15]:
!hadoop fs -copyToLocal /tmp/output /tmp/output
!ls output/*

ls: cannot access 'output/*': No such file or directory


In [16]:
!cat /tmp/output/000000_0

,20
(DA),1
(see,1
Analytics,2
Analytics,,1
Big,1
Data,3
Especially,1
Organizations,1
Since,1
Specifically,,1
The,2
a,1
about,1
aid,1
algorithms,1
analysis,,1
analysis.,1
analytics,8
analytics,,8
analytics.,1
analyze,1
and,15
application,1
apply,1
are,1
areas,2
assortment,1
be,1
big,1
business,4
by,2
call,1
can,2
certain,1
changes.,1
cognitive,1
commercial,1
communication,1
computation,1
computer,2
conclusions,1
contain,,1
credit,1
current,1
data,4
data),,1
data.,1
decision,1
decisions,2
describe,,1
descriptive,1
discovery,,1
disprove,1
draw,1
effects,1
enable,1
enterprise,1
evaluate,1
events,,1
examining,1
extensive,1
field,1
for,1
force,1
fraud,1
gaining,1
given,1
goal,1
harness,1
historical,1
hypotheses.,1
improve,2
improvements,1
in,5
include,1
increasingly,1
industries,1
information,1
information,,1
interpretation,,1
involves,1
is,3
knowledge,1
make,2
management,,1
marketing,2
mathematics.,1
may,1
meaningful,1
methods,1
mix,1
modeling,,2
models,,1
more-informed,1
most,1
of,8
often,

Otra opción para extraer los resultados es usar

      $ hive -S -e 'SELECT * FROM word_counts;' > result.csv
     
     
en donde el archivo `result.txt` se almacena localmente.

In [17]:
!rm -rf *.log