### Pig: Índice invertido

Partiendo del dataset de posts utilizado anteriormente, vamos a calcular un índice invertido. 

In [1]:
! rm -fr pig-indiceinvertido
! mkdir -p pig-indiceinvertido
import os
os.chdir("pig-indiceinvertido/")
! pwd

/media/notebooks/pig-indiceinvertido


## Paso opcional - Instalación de depencias
* Instalamos dos2unix para limpiar el fichero y convertirlo de formato DOS a Unix
* Instalamos pig para ejecutar los correspondientes comandos

In [2]:
! yum install -y dos2unix pig hbase

Loaded plugins: fastestmirror, ovl
Loading mirror speeds from cached hostfile
 * base: ftp.uma.es
 * epel: ftp.uma.es
 * extras: ftp.uma.es
 * updates: ftp.uma.es
Package dos2unix-6.0.3-7.el7.x86_64 already installed and latest version
Package pig-0.12.0+cdh5.9.0+95-1.cdh5.9.0.p0.30.el7.noarch already installed and latest version
Package hbase-1.2.0+cdh5.9.0+205-1.cdh5.9.0.p0.30.el7.x86_64 already installed and latest version
Nothing to do


Copiamos los ficheros de datos al directorio de trabajo

In [3]:
! cp ../dataset/forum_node.tsv.gz ../dataset/forum1.tsv .
! ls -lh

total 38M
-rw-r--r-- 1 root root 1.8K Feb  5 15:33 forum1.tsv
-rwxr-xr-x 1 root root  38M Feb  5 15:33 forum_node.tsv.gz


Descomprimimos el primer fichero y lo limpiamos

In [4]:
! gzip -d forum_node.tsv.gz && dos2unix -f forum_node.tsv

dos2unix: converting file forum_node.tsv to Unix format ...


Creamos el directorio de usuario en Hadoop si no existiera

In [5]:
! hadoop fs -rm -r /user/$(whoami)/pig-indiceinvertido
! hadoop fs -mkdir -p /user/$(whoami)

Deleted /user/root


Copiamos los ficheros a Hadoop y al directorio local

In [6]:
! hadoop fs -put -p forum_node.tsv

In [7]:
! hadoop fs -put forum1.tsv

In [8]:
! hadoop fs -ls

Found 2 items
-rw-r--r--   3 root supergroup       1774 2018-02-05 15:34 forum1.tsv
-rwxr-xr-x   3 root root        120109135 2018-02-05 15:33 forum_node.tsv


In [9]:
%%writefile students-inverted-index.pig

/* 1.Carga el fichero de los posts forum_node.tsv, utilizando una extension de Piggybank para poder quitar la cabecera,
en vez de usar directamente el PigStorage. */
REGISTER /usr/lib/pig/piggybank.jar;
DEFINE StringToInt InvokeForInt('java.lang.Integer.valueOf', 'String');

data =
    load 'forum_node.tsv'
    using org.apache.pig.piggybank.storage.CSVExcelStorage('\t', 'YES_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
    as (pid:chararray, title:chararray, tagnames:chararray,
        author_id:chararray,body:chararray,
        node_type:chararray, parent_id:chararray,
        abs_parent_id:chararray,added_at:chararray,
        score:chararray, state_string:chararray, last_edited_id:chararray,
        last_activity_by_id:chararray, last_activity_at:chararray,
        active_revision_id:chararray, extra:chararray,
        extra_ref_id:chararray, extra_count:chararray, marked:chararray);

/* 2.Limpiamos el fichero quitando los saltos de linea, expresiones html y la expresión regular que se proponia en el ejercicio. */
cleandata = foreach data generate
    REPLACE(pid, '[a-zA-Z]+', '') as post_id,
    LOWER(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(body, '\\\\n\\\\r', ''), '\\\\r', ''), '\\\\n', ''), '<*>', ''), '[^a-zA-Z0-9\'\\s]+', ' ')) AS clean_body;

/* 3.Filtramos los datos de post_id que no son numericos. */
cleandata_filtered = filter cleandata by org.apache.pig.piggybank.evaluation.IsNumeric(post_id);

/* 4.Creamos tuplas separando el body por espacios y convirtiendo el post_id en un numerico a través de una función custom, para evitar problemas que sufrimos con el cast de String a Integer. */
words_data = FOREACH cleandata_filtered GENERATE StringToInt(post_id) as post_id_int:int, FLATTEN(TOKENIZE(clean_body)) as word;
words_data_filtered = filter words_data by SIZE(word) > 0;

/* 5.Agrupamos por palabra */
word_groups = GROUP words_data_filtered BY word;

/* 6.Por cada grupo de palabras, hacemos un distinct para los post_id, eliminando los duplicados, contamos el número de post en que aparece (despues de quitar los duplicados) y generamos una fila con el índice. */
index = FOREACH word_groups {
    pairs = DISTINCT $1.$0;
    cnt = COUNT(pairs);
    GENERATE $0 as word, pairs as index_bag, cnt as count;
};

/* 7.Como se pide que el indice lleve el post_id ordenador, ordenamos la bag resultante de los posts por su id. */
sorted_index = foreach index {
    sorted_bag = order index_bag by $0;
    generate word, sorted_bag, count;
}

/* 8. Lo guardamos en un fichero. */
STORE sorted_index INTO 'inverted_index';





Writing students-inverted-index.pig


In [10]:
! pig -f students-inverted-index.pig -x local

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2018-02-05 15:34:48,014 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.0-cdh5.9.0 (rUnversioned directory) compiled Oct 21 2016, 01:17:18
2018-02-05 15:34:48,015 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/pig-indiceinvertido/pig_1517844887974.log
2018-02-05 15:34:48,044 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - user.name is deprecated. Instead, use mapreduce.job.user.name
2018-02-05 15:34:48,486 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2018-02-05 15:34:48,601 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2018-02-05 15:34:48,601 [main] INFO  org.apache.hadoop.conf.Configurati

2018-02-05 15:34:52,431 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2018-02-05 15:34:52,433 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local1109379629_0001
2018-02-05 15:34:52,435 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases 1-1,cleandata,cleandata_filtered,data,index,pairs,sorted_bag,sorted_index,word_groups,words_data,words_data_filtered
2018-02-05 15:34:52,436 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[8,4],cleandata[-1,-1],cleandata_filtered[25,21],words_data[28,13],words_data_filtered[29,22],word_groups[32,14] C:  R: index[35,8],1-1[36,21],pairs[36,12],1-1[36,21],pairs[36,12],sorted_index[42,15],sorted_bag[43,17]
2018-02-05 15:34:52,451 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Map

2018-02-05 15:35:10,554 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1
Total Length = 33554432
Input split[0]:
   Length = 33554432
  Locations:

-----------------------

2018-02-05 15:35:10,574 [LocalJobRunner Map Task Executor #0] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/media/notebooks/pig-indiceinvertido/forum_node.tsv:33554432+33554432
2018-02-05 15:35:10,728 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - (EQUATOR) 0 kvi 26214396(104857584)
2018-02-05 15:35:10,729 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - mapreduce.task.io.sort.mb: 100
2018-02-05 15:35:10,752 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - soft limit at 83886080
2018-02-05 15:35:10,752 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - bufst

2018-02-05 15:35:37,773 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - map > map
2018-02-05 15:35:40,377 [SpillThread] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
2018-02-05 15:35:40,377 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - (RESET) equator 51163412 kv 12790848(51163392) kvi 10169420(40677680)
2018-02-05 15:35:40,774 [communication thread] INFO  org.apache.hadoop.mapred.LocalJobRunner - map > map
2018-02-05 15:35:42,068 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.LocalJobRunner - map > map
2018-02-05 15:35:42,068 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
2018-02-05 15:35:42,068 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - Spilling map output
2018-02-05 15:35:42,068 [LocalJobRunner Map Task Executor #0] INFO  org.apache.hadoop.mapred.MapTask - bufstart = 51163412; bufend = 70737927; buf

2018-02-05 15:35:58,515 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 68256171, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->68256171
2018-02-05 15:35:58,575 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#1 about to shuffle output of map attempt_local1109379629_0001_m_000001_0 decomp: 67100522 len: 67100526 to MEMORY
2018-02-05 15:35:58,891 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput - Read 67100522 bytes from map-output for attempt_local1109379629_0001_m_000001_0
2018-02-05 15:35:58,891 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl - closeInMemoryFile -> map-output of size: 67100522, inMemoryMapOutputs.size() -> 2, commitMemory -> 68256171, usedMemory ->135356693
2018-02-05 15:35:59,050 [localfetcher#1] INFO  org.apache.hadoop.mapreduce.task.reduce.LocalFetcher - localfetcher#1

2018-02-05 15:37:36,050 [pool-5-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2018-02-05 15:37:36,050 [pool-5-thread-1] INFO  org.apache.hadoop.mapred.Task - Task attempt_local1109379629_0001_r_000000_0 is allowed to commit now
2018-02-05 15:37:36,063 [pool-5-thread-1] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local1109379629_0001_r_000000_0' to file:/media/notebooks/pig-indiceinvertido/inverted_index/_temporary/0/task_local1109379629_0001_r_000000
2018-02-05 15:37:36,064 [pool-5-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - reduce > reduce
2018-02-05 15:37:36,064 [pool-5-thread-1] INFO  org.apache.hadoop.mapred.Task - Task 'attempt_local1109379629_0001_r_000000_0' done.
2018-02-05 15:37:36,065 [pool-5-thread-1] INFO  org.apache.hadoop.mapred.LocalJobRunner - Finishing task: attempt_local1109379629_0001_r_000000_0
2018-02-05 15:37:36,065 [Thread-5] INFO  org.apache.hadoop.mapred.LocalJobRunne

In [11]:
! tail -40000 ./inverted_index/part-r-00000  | head -10

	{(2010230)}	1
reciprocals	{(2009620),(2009707),(2010947),(2014292),(9002395)}	5
reciprocate	{(2001902),(6004447)}	2
reciprocity	{(2008341),(2009563),(2009571),(2009804),(2010292)}	5
recitations	{(2018236),(5014249)}	2
reclamation	{(10438)}	1
recliningon	{(6022210)}	1
recognisers	{(5007198),(5007700),(5007756)}	3
recognising	{(1014977),(1034000),(6015555),(8000513)}	4
recognition	{(3982),(5301),(8066),(9102),(22789),(41152),(47260),(51450),(51559),(52330),(53802),(53881),(53962),(60657),(60916),(63420),(64321),(64470),(66804),(66874),(67092),(67121),(67242),(67488),(1000219),(1000901),(1001733),(1001920),(1002371),(1005129),(1006927),(1007272),(1008146),(1008955),(1008996),(1009010),(1009776),(1010107),(1010329),(1010351),(1012085),(1012836),(1013518),(1013848),(1014107),(1015692),(1018390),(1023592),(1025649),(1026946),(1028196),(1030214),(1030460),(1030646),(1030651),(1031191),(1031238),(1031734),(1032697),(1032720),(1033371),(1033635),(1033857),(1034081),(1034451),(1034946)

In [12]:
! pig -f students-inverted-index.pig

log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
2018-02-05 15:38:35,802 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.0-cdh5.9.0 (rUnversioned directory) compiled Oct 21 2016, 01:17:18
2018-02-05 15:38:35,803 [main] INFO  org.apache.pig.Main - Logging error messages to: /media/notebooks/pig-indiceinvertido/pig_1517845115761.log
2018-02-05 15:38:37,075 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2018-02-05 15:38:37,333 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2018-02-05 15:38:37,333 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2018-02-05 15:38:37,333 [main] INFO  org.apache.pig.backe

2018-02-05 15:38:49,933 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://yarnmaster:8088/proxy/application_1517830644613_0003/
2018-02-05 15:38:49,934 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1517830644613_0003
2018-02-05 15:38:49,934 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases 1-1,cleandata,cleandata_filtered,data,index,pairs,sorted_bag,sorted_index,word_groups,words_data,words_data_filtered
2018-02-05 15:38:49,934 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: data[8,4],cleandata[-1,-1],cleandata_filtered[25,21],words_data[28,13],words_data_filtered[29,22],word_groups[32,14] C:  R: index[35,8],1-1[36,21],pairs[36,12],1-1[36,21],pairs[36,12],sorted_index[42,15],sorted_bag[43,17]
2018-02-05 15:38:50,191 [main] INFO  org.apache.pig.backend.hadoo

In [13]:
! hadoop fs -ls inverted_index/

Found 2 items
-rw-r--r--   3 root supergroup          0 2018-02-05 15:42 inverted_index/_SUCCESS
-rw-r--r--   3 root supergroup   86936662 2018-02-05 15:42 inverted_index/part-r-00000


In [14]:
! hadoop fs -cat inverted_index/part-r-00000 | tail -40000 | head -10

	{(2010230)}	1
reciprocals	{(2009620),(2009707),(2010947),(2014292),(9002395)}	5
reciprocate	{(2001902),(6004447)}	2
reciprocity	{(2008341),(2009563),(2009571),(2009804),(2010292)}	5
recitations	{(2018236),(5014249)}	2
reclamation	{(10438)}	1
recliningon	{(6022210)}	1
recognisers	{(5007198),(5007700),(5007756)}	3
recognising	{(1014977),(1034000),(6015555),(8000513)}	4
recognition	{(3982),(5301),(8066),(9102),(22789),(41152),(47260),(51450),(51559),(52330),(53802),(53881),(53962),(60657),(60916),(63420),(64321),(64470),(66804),(66874),(67092),(67121),(67242),(67488),(1000219),(1000901),(1001733),(1001920),(1002371),(1005129),(1006927),(1007272),(1008146),(1008955),(1008996),(1009010),(1009776),(1010107),(1010329),(1010351),(1012085),(1012836),(1013518),(1013848),(1014107),(1015692),(1018390),(1023592),(1025649),(1026946),(1028196),(1030214),(1030460),(1030646),(1030651),(1031191),(1031238),(1031734),(1032697),(1032720),(1033371),(1033635),(1033857),(1034081),(1034451),(1034946)