Búsqueda de la palabra más frecuente
===

* Última modificación: Mayo 13, 2022

Archivos de prueba
--

In [1]:
#
# Se crea el directorio de entrada
#
!rm -rf /tmp/wordcount
!mkdir -p /tmp/wordcount/input
%cd /tmp/wordcount
!ls

/tmp/wordcount
input


In [2]:
%%writefile input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns 
in data. Especially valuable in areas rich with recorded information, analytics relies 
on the simultaneous application of statistics, computer programming and operations research 
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business 
performance. Specifically, areas within analytics include predictive analytics, prescriptive 
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big 
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, 
marketing optimization and marketing mix modeling, web analytics, call analytics, speech 
analytics, sales force sizing and optimization, price and promotion modeling, predictive 
science, credit risk analysis, and fraud analytics. Since analytics can require extensive 
computation (see big data), the algorithms and software used for analytics harness the most 
current methods in computer science, statistics, and mathematics.

Writing input/text0.txt


In [3]:
%%writefile input/text1.txt
The field of data analysis. Analytics often involves studying past historical data to 
research potential trends, to analyze the effects of certain decisions or events, or to 
evaluate the performance of a given tool or scenario. The goal of analytics is to improve 
the business by gaining knowledge which can be used to make improvements or changes.

Writing input/text1.txt


In [4]:
%%writefile input/text2.txt
Data analytics (DA) is the process of examining data sets in order to draw conclusions 
about the information they contain, increasingly with the aid of specialized systems 
and software. Data analytics technologies and techniques are widely used in commercial 
industries to enable organizations to make more-informed business decisions and by 
scientists and researchers to verify or disprove scientific models, theories and 
hypotheses.

Writing input/text2.txt


Implementación del Algoritmo MapReduce con dos jobs
---

In [5]:
%%writefile most_frequent_words.py

import string

from mrjob.job import MRJob
from mrjob.step import MRStep

from operator import itemgetter

class MostFrequentWords(MRJob):
    #
    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_get_words,
                reducer=self.reducer_count_words,
            ),
            MRStep(
                mapper=self.mapper_group_words,
                reducer=self.reducer_group_words,
            ),
        ]

    # --< Job 1 >------------------------------------------------------------------------
    def preprocessing(self, word):
        word = word.lower()
        word = word.translate(str.maketrans("", "", string.punctuation))
        word = word.replace("\n", "")
        return word

    def mapper_get_words(self, _, line):
        for word in line.split():
            word = self.preprocessing(word)
            yield word, 1

    def reducer_count_words(self, word, counts):
        yield word, sum(counts)

    # --< Job 2 >------------------------------------------------------------------------
    def mapper_group_words(self, word, counts):
        yield "COUNTS", (counts, word)
    
    def reducer_group_words(self, dummy_key, pairs):        
        top = []
        for pair in pairs:
            top.extend(list(pairs))
            top.sort()
            top = top[-10:]
        
        for e in top:
            yield e[1], e[0]


if __name__ == "__main__":
    MostFrequentWords.run()

Writing most_frequent_words.py


Ejecución
---

In [6]:
!python3 most_frequent_words.py  input/*.txt > result.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/most_frequent_words.root.20220526.023852.129587
Running step 1 of 2...
Running step 2 of 2...
job output is in /tmp/most_frequent_words.root.20220526.023852.129587/output
Streaming final output from /tmp/most_frequent_words.root.20220526.023852.129587/output...
Removing temp directory /tmp/most_frequent_words.root.20220526.023852.129587...


In [7]:
!cat result.txt

"performance"	3
"used"	3
"in"	5
"or"	5
"of"	8
"data"	9
"the"	12
"to"	12
"and"	15
"analytics"	20
