Aceleracion de reduce()
===

* Última modificación: Mayo 14, 2022

*Adaptado del libro "Mastering Large Datasets with Python".*

Archivos de prueba
---

In [1]:
!rm -rf /tmp/input /tmp/output
!mkdir /tmp/input

In [2]:
%%writefile /tmp/input/text0.txt
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.

The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.

Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.

Writing /tmp/input/text0.txt


In [3]:
import shutil

for i in range(1, 10000):
    shutil.copy("/tmp/input/text0.txt", f"/tmp/input/text{i}.txt")

Lectura de los archivos línea por línea
---

In [4]:
import fileinput
import glob
import os


def load_data(file_path):
    # -----------------------------------------------------------------------------------
    def make_iterator_from_single_file(file_path):
        with open(file_path, "rt") as file:
            for line in file:
                yield line

    # -----------------------------------------------------------------------------------
    def make_iterator_from_multiple_files(file_path):
        file_path = os.path.join(file_path, "*")
        files = glob.glob(file_path)
        with fileinput.input(files=files) as file:
            for line in file:
                yield line

    # -----------------------------------------------------------------------------------
    if os.path.isfile(file_path):
        return make_iterator_from_single_file(file_path)
    return make_iterator_from_multiple_files(file_path)

Funciones para preprocesamiento
---

In [5]:
import string


def tolower(x):
    return x.lower()


def remove_punctuation(x):
    return x.translate(str.maketrans("", "", string.punctuation))


def remove_newline(x):
    return x.replace("\n", "")


def split_lines(x):
    return x.split()

map()
---

In [6]:
from multiprocessing import Pool

from toolz.functoolz import compose
from toolz.itertoolz import concat

compose_pipeline = compose(
    remove_punctuation,
    tolower,
    remove_newline,
)

with Pool() as pool:

    result = pool.map(split_lines, load_data("/tmp/input/"))
    result = concat(result)
    result = pool.map(compose_pipeline, result)


result[:5]

['analytics', 'is', 'the', 'discovery', 'interpretation']

fold()
--

In [12]:
from toolz.sandbox.parallel import fold


def make_counts(acc, nxt):
    acc[nxt] = acc.get(nxt, 0) + 1
    return acc


def sum_counts(left, right):
    for key in right.keys():
        left[key] = left.get(key, 0) + right[key]
    return left


with Pool() as pool:
    result = fold(
        binop=make_counts,
        seq=result,
        default={},
        map=pool.map,
        chunksize=100,
        combine=sum_counts,
    )

result

{'analytics': 63684922592381716537787943178417852779153357987494792760807856226199109721959093618176023591386619284952148498545521848382,
 'is': 9536618835892879682969374969461087217461127494479569686831048544449061522686502522676871229512769311553961739320125115036,
 'the': 38106435557653839986887466635396112201401940336499277771660788723064866084646275129893726935860885230193914758849244769618,
 'discovery': 3192270448082250693406406496251866647330263917801413273894600571965447687333498451052025162120922959576423841323883370366,
 'interpretation': 3192270448082250693406406496251866647330263917801413273894600571965447687333495707015018311673153590923589511545026189194,
 'and': 47683084558594958104425493916537549173071641097389538837473047185889382660127181690189588902147633355404777104602403074348,
 'communication': 3192270448082250693406406496251866647330263917801413273894600571965447687333495707015018311673153590923589511545026189194,
 'of': 25417738781431504147043173291911085836760

Comparación
---

In [14]:
from functools import reduce

In [15]:
%%timeit

result = map(
    tolower,
    map(
        remove_punctuation,
        map(
            remove_newline,
            concat(
                map(
                    split_lines,
                    load_data("/tmp/input/"),
                )
            ),
        ),
    ),
)


reduce(
    make_counts,
    result,
    {},
)

5.36 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%%timeit

with Pool() as pool:

    result = pool.map(split_lines, load_data("/tmp/input/"))
    result = concat(result)
    result = pool.map(compose_pipeline, result)
    result = fold(
        make_counts,
        result,
        {},
        pool.map,
        10000,
        sum_counts,
    )

2.81 s ± 686 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
