Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapreduce Workflow #24

Merged
merged 4 commits into from
May 17, 2024
Merged

Mapreduce Workflow #24

merged 4 commits into from
May 17, 2024

Conversation

haochenpan
Copy link
Collaborator

@haochenpan haochenpan commented May 14, 2024

Description

This initial commit introduces the map-reduce word count workflow. The workflow accepts configurations for the number of map tasks, the number of words each map task handles, and the range of word lengths. The workflow generates a list of paragraphs, each serving as input for a map task. The workflow executes the map tasks, waits for their completion, and then initiates a reduce task to aggregate the results into a single counter object.

Fixes #14

Type of Change

  • New feature (non-breaking change which adds functionality)

Testing

Run the workflow using standard commands, as it does not rely on external libraries or data dependencies. Below are two example commands. Additional commands and runtime results can be found in webs/wf/mapreduce/__init__.py:

    python -m webs.run mapreduce --executor thread-pool \
      --map-task-word-count 1000000 --map-task-count 2

    python -m webs.run mapreduce --executor thread-pool \
      --map-task-word-count 1000000 --map-task-count 10 \
      --word-len-min 1 --word-len-max 2

@haochenpan haochenpan added the development CI workflows, PR/issue templates, repository configurations label May 14, 2024
This was referenced May 14, 2024
@gpauloski
Copy link
Contributor

Thanks, @haochenpan! This all looks great and can be merged.

Before, we merge though, do you think we could add support for a real dataset? Dask has the example of words counts in the Enron email dataset: https://distributed.dask.org/en/stable/examples/word-count.html. They download from AWS to HDFS, but it seems like we could do it easily with just txt files that the user downloads, and then provides a path to the directory. It also seems like there's a few sources of the Enron email dataset like this page from CMU https://www.cs.cmu.edu/~enron/.

What do you think?

@haochenpan
Copy link
Collaborator Author

Hi Greg,

I would love to improve this workflow to support such a dataset! I'll work on it tomorrow.

Best wishes,
Haochen

@haochenpan
Copy link
Collaborator Author

Hi @gpauloski ,

The workflow now supports two run modes: random and enron. For each run, the user needs to specify the run mode using --mode and control the number of map threads with --map-task-count. As before, the commands and runtime results can be found in webs/wf/mapreduce/__init__.py

In the random mode, the user can set the number of words per map task (the default is 500) and the range of word lengths (the default is [1, 1]).

In the enron mode, an additional argument (to mode and map-task-count) called mail-dir is required, which defaults to maildir in the user's home folder.

Lastly, the user can specify how many most frequent words to save (using n_freq) and where to save them (using out). The defaults are 10 and out.txt in the run directory, respectively.

@gpauloski
Copy link
Contributor

Thanks, @haochenpan! This looks great. I'll give it a try and fix up those branch conflicts I just created, but then I think it should be good to merge.

@haochenpan
Copy link
Collaborator Author

Thanks, about pyproject.toml, one minor issue is that I have to add an extra * for tox to skip checking this folder.

omit = [
    "examples",
    "webs/wf/mapreduce/*",
]

If this is not the case on your side, please drop the *.

Best wishes,
Haochen

@gpauloski gpauloski merged commit d4afe24 into proxystore:main May 17, 2024
7 checks passed
@gpauloski gpauloski added enhancement New features or improvements to existing functionality and removed development CI workflows, PR/issue templates, repository configurations labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements to existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New workflow: MapReduce
2 participants