-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapreduce Workflow #24
Conversation
Thanks, @haochenpan! This all looks great and can be merged. Before, we merge though, do you think we could add support for a real dataset? Dask has the example of words counts in the Enron email dataset: https://distributed.dask.org/en/stable/examples/word-count.html. They download from AWS to HDFS, but it seems like we could do it easily with just txt files that the user downloads, and then provides a path to the directory. It also seems like there's a few sources of the Enron email dataset like this page from CMU https://www.cs.cmu.edu/~enron/. What do you think? |
Hi Greg, I would love to improve this workflow to support such a dataset! I'll work on it tomorrow. Best wishes, |
Hi @gpauloski , The workflow now supports two run modes: In the In the Lastly, the user can specify how many most frequent words to save (using |
Thanks, @haochenpan! This looks great. I'll give it a try and fix up those branch conflicts I just created, but then I think it should be good to merge. |
Thanks, about pyproject.toml, one minor issue is that I have to add an extra * for tox to skip checking this folder.
If this is not the case on your side, please drop the *. Best wishes, |
add "webs/wf/mapreduce/" to the list of omit
Description
This initial commit introduces the map-reduce word count workflow. The workflow accepts configurations for the number of map tasks, the number of words each map task handles, and the range of word lengths. The workflow generates a list of paragraphs, each serving as input for a map task. The workflow executes the map tasks, waits for their completion, and then initiates a reduce task to aggregate the results into a single counter object.
Fixes #14
Type of Change
Testing
Run the workflow using standard commands, as it does not rely on external libraries or data dependencies. Below are two example commands. Additional commands and runtime results can be found in
webs/wf/mapreduce/__init__.py
: