<a href="https://colab.research.google.com/github/groda/big_data/blob/master/mrjob_wordcount.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/groda/big_data/blob/master/generate_data_with_Faker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90"></div></a>
# A simple MapReduce job with mrjob

`mrjob` is a **Python framework** that simplifies writing and running multi-step **MapReduce jobs** and **hybrid Spark jobs** entirely in Python. It allows developers to test jobs locally and then execute them seamlessly across different backends like **Hadoop, YARN, or AWS EMR** with minimal code changes. (**Note:** While $\text{mrjob}$ is robust and was famously developed and used extensively by Yelp, the project has not been actively maintained or updated in recent years.).

In this notebook, we'll start with a basic wordcount example to demonstrate its core functionality.

Find the official $\text{mrjob}$ documentation here: [https://mrjob.readthedocs.io/en/latest/](https://mrjob.readthedocs.io/en/latest/)

In [1]:
!pip install mrjob

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4


Let us check if there are any examples that come with the `mrjob` distribution.

In [2]:
!find /usr -name "*examples*" |grep mrjob

/usr/local/lib/python3.12/dist-packages/mrjob/examples


Here's the list of examples:

In [3]:
!ls /usr/local/lib/python3.12/dist-packages/mrjob/examples

docs-to-classify	      mr_phone_to_url.py
__init__.py		      mr_sparkaboom.py
mr_boom.py		      mr_spark_most_used_word.py
mr_count_lines_by_file.py     mr_spark_wordcount.py
mr_count_lines_right.py       mr_spark_wordcount_script.py
mr_count_lines_wrong.py       mr_text_classifier.py
mr_grep.py		      mr_wc.py
mr_jar_step_example.py	      mr_word_freq_count.py
mr_log_sampler.py	      mr_words_containing_u_freq_count.py
mr_most_used_word.py	      nicknack-1.0.1.jar
mr_next_word_stats.py	      __pycache__
mr_nick_nack_input_format.py  spark_wordcount_script.py
mr_nick_nack.py		      stop_words.txt


`mr_wc.py` must be the classic "word count" example.

In [4]:
!cat /usr/local/lib/python3.12/dist-packages/mrjob/examples/mr_wc.py

# Copyright 2009-2010 Yelp
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""An implementation of wc as an MRJob.

This is meant as an example of why mapper_final is useful."""
from mrjob.job import MRJob


class MRWordCountUtility(MRJob):

    def __init__(self, *args, **kwargs):
        super(MRWordCountUtility, self).__init__(*args, **kwargs)
        self.chars = 0
        self.words = 0
        self.lines = 0

    def mapper(self, _, line):
        # Don't actually yield anything for each line. 

Let us create a symbolic link to `/usr/local/lib/python3.12/dist-packages/mrjob/examples` so that we don't need to type long paths (and the folder is visible in the left pane).

In [5]:
!ln -s /usr/local/lib/python3.12/dist-packages/mrjob/examples examples

We are going to need some text data to run the wordcount example. It is common for Hadoop distributions to provide some toy data together with example scripts. And in fact, also `mrjob` includes some data in the folder `docs-to-classify` (subfolder of `examples`). Thi will do it for our wordcount demonstration.

In [6]:
!ls -lh examples/docs-to-classify

total 88K
-rw-r--r-- 1 root root 9.4K Oct 10 21:18 american_feuillage-whitman-america.txt
-rw-r--r-- 1 root root  933 Oct 10 21:18 as_i_ponderd_in_silence-whitman.txt
-rw-r--r-- 1 root root 1.2K Oct 10 21:18 buckingham_palace-milne-not_america.txt
-rw-r--r-- 1 root root  20K Oct 10 21:18 chants_democratic-whitman-america.txt
-rw-r--r-- 1 root root  288 Oct 10 21:18 corner_of_the_street-milne-not_whitman.txt
-rw-r--r-- 1 root root  154 Oct 10 21:18 happiness-milne.txt
-rw-r--r-- 1 root root 1.5K Oct 10 21:18 in_cabind_ships_at_sea-whitman.txt
-rw-r--r-- 1 root root  326 Oct 10 21:18 lines_and_squares-milne-animals.txt
-rw-r--r-- 1 root root  432 Oct 10 21:18 ones_self_i_sing-whitman.txt
-rw-r--r-- 1 root root 1.2K Oct 10 21:18 puppy_and_i-milne-animals.txt
-rw-r--r-- 1 root root  415 Oct 10 21:18 the_christening-milne-animals.txt
-rw-r--r-- 1 root root  869 Oct 10 21:18 the_four_friends-milne-animals.txt
-rw-r--r-- 1 root root  500 Oct 10 21:18 to_a_historian-whitman.txt
-rw-r--r-- 1 ro

In [7]:
%%bash

DATA=examples/docs-to-classify

python examples/mr_wc.py $DATA

"lines"	660
"words"	6371
"chars"	37967


No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/mr_wc.root.20251010.211910.064207
Running step 1 of 1...
job output is in /tmp/mr_wc.root.20251010.211910.064207/output
Streaming final output from /tmp/mr_wc.root.20251010.211910.064207/output...
Removing temp directory /tmp/mr_wc.root.20251010.211910.064207...


We can verify if the result is correct by concatenating all files in `examples/docs-to-classify` and counting lines/words/characters with the customary shell command `wc`.

In [8]:
!cat examples/docs-to-classify/* |wc

    660    6371   38444


The number of lines and words is the same but `wc` returns a different number of characters ($38444$ vs. the $37967$ from the `mrjob` example).

In [9]:
38444-37967

477

In [10]:
!python examples/mr_wc.py examples/docs-to-classify/american_feuillage-whitman-america.txt  2>/dev/null |grep chars

"chars"	9409


In [11]:
%%bash

for f in examples/docs-to-classify/*
do
  wc -c $f
  DATA=$f
  python examples/mr_wc.py $DATA 2>/dev/null |grep chars
done


9555 examples/docs-to-classify/american_feuillage-whitman-america.txt
"chars"	9409
933 examples/docs-to-classify/as_i_ponderd_in_silence-whitman.txt
"chars"	925
1151 examples/docs-to-classify/buckingham_palace-milne-not_america.txt
"chars"	1085
19857 examples/docs-to-classify/chants_democratic-whitman-america.txt
"chars"	19716
288 examples/docs-to-classify/corner_of_the_street-milne-not_whitman.txt
"chars"	280
154 examples/docs-to-classify/happiness-milne.txt
"chars"	152
1496 examples/docs-to-classify/in_cabind_ships_at_sea-whitman.txt
"chars"	1482
326 examples/docs-to-classify/lines_and_squares-milne-animals.txt
"chars"	318
432 examples/docs-to-classify/ones_self_i_sing-whitman.txt
"chars"	428
1154 examples/docs-to-classify/puppy_and_i-milne-animals.txt
"chars"	1092
415 examples/docs-to-classify/the_christening-milne-animals.txt
"chars"	403
869 examples/docs-to-classify/the_four_friends-milne-animals.txt
"chars"	867
500 examples/docs-to-classify/to_a_historian-whitman.txt
"chars"	500


`mrjob` appears to consistently return a smaller number of characters. Let us open the smallest file and count the characters manually to understand what's going on.

The smallest file is `examples/docs-to-classify/happiness-milne.txt` with $152$ characters according to `mrjob` and $154$ according to `wc`.

In [12]:
!cat examples/docs-to-classify/happiness-milne.txt

John had
Great Big
Waterproof
Boots on;
John had a
Great Big
Waterproof
Hat;
John had a
Great Big
Waterproof
Mackintosh –
And that
(Said John)
Is
That.


Two hours later ... every time I count the characters I get a different number 🤔

Let's try using `wc`: if the result of `wc -c` is greater than the result of `wc -m`, the file contains multi-byte characters.

In [13]:
!wc -m examples/docs-to-classify/happiness-milne.txt

152 examples/docs-to-classify/happiness-milne.txt


In [14]:
!wc -c examples/docs-to-classify/happiness-milne.txt

154 examples/docs-to-classify/happiness-milne.txt


OK, so our `mrjob` script is counting multi-byte characters as multiple characters. Let us verify that:

In [15]:
%%bash

for f in examples/docs-to-classify/*
do
  wc -m $f
  DATA=$f
  python examples/mr_wc.py $DATA 2>/dev/null |grep chars
done

9409 examples/docs-to-classify/american_feuillage-whitman-america.txt
"chars"	9409
925 examples/docs-to-classify/as_i_ponderd_in_silence-whitman.txt
"chars"	925
1085 examples/docs-to-classify/buckingham_palace-milne-not_america.txt
"chars"	1085
19716 examples/docs-to-classify/chants_democratic-whitman-america.txt
"chars"	19716
280 examples/docs-to-classify/corner_of_the_street-milne-not_whitman.txt
"chars"	280
152 examples/docs-to-classify/happiness-milne.txt
"chars"	152
1482 examples/docs-to-classify/in_cabind_ships_at_sea-whitman.txt
"chars"	1482
318 examples/docs-to-classify/lines_and_squares-milne-animals.txt
"chars"	318
428 examples/docs-to-classify/ones_self_i_sing-whitman.txt
"chars"	428
1092 examples/docs-to-classify/puppy_and_i-milne-animals.txt
"chars"	1092
403 examples/docs-to-classify/the_christening-milne-animals.txt
"chars"	403
867 examples/docs-to-classify/the_four_friends-milne-animals.txt
"chars"	867
500 examples/docs-to-classify/to_a_historian-whitman.txt
"chars"	500


In [16]:
!cat examples/docs-to-classify/* |wc -m

37967


✅ $37967$ is the same result as the output of the `mrjob` script!

Note that this job ran on only **locally**.

To see an example of a job running on a cluster, please check the tutorial below:

[Getting started with `mrjob`](https://github.com/groda/big_data/blob/master/getting_started_with_mrjob.ipynb)

## Context: $\text{mrjob}$ Maintenance Status

$\text{mrjob}$ was originally developed and open-sourced by **Yelp**, the well-known business review platform. Yelp created and heavily relied on $\text{mrjob}$ as their primary framework for running analytical jobs across their large Hadoop clusters.

While $\text{mrjob}$ is a robust and widely-used tool that simplifies the development and deployment of Python-based MapReduce and Spark jobs, the project's **active maintenance has slowed significantly in recent years.**

Here's why this is relevant:

* **Yelp's Transition:** Like many tech companies, Yelp has likely evolved its data infrastructure, shifting toward newer technologies (such as pure Spark, Flink, or cloud-native solutions) that offer better performance or integration with modern cloud platforms. This reduces the immediate need for them to heavily invest resources in updating the $\text{mrjob}$ core library.
* **Feature Stagnation:** The codebase generally receives fewer updates, bug fixes, and new features compared to actively maintained frameworks. Users may find that support for the **very latest versions of Hadoop, Spark, or Python** can lag behind.
* **Stability vs. Modernity:** Despite the lack of recent updates, $\text{mrjob}$ remains stable and perfectly functional for environments using compatible versions of Hadoop and Spark. It serves as a strong, proven framework for those who value its **simplicity and unified Python interface** over the bleeding edge of data technology.
