<a href="https://colab.research.google.com/github/groda/big_data/blob/master/getting_started_with_mrjob.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90" alt="Logo Big Data for Beginners"></div></a>
<h1>
  Getting started with mrjob
</h1>

# 🚀 Meet `mrjob`: Your Friendly Python MapReduce Library

[`mrjob`](https://mrjob.readthedocs.io/en/latest/)  is a powerful yet easy-to-use Python library that lets you write and run Hadoop streaming jobs with pure Python. Whether you're crunching logs, processing large datasets, or experimenting with big data workflows, `mrjob` helps you build scalable MapReduce jobs without diving into Java or complex Hadoop configurations.

It works locally, on your own Hadoop cluster, or in the cloud via Amazon EMR — no fuss. With `mrjob`, you can focus on your logic and let the library handle the messy bits of distributed computing.

- ✅ Write MapReduce jobs in Python
- 🌍 Run them locally or at scale (Hadoop/EMR)
- 🛠️ Easily test and debug your code
- ☁️ Seamless integration with AWS

Think of it as a Swiss army knife for data processing: sharp, flexible, and ready for anything.


In [1]:
!python -V

Python 3.12.11


# 👋 Hello, MapReduce! — A Friendly `mrjob` Demo

This simple `mrjob` script, called `MRHello`, is the MapReduce equivalent of a "Hello, World!" — but with a twist! When run, it acts like a distributed greeting card that tells you:

- 🐍 Which Python version it's using  
- 🖥️ The hostname of the machine running each map task  
- 🌍 And, of course, it says "Hello, World!"

By emitting these values from the **mapper**, this job showcases how easy it is to run Python code across multiple machines using `mrjob`. It's perfect for getting started and verifying that your Hadoop or EMR setup is working correctly — with a cheerful greeting built in!


## Install `mrjob`

In [2]:
!pip install mrjob

Collecting mrjob
  Downloading mrjob-0.7.4-py2.py3-none-any.whl.metadata (7.3 kB)
Downloading mrjob-0.7.4-py2.py3-none-any.whl (439 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.6/439.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mrjob
Successfully installed mrjob-0.7.4


In [3]:
!pip show mrjob

Name: mrjob
Version: 0.7.4
Summary: Python MapReduce framework
Home-page: http://github.com/Yelp/mrjob
Author: David Marin
Author-email: dm@davidmarin.org
License: Apache
Location: /usr/local/lib/python3.12/dist-packages
Requires: PyYAML
Required-by: 


## Create a `mrjob` script

In [4]:
%%writefile mr_hello.py

from mrjob.job import MRJob
import sys
import socket
import os

class MRHello(MRJob):

    def mapper(self, _, line):
        yield "Python version", sys.version
        yield "Hostname", socket.gethostname()
        yield "HOME", os.getenv("HOME")
        yield "USERNAME", os.getenv("USER")
        yield "Output", "Hello, World!"

if __name__ == '__main__':
    MRHello.run()

Writing mr_hello.py


## Run the job

The standard way to run a job would be

```
python mr_hello.py input_file.txt
```

as suggested in the [Quickstart documentation](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#writing-your-first-job).

Since our "Hello, World!" job is just outputting a string and not processing any input we're just going to use a [_here string_](https://askubuntu.com/a/678919).

In [5]:
!python mr_hello.py <<<"test"

No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory /tmp/mr_hello.root.20251010.190232.280879
Running step 1 of 1...
reading from STDIN
job output is in /tmp/mr_hello.root.20251010.190232.280879/output
Streaming final output from /tmp/mr_hello.root.20251010.190232.280879/output...
"Python version"	"3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]"
"Hostname"	"87c92d6b4b45"
"HOME"	"/root"
"USERNAME"	null
"Output"	"Hello, World!"
Removing temp directory /tmp/mr_hello.root.20251010.190232.280879...


We can write the output to a folder (let us call it `out`)

In [6]:
!python mr_hello.py --output-dir out <<<"test"

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/mr_hello.root.20251010.190232.788120
reading from STDIN
job output is in out
Removing temp directory /tmp/mr_hello.root.20251010.190232.788120...


Inspect the contents of `out`

In [7]:
!ls out

part-00000


In [8]:
!cat out/part-00000

"Python version"	"3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]"
"Hostname"	"87c92d6b4b45"
"HOME"	"/root"
"USERNAME"	null
"Output"	"Hello, World!"


## Another example: wordcount

Here's another example, the classic "wordcount". The source of the following code is https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#writing-your-first-job

In [9]:
%%writefile mr_wordcount.py

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

Writing mr_wordcount.py


We need an input file. Let us download "Alice's Adventures in Wonderland" from Project Gutenberg to `input.txt`.

In [10]:
!wget --no-clobber https://www.gutenberg.org/files/11/11-0.txt -O input.txt

--2025-10-10 19:02:33--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 151191 (148K) [text/plain]
Saving to: ‘input.txt’


2025-10-10 19:02:33 (1.96 MB/s) - ‘input.txt’ saved [151191/151191]



Compute word frequencies.

In [11]:
!python mr_wordcount.py --output-dir out_wordcount input.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 1...
Creating temp directory /tmp/mr_wordcount.root.20251010.190234.316778
job output is in out_wordcount
Removing temp directory /tmp/mr_wordcount.root.20251010.190234.316778...


In [12]:
!ls out_wordcount

part-00000  part-00001	part-00002


You should get this output:

```
"chars"	141312
"lines"	3384
"words"	26543
```

In [13]:
!cat out_wordcount/part-*

"chars"	141312
"lines"	3384
"words"	26543


## Launch job on a Hadoop cluster

All jobs until now ran on the locally as simple Python processes. If you have a Hadoop cluster you can launch the job on the cluster leveraging distributed computing (see [running your job different ways](https://mrjob.readthedocs.io/en/latest/guides/quickstart.html#running-your-job-different-ways) from mrjob's quickstard guide).

We are going to first start a Hadoop cluster usig BigTop.

### Why BigTop `3.3.0`?

Note that as of 30/5/2025 BigTop's stable version is `3.4.0`. But this version only supports Ubuntu `24.04` while Google Colaboratory still runs on Ubuntu `22.04`. The last BigTop version supporting Ubuntu `22.04` is `3.3.0` from 20/6/2024.

You can find a list of all archived BigTop distributions at https://archive.apache.org/dist/bigtop/ while current releases can be found at https://downloads.apache.org/.

In [14]:
!cat /etc/os-release

PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


### Install Bigtop packages

In [15]:
%%bash
# Add Bigtop repository
echo "Adding Bigtop repository..."
curl -o /etc/apt/sources.list.d/bigtop-3.2.1.list https://archive.apache.org/dist/bigtop/bigtop-3.3.0/repos/$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -rs)/bigtop.list

# Download and add the Bigtop GPG key
echo "Adding Bigtop GPG key..."
wget --no-clobber -qO - https://archive.apache.org/dist/bigtop/bigtop-3.3.0/repos/GPG-KEY-bigtop | sudo apt-key add -

# Update package cache
echo "Updating package cache..."
apt update

Adding Bigtop repository...
Adding Bigtop GPG key...
OK
Updating package cache...
Get:1 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop InRelease [2,502 B]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:4 https://cli.github.com/packages stable InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:9 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 Packages [18.7 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Get:12 https://developer.download.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100    86  100    86    0     0    159      0 --:--:-- --:--:-- --:--:--   159


W: http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64/dists/bigtop/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [16]:
%%bash
echo 'List all available packages that match "bigtop"'
apt search bigtop

echo 'List all available packages that match "hadoop"'
apt search hadoop

List all available packages that match "bigtop"
Sorting...
Full Text Search...
bigtop-groovy/stable 2.5.4-1 all
  An agile and dynamic language for the Java Virtual Machine

bigtop-jsvc/stable 1.2.4-1 amd64
  Application to launch java daemon

bigtop-utils/stable 3.3.0-1 all
  Collection of useful tools for Bigtop

List all available packages that match "hadoop"
Sorting...
Full Text Search...
hadoop/stable 3.3.6-1 amd64
  Hadoop is a software platform for processing vast amounts of data

hadoop-client/stable 3.3.6-1 amd64
  Hadoop client side dependencies

hadoop-conf-pseudo/stable 3.3.6-1 amd64
  Pseudo-distributed Hadoop configuration

hadoop-doc/stable 3.3.6-1 all
  Hadoop Documentation

hadoop-hdfs/stable 3.3.6-1 amd64
  The Hadoop Distributed File System

hadoop-hdfs-datanode/stable 3.3.6-1 amd64
  Hadoop Data Node

hadoop-hdfs-dfsrouter/stable 3.3.6-1 amd64
  Hadoop HDFS Router

hadoop-hdfs-fuse/stable 3.3.6-1 amd64
  Mountable HDFS

hadoop-hdfs-journalnode/stable 3.3.6-1 amd64
 







We are going to install the Bigtop packages needed for running:

- HDFS
  - `hadoop-hdfs-namenode`
  - `hadoop-hdfs-datanode`
- MapReduce
  - `hadoop-mapreduce`
- YARN
  - `hadoop-yarn`
  - `hadoop-yarn-nodemanager`
  - `hadoop-yarn-resourcemanager`

All these service are going to run on a single machine (the local host).

In [17]:
!apt install -y hadoop-mapreduce hadoop-hdfs-namenode hadoop-hdfs-datanode \
                hadoop-yarn hadoop-yarn-nodemanager hadoop-yarn-resourcemanager

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-hdfs netcat-openbsd
  zookeeper
The following NEW packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-hdfs
  hadoop-hdfs-datanode hadoop-hdfs-namenode hadoop-mapreduce hadoop-yarn
  hadoop-yarn-nodemanager hadoop-yarn-resourcemanager netcat-openbsd zookeeper
0 upgraded, 13 newly installed, 0 to remove and 45 not upgraded.
Need to get 469 MB of archives.
After this operation, 609 MB of additional disk space will be used.
Get:1 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 bigtop-utils all 3.3.0-1 [5,422 B]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 netcat-openbsd amd64 1.218-4ubuntu1 [39.4 kB]
Get:3 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 bigtop-groovy all 2

### Minimal configuration

following "Yarn on a Single Node" ([https://hadoop.apache.org/docs/stable/.../SingleCluster.html#YARN_on_a_Single_Node](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node)).


Let us set the default filesystem in Hadoop's configuration file `core-site.xml`.

In [18]:
%%bash

cat > /etc/hadoop/conf/core-site.xml << ⬅️
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
⬅️

Configure `mapred-site.xml` and `yarn-site.xml`.

In [19]:
%%bash

cat > /etc/hadoop/conf/mapred-site.xml << ⬅️
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>/usr/lib/hadoop-mapreduce:/usr/lib/hadoop:/usr/lib/hadoop/tools/lib:/usr/lib/hadoop/*</value>
    </property>
</configuration>
⬅️

Not sure if specifying `yarn.application.classpath` is needed in `yarn-site.xml`.

In [20]:
!hadoop classpath

/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*


In [21]:
%%bash

cat > /etc/hadoop/conf/yarn-site.xml << ⬅️
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.application.classpath</name>
       <value>/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-mapreduce/.//*:/usr/lib/hadoop-yarn/./:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*</value>
       <description>output of hadoop classpath</description>
   </property>
</configuration>
⬅️

### Format HDFS

Initialize the Hadoop Filesystem (HDFS).

In [22]:
# erase the HDFS filesystem in case it already exists
!rm -rf /tmp/hadoop-hdfs/ /tmp/hadoop-root/ 2>/dev/null

# initialize the namenode
!sudo -u hdfs hdfs namenode -format

2025-10-10 19:03:32,214 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = 87c92d6b4b45/172.28.0.12
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.3.6
STARTUP_MSG:   classpath = /etc/hadoop/conf:/usr/lib/hadoop/lib/kerb-common-1.0.1.jar:/usr/lib/hadoop/lib/nimbus-jose-jwt-9.8.1.jar:/usr/lib/hadoop/lib/jersey-server-1.19.4.jar:/usr/lib/hadoop/lib/netty-transport-4.1.89.Final.jar:/usr/lib/hadoop/lib/gson-2.9.0.jar:/usr/lib/hadoop/lib/hadoop-shaded-protobuf_3_7-1.1.1.jar:/usr/lib/hadoop/lib/netty-transport-classes-epoll-4.1.89.Final.jar:/usr/lib/hadoop/lib/jetty-util-9.4.51.v20230217.jar:/usr/lib/hadoop/lib/jetty-io-9.4.51.v20230217.jar:/usr/lib/hadoop/lib/kerby-asn1-1.0.1.jar:/usr/lib/hadoop/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar:/usr/lib/hadoop/lib/snappy-java-1.1.8.2.jar:/usr/lib/hadoop/lib/slf4j-api-

### Start Hadoop services

Start the HDFS Namenode and Datanode services.

In [23]:
%%bash
service hadoop-hdfs-namenode start
service hadoop-hdfs-datanode start

 * Started Hadoop namenode: 
 * Started Hadoop datanode (hadoop-hdfs-datanode): 




Verify that they are up and running, you should see in the output:

```
3664 NameNode
3835 DataNode
```

In [24]:
!jps

2641 DataNode
2471 NameNode
2744 Jps


You can also verify the status of the Hadoop filesystem with `hdfs dfsadmin -report`.

Note that `hdfs` is the _superuser_ for HDFS (as the account that formatted the NameNode), so we need to run this command as `hdfs` and not as `root`.

In [25]:
!sudo -u hdfs hdfs dfsadmin -report

Configured Capacity: 115658190848 (107.72 GB)
Present Capacity: 72659013632 (67.67 GB)
DFS Remaining: 72658989056 (67.67 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:9866 (localhost)
Hostname: 87c92d6b4b45
Decommission Status : Normal
Configured Capacity: 115658190848 (107.72 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 42982400000 (40.03 GB)
DFS Remaining: 72658989056 (67.67 GB)
DFS Used%: 0.00%
DFS Remaining%: 62.82%
Configured Cache Capacity: 0 (0 B)
Cache

Start the YARN services (node manager and resource manager).

In [26]:
%%bash
service hadoop-yarn-nodemanager start
service hadoop-yarn-resourcemanager start

 * Started Hadoop nodemanager: 
 * Started Hadoop resourcemanager: 




In [27]:
!jps

2641 DataNode
3186 Jps
2915 NodeManager
3111 ResourceManager
2471 NameNode


Link various folders for easy access (can be useful for debugging by opening the files in the left panel).

In [28]:
!ln -s /var/log ./
!ln -s /etc/hadoop ./
!ln -s /etc .
!ln -s /lib .

### Copy input file to HDFS

In [29]:
%%bash
sudo -u hdfs hdfs dfs -mkdir -p /user/hdfs
sudo -u hdfs hdfs dfs -chown hdfs:hdfs /user/hdfs
sudo -u hdfs hdfs dfs -put -f input.txt hdfs:///user/hdfs/

### Run the job on the cluster

With the option `-r hadoop` the mrjob script is launched on the hadoop cluster. It is necessary to specify the location of the Hadoop streaming jar (to locate it use the command `find /usr/lib/hadoop -name "hadoop-streaming*.jar`).

With the option `-v` the job runs in verbose mode.

In [30]:
!find /usr/lib/hadoop -name "hadoop-streaming*.jar"

/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar


In [31]:
%%bash
#export HADOOP_ROOT_LOGGER=DEBUG,console
# Run the mrjob wordcount job using the hadoop runner
sudo -u hdfs \
hdfs dfs -rm -r myOutputDir 2>/dev/null

sudo -u hdfs python mr_wordcount.py -v -r hadoop \
    --output-dir myOutputDir \
    --hadoop-streaming-jar /usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    hdfs:///user/hdfs/input.txt

making runner: HadoopJobRunner(hadoop_streaming_jar=/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar, input_paths=['hdfs:///user/hdfs/input.txt'], mr_job_script=/content/mr_wordcount.py, output_dir=myOutputDir, stdin=<_io.BufferedReader name='<stdin>'>, steps=[{'type': 'streaming', 'mapper': {'type': 'script'}, 'reducer': {'type': 'script'}}], ...)
Looking for configs in /var/lib/hadoop-hdfs/.mrjob.conf
Looking for configs in /etc/mrjob.conf
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Active configuration:
{'bootstrap_mrjob': None,
 'check_input_paths': True,
 'cleanup': ['ALL'],
 'cleanup_on_failure': ['NONE'],
 'cmdenv': {},
 'hadoop_bin': None,
 'hadoop_extra_args': [],
 'hadoop_log_dirs': [],
 'hadoop_streaming_jar': '/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar',
 'hadoop_tmp_dir': 'tmp/mrjob',
 'jobconf': {},
 'label': None,
 'libjars': [],
 'local_tmp_dir': None,
 'owner': 'hdfs',
 'py_files': [],
 'python_bin': None,
 

The output is in `myOutputDir` together with a `_SUCCESS` is everything went well.

In [32]:
!sudo -u hdfs hdfs dfs -ls

Found 3 items
-rw-r--r--   3 hdfs hdfs     151191 2025-10-10 19:04 input.txt
drwxr-xr-x   - hdfs hdfs          0 2025-10-10 19:05 myOutputDir
drwxr-xr-x   - hdfs hdfs          0 2025-10-10 19:04 tmp


In [33]:
!sudo -u hdfs hdfs dfs -ls myOutputDir

Found 2 items
-rw-r--r--   3 hdfs hdfs          0 2025-10-10 19:05 myOutputDir/_SUCCESS
-rw-r--r--   3 hdfs hdfs         42 2025-10-10 19:05 myOutputDir/part-00000


Verify the contents of the output file

In [34]:
!sudo -u hdfs hdfs dfs -cat myOutputDir/part-00000

"chars"	141312
"lines"	3384
"words"	26543


# ✨ Hybrid Word Count

This Python script defines a `mrjob` job called `HybridWordCount` that performs a word count with a _hybrid_ approach, using both classic MapReduce steps and a Spark step.

The method `steps(self)` defines the sequence of steps in the job:
- `MRStep(mapper=self.mapper_get_words)`: The first step is a classic MapReduce mapper (mapper_get_words) that takes each line of input, splits it into words, converts them to lowercase, and yields each word with a count of 1 (e.g., "hello", 1).
- `SparkStep(self.spark_wordcount)`: The second step is a Spark step (spark_wordcount). This step takes the output of the previous mapper as input and uses Spark's reduceByKey to efficiently sum the counts for each word. It then saves the aggregated word counts to an output location.
- `MRStep(reducer=self.reducer_aggregate_counts)`: The third step is a classic MapReduce reducer (reducer_aggregate_counts). It receives the output from the Spark step (which might be split across multiple files). For each word, it sums the counts from all the input files for that word. It yields None as the key and a tuple of (total_count, word) as the value. Yielding None as the key sends all pairs to a single reducer in the next step.
- `MRStep(reducer=self.reducer_find_top_n)`: The final step is another classic MapReduce reducer (reducer_find_top_n). It receives all the (count, word) pairs from the previous reducer. It sorts these pairs in descending order based on the count and then yields the top 10 words and their counts.

In essence, this script first uses a classic mapper to tokenize the input, then leverages Spark for efficient word counting, and finally uses classic reducers to aggregate results and find the top words.



In [35]:
%%writefile hy_wordcount.py

from mrjob.job import MRJob
from mrjob.step import MRStep, SparkStep

class HybridWordCount(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words),
            SparkStep(self.spark_wordcount),
            MRStep(reducer=self.reducer_aggregate_counts),
            MRStep(reducer=self.reducer_find_top_n)
        ]

    # --- Step 1: Mapper (classic MR)
    def mapper_get_words(self, _, line):
        for word in line.strip().split():
            yield word.lower(), 1

    # --- Step 2: SparkStep
    def spark_wordcount(self, input_uri, output_uri):
        """
        Spark job: aggregate counts per word
        """
        from pyspark.sql import SparkSession

        spark = SparkSession.builder.getOrCreate()
        sc = spark.sparkContext

        # Load key-value pairs emitted by MR step
        rdd = sc.textFile(input_uri)

        # Each line is like: "word\t1"
        def parse_line(line):
            parts = line.split('\t')
            return parts[0], int(parts[1])

        word_rdd = rdd.map(parse_line)

        # Aggregate using Spark reduceByKey
        counts = word_rdd.reduceByKey(lambda a, b: a + b)

        # Convert to text output
        counts.map(lambda kv: f"{kv[0]}\t{kv[1]}").saveAsTextFile(output_uri)

        spark.stop()

    # --- Step 3: Reducer (classic MR)
    def reducer_aggregate_counts(self, word, counts):
        # The values yielded by the Spark step are counts for each word
        # in different partitions. We need to sum these up.
        total_count = sum(counts)
        yield None, (total_count, word) # Yield None as key to group all results

    def reducer_find_top_n(self, _, count_word_pairs):
        # Receive (count, word) pairs from the previous reducer
        # Sort and take the top N (e.g., top 20)
        top_n = sorted(count_word_pairs, reverse=True)[:20]
        for count, word in top_n:
            yield word, count


if __name__ == '__main__':
    HybridWordCount.run()

Writing hy_wordcount.py


## Run locally

In [36]:
%%time
%%bash

# clean output directory
rm hyOutputDir/* 2>/dev/null

python hy_wordcount.py \
    --output-dir hyOutputDir \
    input.txt

No configs found; falling back on auto-configuration
No configs specified for inline runner
Running step 1 of 4...
Creating temp directory /tmp/hy_wordcount.root.20251010.190557.718609
Running step 2 of 4...
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/10 19:06:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Running step 4 of 4...
job output is in hyOutputDir
Removing temp directory /tmp/hy_wordcount.root.20251010.190557.718609...


CPU times: user 4.11 ms, sys: 300 µs, total: 4.41 ms
Wall time: 17.4 s


In [37]:
!ls -l hyOutputDir

total 4
-rw-r--r-- 1 root root 196 Oct 10 19:06 part-00000


In [38]:
!cat hyOutputDir/part-00000

"the"	1614
"and"	767
"to"	706
"a"	619
"she"	518
"of"	496
"said"	420
"it"	362
"in"	351
"was"	328
"you"	257
"i"	249
"as"	249
"alice"	221
"that"	216
"her"	207
"at"	204
"had"	176
"with"	170
"all"	154


## Run on the cluster

We are going to launch the same job but this time on the cluster through YARN.

In [39]:
%%time
%%bash
sudo -u hdfs \
hdfs dfs -rm -r hyOutputDir 2>/dev/null

# Copy the script to a location accessible by the hdfs user
cp /content/hy_wordcount.py /tmp/hy_wordcount.py

# Run the mrjob wordcount job using the hadoop runner with HADOOP_CONF_DIR set for the hdfs user
sudo -u hdfs env HADOOP_CONF_DIR=/etc/hadoop/conf python /tmp/hy_wordcount.py -v -r hadoop \
    --output-dir hyOutputDir \
    --hadoop-streaming-jar /usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
    hdfs:///user/hdfs/input.txt

making runner: HadoopJobRunner(hadoop_streaming_jar=/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar, input_paths=['hdfs:///user/hdfs/input.txt'], mr_job_script=/tmp/hy_wordcount.py, output_dir=hyOutputDir, stdin=<_io.BufferedReader name='<stdin>'>, steps=[{'type': 'streaming', 'mapper': {'type': 'script'}}, {'jobconf': {}, 'spark_args': [], 'type': 'spark'}, {'type': 'streaming', 'reducer': {'type': 'script'}}, {'type': 'streaming', 'reducer': {'type': 'script'}}], ...)
Looking for configs in /var/lib/hadoop-hdfs/.mrjob.conf
Looking for configs in /etc/mrjob.conf
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Active configuration:
{'bootstrap_mrjob': None,
 'check_input_paths': True,
 'cleanup': ['ALL'],
 'cleanup_on_failure': ['NONE'],
 'cmdenv': {},
 'hadoop_bin': None,
 'hadoop_extra_args': [],
 'hadoop_log_dirs': [],
 'hadoop_streaming_jar': '/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar',
 'hadoop_tmp_dir': 'tmp/mrjob',
 'jo

CPU times: user 22.2 ms, sys: 9 ms, total: 31.2 ms
Wall time: 3min 39s


In [40]:
!sudo -u hdfs hdfs dfs -ls hyOutputDir

Found 2 items
-rw-r--r--   3 hdfs hdfs          0 2025-10-10 19:09 hyOutputDir/_SUCCESS
-rw-r--r--   3 hdfs hdfs        196 2025-10-10 19:09 hyOutputDir/part-00000


In [41]:
!sudo -u hdfs hdfs dfs -cat hyOutputDir/part-00000

"the"	1614
"and"	767
"to"	706
"a"	619
"she"	518
"of"	496
"said"	420
"it"	362
"in"	351
"was"	328
"you"	257
"i"	249
"as"	249
"alice"	221
"that"	216
"her"	207
"at"	204
"had"	176
"with"	170
"all"	154


Well, this worked. But the job didn't _quite_ run on the cluster. The MapReduce steps did, but the Spark step ran locally (with the Spark that comes with PySpark).

In order to run also the Spark step on the cluster we need to install Spark. So, first of all, let us install Spark from the Bigtop distribution.


In [42]:
%%bash
echo 'List all available packages that match "spark"'
apt search  spark

List all available packages that match "spark"
Sorting...
Full Text Search...
alluxio/stable 2.9.3-1 all
  Reliable file sharing at memory speed across cluster frameworks

libjs-jquery.sparkline/jammy 2.1.2-3 all
  library for jQuery to generate sparklines

libsparkline-php/jammy 0.2-7 all
  sparkline graphing library for php

livy/stable 0.8.0-1 all
  Livy is an open source REST interface for interacting with Apache Spark from anywhere.

node-sparkles/jammy 1.0.1-2 all
  Namespaced global event emitter

nspark/jammy 1.7.8B2+git20210317.cb30779-2 amd64
  Unarchiver for Spark and ArcFS files

pcp-export-pcp2spark/jammy 5.3.6-1build1 amd64
  Tool for exporting data from PCP to Apache Spark

python3-sahara-plugin-spark/jammy 7.0.0-0ubuntu1 all
  OpenStack data processing cluster as a service - Spark plugin

python3-sparkpost/jammy 1.3.10-1 all
  SparkPost Python API client (Python 3)

r-cran-analysispipelines/jammy 1.0.2-1.ca2204.1 all
  CRAN Package 'analysisPipelines' (Compose Interoper





In [43]:
%%bash

for p in spark-core spark-master spark-worker; do
  echo "🛠️ Installing $p"
  apt install -y $p
done

🛠️ Installing spark-core
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  hadoop-client
The following NEW packages will be installed:
  hadoop-client spark-core
0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
Need to get 263 MB of archives.
After this operation, 296 MB of additional disk space will be used.
Get:1 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 hadoop-client amd64 3.3.6-1 [5,330 B]
Get:2 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 spark-core all 3.3.4-1 [263 MB]
Fetched 263 MB in 8s (35.0 MB/s)
Selecting previously unselected package hadoop-client.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading datab









Start the Spark services

In [44]:
%%bash
for p in spark-master spark-worker; do
  echo "Starting $p"
  # systemctl start $p
  service $p start
done

Starting spark-master
 * Starting Spark master (spark-master): 
Starting spark-worker
 * Starting Spark worker (spark-worker): 


Check that the `spark-master` is running. `7077` is the default port for `spark-master`.

In [45]:
%%bash
# Retry the command up to 3 times with a 5-second delay
for i in {1..3}; do
  ss -tuln | grep 7077 2>/dev/null && exit 0
  sleep 5
done
exit 1 # Exit with an error code if the process is not found after retries

tcp   LISTEN 0      4096     172.28.0.12:7077       0.0.0.0:*          


Let us call the new script `hy_wordcount_sparkCluster.py`. The script itself is actually the same as before.

In [46]:
%%writefile hy_wordcount_sparkCluster.py

from mrjob.job import MRJob
from mrjob.step import MRStep, SparkStep

class HybridWordCount(MRJob):

    def steps(self):
        return [
            MRStep(mapper=self.mapper_get_words),
            SparkStep(self.spark_wordcount),
            MRStep(reducer=self.reducer_aggregate_counts),
            MRStep(reducer=self.reducer_find_top_n)
        ]

    # --- Step 1: Mapper (classic MR)
    def mapper_get_words(self, _, line):
        for word in line.strip().split():
            yield word.lower(), 1

    # --- Step 2: SparkStep
    def spark_wordcount(self, input_uri, output_uri):
        """
        Spark job: aggregate counts per word
        """
        from pyspark.sql import SparkSession

        spark = SparkSession.builder.getOrCreate()
        sc = spark.sparkContext

        # Load key-value pairs emitted by MR step
        rdd = sc.textFile(input_uri)

        # Each line is like: "word\t1"
        def parse_line(line):
            parts = line.split('\t')
            return parts[0], int(parts[1])

        word_rdd = rdd.map(parse_line)

        # Aggregate using Spark reduceByKey
        counts = word_rdd.reduceByKey(lambda a, b: a + b)

        # Convert to text output
        counts.map(lambda kv: f"{kv[0]}\t{kv[1]}").saveAsTextFile(output_uri)

        spark.stop()

    # --- Step 3: Reducer (classic MR)
    def reducer_aggregate_counts(self, word, counts):
        # The values yielded by the Spark step are counts for each word
        # in different partitions. We need to sum these up.
        total_count = sum(counts)
        yield None, (total_count, word) # Yield None as key to group all results

    def reducer_find_top_n(self, _, count_word_pairs):
        # Receive (count, word) pairs from the previous reducer
        # Sort and take the top N (e.g., top 20)
        top_n = sorted(count_word_pairs, reverse=True)[:20]
        for count, word in top_n:
            yield word, count


if __name__ == '__main__':
    HybridWordCount.run()

Writing hy_wordcount_sparkCluster.py



In order to configure Spark to run on YARN we need to make Spark aware of our Hadoop/YARN configuration. This is typically achieved by setting environment variables and ensuring Hadoop's configuration files are accessible.

Here are the core variables we need in order to get YARN and Spark work together.

```bash
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf # Redundant but safe
export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$SPARK_HOME/bin
```

We also need to define `PYTHONPATH`. This ensures the Python environment (where the `mrjob` script runs) can find the `pyspark` module.

In [47]:
!grep SPARK_HOME /etc/spark/conf/spark-env.sh

export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark}
export SPARK_LIBRARY_PATH=${SPARK_LIBRARY_PATH:-${SPARK_HOME}/lib}
export SCALA_LIBRARY_PATH=${SCALA_LIBRARY_PATH:-${SPARK_HOME}/lib}


In [48]:
%%time
%%bash
sudo -u hdfs \
hdfs dfs -rm -r hyOutputDir_sparkCluster 2>/dev/null

# Copy the script to a location accessible by the hdfs user
cp /content/hy_wordcount_sparkCluster.py /tmp/hy_wordcount_sparkCluster.py

# Run the mrjob wordcount job using the hadoop runner with HADOOP_CONF_DIR set for the hdfs user
# and specifying the Spark master as yarn
# Define environment variables for clarity
HADOOP_CONF_DIR_VAL=/etc/hadoop/conf
SPARK_HOME_VAL=/usr/lib/spark
HADOOP_STREAMING_JAR=/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar
PYTHON_PACKAGES_PATH=/usr/local/lib/python3.12/dist-packages/pyspark/python

sudo -u hdfs \
  HADOOP_CONF_DIR=$HADOOP_CONF_DIR_VAL \
  PYSPARK_PYTHON=/usr/bin/python3 \
  PYTHONPATH=$PYTHON_PACKAGES_PATH:$PYTHONPATH \
  python /tmp/hy_wordcount_sparkCluster.py \
    -v -r hadoop \
    --output-dir hyOutputDir_sparkCluster \
    --hadoop-streaming-jar $HADOOP_STREAMING_JAR \
    hdfs:///user/hdfs/input.txt

making runner: HadoopJobRunner(hadoop_streaming_jar=/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar, input_paths=['hdfs:///user/hdfs/input.txt'], mr_job_script=/tmp/hy_wordcount_sparkCluster.py, output_dir=hyOutputDir_sparkCluster, stdin=<_io.BufferedReader name='<stdin>'>, steps=[{'type': 'streaming', 'mapper': {'type': 'script'}}, {'jobconf': {}, 'spark_args': [], 'type': 'spark'}, {'type': 'streaming', 'reducer': {'type': 'script'}}, {'type': 'streaming', 'reducer': {'type': 'script'}}], ...)
Looking for configs in /var/lib/hadoop-hdfs/.mrjob.conf
Looking for configs in /etc/mrjob.conf
No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Active configuration:
{'bootstrap_mrjob': None,
 'check_input_paths': True,
 'cleanup': ['ALL'],
 'cleanup_on_failure': ['NONE'],
 'cmdenv': {},
 'hadoop_bin': None,
 'hadoop_extra_args': [],
 'hadoop_log_dirs': [],
 'hadoop_streaming_jar': '/usr/lib/hadoop/tools/lib/hadoop-streaming-3.3.6.jar',
 'hadoop_t

CPU times: user 25.3 ms, sys: 8.66 ms, total: 33.9 ms
Wall time: 3min 49s


In [49]:
!sudo -u hdfs hdfs dfs -cat hyOutputDir_sparkCluster/part-00000

"the"	1614
"and"	767
"to"	706
"a"	619
"she"	518
"of"	496
"said"	420
"it"	362
"in"	351
"was"	328
"you"	257
"i"	249
"as"	249
"alice"	221
"that"	216
"her"	207
"at"	204
"had"	176
"with"	170
"all"	154


In [50]:
!sudo -u hdfs hdfs dfs -ls hyOutputDir_sparkCluster

Found 2 items
-rw-r--r--   3 hdfs hdfs          0 2025-10-10 19:14 hyOutputDir_sparkCluster/_SUCCESS
-rw-r--r--   3 hdfs hdfs        196 2025-10-10 19:14 hyOutputDir_sparkCluster/part-00000


**Note:** for small amounts of data using a cluster may result in longer running times because of the overhead of setting up the job on the cluster. To really appreciate the advantages of Hadoop/Spark you need to work with large amounts of data (and with more than 2 virtual CPUs, as in the Google Colab environment).