# EMR Hadoop Streaming Word Frequency (MapReduce) — All-in-One Notebook

This notebook runs an end-to-end **word frequency** job on **Amazon EMR** using **Hadoop Streaming**.

It will:
1. Create a small sample `logs.txt`
2. Write `mapper.py` and `reducer.py`
3. Upload input + code to S3
4. Run Hadoop Streaming (`hadoop-streaming.jar`)
5. Download and preview results

## Assumptions
- You are running this notebook **on an EMR cluster** (e.g., EMR Studio / JupyterLab attached to a cluster)
- `aws` CLI is available and your EMR role has S3 read/write permissions
- `hadoop` CLI is available
- Hadoop streaming jar exists at `/usr/lib/hadoop-mapreduce/hadoop-streaming.jar`


## 0) Configure your S3 paths
Set your bucket and prefix. Output path must not already exist.


In [17]:
import os, subprocess, textwrap
from pathlib import Path

S3_BUCKET = os.environ.get('S3_BUCKET', 'aws-logs-346690756907-us-east-1')
PREFIX = os.environ.get('MR_PREFIX', 'mapreduce/wordcount_demo')

S3_BASE = f"s3://{S3_BUCKET}/{PREFIX}".rstrip('/')
S3_INPUT = f"{S3_BASE}/input/"
S3_CODE = f"{S3_BASE}/code/"
S3_OUTPUT = f"{S3_BASE}/output/"

print('S3_INPUT :', S3_INPUT)
print('S3_CODE  :', S3_CODE)
print('S3_OUTPUT:', S3_OUTPUT)

if 'YOUR_BUCKET_NAME' in S3_BUCKET:
    print('\n⚠️  Set S3_BUCKET before running upload/job steps.')

def run(cmd, check=True):
    print('»', ' '.join(cmd))
    p = subprocess.run(cmd, text=True, capture_output=True)
    if p.stdout:
        print(p.stdout)
    if p.stderr:
        print(p.stderr)
    if check and p.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {p.returncode}")
    return p


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

S3_INPUT : s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/input/
S3_CODE  : s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/code/
S3_OUTPUT: s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/

## 1) Sanity checks
Verify we can find `aws`, `hadoop`, and the Hadoop Streaming jar.


In [18]:
run(['which','aws'], check=False)
run(['which','hadoop'], check=False)

STREAMING_JAR = Path('/usr/lib/hadoop-mapreduce/hadoop-streaming.jar')
print('Streaming jar exists:', STREAMING_JAR.exists(), '-', str(STREAMING_JAR))
if not STREAMING_JAR.exists():
    print('\n⚠️ Streaming jar path not found. Try:')
    print('   sudo find /usr/lib -name "*streaming*.jar" | head')


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? which aws
/usr/bin/aws

? which hadoop
/usr/bin/hadoop

Streaming jar exists: True - /usr/lib/hadoop-mapreduce/hadoop-streaming.jar

## 2) Create sample input data


In [19]:
logs = textwrap.dedent('''\
Hello world hello
MapReduce makes scaling easier
Hello EMR world
Race conditions happen without synchronization
''')
with open('logs.txt','w',encoding='utf-8') as f:
    f.write(logs)
print(logs)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Hello world hello
MapReduce makes scaling easier
Hello EMR world
Race conditions happen without synchronization

## 3) Write the mapper and reducer
- Mapper emits `(word, 1)`
- Reducer sums counts per word


In [20]:
mapper_py = textwrap.dedent('''\
    #!/usr/bin/env python3
    import sys
    import re

    WORD_RE = re.compile(r"[A-Za-z0-9']+")

    for line in sys.stdin:
        for word in WORD_RE.findall(line.lower()):
            print(f"{word}\t1")
''').lstrip()

reducer_py = textwrap.dedent('''\
    #!/usr/bin/env python3
    import sys

    current_word = None
    current_count = 0

    for line in sys.stdin:
        line = line.strip()
        if not line:
            continue
        word, count = line.split("\t", 1)
        count = int(count)

        if current_word == word:
            current_count += count
        else:
            if current_word is not None:
                print(f"{current_word}\t{current_count}")
            current_word = word
            current_count = count

    if current_word is not None:
        print(f"{current_word}\t{current_count}")
''').lstrip()

with open('mapper.py','w',encoding='utf-8') as f:
    f.write(mapper_py)
with open('reducer.py','w',encoding='utf-8') as f:
    f.write(reducer_py)

run(['chmod','+x','mapper.py','reducer.py'], check=False)
print('Wrote mapper.py and reducer.py')


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? chmod +x mapper.py reducer.py
Wrote mapper.py and reducer.py

## 4) Quick local test (optional)


In [21]:
cmd = "cat logs.txt | ./mapper.py | sort | ./reducer.py | sort -k2,2nr | head"
print('»', cmd)
p = subprocess.run(cmd, shell=True, text=True, capture_output=True)
print(p.stdout)
if p.stderr:
    print(p.stderr)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? cat logs.txt | ./mapper.py | sort | ./reducer.py | sort -k2,2nr | head
hello	3
world	2
conditions	1
easier	1
emr	1
happen	1
makes	1
mapreduce	1
race	1
scaling	1

## 5) Upload input + code to S3


In [32]:
if 'YOUR_BUCKET_NAME' in S3_BUCKET:
    raise ValueError('Set S3_BUCKET to a real bucket name first (or export S3_BUCKET).')

run(['aws','s3','cp','logs.txt', S3_INPUT])
run(['aws','s3','cp','mapper.py', S3_CODE])
run(['aws','s3','cp','reducer.py', S3_CODE])


print('Uploaded input and code to S3.')


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? aws s3 cp logs.txt s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/input/
Completed 112 Bytes/112 Bytes (2.0 KiB/s) with 1 file(s) remaining
upload: ./logs.txt to s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/input/logs.txt

? aws s3 cp mapper.py s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/code/
Completed 182 Bytes/182 Bytes (1.9 KiB/s) with 1 file(s) remaining
upload: ./mapper.py to s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/code/mapper.py

? aws s3 cp reducer.py s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/code/
Completed 509 Bytes/509 Bytes (8.4 KiB/s) with 1 file(s) remaining
upload: ./reducer.py to s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/code/reducer.py

Uploaded input and code to S3.

## 6) Run the Hadoop Streaming job


In [33]:
# Optional cleanup so you can re-run without changing S3_OUTPUT
run(['aws','s3','rm', S3_OUTPUT, '--recursive'], check=False)

cmd = [
    'hadoop','jar', str(STREAMING_JAR),
    '-D','mapreduce.job.name=wordcount-streaming',
    '-files','mapper.py,reducer.py',
    '-mapper','mapper.py',
    '-reducer','reducer.py',
    '-input', S3_INPUT,
    '-output', S3_OUTPUT,
]
run(cmd)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? aws s3 rm s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/ --recursive
delete: s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/part-00001
delete: s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/part-00000
delete: s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/part-00002
delete: s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/_SUCCESS

? hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -D mapreduce.job.name=wordcount-streaming -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/input/ -output s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/
packageJobJar: [] [/usr/lib/hadoop/hadoop-streaming-3.4.1-amzn-4.jar] /tmp/streamjob11223802864915766302.jar tmpDir=null

2026-01-20 21:08:02,279 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-172

## 7) Read results


In [34]:
run(['aws','s3','ls', S3_OUTPUT])
p = subprocess.run(['aws','s3','cp', f"{S3_OUTPUT}part-00000", '-'], text=True, capture_output=True)
out = p.stdout.strip().splitlines()
print('\n'.join(out[:50]))

# Top-10 words by count
pairs = []
for line in out:
    w, c = line.split('\t')
    pairs.append((w, int(c)))
pairs.sort(key=lambda x: x[1], reverse=True)
print('\nTop 10:')
for w, c in pairs[:10]:
    print(f"{w:20s} {c}")


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

? aws s3 ls s3://aws-logs-346690756907-us-east-1/mapreduce/wordcount_demo/output/
2026-01-20 21:08:25          0 _SUCCESS
2026-01-20 21:08:21       9840 part-00000
2026-01-20 21:08:24       9780 part-00001
2026-01-20 21:08:25       9154 part-00002

'as	1
11	4
1991	1
2	3
20	1
5	4
50	1
596	1
6221541	1
8	5
809	1
84116	1
able	1
about	102
accepted	2
account	1
accounts	1
ache	1
actual	1
adjourn	1
advance	3
after	43
again	83
against	10
aged	1
agent	1
ago	2
agony	1
agreed	1
agreement	18
airs	1
alone	5
aloud	5
also	4
alteration	1
altered	1
altogether	5
am	16
ambition	1
among	12
angrily	9
angry	5
ann	4
answered	4
anxiously	14
applause	1
apple	1
archbishop	2
argument	4
arm	15

Top 10:
to                   811
it                   610
she                  553
in                   435
as                   273
t                    218
on                   204
this                 181
out                  118
down                 103

## Troubleshooting
- **S3 AccessDenied**: EMR role needs `s3:ListBucket`, `s3:GetObject`, `s3:PutObject`.
- **Output already exists**: delete it (`aws s3 rm ... --recursive`) or change `PREFIX`.
- **Jar path missing**: locate with `sudo find /usr/lib -name "*streaming*.jar" | head`.
